@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Adjust the Drydock allocator to limit each pending lease to one allocating resource

Summary:
Ref T13677. Currently, one lease may cause multiple resources to allocate simultaneously if it starts allocating one, then wakes up from a yield later on and still sees no available resources.

This is never desired -- or, at least, produces desirable behavior only entirely by accident. Normally, it causes an excess of resources to allocate.

This is not a catastrophic problem: the extra resources usually get used sooner or later or cleaned up; and the total amount of badness is limited by overall resource allocation limits.

However, this behavior is also suppressed by an artificial "25% of current pool size" growth limit throttle which I intend to remove. Removing this throttle without fixing the allocator behavior could make this "too many resources" problem worse.

Change the allocator so that a lease that has started allocating a resource won't allocate another resource until the first resource leaves the "pending" state.

This also fixes some general oddness with the allocator and attempts to simplify the structure.

Test Plan:
- Ran 8 taskmasters.
- Destroyed all resources and leases.
- Leased 4 working copies.
- Saw exactly 4 resources build and lease, all simultaneously.
- Destroyed all resources and leases.
- Leased 32 working copies.
- Saw exactly 32 resources build and lease, approximately 8 at a time (limited by taskmasters).
- Destroyed all leases (but not resources).
- Leased 32 working copies, saw them satisfied by existing resources.

Subscribers: PHID-OPKG-gm6ozazyms6q6i22gyam

Maniphest Tasks: T13677

Differential Revision: https://secure.phabricator.com/D21807

+330 -186
+329 -185
src/applications/drydock/worker/DrydockLeaseUpdateWorker.php
··· 188 188 189 189 // First, try to find a suitable open resource which we can acquire a new 190 190 // lease on. 191 - $resources = $this->loadResourcesForAllocatingLease($blueprints, $lease); 191 + $resources = $this->loadAcquirableResourcesForLease($blueprints, $lease); 192 + 193 + list($free_resources, $used_resources) = $this->partitionResources( 194 + $lease, 195 + $resources); 196 + 197 + $resource = $this->leaseAnyResource($lease, $free_resources); 198 + if ($resource) { 199 + return $resource; 200 + } 201 + 202 + // We're about to try creating a resource. If we're already creating 203 + // something, just yield until that resolves. 204 + 205 + $this->yieldForPendingResources($lease); 206 + 207 + // We haven't been able to lease an existing resource yet, so now we try to 208 + // create one. We may still have some less-desirable "used" resources that 209 + // we'll sometimes try to lease later if we fail to allocate a new resource. 210 + 211 + $resource = $this->newLeasedResource($lease, $blueprints); 212 + if ($resource) { 213 + return $resource; 214 + } 215 + 216 + // We haven't been able to lease a desirable "free" resource or create a 217 + // new resource. Try to lease a "used" resource. 218 + 219 + $resource = $this->leaseAnyResource($lease, $used_resources); 220 + if ($resource) { 221 + return $resource; 222 + } 223 + 224 + // If this lease has already triggered a reclaim, just yield and wait for 225 + // it to resolve. 226 + $this->yieldForReclaimingResources($lease); 192 227 193 - // If no resources exist yet, see if we can build one. 194 - if (!$resources) { 195 - $usable_blueprints = $this->removeOverallocatedBlueprints( 196 - $blueprints, 197 - $lease); 228 + // Try to reclaim a resource. This will yield if it reclaims something. 229 + $this->reclaimAnyResource($lease, $blueprints); 198 230 199 - // If we get nothing back here, some blueprint claims it can eventually 200 - // satisfy the lease, just not right now. This is a temporary failure, 201 - // and we expect allocation to succeed eventually. 202 - if (!$usable_blueprints) { 203 - $blueprints = $this->rankBlueprints($blueprints, $lease); 231 + // We weren't able to lease, create, or reclaim any resources. We just have 232 + // to wait for resources to become available. 233 + 234 + $lease->logEvent( 235 + DrydockLeaseWaitingForResourcesLogType::LOGCONST, 236 + array( 237 + 'blueprintPHIDs' => mpull($blueprints, 'getPHID'), 238 + )); 239 + 240 + throw new PhabricatorWorkerYieldException(15); 241 + } 242 + 243 + private function reclaimAnyResource(DrydockLease $lease, array $blueprints) { 244 + assert_instances_of($blueprints, 'DrydockBlueprint'); 245 + 246 + $blueprints = $this->rankBlueprints($blueprints, $lease); 204 247 205 - // Try to actively reclaim unused resources. If we succeed, jump back 206 - // into the queue in an effort to claim it. 207 - foreach ($blueprints as $blueprint) { 208 - $reclaimed = $this->reclaimResources($blueprint, $lease); 209 - if ($reclaimed) { 210 - $lease->logEvent( 211 - DrydockLeaseReclaimLogType::LOGCONST, 212 - array( 213 - 'resourcePHIDs' => array($reclaimed->getPHID()), 214 - )); 248 + // Try to actively reclaim unused resources. If we succeed, jump back 249 + // into the queue in an effort to claim it. 215 250 216 - throw new PhabricatorWorkerYieldException(15); 217 - } 218 - } 251 + foreach ($blueprints as $blueprint) { 252 + $reclaimed = $this->reclaimResources($blueprint, $lease); 253 + if ($reclaimed) { 219 254 220 255 $lease->logEvent( 221 - DrydockLeaseWaitingForResourcesLogType::LOGCONST, 256 + DrydockLeaseReclaimLogType::LOGCONST, 222 257 array( 223 - 'blueprintPHIDs' => mpull($blueprints, 'getPHID'), 258 + 'resourcePHIDs' => array($reclaimed->getPHID()), 224 259 )); 225 260 261 + // Yield explicitly here: we'll be awakened when the resource is 262 + // reclaimed. 263 + 226 264 throw new PhabricatorWorkerYieldException(15); 227 265 } 266 + } 267 + } 268 + 269 + private function yieldForPendingResources(DrydockLease $lease) { 270 + // See T13677. If this lease has already triggered the allocation of 271 + // one or more resources and they are still pending, just yield and 272 + // wait for them. 273 + 274 + $viewer = $this->getViewer(); 275 + 276 + $phids = $lease->getAllocatedResourcePHIDs(); 277 + if (!$phids) { 278 + return null; 279 + } 280 + 281 + $resources = id(new DrydockResourceQuery()) 282 + ->setViewer($viewer) 283 + ->withPHIDs($phids) 284 + ->withStatuses( 285 + array( 286 + DrydockResourceStatus::STATUS_PENDING, 287 + )) 288 + ->setLimit(1) 289 + ->execute(); 290 + if (!$resources) { 291 + return; 292 + } 293 + 294 + $lease->logEvent( 295 + DrydockLeaseWaitingForActivationLogType::LOGCONST, 296 + array( 297 + 'resourcePHIDs' => mpull($resources, 'getPHID'), 298 + )); 299 + 300 + throw new PhabricatorWorkerYieldException(15); 301 + } 302 + 303 + private function yieldForReclaimingResources(DrydockLease $lease) { 304 + $viewer = $this->getViewer(); 228 305 229 - $usable_blueprints = $this->rankBlueprints($usable_blueprints, $lease); 306 + $phids = $lease->getReclaimedResourcePHIDs(); 307 + if (!$phids) { 308 + return; 309 + } 230 310 231 - $exceptions = array(); 232 - foreach ($usable_blueprints as $blueprint) { 233 - try { 234 - $resources[] = $this->allocateResource($blueprint, $lease); 311 + $resources = id(new DrydockResourceQuery()) 312 + ->setViewer($viewer) 313 + ->withPHIDs($phids) 314 + ->withStatuses( 315 + array( 316 + DrydockResourceStatus::STATUS_ACTIVE, 317 + )) 318 + ->setLimit(1) 319 + ->execute(); 320 + if (!$resources) { 321 + return; 322 + } 235 323 236 - // Bail after allocating one resource, we don't need any more than 237 - // this. 238 - break; 239 - } catch (Exception $ex) { 240 - // This failure is not normally expected, so log it. It can be 241 - // caused by something mundane and recoverable, however (see below 242 - // for discussion). 324 + $lease->logEvent( 325 + DrydockLeaseWaitingForReclamationLogType::LOGCONST, 326 + array( 327 + 'resourcePHIDs' => mpull($resources, 'getPHID'), 328 + )); 243 329 244 - // We log to the blueprint separately from the log to the lease: 245 - // the lease is not attached to a blueprint yet so the lease log 246 - // will not show up on the blueprint; more than one blueprint may 247 - // fail; and the lease is not really impacted (and won't log) if at 248 - // least one blueprint actually works. 330 + throw new PhabricatorWorkerYieldException(15); 331 + } 249 332 250 - $blueprint->logEvent( 251 - DrydockResourceAllocationFailureLogType::LOGCONST, 252 - array( 253 - 'class' => get_class($ex), 254 - 'message' => $ex->getMessage(), 255 - )); 333 + private function newLeasedResource( 334 + DrydockLease $lease, 335 + array $blueprints) { 336 + assert_instances_of($blueprints, 'DrydockBlueprint'); 256 337 257 - $exceptions[] = $ex; 258 - } 338 + $usable_blueprints = $this->removeOverallocatedBlueprints( 339 + $blueprints, 340 + $lease); 341 + 342 + // If we get nothing back here, some blueprint claims it can eventually 343 + // satisfy the lease, just not right now. This is a temporary failure, 344 + // and we expect allocation to succeed eventually. 345 + 346 + // Return, try to lease a "used" resource, and continue from there. 347 + 348 + if (!$usable_blueprints) { 349 + return null; 350 + } 351 + 352 + $usable_blueprints = $this->rankBlueprints($usable_blueprints, $lease); 353 + 354 + $new_resources = $this->newResources($lease, $usable_blueprints); 355 + if (!$new_resources) { 356 + // If we were unable to create any new resources, return and 357 + // try to lease a "used" resource. 358 + return null; 359 + } 360 + 361 + $new_resources = $this->removeUnacquirableResources( 362 + $new_resources, 363 + $lease); 364 + if (!$new_resources) { 365 + // If we make it here, we just built a resource but aren't allowed 366 + // to acquire it. We expect this to happen if the resource prevents 367 + // acquisition until it activates, which is common when a resource 368 + // needs to perform setup steps. 369 + 370 + // Explicitly yield and wait for activation, since we don't want to 371 + // lease a "used" resource. 372 + 373 + throw new PhabricatorWorkerYieldException(15); 374 + } 375 + 376 + $resource = $this->leaseAnyResource($lease, $new_resources); 377 + if ($resource) { 378 + return $resource; 379 + } 380 + 381 + // We may not be able to lease a resource even if we just built it: 382 + // another process may snatch it up before we can lease it. This should 383 + // be rare, but is not concerning. Just try to build another resource. 384 + 385 + // We likely could try to build the next resource immediately, but err on 386 + // the side of caution and yield for now, at least until this code is 387 + // better vetted. 388 + 389 + throw new PhabricatorWorkerYieldException(15); 390 + } 391 + 392 + private function partitionResources( 393 + DrydockLease $lease, 394 + array $resources) { 395 + 396 + assert_instances_of($resources, 'DrydockResource'); 397 + $viewer = $this->getViewer(); 398 + 399 + $lease_statuses = array( 400 + DrydockLeaseStatus::STATUS_PENDING, 401 + DrydockLeaseStatus::STATUS_ACQUIRED, 402 + DrydockLeaseStatus::STATUS_ACTIVE, 403 + ); 404 + 405 + // Partition resources into "free" resources (which we can try to lease 406 + // immediately) and "used" resources, which we can only to lease after we 407 + // fail to allocate a new resource. 408 + 409 + // "Free" resources are unleased and/or prefer reuse over allocation. 410 + // "Used" resources are leased and prefer allocation over reuse. 411 + 412 + $free_resources = array(); 413 + $used_resources = array(); 414 + 415 + foreach ($resources as $resource) { 416 + $blueprint = $resource->getBlueprint(); 417 + 418 + if (!$blueprint->shouldAllocateSupplementalResource($resource, $lease)) { 419 + $free_resources[] = $resource; 420 + continue; 259 421 } 260 422 261 - if (!$resources) { 262 - // If one or more blueprints claimed that they would be able to 263 - // allocate resources but none are actually able to allocate resources, 264 - // log the failure and yield so we try again soon. 423 + $leases = id(new DrydockLeaseQuery()) 424 + ->setViewer($viewer) 425 + ->withResourcePHIDs(array($resource->getPHID())) 426 + ->withStatuses($lease_statuses) 427 + ->setLimit(1) 428 + ->execute(); 429 + if (!$leases) { 430 + $free_resources[] = $resource; 431 + continue; 432 + } 433 + 434 + $used_resources[] = $resource; 435 + } 436 + 437 + return array($free_resources, $used_resources); 438 + } 439 + 440 + private function newResources( 441 + DrydockLease $lease, 442 + array $blueprints) { 443 + assert_instances_of($blueprints, 'DrydockBlueprint'); 444 + 445 + $resources = array(); 446 + $exceptions = array(); 447 + foreach ($blueprints as $blueprint) { 448 + $caught = null; 449 + try { 450 + $resources[] = $this->allocateResource($blueprint, $lease); 265 451 266 - // This can happen if some unexpected issue occurs during allocation 267 - // (for example, a call to build a VM fails for some reason) or if we 268 - // raced another allocator and the blueprint is now full. 452 + // Bail after allocating one resource, we don't need any more than 453 + // this. 454 + break; 455 + } catch (Exception $ex) { 456 + $caught = $ex; 457 + } catch (Throwable $ex) { 458 + $caught = $ex; 459 + } 269 460 270 - $ex = new PhutilAggregateException( 271 - pht( 272 - 'All blueprints failed to allocate a suitable new resource when '. 273 - 'trying to allocate lease ("%s").', 274 - $lease->getPHID()), 275 - $exceptions); 461 + if ($caught) { 462 + // This failure is not normally expected, so log it. It can be 463 + // caused by something mundane and recoverable, however (see below 464 + // for discussion). 465 + 466 + // We log to the blueprint separately from the log to the lease: 467 + // the lease is not attached to a blueprint yet so the lease log 468 + // will not show up on the blueprint; more than one blueprint may 469 + // fail; and the lease is not really impacted (and won't log) if at 470 + // least one blueprint actually works. 276 471 277 - $lease->logEvent( 278 - DrydockLeaseAllocationFailureLogType::LOGCONST, 472 + $blueprint->logEvent( 473 + DrydockResourceAllocationFailureLogType::LOGCONST, 279 474 array( 280 - 'class' => get_class($ex), 281 - 'message' => $ex->getMessage(), 475 + 'class' => get_class($caught), 476 + 'message' => $caught->getMessage(), 282 477 )); 283 478 284 - throw new PhabricatorWorkerYieldException(15); 479 + $exceptions[] = $caught; 285 480 } 481 + } 286 482 287 - $resources = $this->removeUnacquirableResources($resources, $lease); 288 - if (!$resources) { 289 - // If we make it here, we just built a resource but aren't allowed 290 - // to acquire it. We expect this during routine operation if the 291 - // resource prevents acquisition until it activates. Yield and wait 292 - // for activation. 293 - throw new PhabricatorWorkerYieldException(15); 294 - } 483 + if (!$resources) { 484 + // If one or more blueprints claimed that they would be able to allocate 485 + // resources but none are actually able to allocate resources, log the 486 + // failure and yield so we try again soon. 295 487 296 - // NOTE: We have not acquired the lease yet, so it is possible that the 297 - // resource we just built will be snatched up by some other lease before 298 - // we can acquire it. This is not problematic: we'll retry a little later 299 - // and should succeed eventually. 488 + // This can happen if some unexpected issue occurs during allocation 489 + // (for example, a call to build a VM fails for some reason) or if we 490 + // raced another allocator and the blueprint is now full. 491 + 492 + $ex = new PhutilAggregateException( 493 + pht( 494 + 'All blueprints failed to allocate a suitable new resource when '. 495 + 'trying to allocate lease ("%s").', 496 + $lease->getPHID()), 497 + $exceptions); 498 + 499 + $lease->logEvent( 500 + DrydockLeaseAllocationFailureLogType::LOGCONST, 501 + array( 502 + 'class' => get_class($ex), 503 + 'message' => $ex->getMessage(), 504 + )); 505 + 506 + return null; 507 + } 508 + 509 + return $resources; 510 + } 511 + 512 + 513 + private function leaseAnyResource( 514 + DrydockLease $lease, 515 + array $resources) { 516 + assert_instances_of($resources, 'DrydockResource'); 517 + 518 + if (!$resources) { 519 + return null; 300 520 } 301 521 302 522 $resources = $this->rankResources($resources, $lease); 303 523 304 524 $exceptions = array(); 305 525 $yields = array(); 306 - $allocated = false; 526 + 527 + $allocated = null; 307 528 foreach ($resources as $resource) { 308 529 try { 309 - $resource = $this->newResourceForAcquisition($resource, $lease); 310 530 $this->acquireLease($resource, $lease); 311 - $allocated = true; 531 + $allocated = $resource; 312 532 break; 313 533 } catch (DrydockResourceLockException $ex) { 314 534 // We need to lock the resource to actually acquire it. If we aren't 315 535 // able to acquire the lock quickly enough, we can yield and try again 316 536 // later. 317 537 $yields[] = $ex; 538 + } catch (DrydockSlotLockException $ex) { 539 + // This also just indicates we ran into some kind of contention, 540 + // probably from another lease. Just yield. 541 + $yields[] = $ex; 318 542 } catch (DrydockAcquiredBrokenResourceException $ex) { 319 543 // If a resource was reclaimed or destroyed by the time we actually 320 - // got around to acquiring it, we just got unlucky. We can yield and 321 - // try again later. 544 + // got around to acquiring it, we just got unlucky. 322 545 $yields[] = $ex; 323 546 } catch (PhabricatorWorkerYieldException $ex) { 324 547 // We can be told to yield, particularly by the supplemental allocator ··· 329 552 } 330 553 } 331 554 332 - if (!$allocated) { 333 - if ($yields) { 334 - throw new PhabricatorWorkerYieldException(15); 335 - } else { 336 - throw new PhutilAggregateException( 337 - pht( 338 - 'Unable to acquire lease "%s" on any resource.', 339 - $lease->getPHID()), 340 - $exceptions); 341 - } 555 + if ($allocated) { 556 + return $allocated; 557 + } 558 + 559 + if ($yields) { 560 + throw new PhabricatorWorkerYieldException(15); 342 561 } 562 + 563 + throw new PhutilAggregateException( 564 + pht( 565 + 'Unable to acquire lease "%s" on any resource.', 566 + $lease->getPHID()), 567 + $exceptions); 343 568 } 344 569 345 570 ··· 426 651 * the lease. 427 652 * @task allocator 428 653 */ 429 - private function loadResourcesForAllocatingLease( 654 + private function loadAcquirableResourcesForLease( 430 655 array $blueprints, 431 656 DrydockLease $lease) { 432 657 assert_instances_of($blueprints, 'DrydockBlueprint'); ··· 438 663 ->withTypes(array($lease->getResourceType())) 439 664 ->withStatuses( 440 665 array( 441 - DrydockResourceStatus::STATUS_PENDING, 442 666 DrydockResourceStatus::STATUS_ACTIVE, 443 667 )) 444 668 ->execute(); ··· 558 782 // If this resource was allocated as a pending resource, queue a task to 559 783 // activate it. 560 784 if ($resource->getStatus() == DrydockResourceStatus::STATUS_PENDING) { 785 + 786 + $lease->addAllocatedResourcePHIDs( 787 + array( 788 + $resource->getPHID(), 789 + )); 790 + $lease->save(); 791 + 561 792 PhabricatorWorker::scheduleTask( 562 793 'DrydockResourceUpdateWorker', 563 794 array( ··· 636 867 DrydockLease $lease) { 637 868 $viewer = $this->getViewer(); 638 869 639 - // If this lease is marked as already in the process of reclaiming a 640 - // resource, don't let it reclaim another one until the first reclaim 641 - // completes. This stops one lease from reclaiming a large number of 642 - // resources if the reclaims take a while to complete. 643 - $reclaiming_phid = $lease->getAttribute('drydock.reclaimingPHID'); 644 - if ($reclaiming_phid) { 645 - $reclaiming_resource = id(new DrydockResourceQuery()) 646 - ->setViewer($viewer) 647 - ->withPHIDs(array($reclaiming_phid)) 648 - ->withStatuses( 649 - array( 650 - DrydockResourceStatus::STATUS_ACTIVE, 651 - DrydockResourceStatus::STATUS_RELEASED, 652 - )) 653 - ->executeOne(); 654 - if ($reclaiming_resource) { 655 - return null; 656 - } 657 - } 658 - 659 870 $resources = id(new DrydockResourceQuery()) 660 871 ->setViewer($viewer) 661 872 ->withBlueprintPHIDs(array($blueprint->getPHID())) ··· 752 963 $blueprint->getClassName(), 753 964 'acquireLease()')); 754 965 } 755 - } 756 - 757 - private function newResourceForAcquisition( 758 - DrydockResource $resource, 759 - DrydockLease $lease) { 760 - 761 - // If the resource has no leases against it, never build a new one. This is 762 - // likely already a new resource that just activated. 763 - $viewer = $this->getViewer(); 764 - 765 - $statuses = array( 766 - DrydockLeaseStatus::STATUS_PENDING, 767 - DrydockLeaseStatus::STATUS_ACQUIRED, 768 - DrydockLeaseStatus::STATUS_ACTIVE, 769 - ); 770 - 771 - $leases = id(new DrydockLeaseQuery()) 772 - ->setViewer($viewer) 773 - ->withResourcePHIDs(array($resource->getPHID())) 774 - ->withStatuses($statuses) 775 - ->setLimit(1) 776 - ->execute(); 777 - if (!$leases) { 778 - return $resource; 779 - } 780 - 781 - // If we're about to get a lease on a resource, check if the blueprint 782 - // wants to allocate a supplemental resource. If it does, try to perform a 783 - // new allocation instead. 784 - $blueprint = $resource->getBlueprint(); 785 - if (!$blueprint->shouldAllocateSupplementalResource($resource, $lease)) { 786 - return $resource; 787 - } 788 - 789 - // If the blueprint is already overallocated, we can't allocate a new 790 - // resource. Just return the existing resource. 791 - $remaining = $this->removeOverallocatedBlueprints( 792 - array($blueprint), 793 - $lease); 794 - if (!$remaining) { 795 - return $resource; 796 - } 797 - 798 - // Try to build a new resource. 799 - try { 800 - $new_resource = $this->allocateResource($blueprint, $lease); 801 - } catch (Exception $ex) { 802 - $blueprint->logEvent( 803 - DrydockResourceAllocationFailureLogType::LOGCONST, 804 - array( 805 - 'class' => get_class($ex), 806 - 'message' => $ex->getMessage(), 807 - )); 808 - 809 - return $resource; 810 - } 811 - 812 - // If we can't actually acquire the new resource yet, just yield. 813 - // (We could try to move forward with the original resource instead.) 814 - $acquirable = $this->removeUnacquirableResources( 815 - array($new_resource), 816 - $lease); 817 - if (!$acquirable) { 818 - throw new PhabricatorWorkerYieldException(15); 819 - } 820 - 821 - return $new_resource; 822 966 } 823 967 824 968
+1 -1
src/applications/drydock/worker/DrydockWorker.php
··· 260 260 // Mark the lease as reclaiming this resource. It won't be allowed to start 261 261 // another reclaim as long as this resource is still in the process of 262 262 // being reclaimed. 263 - $lease->setAttribute('drydock.reclaimingPHID', $resource->getPHID()); 263 + $lease->addReclaimedResourcePHIDs(array($resource->getPHID())); 264 264 265 265 // When the resource releases, we we want to reawaken this task since it 266 266 // should (usually) be able to start building a new resource right away.