@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Fix unbounded expansion of allocating resource pool

Summary:
Ref T9252. I think there's a more complex version of this problem discussed elsewhere, but here's what we hit today:

- 5 commits land at the same time and trigger 5 builds.
- All of them go to acquire a working copy.
- Working copies have a limit of 1 right now, so 1 of them gets the lease on it.
- The other 4 all trigger allocation of //new// working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies.
- The 4 pending working copies will never activate without manual intervention, so these 4 builds are stuck forever.

To fix this, prevent WorkingCopies from giving out leases until they activate. So now the leases won't acquire until we know the working copy is good, which solves the first problem.

However, this creates a secondary problem:

- As above, all 5 go to acquire a working copy.
- One gets it.
- The other 4 trigger allocations, but no longer acquire leases. This is an improvement.
- Every time the leases update, they trigger another allocation, but never acquire. They trigger, say, a few thousand allocations.
- Eventually the first build finishes up and the second lease acquires the working copy. After some time, all of the builds finish.
- However, they generated an unboundedly large number of pending working copy resources during this time.

This is technically "okay-ish", in that it did work correctly, it just generated a gigantic mess as a side effect.

To solve this, at least for now, provide a mechanism to impose allocation rate limits and put a cap on the number of allocating resources of a given type. As hard-coded, this the greater of "1" or "25% of the active resources in the pool".

So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out. This prevents us from getting runaway queues of limitless size.

This also imposes a total active working copy resource limit of 1, which incidentally also fixes the problem, although I expect to raise this soon.

These mechanisms will need refinement, but the basic idea is:

- Resources which aren't sure if they can actually activate should wait until they do activate before allowing leases to acquire them. I'm fairly confident this rule is a reasonable one.
- Then we limit how many bookkeeping side effects Drydock can generate once it starts encountering limits.

Broadly, some amount of mess is inevitable because Drydock is allowed to try things that might not work. In an extreme case we could prevent this mess by setting all these limits at "1" forever, which would degrade Drydock to effectively be a synchronous, blocking queue.

The idea here is to put some amount of slack in the system (more than zero, but less than infinity) so we get the performance benefits of having a parallel, asyncronous system without a finite, manageable amount of mess.

Numbers larger than 0 but less than infinity are pretty tricky, but I think rules like "X% of active resources" seem fairly reasonable, at least for resources like working copies.

Test Plan:
Ran something like this:

```
for i in `seq 1 5`; do sh -c '(./bin/harbormaster build --plan 10 rX... &) &'; done;
```

Saw 5 plans launch, acquire leases, proceed in an orderly fashion, and eventually finish successfully.

Reviewers: hach-que, chad

Reviewed By: chad

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14236

+106 -4
+67
src/applications/drydock/blueprint/DrydockBlueprintImplementation.php
··· 19 19 return array(); 20 20 } 21 21 22 + public function getViewer() { 23 + return PhabricatorUser::getOmnipotentUser(); 24 + } 25 + 22 26 23 27 /* -( Lease Acquisition )-------------------------------------------------- */ 24 28 ··· 308 312 $lease_status, 309 313 DrydockLeaseStatus::STATUS_ACTIVE)); 310 314 } 315 + } 316 + 317 + 318 + /** 319 + * Apply standard limits on resource allocation rate. 320 + * 321 + * @param DrydockBlueprint The blueprint requesting an allocation. 322 + * @return bool True if further allocations should be limited. 323 + */ 324 + protected function shouldLimitAllocatingPoolSize( 325 + DrydockBlueprint $blueprint) { 326 + 327 + // TODO: If this mechanism sticks around, these values should be 328 + // configurable by the blueprint implementation. 329 + 330 + // Limit on total number of active resources. 331 + $total_limit = 1; 332 + 333 + // Always allow at least this many allocations to be in flight at once. 334 + $min_allowed = 1; 335 + 336 + // Allow this fraction of allocating resources as a fraction of active 337 + // resources. 338 + $growth_factor = 0.25; 339 + 340 + $resource = new DrydockResource(); 341 + $conn_r = $resource->establishConnection('r'); 342 + 343 + $counts = queryfx_all( 344 + $conn_r, 345 + 'SELECT status, COUNT(*) N FROM %T WHERE blueprintPHID = %s', 346 + $resource->getTableName(), 347 + $blueprint->getPHID()); 348 + $counts = ipull($counts, 'N', 'status'); 349 + 350 + $n_alloc = idx($counts, DrydockResourceStatus::STATUS_PENDING, 0); 351 + $n_active = idx($counts, DrydockResourceStatus::STATUS_ACTIVE, 0); 352 + $n_broken = idx($counts, DrydockResourceStatus::STATUS_BROKEN, 0); 353 + $n_released = idx($counts, DrydockResourceStatus::STATUS_RELEASED, 0); 354 + 355 + // If we're at the limit on total active resources, limit additional 356 + // allocations. 357 + $n_total = ($n_alloc + $n_active + $n_broken + $n_released); 358 + if ($n_total >= $total_limit) { 359 + return true; 360 + } 361 + 362 + // If the number of in-flight allocations is fewer than the minimum number 363 + // of allowed allocations, don't impose a limit. 364 + if ($n_alloc < $min_allowed) { 365 + return false; 366 + } 367 + 368 + $allowed_alloc = (int)ceil($n_active * $growth_factor); 369 + 370 + // If the number of in-flight allocation is fewer than the number of 371 + // allowed allocations according to the pool growth factor, don't impose 372 + // a limit. 373 + if ($n_alloc < $allowed_alloc) { 374 + return false; 375 + } 376 + 377 + return true; 311 378 } 312 379 313 380 }
+21 -2
src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php
··· 29 29 public function canAllocateResourceForLease( 30 30 DrydockBlueprint $blueprint, 31 31 DrydockLease $lease) { 32 + $viewer = $this->getViewer(); 33 + 34 + if ($this->shouldLimitAllocatingPoolSize($blueprint)) { 35 + return false; 36 + } 37 + 38 + // TODO: If we have a pending resource which is compatible with the 39 + // configuration for this lease, prevent a new allocation? Otherwise the 40 + // queue can fill up with copies of requests from the same lease. But 41 + // maybe we can deal with this with "pre-leasing"? 42 + 32 43 return true; 33 44 } 34 45 ··· 36 47 DrydockBlueprint $blueprint, 37 48 DrydockResource $resource, 38 49 DrydockLease $lease) { 50 + 51 + // Don't hand out leases on working copies which have not activated, since 52 + // it may take an arbitrarily long time for them to acquire a host. 53 + if (!$resource->isActive()) { 54 + return false; 55 + } 39 56 40 57 $need_map = $lease->getAttribute('repositories.map'); 41 58 if (!is_array($need_map)) { ··· 320 337 } 321 338 322 339 private function loadRepositories(array $phids) { 340 + $viewer = $this->getViewer(); 341 + 323 342 $repositories = id(new PhabricatorRepositoryQuery()) 324 - ->setViewer(PhabricatorUser::getOmnipotentUser()) 343 + ->setViewer($viewer) 325 344 ->withPHIDs($phids) 326 345 ->execute(); 327 346 $repositories = mpull($repositories, null, 'getPHID'); ··· 353 372 } 354 373 355 374 private function loadHostLease(DrydockResource $resource) { 356 - $viewer = PhabricatorUser::getOmnipotentUser(); 375 + $viewer = $this->getViewer(); 357 376 358 377 $lease_phid = $resource->getAttribute('host.leasePHID'); 359 378
+9
src/applications/drydock/storage/DrydockResource.php
··· 293 293 } 294 294 } 295 295 296 + public function isActive() { 297 + switch ($this->getStatus()) { 298 + case DrydockResourceStatus::STATUS_ACTIVE: 299 + return true; 300 + } 301 + 302 + return false; 303 + } 304 + 296 305 public function logEvent($type, array $data = array()) { 297 306 $log = id(new DrydockLog()) 298 307 ->setEpoch(PhabricatorTime::getNow())
+9 -2
src/applications/drydock/worker/DrydockLeaseUpdateWorker.php
··· 535 535 // If this lease has been acquired but not activated, queue a task to 536 536 // activate it. 537 537 if ($lease->getStatus() == DrydockLeaseStatus::STATUS_ACQUIRED) { 538 - PhabricatorWorker::scheduleTask( 538 + $this->queueTask( 539 539 __CLASS__, 540 540 array( 541 541 'leasePHID' => $lease->getPHID(), ··· 691 691 ->setStatus(DrydockLeaseStatus::STATUS_BROKEN) 692 692 ->save(); 693 693 694 - $lease->scheduleUpdate(); 694 + $this->queueTask( 695 + __CLASS__, 696 + array( 697 + 'leasePHID' => $lease->getPHID(), 698 + ), 699 + array( 700 + 'objectPHID' => $lease->getPHID(), 701 + )); 695 702 696 703 $lease->logEvent( 697 704 DrydockLeaseActivationFailureLogType::LOGCONST,