@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Make cluster repositories more resistant to freezing

Summary:
Ref T10860. This allows us to recover if the connection to the database is lost during a push.

If we lose the connection to the master database during a push, we would previously freeze the repository. This is very safe, but not very operator-friendly since you have to go manually unfreeze it.

We don't need to be quite this aggressive about freezing things. The repository state is still consistent after we've "upgraded" the lock by setting `isWriting = 1`, so we're actually fine even if we lost the global lock.

Instead of just freezing the repository immediately, sit there in a loop waiting for the master to come back up for a few minutes. If it recovers, we can release the lock and everything will be OK again.

Basically, the changes are:

- If we can't release the lock at first, sit in a loop trying really hard to release it for a while.
- Add a unique lock identifier so we can be certain we're only releasing //our// lock no matter what else is going on.
- Do the version reads on the same connection holding the lock, so we can be sure we haven't lost the lock before we do that read.

Test Plan:
- Added a `sleep(10)` after accepting the write but before releasing the lock so I could run `mysqld stop` and force this issue to occur.
- Pushed like this:

```
$ echo D >> record && git commit -am D && git push
[master 707ecc3] D
1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 254 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
BEGIN SLEEP
```

- Here, I stopped `mysqld` from the CLI in another terminal window.

```
END SLEEP
# CRITICAL. Failed to release cluster write lock!
# The connection to the master database was lost while receiving the write.
# This process will spend 300 more second(s) attempting to recover, then give up.
```

- Here, I started `mysqld` again.

```
# RECOVERED. Link to master database was restored.
# Released cluster write lock.
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
2cbf87c..707ecc3 master -> master
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10860

Differential Revision: https://secure.phabricator.com/D15792

+183 -40
+2
resources/sql/autopatches/20160424.locks.1.sql
··· 1 + ALTER TABLE {$NAMESPACE}_repository.repository_workingcopyversion 2 + ADD lockOwner VARCHAR(255) COLLATE {$COLLATE_TEXT};
+108 -21
src/applications/diffusion/protocol/DiffusionRepositoryClusterEngine.php
··· 11 11 12 12 private $repository; 13 13 private $viewer; 14 + private $logger; 15 + 14 16 private $clusterWriteLock; 15 17 private $clusterWriteVersion; 16 - private $logger; 18 + private $clusterWriteOwner; 17 19 18 20 19 21 /* -( Configuring Synchronization )---------------------------------------- */ ··· 247 249 $device = AlmanacKeys::getLiveDevice(); 248 250 $device_phid = $device->getPHID(); 249 251 252 + $table = new PhabricatorRepositoryWorkingCopyVersion(); 253 + $locked_connection = $table->establishConnection('w'); 254 + 250 255 $write_lock = PhabricatorRepositoryWorkingCopyVersion::getWriteLock( 251 256 $repository_phid); 257 + 258 + $write_lock->useSpecificConnection($locked_connection); 252 259 253 260 $lock_wait = phutil_units('2 minutes in seconds'); 254 261 ··· 290 297 throw new Exception( 291 298 pht( 292 299 'An previous write to this repository was interrupted; refusing '. 293 - 'new writes. This issue resolves operator intervention to resolve, '. 300 + 'new writes. This issue requires operator intervention to resolve, '. 294 301 'see "Write Interruptions" in the "Cluster: Repositories" in the '. 295 302 'documentation for instructions.')); 296 303 } ··· 302 309 throw $ex; 303 310 } 304 311 312 + $pid = getmypid(); 313 + $hash = Filesystem::readRandomCharacters(12); 314 + $this->clusterWriteOwner = "{$pid}.{$hash}"; 315 + 305 316 PhabricatorRepositoryWorkingCopyVersion::willWrite( 317 + $locked_connection, 306 318 $repository_phid, 307 319 $device_phid, 308 320 array( 309 321 'userPHID' => $viewer->getPHID(), 310 322 'epoch' => PhabricatorTime::getNow(), 311 323 'devicePHID' => $device_phid, 312 - )); 324 + ), 325 + $this->clusterWriteOwner); 313 326 314 327 $this->clusterWriteVersion = $max_version; 315 328 $this->clusterWriteLock = $write_lock; ··· 337 350 $device = AlmanacKeys::getLiveDevice(); 338 351 $device_phid = $device->getPHID(); 339 352 340 - // NOTE: This means we're still bumping the version when pushes fail. We 341 - // could select only un-rejected events instead to bump a little less 342 - // often. 353 + // It is possible that we've lost the global lock while receiving the push. 354 + // For example, the master database may have been restarted between the 355 + // time we acquired the global lock and now, when the push has finished. 356 + 357 + // We wrote a durable lock while we were holding the the global lock, 358 + // essentially upgrading our lock. We can still safely release this upgraded 359 + // lock even if we're no longer holding the global lock. 360 + 361 + // If we fail to release the lock, the repository will be frozen until 362 + // an operator can figure out what happened, so we try pretty hard to 363 + // reconnect to the database and release the lock. 364 + 365 + $now = PhabricatorTime::getNow(); 366 + $duration = phutil_units('5 minutes in seconds'); 367 + $try_until = $now + $duration; 368 + 369 + $did_release = false; 370 + $already_failed = false; 371 + while (PhabricatorTime::getNow() <= $try_until) { 372 + try { 373 + // NOTE: This means we're still bumping the version when pushes fail. We 374 + // could select only un-rejected events instead to bump a little less 375 + // often. 376 + 377 + $new_log = id(new PhabricatorRepositoryPushEventQuery()) 378 + ->setViewer(PhabricatorUser::getOmnipotentUser()) 379 + ->withRepositoryPHIDs(array($repository_phid)) 380 + ->setLimit(1) 381 + ->executeOne(); 382 + 383 + $old_version = $this->clusterWriteVersion; 384 + if ($new_log) { 385 + $new_version = $new_log->getID(); 386 + } else { 387 + $new_version = $old_version; 388 + } 389 + 390 + PhabricatorRepositoryWorkingCopyVersion::didWrite( 391 + $repository_phid, 392 + $device_phid, 393 + $this->clusterWriteVersion, 394 + $new_log->getID(), 395 + $this->clusterWriteOwner); 396 + $did_release = true; 397 + break; 398 + } catch (AphrontConnectionQueryException $ex) { 399 + $connection_exception = $ex; 400 + } catch (AphrontConnectionLostQueryException $ex) { 401 + $connection_exception = $ex; 402 + } 343 403 344 - $new_log = id(new PhabricatorRepositoryPushEventQuery()) 345 - ->setViewer(PhabricatorUser::getOmnipotentUser()) 346 - ->withRepositoryPHIDs(array($repository_phid)) 347 - ->setLimit(1) 348 - ->executeOne(); 404 + if (!$already_failed) { 405 + $already_failed = true; 406 + $this->logLine( 407 + pht('CRITICAL. Failed to release cluster write lock!')); 408 + 409 + $this->logLine( 410 + pht( 411 + 'The connection to the master database was lost while receiving '. 412 + 'the write.')); 413 + 414 + $this->logLine( 415 + pht( 416 + 'This process will spend %s more second(s) attempting to '. 417 + 'recover, then give up.', 418 + new PhutilNumber($duration))); 419 + } 349 420 350 - $old_version = $this->clusterWriteVersion; 351 - if ($new_log) { 352 - $new_version = $new_log->getID(); 421 + sleep(1); 422 + } 423 + 424 + if ($did_release) { 425 + if ($already_failed) { 426 + $this->logLine( 427 + pht('RECOVERED. Link to master database was restored.')); 428 + } 429 + $this->logLine(pht('Released cluster write lock.')); 353 430 } else { 354 - $new_version = $old_version; 431 + throw new Exception( 432 + pht( 433 + 'Failed to reconnect to master database and release held write '. 434 + 'lock ("%s") on device "%s" for repository "%s" after trying '. 435 + 'for %s seconds(s). This repository will be frozen.', 436 + $this->clusterWriteOwner, 437 + $device->getName(), 438 + $this->getDisplayName(), 439 + new PhutilNumber($duration))); 355 440 } 356 441 357 - PhabricatorRepositoryWorkingCopyVersion::didWrite( 358 - $repository_phid, 359 - $device_phid, 360 - $this->clusterWriteVersion, 361 - $new_log->getID()); 442 + // We can continue even if we've lost this lock, everything is still 443 + // consistent. 444 + try { 445 + $this->clusterWriteLock->unlock(); 446 + } catch (Exception $ex) { 447 + // Ignore. 448 + } 362 449 363 - $this->clusterWriteLock->unlock(); 364 450 $this->clusterWriteLock = null; 451 + $this->clusterWriteOwner = null; 365 452 } 366 453 367 454
+1
src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php
··· 11 11 12 12 public function writeClusterEngineLogMessage($message) { 13 13 parent::writeError($message); 14 + $this->getErrorChannel()->update(); 14 15 } 15 16 16 17 protected function identifyRepository() {
+15
src/applications/diffusion/ssh/DiffusionSSHWorkflow.php
··· 55 55 return $this; 56 56 } 57 57 58 + protected function getCurrentDeviceName() { 59 + $device = AlmanacKeys::getLiveDevice(); 60 + if ($device) { 61 + return $device->getName(); 62 + } 63 + 64 + return php_uname('n'); 65 + } 66 + 67 + protected function getTargetDeviceName() { 68 + // TODO: This should use the correct device identity. 69 + $uri = new PhutilURI($this->proxyURI); 70 + return $uri->getDomain(); 71 + } 72 + 58 73 protected function shouldProxy() { 59 74 return (bool)$this->proxyURI; 60 75 }
+22 -11
src/applications/repository/storage/PhabricatorRepositoryWorkingCopyVersion.php
··· 7 7 protected $devicePHID; 8 8 protected $repositoryVersion; 9 9 protected $isWriting; 10 + protected $lockOwner; 10 11 protected $writeProperties; 11 12 12 13 protected function getConfiguration() { ··· 16 17 'repositoryVersion' => 'uint32', 17 18 'isWriting' => 'bool', 18 19 'writeProperties' => 'text?', 20 + 'lockOwner' => 'text255?', 19 21 ), 20 22 self::CONFIG_KEY_SCHEMA => array( 21 23 'key_workingcopy' => array( ··· 69 71 * by default. 70 72 */ 71 73 public static function willWrite( 74 + AphrontDatabaseConnection $locked_connection, 72 75 $repository_phid, 73 76 $device_phid, 74 - array $write_properties) { 77 + array $write_properties, 78 + $lock_owner) { 79 + 75 80 $version = new self(); 76 - $conn_w = $version->establishConnection('w'); 77 81 $table = $version->getTableName(); 78 82 79 83 queryfx( 80 - $conn_w, 84 + $locked_connection, 81 85 'INSERT INTO %T 82 86 (repositoryPHID, devicePHID, repositoryVersion, isWriting, 83 - writeProperties) 87 + writeProperties, lockOwner) 84 88 VALUES 85 - (%s, %s, %d, %d, %s) 89 + (%s, %s, %d, %d, %s, %s) 86 90 ON DUPLICATE KEY UPDATE 87 91 isWriting = VALUES(isWriting), 88 - writeProperties = VALUES(writeProperties)', 92 + writeProperties = VALUES(writeProperties), 93 + lockOwner = VALUES(lockOwner)', 89 94 $table, 90 95 $repository_phid, 91 96 $device_phid, 92 97 0, 93 98 1, 94 - phutil_json_encode($write_properties)); 99 + phutil_json_encode($write_properties), 100 + $lock_owner); 95 101 } 96 102 97 103 ··· 102 108 $repository_phid, 103 109 $device_phid, 104 110 $old_version, 105 - $new_version) { 111 + $new_version, 112 + $lock_owner) { 113 + 106 114 $version = new self(); 107 115 $conn_w = $version->establishConnection('w'); 108 116 $table = $version->getTableName(); ··· 111 119 $conn_w, 112 120 'UPDATE %T SET 113 121 repositoryVersion = %d, 114 - isWriting = 0 122 + isWriting = 0, 123 + lockOwner = NULL 115 124 WHERE 116 125 repositoryPHID = %s AND 117 126 devicePHID = %s AND 118 127 repositoryVersion = %d AND 119 - isWriting = 1', 128 + isWriting = 1 AND 129 + lockOwner = %s', 120 130 $table, 121 131 $new_version, 122 132 $repository_phid, 123 133 $device_phid, 124 - $old_version); 134 + $old_version, 135 + $lock_owner); 125 136 } 126 137 127 138
+35 -8
src/infrastructure/util/PhabricatorGlobalLock.php
··· 29 29 final class PhabricatorGlobalLock extends PhutilLock { 30 30 31 31 private $conn; 32 + private $isExternalConnection = false; 32 33 33 34 private static $pool = array(); 34 35 ··· 74 75 */ 75 76 public function useSpecificConnection(AphrontDatabaseConnection $conn) { 76 77 $this->conn = $conn; 78 + $this->isExternalConnection = true; 77 79 return $this; 78 80 } 79 81 ··· 109 111 $max_allowed_timeout = 2147483; 110 112 queryfx($conn, 'SET wait_timeout = %d', $max_allowed_timeout); 111 113 114 + $lock_name = $this->getName(); 115 + 112 116 $result = queryfx_one( 113 117 $conn, 114 118 'SELECT GET_LOCK(%s, %f)', 115 - $this->getName(), 119 + $lock_name, 116 120 $wait); 117 121 118 122 $ok = head($result); 119 123 if (!$ok) { 120 - throw new PhutilLockException($this->getName()); 124 + throw new PhutilLockException($lock_name); 121 125 } 126 + 127 + $conn->rememberLock($lock_name); 122 128 123 129 $this->conn = $conn; 124 130 } 125 131 126 132 protected function doUnlock() { 127 - queryfx( 128 - $this->conn, 129 - 'SELECT RELEASE_LOCK(%s)', 130 - $this->getName()); 133 + $lock_name = $this->getName(); 134 + 135 + $conn = $this->conn; 136 + 137 + try { 138 + $result = queryfx_one( 139 + $conn, 140 + 'SELECT RELEASE_LOCK(%s)', 141 + $lock_name); 142 + $conn->forgetLock($lock_name); 143 + } catch (Exception $ex) { 144 + $result = array(null); 145 + } 146 + 147 + $ok = head($result); 148 + if (!$ok) { 149 + // TODO: We could throw here, but then this lock doesn't get marked 150 + // unlocked and we throw again later when exiting. It also doesn't 151 + // particularly matter for any current applications. For now, just 152 + // swallow the error. 153 + } 131 154 132 - $this->conn->close(); 133 - self::$pool[] = $this->conn; 134 155 $this->conn = null; 156 + $this->isExternalConnection = false; 157 + 158 + if (!$this->isExternalConnection) { 159 + $conn->close(); 160 + self::$pool[] = $conn; 161 + } 135 162 } 136 163 137 164 }