Never sever non-cluster database; write more read-only documentation

+5 -2

src/applications/config/controller/PhabricatorConfigClusterDatabasesController.php

··· 35 35 36 36 $rows = array(); 37 37 foreach ($databases as $database) { 38 + $messages = array(); 39 + 38 40 if ($database->getIsMaster()) { 39 41 $role_icon = id(new PHUIIconView()) 40 42 ->setIcon('fa-database sky') ··· 125 127 } else { 126 128 $health_icon = id(new PHUIIconView()) 127 129 ->setIcon('fa-times red'); 130 + $messages[] = pht( 131 + 'UNHEALTHY: This database has failed recent health checks. Traffic '. 132 + 'will not be sent to it until it recovers.'); 128 133 } 129 134 130 135 $health_count = pht( ··· 137 142 ' ', 138 143 $health_count, 139 144 ); 140 - 141 - $messages = array(); 142 145 143 146 $conn_message = $database->getConnectionMessage(); 144 147 if ($conn_message) {

+167 -22

src/docs/user/cluster/cluster_databases.diviner

··· 22 22 Phabricator can not currently be configured into a multi-master mode, nor can 23 23 it be configured to automatically promote a replica to become the new master. 24 24 25 + If you lose the master, Phabricator can degrade automatically into read-only 26 + mode and remain available, but can not fully recover without operational 27 + intervention unless the master recovers on its own. 28 + 25 29 26 30 Setting up MySQL Replication 27 31 ============================ ··· 59 63 `mysql.pass`) are used only to provide defaults. 60 64 61 65 Once you've configured this option, restart Phabricator for the changes to take 62 - effect, then continue to "Monitoring and Testing" to verify the configuration. 66 + effect, then continue to "Monitoring Replicas" to verify the configuration. 63 67 64 68 65 - Monitoring and Testing 66 - ====================== 69 + Monitoring Replicas 70 + =================== 67 71 68 72 You can monitor replicas in {nav Config > Cluster Databases}. This interface 69 73 shows you a quick overview of replicas and their health, and can detect some 70 74 common issues with replication. 71 75 72 - TODO: Write more stuff here. 76 + The table on this page shows each database and current status. 77 + 78 + NOTE: This page runs its diagnostics //from the web server that is serving the 79 + request//. If you are recovering from a disaster, the view this page shows 80 + may be partial or misleading, and two requests served by different servers may 81 + see different views of the cluster. 82 + 83 + **Connection**: Phabricator tries to connect to each configured database, then 84 + shows the result in this column. If it fails, a brief diagnostic message with 85 + details about the error is shown. If it succeeds, the column shows a rough 86 + measurement of latency from the current webserver to the database. 87 + 88 + **Replication**: This is a summary of replication status on the database. If 89 + things are properly configured and stable, the replicas should be actively 90 + replicating and no more than a few seconds behind master, and the master 91 + should //not// be replicating from another database. 92 + 93 + To report this status, the user Phabricator is connecting as must have the 94 + `REPLICATION CLIENT` privilege (or the `SUPER` privilege) so it can run the 95 + `SHOW SLAVE STATUS` command. The `REPLICATION CLIENT` privilege only enables 96 + the user to run diagnostic commands so it should be reasonable to grant it in 97 + most cases, but it is not required. If you choose not to grant it, this page 98 + can not show any useful diagnostic information about replication status but 99 + everything else will still work. 100 + 101 + If a replica is more than a second behind master, this page will show the 102 + current replication delay. If the replication delay is more than 30 seconds, 103 + it will report "Slow Replication" with a warning icon. 104 + 105 + If replication is delayed, data is at risk: if you lose the master and can not 106 + later recover it (for example, because a meteor has obliterated the datacenter 107 + housing the physical host), data which did not make it to the replica will be 108 + lost forever. 109 + 110 + Beyond the risk of data loss, any read-only traffic sent to the replica will 111 + see an older view of the world which could be confusing for users: it may 112 + appear that their data has been lost, even if it is safe and just hasn't 113 + replicated yet. 114 + 115 + Phabricator will attempt to prevent clients from seeing out-of-date views, but 116 + sometimes sending traffic to a delayed replica is the best available option 117 + (for example, if the master can not be reached). 118 + 119 + **Health**: This column shows the result of recent health checks against the 120 + server. After several checks in a row fail, Phabricator will mark the server 121 + as unhealthy and stop sending traffic to it until several checks in a row 122 + later succeed. 123 + 124 + Note that each web server tracks database health independently, so if you have 125 + several servers they may have different views of database health. This is 126 + normal and not problematic. 127 + 128 + For more information on health checks, see "Unreachable Masters" below. 129 + 130 + **Messages**: This column has additional details about any errors shown in the 131 + other columns. These messages can help you understand or resolve problems. 132 + 133 + 134 + Testing Replicas 135 + ================ 136 + 137 + To test that your configuration can survive a disaster, turn off the master 138 + database. Do this with great ceremony, making a cool explosion sound as you 139 + run the `mysqld stop` command. 140 + 141 + If things have been set up properly, Phabricator should degrade to a temporary 142 + read-only mode immediately. After a brief period of unresponsiveness, it will 143 + degrade further into a longer-term read-only mode. For details on how this 144 + works interanlly, see "Unreachable Masters" below. 145 + 146 + Once satisfied, turn the master back on. After a brief delay, Phabricator 147 + should recognize that the master is healthy again and recover fully. 148 + 149 + Throughout this process, the {nav Cluster Databases} console will show a 150 + current view of the world from the perspective of the web server handling the 151 + request. You can use it to monitor state. 152 + 153 + You can perform a more narrow test by enabling `cluster.read-only` in 154 + configuration. This will put Phabricator into read-only mode immediately 155 + without turning off any databases. 156 + 157 + You can use this mode to understand which capabilities will and will not be 158 + available in read-only mode, and make sure any information you want to remain 159 + accessible in a disaster (like wiki pages or contact information) is really 160 + accessible. 161 + 162 + See the next section, "Degradation to Read Only Mode", for more details about 163 + when, why, and how Phabricator degrades. 164 + 165 + If you run custom code or extensions, they may not accommodate read-only mode 166 + properly. You should specifically test that they function correctly in 167 + read-only mode and do not prevent you from accessing important information. 168 + 73 169 74 170 Degradation to Read-Only Mode 75 171 ============================= ··· 78 174 79 175 - you turn it on explicitly; 80 176 - you configure cluster mode, but don't set up any masters; 81 - - the master is misconfigured and unsafe to write to; or 82 - - the master is unreachable. 177 + - the master can not be reached while handling a request; or 178 + - recent attempts to connect to the master have consistently failed. 83 179 84 180 When Phabricator is running in read-only mode, users can still read data and 85 181 browse and clone repositories, but they can not edit, update, or push new ··· 99 195 be more convenient than turning it on explicitly during the course of 100 196 operations work. 101 197 102 - Before writing to a master, Phabricator will verify that the host is not 103 - configured as a replica. This is a safety feature to prevent data loss if your 104 - MySQL and Phabricator configurations disagree about replica configuration. If 105 - your `master` is currently replicating from another host, Phabricator will 106 - treat it as a `replica` instead and implicitly degrade into read-only mode. 107 - 108 - Finally, if Phabricator is unable to reach the master, it will degrade into 109 - read-only mode. For details on how Phabricator determines that a master is 110 - unreachable, see "Unreachable Masters" below. 111 - 112 - If a master becomes unreachable, this normally corresponds to loss of the 113 - master host, a severed network link, or some other sort of disaster. 114 - Phabricator will degrade and continue operating in read-only mode until the 115 - master recovers or operations personnel can assess the situation and intervene. 198 + If Phabricator is unable to reach the master database, it will degrade into 199 + read-only mode automatically. See "Unreachable Masters" below for details on 200 + how this process works. 116 201 117 202 If you end up in a situation where you have lost the master and can not get it 118 203 back online (or can not restore it quickly) you can promote a replica to become ··· 122 207 Promoting a Replica 123 208 =================== 124 209 125 - TODO: Write this, too. 210 + TODO: Write this section. 126 211 127 212 128 213 Unreachable Masters ··· 131 216 This section describes how Phabricator determines that a master has been lost, 132 217 marks it unreachable, and degrades into read-only mode. 133 218 134 - TODO: For now, it doesn't. 219 + Phabricator degrades into read-only mode automatically in two ways: very 220 + briefly in response to a single connection failure, or more permanently in 221 + response to a series of connection failures. 222 + 223 + In the first case, if a request needs to connect to the master but is not able 224 + to, Phabricator will temporarily degrade into read-only mode for the remainder 225 + of that request. The alternative is to fail abruptly, but Phabricator can 226 + sometimes degrade successfully and still respond to the user's request, so it 227 + makes an effort to finish serving the request from replicas. 228 + 229 + If the request was a write (like posting a comment) it will fail anyway, but 230 + if it was a read that did not actually need to use the master it may succeed. 231 + 232 + This temporary mode is intended to recover as gracefully as possible from brief 233 + interruptions in service (a few seconds), like a server being restarted, a 234 + network link becoming temporarily unavailable, or brief periods of load-related 235 + disruption. If the anomaly is temporary, Phabricator should recover immediately 236 + (on the next request once service is restored). 237 + 238 + This mode can be slow for users (they need to wait on connection attempts to 239 + the master which fail) and does not reduce load on the master (requests still 240 + attempt to connect to it). 241 + 242 + The second way Phabricator degrades is by running periodic health checks 243 + against databases, and marking them unhealthy if they fail over a longer period 244 + of time. This mechanism is very similar to the health checks that most HTTP 245 + load balancers perform against web servers. 246 + 247 + If a database fails several health checks in a row, Phabricator will mark it as 248 + unhealthy and stop sending all traffic (except for more health checks) to it. 249 + This improves performance during a service interruption and reduces load on the 250 + master, which may help it recover from load problems. 251 + 252 + You can monitor the status of health checks in the {nav Cluster Databases} 253 + console. The "Health" column shows how many checks have run recently and 254 + how many have succeeded. 255 + 256 + Health checks run every 3 seconds, and 5 checks in a row must fail or succeed 257 + before Phabricator marks the database as healthy or unhealthy, so it will 258 + generally take about 15 seconds for a database to change state after it goes 259 + down or comes up. 260 + 261 + If all of the recent checks fail, Phabricator will mark the database as 262 + unhealthy and stop sending traffic to it. If the master was the database that 263 + was marked as unhealthy, Phabricator will actively degrade into read-only mode 264 + until it recovers. 265 + 266 + This mode only attempts to connect to the unhealthy database once every few 267 + seconds to see if it is recovering, so performance will be better on average 268 + (users rarely need to wait for bad connections to fail or time out) and the 269 + datbase will receive less load. 270 + 271 + Once all of the recent checks succeed, Phabricator will mark the database as 272 + healthy again and continue sending traffic to it. 273 + 274 + Health checks are tracked individually for each web server, so some web servers 275 + may see a host as healthy while others see it as unhealthy. This is normal, and 276 + can accurately reflect the state of the world: for example, the link between 277 + datacenters may have been lost, so hosts in one datacenter can no longer see 278 + the master, while hosts in the other datacenter still have a healthy link to 279 + it. 135 280 136 281 137 282 Backups

+2

src/infrastructure/cluster/PhabricatorDatabaseHealthRecord.php

··· 52 52 * the state. 53 53 */ 54 54 public function getRequiredEventCount() { 55 + // NOTE: If you change this value, update the "Cluster: Databases" docs. 55 56 return 5; 56 57 } 57 58 ··· 60 61 * Seconds to wait between health checks. 61 62 */ 62 63 public function getHealthCheckFrequency() { 64 + // NOTE: If you change this value, update the "Cluster: Databases" docs. 63 65 return 3; 64 66 } 65 67

+46 -10

src/infrastructure/cluster/PhabricatorDatabaseRef.php

··· 14 14 const REPLICATION_SLOW = 'replica-slow'; 15 15 16 16 const KEY_REFS = 'cluster.db.refs'; 17 + const KEY_INDIVIDUAL = 'cluster.db.individual'; 17 18 18 19 private $host; 19 20 private $port; ··· 21 22 private $pass; 22 23 private $disabled; 23 24 private $isMaster; 25 + private $isIndividual; 24 26 25 27 private $connectionLatency; 26 28 private $connectionStatus; ··· 145 147 return $this->replicaDelay; 146 148 } 147 149 150 + public function setIsIndividual($is_individual) { 151 + $this->isIndividual = $is_individual; 152 + return $this; 153 + } 154 + 155 + public function getIsIndividual() { 156 + return $this->isIndividual; 157 + } 158 + 148 159 public static function getConnectionStatusMap() { 149 160 return array( 150 161 self::STATUS_OKAY => array( ··· 207 218 return $refs; 208 219 } 209 220 221 + public static function getLiveIndividualRef() { 222 + $cache = PhabricatorCaches::getRequestCache(); 223 + 224 + $ref = $cache->getKey(self::KEY_INDIVIDUAL); 225 + if (!$ref) { 226 + $ref = self::newIndividualRef(); 227 + $cache->setKey(self::KEY_INDIVIDUAL, $ref); 228 + } 229 + 230 + return $ref; 231 + } 232 + 210 233 public static function newRefs() { 211 234 $refs = array(); 212 235 ··· 339 362 } 340 363 341 364 public function isSevered() { 365 + // If we only have an individual database, never sever our connection to 366 + // it, at least for now. It's possible that using the same severing rules 367 + // might eventually make sense to help alleviate load-related failures, 368 + // but we should wait for all the cluster stuff to stabilize first. 369 + if ($this->getIsIndividual()) { 370 + return false; 371 + } 372 + 342 373 if ($this->didFailToConnect) { 343 374 return true; 344 375 } ··· 402 433 $refs = self::getLiveRefs(); 403 434 404 435 if (!$refs) { 405 - $conf = PhabricatorEnv::newObjectFromConfig( 406 - 'mysql.configuration-provider', 407 - array(null, 'w', null)); 408 - 409 - return id(new self()) 410 - ->setHost($conf->getHost()) 411 - ->setPort($conf->getPort()) 412 - ->setUser($conf->getUser()) 413 - ->setPass($conf->getPassword()) 414 - ->setIsMaster(true); 436 + return self::getLiveIndividualRef(); 415 437 } 416 438 417 439 $master = null; ··· 425 447 } 426 448 427 449 return null; 450 + } 451 + 452 + public static function newIndividualRef() { 453 + $conf = PhabricatorEnv::newObjectFromConfig( 454 + 'mysql.configuration-provider', 455 + array(null, 'w', null)); 456 + 457 + return id(new self()) 458 + ->setHost($conf->getHost()) 459 + ->setPort($conf->getPort()) 460 + ->setUser($conf->getUser()) 461 + ->setPass($conf->getPassword()) 462 + ->setIsIndividual(true) 463 + ->setIsMaster(true); 428 464 } 429 465 430 466 public static function getReplicaDatabaseRef() {

Configure Feed

Configure Feed