Make repository synchronization safer when leaders are ambiguous

Summary:
Ref T4292. Right now, repository versions only get marked when a write happens.

This potentially creates a problem: if I pushed all the sync code to `secure` and enabled `secure002` as a repository host, the daemons would create empty copies of all the repositories on that host.

Usually, this would be fine. Most repositories have already received a write on `secure001`, so that working copy has a verison and is a leader.

However, when a write happened to a rarely-used repository (say, rKEYSTORE) that hadn't received any write recently, it might be sent to `secure002` randomly. Now, we'd try to figure out if `secure002` has the most up-to-date copy of the repository or not.

We wouldn't be able to, since we don't have any information about which node has the data on it, since we never got a write before. The old code could guess wrong and decide that `secure002` is a leader, then accept the write. Since this would bump the version on `secure002`, that would //make// it an authoritative leader, and `secure001` would synchronize from it passively (or on the next read or write), which would potentially destroy data.

Instead:

- Refuse to continue in situations like this.
- When a repository is on exactly one device, mark it as a leader with version "0".
- When a repository is created into a cluster service, mark its version as "0" on all devices (they're all leaders, since the repository is empty).

This should mean that we won't lose data no matter how much weird stuff we run into.

Test Plan:
- In single-node mode, used `repository update` to verify that `0` was written properly.
- With multiple nodes, used `repository update` to verify that we refuse to continue.
- Created a new repository, verified versions were initialized correctly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4292

Differential Revision: https://secure.phabricator.com/D15761

epriestley 10 years ago 287e761f 6edf181a

+106 -23

2 changed files

expand all

src

applications

repository

editor

PhabricatorRepositoryEditor.php

storage

PhabricatorRepository.php

src/applications/repository/editor/PhabricatorRepositoryEditor.php

··· 683 683 $object->save(); 684 684 } 685 685 686 + if ($this->getIsNewObject()) { 687 + $object->synchronizeWorkingCopyAfterCreation(); 688 + } 689 + 686 690 return $xactions; 687 691 } 688 692

+102 -23

src/applications/repository/storage/PhabricatorRepository.php

··· 1932 1932 return null; 1933 1933 } 1934 1934 1935 - $bindings = $service->getBindings(); 1935 + $bindings = $service->getActiveBindings(); 1936 1936 if (!$bindings) { 1937 1937 throw new Exception( 1938 1938 pht( ··· 1954 1954 1955 1955 $uris = array(); 1956 1956 foreach ($bindings as $binding) { 1957 - if ($binding->getIsDisabled()) { 1958 - continue; 1959 - } 1960 - 1961 1957 $iface = $binding->getInterface(); 1962 1958 1963 1959 // If we're never proxying this and it's locally satisfiable, return ··· 2228 2224 2229 2225 2230 2226 /** 2227 + * Synchronize repository version information after creating a repository. 2228 + * 2229 + * This initializes working copy versions for all currently bound devices to 2230 + * 0, so that we don't get stuck making an ambiguous choice about which 2231 + * devices are leaders when we later synchronize before a read. 2232 + * 2233 + * @task sync 2234 + */ 2235 + public function synchronizeWorkingCopyAfterCreation() { 2236 + if (!$this->shouldEnableSynchronization()) { 2237 + return; 2238 + } 2239 + 2240 + $repository_phid = $this->getPHID(); 2241 + 2242 + $service = $this->loadAlmanacService(); 2243 + if (!$service) { 2244 + throw new Exception(pht('Failed to load repository cluster service.')); 2245 + } 2246 + 2247 + $bindings = $service->getActiveBindings(); 2248 + foreach ($bindings as $binding) { 2249 + PhabricatorRepositoryWorkingCopyVersion::updateVersion( 2250 + $repository_phid, 2251 + $binding->getDevicePHID(), 2252 + 0); 2253 + } 2254 + } 2255 + 2256 + 2257 + /** 2231 2258 * @task sync 2232 2259 */ 2233 2260 public function synchronizeWorkingCopyBeforeRead() { ··· 2255 2282 if ($this_version) { 2256 2283 $this_version = (int)$this_version->getRepositoryVersion(); 2257 2284 } else { 2258 - $this_version = 0; 2285 + $this_version = -1; 2259 2286 } 2260 2287 2261 2288 if ($versions) { 2289 + // This is the normal case, where we have some version information and 2290 + // can identify which nodes are leaders. If the current node is not a 2291 + // leader, we want to fetch from a leader and then update our version. 2292 + 2262 2293 $max_version = (int)max(mpull($versions, 'getRepositoryVersion')); 2294 + if ($max_version > $this_version) { 2295 + $fetchable = array(); 2296 + foreach ($versions as $version) { 2297 + if ($version->getRepositoryVersion() == $max_version) { 2298 + $fetchable[] = $version->getDevicePHID(); 2299 + } 2300 + } 2301 + 2302 + $this->synchronizeWorkingCopyFromDevices($fetchable); 2303 + 2304 + PhabricatorRepositoryWorkingCopyVersion::updateVersion( 2305 + $repository_phid, 2306 + $device_phid, 2307 + $max_version); 2308 + } 2309 + 2310 + $result_version = $max_version; 2263 2311 } else { 2264 - $max_version = 0; 2265 - } 2312 + // If no version records exist yet, we need to be careful, because we 2313 + // can not tell which nodes are leaders. 2314 + 2315 + // There might be several nodes with arbitrary existing data, and we have 2316 + // no way to tell which one has the "right" data. If we pick wrong, we 2317 + // might erase some or all of the data in the repository. 2318 + 2319 + // Since this is dangeorus, we refuse to guess unless there is only one 2320 + // device. If we're the only device in the group, we obviously must be 2321 + // a leader. 2322 + 2323 + $service = $this->loadAlmanacService(); 2324 + if (!$service) { 2325 + throw new Exception(pht('Failed to load repository cluster service.')); 2326 + } 2266 2327 2267 - if ($max_version > $this_version) { 2268 - $fetchable = array(); 2269 - foreach ($versions as $version) { 2270 - if ($version->getRepositoryVersion() == $max_version) { 2271 - $fetchable[] = $version->getDevicePHID(); 2272 - } 2328 + $bindings = $service->getActiveBindings(); 2329 + $device_map = array(); 2330 + foreach ($bindings as $binding) { 2331 + $device_map[$binding->getDevicePHID()] = true; 2332 + } 2333 + 2334 + if (count($device_map) > 1) { 2335 + throw new Exception( 2336 + pht( 2337 + 'Repository "%s" exists on more than one device, but no device '. 2338 + 'has any repository version information. Phabricator can not '. 2339 + 'guess which copy of the existing data is authoritative. Remove '. 2340 + 'all but one device from service to mark the remaining device '. 2341 + 'as the authority.', 2342 + $this->getDisplayName())); 2273 2343 } 2274 2344 2275 - $this->synchronizeWorkingCopyFromDevices($fetchable); 2345 + if (empty($device_map[$device->getPHID()])) { 2346 + throw new Exception( 2347 + pht( 2348 + 'Repository "%s" is being synchronized on device "%s", but '. 2349 + 'this device is not bound to the corresponding cluster '. 2350 + 'service ("%s").', 2351 + $this->getDisplayName(), 2352 + $device->getName(), 2353 + $service->getName())); 2354 + } 2276 2355 2356 + // The current device is the only device in service, so it must be a 2357 + // leader. We can safely have any future nodes which come online read 2358 + // from it. 2277 2359 PhabricatorRepositoryWorkingCopyVersion::updateVersion( 2278 2360 $repository_phid, 2279 2361 $device_phid, 2280 - $max_version); 2362 + 0); 2363 + 2364 + $result_version = 0; 2281 2365 } 2282 2366 2283 2367 $read_lock->unlock(); 2284 2368 2285 - return $max_version; 2369 + return $result_version; 2286 2370 } 2287 2371 2288 2372 ··· 2399 2483 } 2400 2484 2401 2485 $device_map = array_fuse($device_phids); 2402 - $bindings = $service->getBindings(); 2486 + $bindings = $service->getActiveBindings(); 2403 2487 2404 2488 $fetchable = array(); 2405 2489 foreach ($bindings as $binding) { 2406 - // We can't fetch from disabled nodes. 2407 - if ($binding->getIsDisabled()) { 2408 - continue; 2409 - } 2410 - 2411 2490 // We can't fetch from nodes which don't have the newest version. 2412 2491 $device_phid = $binding->getDevicePHID(); 2413 2492 if (empty($device_map[$device_phid])) {

Configure Feed

Configure Feed