More recovery docs · corporate.fm/hobbes@9f3c57e

+82

1 changed file

expand all

internals

recovery.md

+82

internals/recovery.md

··· 27 27 Because `TLog` servers store important transaction logs, they are not destroyed immediately after a recovery. 28 28 They live until all of their logs have been applied by `Storage` servers and are then destroyed once they are no longer needed. 29 29 In practice, this is usually only a few seconds. 30 + 31 + ## Recovery Path 32 + 33 + ### Generation failure 34 + 35 + Each Transaction Plane generation is managed by a `Manager` server. 36 + The `Manager` is spawned by the lead `Coordinator` and receives heartbeat messages from `ServerSupervisor` processes on all nodes. 37 + If the `Manager` finds out a server has died, or does not hear from a node, it kills itself. 38 + Remember that various roles within the Transaction Plane also kill themselves if their requests to each other time out, 39 + so a failure of any server within the Transaction Plane is sure to eventually reach the `Manager`. 40 + 41 + If the node hosting the `Manager` (and therefore also the lead `Coordinator`) fails, 42 + the Control Plane will choose a new lead `Coordinator` via distributed consensus. 43 + It will in turn spawn a new `Manager`. 44 + 45 + The new `Manager` will then initiate a Recovery. 46 + 47 + ### Increment generation 48 + 49 + The new `Manager` will first attempt to increment the generation number in the Control Plane state. 50 + This generation number serves as a fencing token; 51 + it ensures that only one Recovery can take place at a time. 52 + 53 + ### Collection 54 + 55 + #### Slots 56 + 57 + On each node, a `ServerSupervisor` process regularly pings the `Coordinator`s to see if there is a new `Manager`. 58 + When a new `Manager` is received, the `ServerSupervisor` will send a ping informing the `Manager` of its available slots. 59 + Slots are spaces where new servers can be spawned. 60 + 61 + The `Manager` must wait until it has collected enough slots on enough nodes to meet the fault tolerance requirements of the cluster. 62 + For example, if the cluster uses triple replication the `Manager` must receive at least 3 `TLog` slots on 3 different nodes to proceed. 63 + 64 + #### Old TLogs 65 + 66 + The `ServerSupervisor` will also collect information about the cluster which it then forwards to all servers from the previous generation. 67 + Stateless servers will kill themselves immediately upon finding out about a new generation, 68 + but the `TLog` servers will instead ping the new `Manager` to inform it about their state. 69 + The `TLog`s store information without which the recovery cannot proceed. 70 + 71 + The `Manager` must collect enough `TLog` servers from the previous generation to proceed. 72 + Specifically, at least one `TLog` from each `TLog` team must be available. 73 + If an entire `TLog` team were missing, that would exceed the fault tolerance of the cluster; 74 + it could no longer remain available without data loss. 75 + 76 + ### Recovery 77 + 78 + To recover, the `Manager` first analyzes the information sent by the previous generation's `TLog`s. 79 + The ping message includes two important values: 80 + 81 + - `durable_version`: the version of the largest committed batch the `TLog` has persisted 82 + - `known_committed_version`: the largest version that the `TLog` *knows* was committed to every `TLog` in the generation 83 + 84 + The `Manager` then uses values from all surviving `TLog`s to compute two values: 85 + 86 + - `min_dv`: the smallest durable version, i.e. the largest version that was replicated to all *surviving* `TLog`s 87 + - `max_kcv`: the largest version that was *known* to be replicated to all `TLogs`, including those which did not survive 88 + 89 + The `min_dv` will be used as the **recovery version**: 90 + any batches *above* this version were not fully replicated to all `TLog`s, 91 + and because mutations are sharded some mutations from those batches could be missing. 92 + These partially-committed batches will be discarded; 93 + this is safe because their transactions cannot possibly have returned to the client 94 + (an important guarantee). 95 + 96 + The `max_kcv` will be used as a lower bound above which mutations will be copied and re-replicated to the new generation of `TLog`s. 97 + The need to copy these versions is subtle: 98 + batches in the range of `(max_kcv, min_dv]` are fully replicated on all *surviving* `TLog`s, 99 + but may not have been replicated to the `TLog`s that are down (temporarily or permanently). 100 + This means that batches in this range *may not* have reached full fault tolerance. 101 + If we *knew* a batch was not fully replicated we could discard it, but because some `TLog`s have been lost there is no way to know. 102 + 103 + Because completing a recovery means permanently committing those versions, 104 + it is important that we re-replicate them back to the full fault tolerance requirements of the cluster. 105 + To understand why, imagine that a batch is only replicated to one `TLog` in a team of three, 106 + and then the other two go down. 107 + During recovery, this batch will still be present, so it will make it into the new generation. 108 + Now imagine that, after the recovery is complete, that one `TLog` goes down and the *other two* come back up. 109 + At this point, we have lost the only copy of those mutations, meaning the cluster cannot remain available. 110 + But the cluster is *supposed* to be able to tolerate such a fault; 111 + to restore fault tolerance, we re-replicate these batches.

Configure Feed

Configure Feed