···2727Because `TLog` servers store important transaction logs, they are not destroyed immediately after a recovery.
2828They live until all of their logs have been applied by `Storage` servers and are then destroyed once they are no longer needed.
2929In practice, this is usually only a few seconds.
3030+3131+## Recovery Path
3232+3333+### Generation failure
3434+3535+Each Transaction Plane generation is managed by a `Manager` server.
3636+The `Manager` is spawned by the lead `Coordinator` and receives heartbeat messages from `ServerSupervisor` processes on all nodes.
3737+If the `Manager` finds out a server has died, or does not hear from a node, it kills itself.
3838+Remember that various roles within the Transaction Plane also kill themselves if their requests to each other time out,
3939+so a failure of any server within the Transaction Plane is sure to eventually reach the `Manager`.
4040+4141+If the node hosting the `Manager` (and therefore also the lead `Coordinator`) fails,
4242+the Control Plane will choose a new lead `Coordinator` via distributed consensus.
4343+It will in turn spawn a new `Manager`.
4444+4545+The new `Manager` will then initiate a Recovery.
4646+4747+### Increment generation
4848+4949+The new `Manager` will first attempt to increment the generation number in the Control Plane state.
5050+This generation number serves as a fencing token;
5151+it ensures that only one Recovery can take place at a time.
5252+5353+### Collection
5454+5555+#### Slots
5656+5757+On each node, a `ServerSupervisor` process regularly pings the `Coordinator`s to see if there is a new `Manager`.
5858+When a new `Manager` is received, the `ServerSupervisor` will send a ping informing the `Manager` of its available slots.
5959+Slots are spaces where new servers can be spawned.
6060+6161+The `Manager` must wait until it has collected enough slots on enough nodes to meet the fault tolerance requirements of the cluster.
6262+For example, if the cluster uses triple replication the `Manager` must receive at least 3 `TLog` slots on 3 different nodes to proceed.
6363+6464+#### Old TLogs
6565+6666+The `ServerSupervisor` will also collect information about the cluster which it then forwards to all servers from the previous generation.
6767+Stateless servers will kill themselves immediately upon finding out about a new generation,
6868+but the `TLog` servers will instead ping the new `Manager` to inform it about their state.
6969+The `TLog`s store information without which the recovery cannot proceed.
7070+7171+The `Manager` must collect enough `TLog` servers from the previous generation to proceed.
7272+Specifically, at least one `TLog` from each `TLog` team must be available.
7373+If an entire `TLog` team were missing, that would exceed the fault tolerance of the cluster;
7474+it could no longer remain available without data loss.
7575+7676+### Recovery
7777+7878+To recover, the `Manager` first analyzes the information sent by the previous generation's `TLog`s.
7979+The ping message includes two important values:
8080+8181+- `durable_version`: the version of the largest committed batch the `TLog` has persisted
8282+- `known_committed_version`: the largest version that the `TLog` *knows* was committed to every `TLog` in the generation
8383+8484+The `Manager` then uses values from all surviving `TLog`s to compute two values:
8585+8686+- `min_dv`: the smallest durable version, i.e. the largest version that was replicated to all *surviving* `TLog`s
8787+- `max_kcv`: the largest version that was *known* to be replicated to all `TLogs`, including those which did not survive
8888+8989+The `min_dv` will be used as the **recovery version**:
9090+any batches *above* this version were not fully replicated to all `TLog`s,
9191+and because mutations are sharded some mutations from those batches could be missing.
9292+These partially-committed batches will be discarded;
9393+this is safe because their transactions cannot possibly have returned to the client
9494+(an important guarantee).
9595+9696+The `max_kcv` will be used as a lower bound above which mutations will be copied and re-replicated to the new generation of `TLog`s.
9797+The need to copy these versions is subtle:
9898+batches in the range of `(max_kcv, min_dv]` are fully replicated on all *surviving* `TLog`s,
9999+but may not have been replicated to the `TLog`s that are down (temporarily or permanently).
100100+This means that batches in this range *may not* have reached full fault tolerance.
101101+If we *knew* a batch was not fully replicated we could discard it, but because some `TLog`s have been lost there is no way to know.
102102+103103+Because completing a recovery means permanently committing those versions,
104104+it is important that we re-replicate them back to the full fault tolerance requirements of the cluster.
105105+To understand why, imagine that a batch is only replicated to one `TLog` in a team of three,
106106+and then the other two go down.
107107+During recovery, this batch will still be present, so it will make it into the new generation.
108108+Now imagine that, after the recovery is complete, that one `TLog` goes down and the *other two* come back up.
109109+At this point, we have lost the only copy of those mutations, meaning the cluster cannot remain available.
110110+But the cluster is *supposed* to be able to tolerate such a fault;
111111+to restore fault tolerance, we re-replicate these batches.