···11+# Recovery
22+33+Recovery is the process by which a Hobbes cluster rebuilds its Transaction Plane after failures have occurred.
44+Some servers within the Transaction Plane, like the `Sequencer`, are singletons;
55+if they fail the cluster can no longer serve requests.
66+Others, like the `TLog`, are redundant by design,
77+but they must still be replaced or the cluster will lose its fault tolerance over time.
88+99+The Transaction Plane is complex and has many roles which must work together to commit transactions with Hobbes's strong guarantees.
1010+Much of this coordination also happens over the network,
1111+meaning error handling has an enormous complexity and bug surface.
1212+To avoid this complexity, servers in the Transaction Plane do not handle errors at all.
1313+1414+For example, if a `CommitBuffer`'s request to a `TLog` timed out,
1515+it would have to perform complex logic to return to a consistent state.
1616+Instead, the `CommitBuffer` simply kills itself,
1717+and the Recovery process returns the Transaction Plane to a consistent state by rebuilding it completely.
1818+1919+### Stateless vs Stateful roles
2020+2121+Most of the roles in the Transaction Plane are stateless on disk (that is, they exist only in-memory) and can be trivially replaced.
2222+However, each generation of the Transaction Plane *does* need to hold some persistent state (the transaction log),
2323+and this data is stored on `TLog` servers.
2424+This data is persistent (it must not be lost), but it is also transient:
2525+the log can be safely truncated once mutations have been applied to the relevant `Storage` servers' on-disk state.
2626+2727+Because `TLog` servers store important transaction logs, they are not destroyed immediately after a recovery.
2828+They live until all of their logs have been applied by `Storage` servers and are then destroyed once they are no longer needed.
2929+In practice, this is usually only a few seconds.