this repo has no description
2
fork

Configure Feed

Select the types of activity you want to include in your feed.

Add recovery doc

garrison 906b8cb6 ba0dab8d

+29
+29
internals/recovery.md
··· 1 + # Recovery 2 + 3 + Recovery is the process by which a Hobbes cluster rebuilds its Transaction Plane after failures have occurred. 4 + Some servers within the Transaction Plane, like the `Sequencer`, are singletons; 5 + if they fail the cluster can no longer serve requests. 6 + Others, like the `TLog`, are redundant by design, 7 + but they must still be replaced or the cluster will lose its fault tolerance over time. 8 + 9 + The Transaction Plane is complex and has many roles which must work together to commit transactions with Hobbes's strong guarantees. 10 + Much of this coordination also happens over the network, 11 + meaning error handling has an enormous complexity and bug surface. 12 + To avoid this complexity, servers in the Transaction Plane do not handle errors at all. 13 + 14 + For example, if a `CommitBuffer`'s request to a `TLog` timed out, 15 + it would have to perform complex logic to return to a consistent state. 16 + Instead, the `CommitBuffer` simply kills itself, 17 + and the Recovery process returns the Transaction Plane to a consistent state by rebuilding it completely. 18 + 19 + ### Stateless vs Stateful roles 20 + 21 + Most of the roles in the Transaction Plane are stateless on disk (that is, they exist only in-memory) and can be trivially replaced. 22 + However, each generation of the Transaction Plane *does* need to hold some persistent state (the transaction log), 23 + and this data is stored on `TLog` servers. 24 + This data is persistent (it must not be lost), but it is also transient: 25 + the log can be safely truncated once mutations have been applied to the relevant `Storage` servers' on-disk state. 26 + 27 + Because `TLog` servers store important transaction logs, they are not destroyed immediately after a recovery. 28 + They live until all of their logs have been applied by `Storage` servers and are then destroyed once they are no longer needed. 29 + In practice, this is usually only a few seconds.