Add generational read version rfd · corporate.fm/hobbes@1700f68

+43

1 changed file

expand all

rfds

+43

rfds/00005_generational_read_version.md

··· 1 + # Generational Read Version 2 + 3 + In FoundationDB, transactions are not allowed to span generations. 4 + During recovery, the current database version is advanced by 100 million (100 seconds). 5 + Because the mvcc window is only 5 million (5 seconds), 6 + this effectively kills any transactions that were in-progress during the recovery; 7 + they will fail with `transaction too old`. 8 + 9 + There are a couple of reasons for this design. 10 + 11 + For one, the new generation comes with a completely new set of stateless processes, 12 + including a new Resolver. 13 + The new Resolver does not have the last 5M versions of writes in-memory, 14 + so it can't perform concurrency control for any transactions spanning the recovery. 15 + Fast-forwarding the database version fixes this: 16 + these transactions will never reach the Resolver because post-recovery they will always be considered too old. 17 + 18 + The second reason is considerably more subtle. 19 + On storage servers, 20 + writes are applied eagerly to the in-memory ptree before they are actually considered committed (fully replicated). 21 + This is safe because clients cannot receive a read version to read these mutations until after the CommitProxy acks the version 22 + (to the Master a.k.a. Sequencer), which only happens once the batch is fully replicated. 23 + This eager fetching effectively pipelines the commit process, saving latency; 24 + by the time a client is able to receive a new read version the mutations are likely already on the Storage servers, 25 + so the extra hop from TLogs to Storage is effectively nullified. 26 + 27 + The performance improvement comes at the cost of some complexity: 28 + during a recovery uncommitted mutations can be lost (an unavoidable property of a fault-tolerant system), 29 + and this means Storage servers must be able to "undo" uncommitted mutations. 30 + FDB solves this elegantly: 31 + due to its MVCC model, the last 5M versions of mutations are stored *only* in-memory on Storage servers, 32 + and are not written to disk. 33 + When a Storage server notices that a recovery has occurred, it undoes all in-memory locations by literally just killing itself. 34 + After restarting it will no longer read mutations past the end of the previous generation before moving on, 35 + effectively skipping the uncommitted versions. 36 + 37 + (Note that this means the range of partially-committed batches on TLogs cannot be allowed to exceed 5M versions, 38 + which is an incredibly subtle gotcha.) 39 + 40 + So the problem is that if we do *not* fast-forward during recovery, 41 + a new transaction with a read version from the new generation's Sequencer could attempt to read uncommitted mutations from the *last* generation. 42 + That is, if we did not fast-forward, the generations' version ranges could *overlap* and there would be confusion about which mutations are being read. 43 + The fast-forward fixes this because it effectively skips over those versions in the "history" of the database.

Configure Feed

Configure Feed