···2424Data is split into contiguous (ordered by key) shards and distributed (and replicated) amongst Storage servers.
2525If a Storage server fails or becomes unresponsive,
2626the Transaction Plane will issue commands to re-replicate its shards to restore fault tolerance.
2727+2828+2929+## Transaction Path
3030+3131+Before a transaction is started, the client must be aware of the topology of the cluster,
3232+particularly the pids of various servers in the Transaction Plane.
3333+This information is requested from the Coordinators as they are the ultimate authority and have static registered names within the cluster.
3434+Once obtained, the information can be aggressively cached on each node as it changes rarely (only after a recovery).
3535+3636+### Read version
3737+3838+To begin a transaction, the client needs to get a read version at which all reads will be performed.
3939+The client chooses a `BeginBuffer` server and sends a `get_read_version()` request.
4040+4141+The `BeginBuffer` batches a set of requests before sending its own request for a read version to the `Sequencer`.
4242+The `Sequencer` replies with a strictly monotonic read version guaranteed to be higher than any committed version
4343+(this ensures strict serializability of transactions).
4444+Additionally, the `BeginBuffer` sends requests to the fellow `TLog` servers from its generation asking whether any of them have been locked,
4545+which would indicate a recovery taking place.
4646+If no recovery is taking or has taken place, the `BeginBuffer` returns a read version to all clients in the batch.
4747+4848+### Performing reads
4949+5050+Before performing a read, the client needs to ask a `CommitBuffer` for information about the relevant shard and where it's stored.
5151+This shard information can be aggressively cached as shards are moved rarely.
5252+If a Storage server turns out to no longer be serving the requested shard (i.e. the cache is stale),
5353+the client will ask the `CommitBuffer` for updated shard information.
5454+5555+To perform reads, the client sends read requests to `Storage` servers.
5656+The requests include the read version as well as the Transaction Plane generation,
5757+which is needed to prevent reads of uncommitted data following a recovery.
5858+Reads are effectively performed against a consistent snapshot of the database at the read version.
5959+6060+Reads which cross shard boundaries are split and sent to their respective Storage servers.
6161+The results are joined.
6262+6363+When reads are performed, the keys or key ranges read are tracked interally as read conflicts.
6464+If and when the transaction is committed, these read conflicts will be used to perform concurrency control.
6565+6666+## Performing writes
6767+6868+Writes are simply buffered internally such that they can be sent when the transaction is committed.
6969+7070+Write conflicts are also tracked in a manner similar to read conflicts, again to be used for concurrency control.
7171+This may seem redundant (we already have the writes themselves),
7272+but in some cases advanced clients may choose to manipulate the write conflicts independently to improve performance or selectively weaken consistency guarantees.
7373+However, such manipulation is dangerous and not generally recommended.
+1
lib/servers/begin_buffer.ex
···141141 # We use the replication_factor for the current generation to do the liveness check
142142 %TLogGeneration{replication_factor: replication_factor} = hd(state.cluster.tlog_generations)
143143144144+ # TODO: handle teams
144145 case length(state.check_locked_reply_ids) >= replication_factor do
145146 true ->
146147 read_version = state.batch_read_version