Restore a rocks database from object storage
1implementation, format stuff, etc.
2
3in general when there's a question of how to handle something: try to match the rocksdb behaviour.
4
5
6## backup files
7
8see the rocksdb [wiki](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB) and the authoritative rocks backup implementation: [`utilities/backup/backup_engine.cc`](https://github.com/facebook/rocksdb/blob/main/utilities/backup/backup_engine.cc).
9
10The format is line-oriented text. yay.
11
12The incremental backup files are stored like this:
13
14```
15meta/
16 1 # meta file for backup 1
17 2 # meta file for backup 2
18private/
19 1/ # per-backup files (CURRENT, MANIFEST, OPTIONS, WALs)
20 2/
21shared_checksum/
22 000007_2894567812_590.sst # SSTs shared across backups, funny names
23```
24
25The `meta` files contain everything you need to restore one backup: namely, a list of files to copy.
26It's all not too complicated.
27
28Schema v2 added a `schema_version 2.N` on the first line but is otherwise backwards compatible with v1 (which starts with the timestamp line), so both are supported:
29
30```
31[schema_version 2.n] # absent for v1
32<timestamp>
33<sequence_number>
34[metadata <hex>] # optional app metadata
35[<field> <value>]* # unknown fields, skippable unless ni:: prefixed
36<file_count>
37<path> [field value]* # many lines like this (<file_count> of them)
38```
39
40File fields include `crc32` (actually crc32c), `size`, `temp`, and `ni::excluded`.
41Fields starting with `ni::` (non-ignorable) are meant to fail to parse unless they are specifically recognized by the parser. yay forward compat.
42
43To avoid file name collisions, rocks puts `_<checksum>_<size>` suffixes on files in the `shared_checksum/` folder.
44These are just for uniqueness.
45During restore you pop them out and just write to `<name before underscore>.<ext>`, without any interpretation.
46
47
48### exclusion zone
49
50rocks has fancy weird multi-backup advanced stuff where you can reference files living in one backup from another, to avoid redundant copies, and these get marked `excluded`.
51
52these are outside the scope of `eat-rocks` and we error if any files in the meta are `excluded`, since we wouldn't know where to look for them.
53
54
55### get `CURRENT`
56
57rocks itself uses a file called `CURRENT` as its entrypoint to the db.
58when restoring, we write all other files first, then atomically rename the new `CURRENT` into place, so a partial restore won't corrupt things. (just following rocks here)
59
60
61## with integrity
62
63rocksdb (accidentally?) doesn't emit a `size` value in the meta for files, for... some reason.
64so despite implementing an unconditional validation check for it, it's really not actually checked.
65
66we do reliably get the crc32c from rocks though, and `eat-rocks` will check it by default.
67passing `--no-verify` (or setting `RestoreOptions::verify` to false) disables the check.
68i'm not sure why you'd want to though -- unlike a restore from the local filesystem, downloading from object storage means the full contents get streamed through us, so checking the crc is basically free.
69
70The meta file field is called `crc32` but the values are actually crc32c.
71grep for "WART" in rocksdb source about it. `crc32c` with `crc32c_append` for streaming works for us.
72
73
74## all together (concurrency)
75
76the concurrency limit is applied in two places -- idk if this was a great idea but hey.
77`futures::stream::buffer_unordered` limits how many files we're asking `object_store` to work on at any given time, and `object_store::limit::LimitStore` wraps the actual `ObjectStore` backend to apply the same limit at a lower level.
78
79
80## plz don't ignore
81
82non-ignorable fields (`ni::` prefix) cause hard failures in both header and file parsing if they're not recognized.
83Unknown ignorable fields are silently skipped.
84`ni::excluded` is the only recognized `ni::` field currently.
85
86
87## don't test me
88
89unit tests go in the modules of things they test (normal rust)
90
91`object_store::memory::InMemory` is great for stubbing out object storage space with arbitrary contents.
92
93there are some neat end-to-end tests in `tests/e2e.rs` which hopefully validate the whole thing going on here, down to actually generating real backups from rocksdb (via the `rocksdb` rust crate) and restoring them with both our implementation and rocks'.
94
95the meta file parser is pretty simple/small but hey why not fuzz it -- try:
96
97```bash
98rustup run nightly cargo fuzz run parse_meta
99```