Tune ClickHouse schema after performance review

Apply findings from a review of the ClickHouse community-wisdom
performance docs against the actual write/read patterns:

- Drop monthly partitioning on both tables (PARTITION BY tuple()).
Backfill writes ~10k posts spanning years per user in batches of 25;
monthly partitioning would fan each batch across many partitions and
trigger the "too many parts" failure mode at our small total volume
(~50 GB lifetime), with no benefit since no query filters by date
alone.
- LowCardinality(String) on post_author_did in both tables.
- Minmax skip index on post_snapshots.post_created_at to support the
optional ?after= filter without putting microsecond timestamps in the
sort key.
- Switch post_snapshots engine to the two-argument form
ReplacingMergeTree(snapshot_taken_at, is_deleted) for native deletion
semantics.
- Top-25 query now uses FROM post_snapshots FINAL. Without it, multiple
snapshot versions of the same post_uri (between merges) split into
separate GROUP BY buckets and return the same post twice. FINAL is
cheap here because the query always filters on the leading order-key
column.
- engagement_events still does NOT use FINAL — countIf over briefly-
duplicated rows is within the accepted error budget.

No T64 codec on the count columns. T64 expects sparse data or small
ranges within a block; like counts on Bluesky mix small and large
values in any granule, which is the wrong distribution. Default LZ4 is
fine and we can revisit with a real benchmark if compression becomes a
problem.

Also rejected: putting is_deleted in the post_snapshots ORDER BY. Doing
so would give tombstones a different key tuple from the originals they
replace, breaking the ReplacingMergeTree collapse mechanism entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tao Bojlén 2 months ago 6d6598f3 28a37f6e

+64 -22

1 changed file

expand all

docs

superpowers

specs

2026-04-11-skystar-bluesky-design.md

+64 -22

docs/superpowers/specs/2026-04-11-skystar-bluesky-design.md

··· 221 221 ```sql 222 222 CREATE TABLE post_snapshots ( 223 223 post_uri String, 224 - post_author_did String, 224 + post_author_did LowCardinality(String), 225 225 post_text String, 226 226 post_created_at DateTime64(6), 227 227 snapshot_likes UInt32, 228 228 snapshot_reposts UInt32, 229 229 snapshot_quotes UInt32, -- populated from getPosts.quoteCount 230 230 snapshot_taken_at DateTime64(6), -- per-post watermark 231 - is_deleted UInt8 DEFAULT 0 232 - ) ENGINE = ReplacingMergeTree(snapshot_taken_at) 231 + is_deleted UInt8 DEFAULT 0, 232 + INDEX idx_created post_created_at TYPE minmax GRANULARITY 4 233 + ) ENGINE = ReplacingMergeTree(snapshot_taken_at, is_deleted) 233 234 ORDER BY (post_author_did, post_uri) 234 - PARTITION BY toYYYYMM(post_created_at); 235 + PARTITION BY tuple(); 235 236 ``` 236 237 237 - `ReplacingMergeTree` means re-running backfill or post-delete tombstones for 238 - the same `(post_author_did, post_uri)` collapses to the latest version on 239 - merge. 238 + `ReplacingMergeTree` collapses rows with the same ORDER BY tuple 239 + `(post_author_did, post_uri)` to the latest version (largest 240 + `snapshot_taken_at`). The two-argument engine form enables native deletion 241 + semantics: a tombstone row written with `is_deleted=1` causes ClickHouse to 242 + physically remove the row on the next merge, and `FINAL` queries skip 243 + deleted rows at read time. `is_deleted` is **not** part of the order key — 244 + if it were, tombstones would have a different key tuple from the originals 245 + they replace and the collapse would never happen. 246 + 247 + `LowCardinality(String)` on `post_author_did` turns every column read into 248 + a dictionary lookup; tens of thousands of distinct authors is well within 249 + the LowCardinality sweet spot. 250 + 251 + The `idx_created` minmax skip index lets the optional `?after=` filter 252 + prune granules without putting a microsecond timestamp in the sort key 253 + (which would destroy the sparse primary index). 254 + 255 + `PARTITION BY tuple()` (i.e. no partitioning) is deliberate. Backfill 256 + inserts ~10,000 posts per user spanning years of history in batches of 25; 257 + monthly partitioning would fan each batch across many partitions and 258 + trigger the "too many parts" failure mode. At our total data size (~50 GB 259 + lifetime) and with no query that filters by date alone, partitioning buys 260 + us nothing. 240 261 241 262 #### `engagement_events` 242 263 ··· 247 268 ```sql 248 269 CREATE TABLE engagement_events ( 249 270 post_uri String, 250 - post_author_did String, -- denormalized for fast author filter 271 + post_author_did LowCardinality(String), -- denormalized for fast author filter 251 272 actor_did String, 252 - rkey String, -- the engagement record's rkey 253 - kind LowCardinality(String), -- 'like' | 'repost' | 'quote' 254 - event_created_at DateTime64(6), -- when the actor created the engagement 273 + rkey String, -- the engagement record's rkey 274 + kind LowCardinality(String), -- 'like' | 'repost' | 'quote' 275 + event_created_at DateTime64(6), -- when the actor created the engagement 255 276 ingested_at DateTime64(6) DEFAULT now64(6) 256 277 ) ENGINE = ReplacingMergeTree(ingested_at) 257 278 ORDER BY (post_author_did, kind, post_uri, actor_did, rkey) 258 - PARTITION BY toYYYYMM(event_created_at); 279 + PARTITION BY tuple(); 259 280 ``` 260 281 261 282 `ReplacingMergeTree` keyed on the natural unique identifier 262 283 `(post_author_did, kind, post_uri, actor_did, rkey)` makes inserts 263 284 idempotent: replaying the same event on Jetstream reconnect collides with 264 - itself and collapses on merge. We do **not** use `FINAL` in queries; the 265 - brief transient overcount during the merge window is accepted (see §10). 285 + itself and collapses on merge. We do **not** use `FINAL` on 286 + `engagement_events` — `countIf` over briefly-duplicated rows produces a 287 + slight transient overcount, which is within the accepted error budget (see 288 + §10). This is different from `post_snapshots`, where duplicate rows would 289 + cause GROUP BY bucket splitting that returns the same post twice; on 290 + `post_snapshots` we do use `FINAL`. 266 291 267 292 The order key also makes "top posts for one author, one kind" a sequential 268 293 scan over a tight slab of disk, supporting single-digit-millisecond queries. 294 + `post_author_did` is `LowCardinality(String)` for the same dictionary-lookup 295 + benefits as on `post_snapshots`. `PARTITION BY tuple()` for the same 296 + "too many parts" reasons. 269 297 270 298 ### How counts are computed 271 299 ··· 288 316 ```sql 289 317 SELECT 290 318 s.post_uri, 291 - any(s.post_text) AS text, 292 - any(s.post_created_at) AS created_at, 319 + s.post_text, 320 + s.post_created_at, 293 321 s.snapshot_likes 294 322 + countIf(e.kind='like' AND e.event_created_at > s.snapshot_taken_at) 295 323 AS likes, 296 324 s.snapshot_reposts 297 325 + countIf(e.kind='repost' AND e.event_created_at > s.snapshot_taken_at) 298 326 AS reposts 299 - FROM post_snapshots s 300 - LEFT JOIN engagement_events e 327 + FROM post_snapshots FINAL AS s 328 + LEFT JOIN engagement_events AS e 301 329 ON e.post_uri = s.post_uri 302 330 AND e.post_author_did = s.post_author_did 303 331 WHERE s.post_author_did = ? 304 332 AND s.is_deleted = 0 305 333 AND (? IS NULL OR s.post_created_at >= ?) -- ?after= filter 306 - GROUP BY s.post_uri, s.snapshot_likes, s.snapshot_reposts, s.snapshot_taken_at 307 - ORDER BY {likes|reposts} DESC, created_at DESC 334 + GROUP BY s.post_uri, s.post_text, s.post_created_at, 335 + s.snapshot_likes, s.snapshot_reposts, s.snapshot_taken_at 336 + ORDER BY {likes|reposts} DESC, s.post_created_at DESC 308 337 LIMIT 25; 309 338 ``` 310 339 311 - ClickHouse pushes the `post_author_did = ?` predicate down both sides of the 312 - join because both tables are ordered by `post_author_did` first. 340 + `FROM post_snapshots FINAL` is necessary for correctness: between merges, 341 + multiple snapshot versions of the same `post_uri` can coexist (re-backfill, 342 + tombstones), and a naive GROUP BY would split them into separate buckets 343 + and return the same post twice. `FINAL` deduplicates at query time and 344 + applies the `is_deleted` marker. Per the ClickHouse docs, `FINAL` is cheap 345 + when the query filters on the leading order-key column — we always filter 346 + on `post_author_did` first, so `FINAL` only has to scan that one author's 347 + tight slab (a few granules at most for ~10k posts). 348 + 349 + We do **not** use `FINAL` on `engagement_events`. `countIf` over briefly- 350 + duplicated rows from Jetstream replays produces a small transient overcount 351 + which is within the accepted error budget (see §10). 352 + 353 + ClickHouse pushes the `post_author_did = ?` predicate down both sides of 354 + the join because both tables are ordered by `post_author_did` first. 313 355 314 356 ### SQLite — metadata (Lucid models) 315 357

Configure Feed

Configure Feed