Tune ClickHouse schema after performance review
Apply findings from a review of the ClickHouse community-wisdom
performance docs against the actual write/read patterns:
- Drop monthly partitioning on both tables (PARTITION BY tuple()).
Backfill writes ~10k posts spanning years per user in batches of 25;
monthly partitioning would fan each batch across many partitions and
trigger the "too many parts" failure mode at our small total volume
(~50 GB lifetime), with no benefit since no query filters by date
alone.
- LowCardinality(String) on post_author_did in both tables.
- Minmax skip index on post_snapshots.post_created_at to support the
optional ?after= filter without putting microsecond timestamps in the
sort key.
- Switch post_snapshots engine to the two-argument form
ReplacingMergeTree(snapshot_taken_at, is_deleted) for native deletion
semantics.
- Top-25 query now uses FROM post_snapshots FINAL. Without it, multiple
snapshot versions of the same post_uri (between merges) split into
separate GROUP BY buckets and return the same post twice. FINAL is
cheap here because the query always filters on the leading order-key
column.
- engagement_events still does NOT use FINAL — countIf over briefly-
duplicated rows is within the accepted error budget.
No T64 codec on the count columns. T64 expects sparse data or small
ranges within a block; like counts on Bluesky mix small and large
values in any granule, which is the wrong distribution. Default LZ4 is
fine and we can revisit with a real benchmark if compression becomes a
problem.
Also rejected: putting is_deleted in the post_snapshots ORDER BY. Doing
so would give tombstones a different key tuple from the originals they
replace, breaking the ReplacingMergeTree collapse mechanism entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>