fix: index site.standard.document with embedded leaflet content
This commit fixes indexing of documents published via site.standard.document
that contain embedded pub.leaflet.content. This was a multi-layered debugging
saga worth documenting.
## The Problem
Search for "set of shared standards" returned nothing, despite the document
existing at lab.leaflet.pub/3md4qsktbms24.
## Root Causes (in order of discovery)
1. **Fly secret overriding fly.toml** - TAP_SIGNAL_COLLECTION was set as a
Fly secret with wrong value ("pub.leaflet.document"), which overrode the
correct fly.toml value. Secrets take precedence over env vars in fly.toml.
Fix: `fly secrets unset TAP_SIGNAL_COLLECTION -a leaflet-search-tap`
2. **Extractor missing content.pages path** - The extractor only checked
`record.pages` (pub.leaflet.document) but not `record.content.pages`
(site.standard.document with embedded pub.leaflet.content). The content
structure differs between lexicons.
3. **Platform detection from collection only** - site.standard.document
gets platform="other" from collection detection, but the actual platform
(leaflet/pckt/offprint) should be inferred from the publication's basePath.
4. **Use-after-free in indexer** - base_path was read from query result,
but row.text() returns memory that's freed by result.deinit(). Fixed by
copying to stack buffer.
5. **Startup timeout** - HTTP server started after slow initialization,
causing Fly proxy timeouts. Reordered to start HTTP first, then init
services in background.
## The Saga of Failure Modes
- Repeatedly trying to curl from TAP container (which has no curl)
- Hallucinating TAP API endpoints that don't exist
- Falling back to backfill scripts instead of fixing root cause
- Not verifying Fly secrets vs fly.toml precedence
- Writing custom PDS client when zat already has everything needed
## How Content Extraction Now Works
1. TAP receives site.standard.document (signal collection)
2. Extractor tries: textContent → pages → content.pages
3. If content still empty AND content.$type == "pub.leaflet.content",
fetch pub.leaflet.document from PDS as fallback
4. Indexer infers platform from basePath (leaflet.pub → leaflet, etc.)
## Files Changed
- tap/fly.toml: signal collection → site.standard.document
- backend/src/extractor.zig: check content.pages, add test
- backend/src/tap.zig: PDS fetch fallback for empty content
- backend/src/indexer.zig: platform detection from basePath, fix UAF
- backend/src/main.zig: start HTTP first, init services in background
- backend/src/db/mod.zig: split init for staged startup
- scripts/backfill-pds: same content.pages fix
- docs/tap.md: updated documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>