docs: update pipeline diagram and hub docs for all 8 flows · zzstoatzz.io/my-prefect-server@d62e087

+88 -40

2 changed files

expand all

README.md

docs

hub.md

+23 -18

README.md

··· 1 - personal data pipeline that digests my github and [tangled.org](https://tangled.org) activity, scores items by importance, and generates an LLM-curated briefing. self-hosted on a single hetzner VM (k3s) running prefect OSS. 1 + personal data pipeline and intelligence layer. digests github, [tangled.org](https://tangled.org), and bluesky activity, scores items, generates LLM-curated briefings, and maintains phi's long-term memory. self-hosted on a single hetzner VM (k3s) running prefect OSS. 2 2 3 3 [hub](https://hub.waow.tech) · [grafana](https://prefect-metrics.waow.tech/d/executive-overview/executive-overview?orgId=1&from=now-6h&to=now&timezone=browser) 4 4 5 5 ``` 6 6 github API ──┐ 7 - ├──► ingest ──► raw_github_issues ──┐ 8 - tangled PDS ─┘ (hourly) raw_tangled_items ──┤ 9 - ▼ 10 - transform (dbt) 11 - [on ingest ✓] 12 - │ 13 - ▼ 14 - hub_action_items 15 - (top 200) 16 - │ 17 - ┌──────────────┼──────────┐ 18 - ▼ ▼ ▼ 19 - brief /api/cards hub UI 20 - [on transform ✓] 21 - │ 22 - ▼ 23 - briefing.json 7 + ├──► ingest ──► raw_github_issues ──┐ 8 + tangled PDS ─┤ (hourly) raw_tangled_items ─┤ 9 + bluesky PDS ─┤ raw_likes + raw_liked_posts │ 10 + phi (tpuf) ─┘ raw_phi_observations ─┘ 11 + │ 12 + transform (dbt) ◄──────┘ 13 + [on ingest ✓] 14 + │ 15 + ┌───────────────┼───────────────┐ 16 + ▼ ▼ ▼ 17 + brief compact hub UI 18 + [on transform ✓] [on transform ✓] 19 + │ │ 20 + ▼ ▼ 21 + briefing.json TurboPuffer 22 + (phi-users-*) 23 + 24 + morning ──► TurboPuffer + semble 25 + (daily 8am CT) 26 + 27 + rebuild-atlas ──► Cloudflare Pages 28 + (every 6h) 24 29 ``` 25 30 26 31 see [docs/hub.md](docs/hub.md) for the full pipeline breakdown.

+65 -22

docs/hub.md

··· 1 1 # hub 2 2 3 - action item dashboard at [hub.waow.tech](https://hub.waow.tech). aggregates issues and PRs from github and [tangled.org](https://tangled.org), scores them by importance, and generates a daily briefing with an LLM. 3 + action item dashboard and intelligence pipeline at [hub.waow.tech](https://hub.waow.tech). aggregates issues and PRs from github and [tangled.org](https://tangled.org), scores them by importance, generates a daily briefing with an LLM, and maintains phi's long-term memory. 4 4 5 5 ## data sources 6 6 7 - a single `ingest` flow runs hourly on cron and fetches both data sources concurrently, then writes to DuckDB sequentially (same process = no single-writer lock contention). downstream flows (transform, brief) are event-driven via deployment triggers — they only run when upstream completes. 7 + a single `ingest` flow runs hourly on cron and fetches all data sources concurrently, then writes to DuckDB sequentially (same process = no single-writer lock contention). downstream flows (transform, brief, compact) are event-driven via deployment triggers — they only run when upstream completes. 8 8 9 9 **github** — fetches notifications (issues + PRs) and open items authored by `zzstoatzz` via the search API. each issue is cached by repo+number for 24h. persists to `raw_github_issues`. 10 10 11 11 **tangled.org** — fetches issues, PRs, and comments from the PDS (`pds.zzstoatzz.io`) via AT Protocol's `com.atproto.repo.listRecords`. no auth needed — records are public. targets repos: zat, zlay, plyr.fm, at-me, pollz, typeahead. persists to `raw_tangled_items`. 12 12 13 + **bluesky likes** — fetches nate's recent likes from the PDS via `app.bsky.feed.like` records, then batch-resolves post content via `app.bsky.feed.getPosts` (25 per call). persists to `raw_likes` and `raw_liked_posts`. 14 + 15 + **phi memory** — reads phi's TurboPuffer namespaces (`phi-users-*`) to snapshot observations and interactions into DuckDB for dbt processing. persists to `raw_phi_observations` and `raw_phi_interactions`. 16 + 13 17 ## pipeline 14 18 15 19 ``` 16 20 github API ──┐ 17 - ├──► ingest ──► raw_github_issues ──┐ 18 - tangled PDS ─┘ (hourly) raw_tangled_items ──┤ 19 - ▼ 20 - transform (dbt) 21 - [on ingest ✓] 22 - │ 23 - ▼ 24 - hub_action_items 25 - (mart, top 200) 26 - │ 27 - ┌──────────────┼──────────────┐ 28 - ▼ ▼ ▼ 29 - brief /api/cards.json +page.svelte 30 - [on transform ✓] (SSR loader) 31 - │ 32 - ▼ 33 - briefing.json ──► /api/briefing.json 21 + ├──► ingest ──► raw_github_issues ──┐ 22 + tangled PDS ─┤ (hourly) raw_tangled_items ─┤ 23 + bluesky PDS ─┤ raw_likes + raw_liked_posts │ 24 + phi (tpuf) ─┘ raw_phi_observations ─┘ 25 + │ 26 + transform (dbt) ◄──────┘ 27 + [on ingest ✓] 28 + │ 29 + ┌───────────────┼───────────────┐ 30 + ▼ ▼ ▼ 31 + brief compact hub UI 32 + [on transform ✓] [on transform ✓] /api/cards.json 33 + │ │ +page.svelte 34 + ▼ ▼ 35 + briefing.json TurboPuffer 36 + /api/briefing (phi-users-*) 37 + ``` 38 + 39 + two additional flows run independently: 40 + 41 + ``` 42 + morning (daily 8am CT) ──► TurboPuffer (tag maintenance) + semble (curation) 43 + rebuild-atlas (every 6h) ──► Cloudflare Pages (leaflet-search atlas) 34 44 ``` 35 45 36 46 ## flows 37 47 38 48 | deployment | trigger | what it does | 39 49 |---|---|---| 40 - | `diagnostics` | cron `*/5 * * * *` | prints system info — canary for worker health | 41 - | `ingest` | cron `0 * * * *` | fetches github notifications + authored items and tangled.org items concurrently, persists both to DuckDB sequentially | 42 - | `transform` | on `ingest` completion | dbt build: staging → scoring → mart. concurrency limit 1. runs under python 3.13 (dbt-core compat) | 50 + | `diagnostics` | cron `*/5 * * * *` (inactive) | prints system info — canary for worker health | 51 + | `ingest` | cron `0 * * * *` | fetches github, tangled.org, bluesky likes, and phi memory concurrently, resolves liked post content, persists all to DuckDB sequentially | 52 + | `transform` | on `ingest` completion | dbt build: staging → enrichment → mart. concurrency limit 1. runs under python 3.13 (dbt-core compat) | 43 53 | `brief` | on `transform` completion | loads top 200 scored items, sends to claude haiku 4.5 via pydantic-ai, writes `briefing.json`. cached by items content hash (skips LLM when data unchanged) | 54 + | `compact` | on `transform` completion | synthesizes per-user relationship summaries from phi's observations + interactions. extracts new observations from liked posts (LLM). writes summaries to TurboPuffer (`phi-users-*`). cached by observations content hash | 55 + | `morning` | cron `0 13 * * *` (8am CT) | tag maintenance (dedup, merge, relationship discovery) + agentic semble curation (promotes observations to public cosmik cards). runs 1h before phi's daily reflection | 56 + | `rebuild-atlas` | cron `0 */6 * * *` | rebuilds the leaflet-search 2D semantic map (UMAP + HDBSCAN on TurboPuffer embeddings), deploys to Cloudflare Pages | 44 57 | `cleanup` | cron `0 2 * * 0` | deletes old terminal flow runs (completed, failed, cancelled, crashed) older than 30 days | 45 58 46 59 all flows run in the `kubernetes-pool` work pool. code is pulled at runtime via `git clone` from tangled.sh (github fallback). deps install via `uv run --with 'my-prefect-server @ git+...'`. deployments are registered by CI on every push to main. ··· 53 66 |---|---|---| 54 67 | `stg_github_issues` | view | dedup `raw_github_issues` by (repo, number), keep most recent fetch | 55 68 | `stg_tangled_items` | view | dedup `raw_tangled_items` by `at_uri`, exclude comments, keep most recent | 69 + | `stg_likes` | view | dedup `raw_likes` by `subject_uri` | 70 + | `stg_liked_posts` | view | dedup `raw_liked_posts` by `subject_uri`, join with like timestamps | 71 + | `stg_phi_observations` | view | dedup `raw_phi_observations` by observation id | 72 + | `stg_phi_interactions` | view | dedup `raw_phi_interactions` by interaction id | 56 73 | `int_github_issues_scored` | table | scoring: recency (30-day decay) x engagement (log scale) x label multiplier (bug=1.5) x contributor weight | 57 74 | `int_tangled_items_scored` | table | scoring: recency (30-day decay) x 0.5 (no engagement data) x contributor weight | 75 + | `int_phi_user_profiles` | table | aggregates per-user observation + interaction counts for compact flow | 58 76 | `hub_action_items` | mart | union of both scored tables, ordered by `importance_score` desc, limit 200 | 59 77 60 78 contributor weights come from the `known_contributors` seed (zzstoatzz + zzstoatzz.io at 2.0x). ··· 70 88 5. writes a structured `Briefing` (headline, 4 themed sections with accent colors, icons, priority) to `briefing.json` 71 89 72 90 briefing model is defined in `packages/mps/src/mps/briefing.py`. 91 + 92 + ## compact 93 + 94 + the `compact` flow fires in parallel with `brief` when `transform` completes. it maintains phi's long-term memory in TurboPuffer. 95 + 96 + **relationship summaries** — for each user phi has interacted with (from `int_phi_user_profiles`), loads their observations and recent interactions from DuckDB, resolves their bluesky profile, and sends to claude haiku to synthesize a dense relationship summary. writes to `phi-users-{handle}` namespaces as `kind="summary"` records. cached by observations content hash — skips the LLM when data unchanged. 97 + 98 + **liked post observations** — loads recently resolved liked posts from DuckDB (`raw_liked_posts`), groups by author, queries TurboPuffer for existing knowledge about each author, searches pub-search for their publications, then extracts 1-3 atomic observations per author via LLM. uses ADD/UPDATE/NOOP reconciliation to avoid duplicating what phi already knows. writes to `phi-users-{author_handle}` namespaces as `kind="observation"` records. 99 + 100 + the result: observations from likes are indistinguishable from observations phi creates during conversations — same schema, same namespaces, same reconciliation logic. 101 + 102 + ## morning 103 + 104 + the `morning` flow runs daily at 8am CT (1h before phi's reflection). it has two halves: 105 + 106 + **tag maintenance (phases 1-3)** — mechanical operations on TurboPuffer: 107 + 1. collect all tags across all `phi-users-*` namespaces 108 + 2. embed tags and identify near-duplicates via LLM (e.g., "atproto" / "at protocol" / "AT Protocol") 109 + 3. apply merges: rewrite tags in TurboPuffer, discover inter-tag relationships, store in `phi-tag-relationships` namespace 110 + 111 + **agentic curation (phase 4)** — assembles phi's recent observations, episodic knowledge, existing cosmik cards, and tag relationships into a context bundle. sends to an LLM that decides what (if anything) deserves promotion to semble as a public cosmik card. executes the plan: creates `network.cosmik.card` records on phi's PDS, which semble's firehose subscriber auto-indexes for semantic search. 112 + 113 + ## atlas 114 + 115 + the `rebuild-atlas` flow runs every 6h. it clones leaflet-search, runs the build-atlas script (PCA → UMAP → HDBSCAN on TurboPuffer document embeddings), produces `atlas.json`, and deploys the static site to Cloudflare Pages via wrangler. 73 116 74 117 ## frontend 75 118

Configure Feed

Configure Feed