A tool for tailing a labelers' firehose, rehydrating, and storing records for future analysis of moderation decisions.
3
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: Add comprehensive README

Comprehensive documentation covering:
- Feature overview and architecture
- Quick start with Docker
- Complete configuration reference
- Database schema documentation
- Development guidelines
- Safety features and warnings

Includes examples for common tasks and monitoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

+272
+272
README.md
··· 1 + # Skywatch Tail 2 + 3 + A high-performance label capture and content hydration service for Bluesky moderation systems. Subscribe to a labeler's firehose, capture label events, and automatically hydrate associated content for machine learning training. 4 + 5 + ## Features 6 + 7 + - **Real-time Label Capture**: Subscribe to any Bluesky labeler's firehose via WebSocket 8 + - **Automatic Content Hydration**: Fetch full post records and user profiles for labeled content 9 + - **Intelligent Filtering**: Optionally filter labels by type to capture only what you need 10 + - **Cursor Persistence**: Resume from where you left off after restarts 11 + - **Automatic Reconnection**: Exponential backoff reconnection (1s-30s) for stability 12 + - **DuckDB Storage**: Embedded analytics database optimized for ML pipelines 13 + - **Docker Ready**: Containerized deployment with volume persistence 14 + - **Type-Safe**: Full TypeScript implementation with Zod validation 15 + 16 + ## Architecture 17 + 18 + ``` 19 + Firehose → Label Event → Filter → Store Label → Hydration Queue 20 + ↓ ↓ 21 + DuckDB ← [Post/Profile Fetch] 22 + ``` 23 + 24 + ### Components 25 + 26 + - **Firehose Subscriber**: WebSocket client with DAG-CBOR decoding 27 + - **Label Filter**: Configurable allow-list for label types 28 + - **Hydration Services**: Automatic post and profile data fetching 29 + - **Hydration Queue**: Async queue with deduplication 30 + - **Repository Layer**: Clean database abstraction for all entities 31 + 32 + ## Quick Start 33 + 34 + ### Prerequisites 35 + 36 + - Docker and Docker Compose 37 + - Bluesky account with app password 38 + - Access to a labeler firehose endpoint 39 + 40 + ### Installation 41 + 42 + 1. Clone the repository: 43 + ```bash 44 + git clone <repository-url> 45 + cd skywatch-tail 46 + ``` 47 + 48 + 2. Copy the example environment file: 49 + ```bash 50 + cp .env.example .env 51 + ``` 52 + 53 + 3. Configure your `.env` file: 54 + ```env 55 + # Bluesky Credentials 56 + BSKY_HANDLE=your-handle.bsky.social 57 + BSKY_PASSWORD=your-app-password 58 + 59 + # Labeler Configuration 60 + WSS_URL=wss://your-labeler.com/xrpc/com.atproto.label.subscribeLabels 61 + 62 + # Optional: Filter specific labels 63 + CAPTURE_LABELS=spam,hate-speech,csam 64 + 65 + # Logging 66 + LOG_LEVEL=info 67 + ``` 68 + 69 + 4. Start with Docker Compose: 70 + ```bash 71 + docker-compose up -d 72 + ``` 73 + 74 + ### Local Development 75 + 76 + Install dependencies with Bun: 77 + ```bash 78 + bun install 79 + ``` 80 + 81 + Run in development mode: 82 + ```bash 83 + bun run dev 84 + ``` 85 + 86 + Run tests: 87 + ```bash 88 + bun test 89 + ``` 90 + 91 + ## Configuration 92 + 93 + All configuration is managed via environment variables: 94 + 95 + ### Required 96 + 97 + - `BSKY_HANDLE`: Your Bluesky handle 98 + - `BSKY_PASSWORD`: App password (not your main password) 99 + - `WSS_URL`: Labeler firehose WebSocket URL 100 + 101 + ### Optional 102 + 103 + - `PDS`: Bluesky PDS host (default: `bsky.social`) 104 + - `CAPTURE_LABELS`: Comma-separated list of label values to capture 105 + - `DB_PATH`: Path to DuckDB database file (default: `./data/skywatch.duckdb`) 106 + - `LOG_LEVEL`: Logging level (default: `info`) 107 + - `HYDRATE_BLOBS`: Enable blob download (default: `false`) - Phase 4 108 + - `BLOB_STORAGE_TYPE`: Storage backend for blobs (`local` or `s3`) - Phase 4 109 + - `BLOB_STORAGE_PATH`: Local path for blob storage - Phase 4 110 + 111 + ### S3 Configuration (Phase 4, Optional) 112 + 113 + - `S3_BUCKET`: S3 bucket name 114 + - `S3_REGION`: AWS region 115 + - `AWS_ACCESS_KEY_ID`: AWS access key 116 + - `AWS_SECRET_ACCESS_KEY`: AWS secret key 117 + 118 + ## Database Schema 119 + 120 + ### Labels Table 121 + Stores raw label events from the firehose. 122 + 123 + - `id`: Auto-incrementing primary key 124 + - `uri`: AT-URI or DID of labeled content 125 + - `cid`: Content identifier (optional) 126 + - `val`: Label value (e.g., "spam") 127 + - `neg`: Negation flag 128 + - `cts`: Created timestamp 129 + - `exp`: Expiration timestamp (optional) 130 + - `src`: Labeler DID 131 + 132 + ### Posts Table 133 + Hydrated post data for labeled content. 134 + 135 + - `uri`: AT-URI (primary key) 136 + - `did`: Author DID 137 + - `text`: Post content 138 + - `facets`: Rich text annotations (JSON) 139 + - `embeds`: Embedded content (JSON) 140 + - `langs`: Language codes (JSON) 141 + - `tags`: Hashtags (JSON) 142 + - `created_at`: Post creation timestamp 143 + - `is_reply`: Reply flag 144 + 145 + ### Profiles Table 146 + Hydrated user profile data. 147 + 148 + - `did`: User DID (primary key) 149 + - `handle`: User handle 150 + - `display_name`: Display name 151 + - `description`: Bio/description 152 + 153 + ### Blobs Table (Phase 4) 154 + Image and video blob metadata. 155 + 156 + - `post_uri`: Associated post URI 157 + - `blob_cid`: Blob content identifier 158 + - `sha256`: Cryptographic hash 159 + - `phash`: Perceptual hash 160 + - `storage_path`: Local or S3 path (if downloaded) 161 + - `mimetype`: Content type 162 + 163 + ## Label Filtering 164 + 165 + Filter labels by providing a comma-separated list in `CAPTURE_LABELS`: 166 + 167 + ```env 168 + CAPTURE_LABELS=spam,hate-speech,csam,scam 169 + ``` 170 + 171 + If not set, all labels are captured. 172 + 173 + ## Data Persistence 174 + 175 + ### Cursor Persistence 176 + The application saves its position in the firehose to `data/cursor.txt`. On restart, it resumes from this cursor, preventing duplicate processing. 177 + 178 + ### Database Persistence 179 + The DuckDB database is stored in the `data/` directory, which is mounted as a Docker volume. Your data persists across container restarts. 180 + 181 + ## Monitoring 182 + 183 + Logs are output in structured JSON format (production) or pretty-printed (development). 184 + 185 + Key log events: 186 + - `Firehose connected`: Successfully connected to labeler 187 + - `Firehose disconnected`: Connection lost, will auto-reconnect 188 + - `Received label`: Label captured and stored 189 + - `Post hydrated successfully`: Post data fetched 190 + - `Profile hydrated successfully`: Profile data fetched 191 + 192 + ## Development 193 + 194 + ### Project Structure 195 + 196 + ``` 197 + skywatch-tail/ 198 + ├── src/ 199 + │ ├── config/ # Environment validation 200 + │ ├── database/ # Schema and repositories 201 + │ ├── firehose/ # WebSocket subscriber 202 + │ ├── hydration/ # Content hydration services 203 + │ ├── logger/ # Pino logger setup 204 + │ └── index.ts # Main entry point 205 + ├── tests/ 206 + │ ├── integration/ # Database integration tests 207 + │ └── unit/ # Unit tests 208 + ├── data/ # Database and cursor storage 209 + ├── docker-compose.yml 210 + ├── Dockerfile 211 + └── .env.example 212 + ``` 213 + 214 + ### Running Tests 215 + 216 + ```bash 217 + # All tests 218 + bun test 219 + 220 + # Specific test file 221 + bun test tests/unit/decoder.test.ts 222 + 223 + # Watch mode 224 + bun test --watch 225 + ``` 226 + 227 + ### Database Access 228 + 229 + Access the DuckDB database directly: 230 + 231 + ```bash 232 + # Using DuckDB CLI 233 + duckdb data/skywatch.duckdb 234 + 235 + # Query labels 236 + SELECT val, COUNT(*) FROM labels GROUP BY val; 237 + 238 + # Query recent posts 239 + SELECT uri, text FROM posts ORDER BY created_at DESC LIMIT 10; 240 + ``` 241 + 242 + ## Roadmap 243 + 244 + - [x] Phase 1: Core infrastructure (Docker, config, database, logging) 245 + - [x] Phase 2: Firehose connection and label capture 246 + - [x] Phase 3: Content hydration (posts and profiles) 247 + - [ ] Phase 4: Blob processing (image/video hashing and storage) 248 + - [ ] Phase 5: Rate limiting and optimization 249 + - [ ] Phase 6: Comprehensive testing 250 + - [ ] Phase 7: Documentation 251 + 252 + ## Safety Features 253 + 254 + ### Blob Hydration 255 + By default, `HYDRATE_BLOBS` is `false`. This prevents accidental download of potentially harmful content (CSAM, graphic violence, etc.) while still capturing cryptographic and perceptual hashes for ML training. 256 + 257 + Only enable blob download if: 258 + 1. You understand the legal and safety implications 259 + 2. You have proper content storage policies in place 260 + 3. You're operating in a jurisdiction where possessing such content for moderation is legal 261 + 262 + ## License 263 + 264 + See LICENSE file for details. 265 + 266 + ## Contributing 267 + 268 + Contributions welcome. Please ensure all tests pass before submitting PRs. 269 + 270 + ## Support 271 + 272 + For issues and feature requests, please use the GitHub issue tracker.