···11+# Product Requirements Document (PRD)
22+33+This document outlines the requirements for the Skywatch Capture application. It serves as a reference for developers, designers, and stakeholders to ensure that the product meets the needs of its users.
44+55+`labels.uri` is the URI against which the label is applied. It can take two forms, a reference to a post in the form of an at-uri: `at://did:plc:7i7s4avtaolnrgc3ubcoqrq3/app.bsky.feed.post/3lf5u32pxwk2f` or a reference to a user in the form of a did: `did:plc:piwuaowuiykzaare644i5fre`.
66+77+`labels.val` is the label value being emitted.
88+`labels.neg` is a boolean indicating whether this label is a negation label, overwriting a previous label.
99+1010+## Core Use Case
1111+1212+The primary purpose of this application is to subscribe to a Bluesky labeler's firehose, capture all emitted label events, hydrate the associated data (posts and user profiles), and store this comprehensive dataset in a local database. This data is intended for future use in training machine learning classifiers for content moderation.
1313+1414+## Functional Requirements
1515+1616+- **Firehose Subscription:** Connect to and process a DAG-CBOR encoded firehose from a specified Bluesky labeler service.
1717+- **Data Hydration:** For each label received, fetch the full context of the labeled content.
1818+ - **Post Hydration:** If the label URI is an `at-uri` (post), fetch the full `app.bsky.feed.post` record and store the following fields: `did`, `text`, `facets`, `embeds`, `langs`, `tags`, `createdAt`, and reply status.
1919+ - **Profile Hydration:** If the label URI is a `did` (user), fetch the full `app.bsky.actor.profile` record and store the `displayName` and `description`. Additionally, resolve and store the user's `handle`.
2020+- **Image & Blob Handling:**
2121+ - An option (`HYDRATE_BLOBS`) must be provided to control whether to download image/video blobs. This is a safety feature for users labeling sensitive content.
2222+ - In all cases, both a **SHA-256 (cryptographic) hash** and a **perceptual hash (pHash)** of any referenced image blobs must be captured to ensure compatibility with various moderation toolkits.
2323+ - If `HYDRATE_BLOBS` is true, the application must support storing the downloaded blobs either on the local filesystem or in an AWS S3 bucket, configurable via environment variables.
2424+- **Data Storage:**
2525+ - All captured and hydrated data should be stored in a DuckDB database file.
2626+ - The database schema should be structured to link labels to their hydrated content.
2727+- **Filtering:** The user must be able to optionally provide a comma-separated list of labels to capture (`CAPTURE_LABELS`). If provided, any label not in this list will be ignored.
2828+2929+## Technical Requirements
3030+3131+- **Language/Runtime:** Use TypeScript with Bun.
3232+- **Containerization:** The application must be containerized using Docker. The DuckDB database file must be stored on a volume outside the container to ensure data persistence. A `docker-compose.yml` file should be provided to manage services.
3333+- **Key Libraries:**
3434+ - `@atcute/cbor` and `@atcute/car` for parsing the firehose.
3535+ - `@atproto/api` for all Bluesky API interactions.
3636+ - `pino` and `pino-pretty` for logging.
3737+ - `dotenv` for environment variable management.
3838+- **Portability:** The application should be designed to be portable and easily configurable for use by other moderation services or researchers.
3939+- **Rate Limits:** Be mindful of Bluesky API rate limits during hydration.
4040+4141+## Configuration
4242+4343+The application will be configured via a `.env` file with the following variables:
4444+4545+```env
4646+# Bluesky Credentials
4747+BSKY_HANDLE=your-bluesky-handle.bsky.social
4848+BSKY_PASSWORD=your-app-password
4949+5050+# Bluesky PDS and Labeler URL
5151+PDS=bsky.social
5252+WSS_URL=wss://your-labeler-service.com/xrpc/com.atproto.label.subscribeLabels
5353+5454+# Blob & Image Handling
5555+HYDRATE_BLOBS=false # Set to true to download images/videos
5656+BLOB_STORAGE_TYPE=local # 'local' or 's3'
5757+BLOB_STORAGE_PATH=./data/blobs # Path for local storage
5858+5959+# S3 Configuration (only required if BLOB_STORAGE_TYPE is 's3')
6060+S3_BUCKET=your-s3-bucket-name
6161+S3_REGION=us-east-1
6262+AWS_ACCESS_KEY_ID=your-aws-access-key
6363+AWS_SECRET_ACCESS_KEY=your-aws-secret-key
6464+6565+# Database
6666+DB_PATH=./data/skywatch.duckdb
6767+6868+# Filtering (Optional)
6969+# Comma-separated list of labels to capture, e.g., "spam,hate-speech"
7070+CAPTURE_LABELS=
7171+7272+# Logging
7373+LOG_LEVEL=info
7474+```
7575+7676+## Data Schema
7777+7878+The database will contain the following tables:
7979+8080+#### `labels`
8181+Stores the raw label event data.
8282+- `id` (INTEGER, Primary Key, Auto-incrementing)
8383+- `uri` (TEXT) - The `at-uri` or `did` of the labeled content.
8484+- `cid` (TEXT) - The CID of the specific record version.
8585+- `val` (TEXT) - The label value (e.g., "spam").
8686+- `neg` (BOOLEAN) - If the label is a negation.
8787+- `cts` (DATETIME) - Timestamp of label creation.
8888+- `exp` (DATETIME, nullable) - Expiration timestamp of the label.
8989+- `src` (TEXT) - The DID of the labeler.
9090+9191+#### `posts`
9292+Stores hydrated data for labeled posts. Linked to `labels.uri`.
9393+- `uri` (TEXT, Primary Key)
9494+- `did` (TEXT) - Author of the post.
9595+- `text` (TEXT)
9696+- `facets` (JSON)
9797+- `embeds` (JSON)
9898+- `langs` (JSON)
9999+- `tags` (JSON)
100100+- `createdAt` (DATETIME)
101101+- `is_reply` (BOOLEAN)
102102+103103+#### `profiles`
104104+Stores hydrated data for labeled user accounts. Linked to `labels.uri`.
105105+- `did` (TEXT, Primary Key)
106106+- `handle` (TEXT)
107107+- `displayName` (TEXT)
108108+- `description` (TEXT)
109109+110110+#### `blobs`
111111+Stores information about image blobs found in posts.
112112+- `post_uri` (TEXT) - Foreign key to `posts.uri`.
113113+- `blob_cid` (TEXT) - CID of the blob.
114114+- `sha256` (TEXT) - Cryptographic hash for exact file matching.
115115+- `phash` (TEXT) - Perceptual hash for finding visually similar images.
116116+- `storage_path` (TEXT, nullable) - Local or S3 path if downloaded.
117117+- `mimetype` (TEXT)
118118+- PRIMARY KEY (`post_uri`, `blob_cid`)
119119+120120+121121+## Lexicons
122122+The following bluesky lexicons are necessary for this tool:
123123+124124+### `com.atproto.label.subscribeLabels`
125125+Skywatch emits a DAG-CBOR encoded firehose of moderation decisions at `wss://ozone.skywatch.blue/xrpc/com.atproto.label.subscribeLabels
126126+A label event looks like the following:
127127+128128+```json
129129+"label": {
130130+ "type": "object",
131131+ "description": "Metadata tag on an atproto resource (eg, repo or record).",
132132+ "required": ["src", "uri", "val", "cts"],
133133+ "properties": {
134134+ "ver": {
135135+ "type": "integer",
136136+ "description": "The AT Protocol version of the label object."
137137+ },
138138+ "src": {
139139+ "type": "string",
140140+ "format": "did",
141141+ "description": "DID of the actor who created this label."
142142+ },
143143+ "uri": {
144144+ "type": "string",
145145+ "format": "uri",
146146+ "description": "AT URI of the record, repository (account), or other resource that this label applies to."
147147+ },
148148+ "cid": {
149149+ "type": "string",
150150+ "format": "cid",
151151+ "description": "Optionally, CID specifying the specific version of 'uri' resource this label applies to."
152152+ },
153153+ "val": {
154154+ "type": "string",
155155+ "maxLength": 128,
156156+ "description": "The short string name of the value or type of this label."
157157+ },
158158+ "neg": {
159159+ "type": "boolean",
160160+ "description": "If true, this is a negation label, overwriting a previous label."
161161+ },
162162+ "cts": {
163163+ "type": "string",
164164+ "format": "datetime",
165165+ "description": "Timestamp when this label was created."
166166+ },
167167+ "exp": {
168168+ "type": "string",
169169+ "format": "datetime",
170170+ "description": "Timestamp at which this label expires (no longer applies)."
171171+ },
172172+ "sig": {
173173+ "type": "bytes",
174174+ "description": "Signature of dag-cbor encoded label."
175175+ }
176176+ }
177177+},
178178+```
179179+180180+### `app.bsky.feed.post`
181181+Post are structured as the following:
182182+183183+```json
184184+{
185185+ "lexicon": 1,
186186+ "id": "app.bsky.feed.post",
187187+ "defs": {
188188+ "main": {
189189+ "type": "record",
190190+ "description": "Record containing a Bluesky post.",
191191+ "key": "tid",
192192+ "record": {
193193+ "type": "object",
194194+ "required": ["text", "createdAt"],
195195+ "properties": {
196196+ "text": {
197197+ "type": "string",
198198+ "maxLength": 3000,
199199+ "maxGraphemes": 300,
200200+ "description": "The primary post content. May be an empty string, if there are embeds."
201201+ },
202202+ "entities": {
203203+ "type": "array",
204204+ "description": "DEPRECATED: replaced by app.bsky.richtext.facet.",
205205+ "items": { "type": "ref", "ref": "#entity" }
206206+ },
207207+ "facets": {
208208+ "type": "array",
209209+ "description": "Annotations of text (mentions, URLs, hashtags, etc)",
210210+ "items": { "type": "ref", "ref": "app.bsky.richtext.facet" }
211211+ },
212212+ "reply": { "type": "ref", "ref": "#replyRef" },
213213+ "embed": {
214214+ "type": "union",
215215+ "refs": [
216216+ "app.bsky.embed.images",
217217+ "app.bsky.embed.video",
218218+ "app.bsky.embed.external",
219219+ "app.bsky.embed.record",
220220+ "app.bsky.embed.recordWithMedia"
221221+ ]
222222+ },
223223+ "langs": {
224224+ "type": "array",
225225+ "description": "Indicates human language of post primary text content.",
226226+ "maxLength": 3,
227227+ "items": { "type": "string", "format": "language" }
228228+ },
229229+ "labels": {
230230+ "type": "union",
231231+ "description": "Self-label values for this post. Effectively content warnings.",
232232+ "refs": ["com.atproto.label.defs#selfLabels"]
233233+ },
234234+ "tags": {
235235+ "type": "array",
236236+ "description": "Additional hashtags, in addition to any included in post text and facets.",
237237+ "maxLength": 8,
238238+ "items": { "type": "string", "maxLength": 640, "maxGraphemes": 64 }
239239+ },
240240+ "createdAt": {
241241+ "type": "string",
242242+ "format": "datetime",
243243+ "description": "Client-declared timestamp when this post was originally created."
244244+ }
245245+ }
246246+ }
247247+ },
248248+ "replyRef": {
249249+ "type": "object",
250250+ "required": ["root", "parent"],
251251+ "properties": {
252252+ "root": { "type": "ref", "ref": "com.atproto.repo.strongRef" },
253253+ "parent": { "type": "ref", "ref": "com.atproto.repo.strongRef" }
254254+ }
255255+ },
256256+ "entity": {
257257+ "type": "object",
258258+ "description": "Deprecated: use facets instead.",
259259+ "required": ["index", "type", "value"],
260260+ "properties": {
261261+ "index": { "type": "ref", "ref": "#textSlice" },
262262+ "type": {
263263+ "type": "string",
264264+ "description": "Expected values are 'mention' and 'link'."
265265+ },
266266+ "value": { "type": "string" }
267267+ }
268268+ },
269269+ "textSlice": {
270270+ "type": "object",
271271+ "description": "Deprecated. Use app.bsky.richtext instead -- A text segment. Start is inclusive, end is exclusive. Indices are for utf16-encoded strings.",
272272+ "required": ["start", "end"],
273273+ "properties": {
274274+ "start": { "type": "integer", "minimum": 0 },
275275+ "end": { "type": "integer", "minimum": 0 }
276276+ }
277277+ }
278278+ }
279279+}
280280+```
281281+282282+With posts we are interested in the `app.bsky.embeds.images` lexicon in particular. The blob reference can be used to retriexe the image from the PDS and then saved to local storage or hashed.
283283+284284+```json
285285+{
286286+ "lexicon": 1,
287287+ "id": "app.bsky.embed.images",
288288+ "description": "A set of images embedded in a Bluesky record (eg, a post).",
289289+ "defs": {
290290+ "main": {
291291+ "type": "object",
292292+ "required": ["images"],
293293+ "properties": {
294294+ "images": {
295295+ "type": "array",
296296+ "items": { "type": "ref", "ref": "#image" },
297297+ "maxLength": 4
298298+ }
299299+ }
300300+ },
301301+ "image": {
302302+ "type": "object",
303303+ "required": ["image", "alt"],
304304+ "properties": {
305305+ "image": {
306306+ "type": "blob",
307307+ "accept": ["image/*"],
308308+ "maxSize": 1000000
309309+ },
310310+ "alt": {
311311+ "type": "string",
312312+ "description": "Alt text description of the image, for accessibility."
313313+ },
314314+ "aspectRatio": {
315315+ "type": "ref",
316316+ "ref": "app.bsky.embed.defs#aspectRatio"
317317+ }
318318+ }
319319+ },
320320+ "view": {
321321+ "type": "object",
322322+ "required": ["images"],
323323+ "properties": {
324324+ "images": {
325325+ "type": "array",
326326+ "items": { "type": "ref", "ref": "#viewImage" },
327327+ "maxLength": 4
328328+ }
329329+ }
330330+ },
331331+ "viewImage": {
332332+ "type": "object",
333333+ "required": ["thumb", "fullsize", "alt"],
334334+ "properties": {
335335+ "thumb": {
336336+ "type": "string",
337337+ "format": "uri",
338338+ "description": "Fully-qualified URL where a thumbnail of the image can be fetched. For example, CDN location provided by the App View."
339339+ },
340340+ "fullsize": {
341341+ "type": "string",
342342+ "format": "uri",
343343+ "description": "Fully-qualified URL where a large version of the image can be fetched. May or may not be the exact original blob. For example, CDN location provided by the App View."
344344+ },
345345+ "alt": {
346346+ "type": "string",
347347+ "description": "Alt text description of the image, for accessibility."
348348+ },
349349+ "aspectRatio": {
350350+ "type": "ref",
351351+ "ref": "app.bsky.embed.defs#aspectRatio"
352352+ }
353353+ }
354354+ }
355355+ }
356356+}
357357+```
358358+359359+### `app.bsky.actor.profile`
360360+361361+```json
362362+{
363363+ "lexicon": 1,
364364+ "id": "app.bsky.actor.profile",
365365+ "defs": {
366366+ "main": {
367367+ "type": "record",
368368+ "description": "A declaration of a Bluesky account profile.",
369369+ "key": "literal:self",
370370+ "record": {
371371+ "type": "object",
372372+ "properties": {
373373+ "displayName": {
374374+ "type": "string",
375375+ "maxGraphemes": 64,
376376+ "maxLength": 640
377377+ },
378378+ "description": {
379379+ "type": "string",
380380+ "description": "Free-form profile description text.",
381381+ "maxGraphemes": 256,
382382+ "maxLength": 2560
383383+ },
384384+ "pronouns": {
385385+ "type": "string",
386386+ "description": "Free-form pronouns text.",
387387+ "maxGraphemes": 20,
388388+ "maxLength": 200
389389+ },
390390+ "website": { "type": "string", "format": "uri" },
391391+ "avatar": {
392392+ "type": "blob",
393393+ "description": "Small image to be displayed next to posts from account. AKA, 'profile picture'",
394394+ "accept": ["image/png", "image/jpeg"],
395395+ "maxSize": 1000000
396396+ },
397397+ "banner": {
398398+ "type": "blob",
399399+ "description": "Larger horizontal image to display behind profile view.",
400400+ "accept": ["image/png", "image/jpeg"],
401401+ "maxSize": 1000000
402402+ },
403403+ "labels": {
404404+ "type": "union",
405405+ "description": "Self-label values, specific to the Bluesky application, on the overall account.",
406406+ "refs": ["com.atproto.label.defs#selfLabels"]
407407+ },
408408+ "joinedViaStarterPack": {
409409+ "type": "ref",
410410+ "ref": "com.atproto.repo.strongRef"
411411+ },
412412+ "pinnedPost": {
413413+ "type": "ref",
414414+ "ref": "com.atproto.repo.strongRef"
415415+ },
416416+ "createdAt": { "type": "string", "format": "datetime" }
417417+ }
418418+ }
419419+ }
420420+ }
421421+}
422422+```