···2020- Run `PERLDS_RUN_REFERENCE_DIFF=1 prove -lv t/reference-differential.t` to exercise the same harness from the test suite.
2121- Run `PERLDS_RUN_REFERENCE_DIFF=1 prove -lv t/reference-differential-plc.t` to run the PLC-specific reference comparison from the test suite.
22222323+Metrics and observability:
2424+2525+- `perlds` now exposes Prometheus-compatible metrics at `/metrics`.
2626+- Set `metrics_token` to require `Authorization: Bearer <token>` for scrapes.
2727+- The main runtime signals cover XRPC request counts/latency, websocket subscriptions and emitted frames, crawler notifications, blob ingress/egress bytes, and key store operation timings.
2828+- Detailed operator documentation lives in `docs/METRICS.md`.
2929+2330Relay / crawler discovery:
24312532- Configure `hostname` to the public host name you want relays to crawl, for example `pds.example.com`. This should be the host, not the full URL.
+72
docs/METRICS.md
···11+# Metrics
22+33+`perlds` now exposes Prometheus-style metrics at `/metrics`.
44+55+## Security
66+77+- If `metrics_token` is configured, the endpoint requires `Authorization: Bearer <token>`.
88+- If `metrics_token` is omitted, the endpoint is public.
99+- For internet-facing deployments, prefer setting `metrics_token` and/or restricting `/metrics` at the reverse proxy layer.
1010+1111+## Main Metrics
1212+1313+- `perlds_xrpc_requests_total`
1414+ Counts HTTP XRPC requests by method, NSID, endpoint type, and status.
1515+- `perlds_xrpc_request_duration_seconds`
1616+ Histogram for HTTP XRPC latency with the same labels.
1717+- `perlds_subscription_connections_total`
1818+ Counts websocket subscription opens by NSID.
1919+- `perlds_subscription_active`
2020+ Gauge of active websocket subscriptions by NSID.
2121+- `perlds_subscription_closes_total`
2222+ Counts websocket closes by NSID and close code.
2323+- `perlds_subscription_frames_total`
2424+ Counts emitted websocket frames by NSID, frame type, and encoding.
2525+- `perlds_subscription_bytes_total`
2626+ Counts emitted websocket bytes by NSID and encoding.
2727+- `perlds_subscription_duration_seconds`
2828+ Histogram of websocket lifetime by NSID.
2929+- `perlds_crawler_requests_total`
3030+ Counts outbound `com.atproto.sync.requestCrawl` calls by crawler service and result.
3131+- `perlds_crawler_request_duration_seconds`
3232+ Histogram of outbound crawler request latency.
3333+- `perlds_blob_ingress_bytes_total`
3434+ Counts uploaded blob bytes by MIME type.
3535+- `perlds_blob_egress_bytes_total`
3636+ Counts downloaded blob bytes by MIME type.
3737+- `perlds_store_operations_total`
3838+ Counts instrumented SQLite-backed store operations by operation and status.
3939+- `perlds_store_operation_duration_seconds`
4040+ Histogram of instrumented store operation duration.
4141+- `perlds_build_info`
4242+ Static build/service info gauge.
4343+4444+## Current Store Coverage
4545+4646+The store metrics currently cover the highest-signal operations on the live path:
4747+4848+- transactions
4949+- event append and event stream reads
5050+- event high-watermark reads
5151+- blob put/get
5252+- label put/list
5353+- record list
5454+- repo CAR export
5555+5656+This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase.
5757+5858+## Suggested Alerts
5959+6060+- high error rate on `perlds_xrpc_requests_total`
6161+- sustained increase in `perlds_xrpc_request_duration_seconds`
6262+- non-zero `perlds_subscription_active` with no corresponding frame growth
6363+- crawler errors from `perlds_crawler_requests_total{result="error"}`
6464+- large ingress with low egress or vice versa on blob byte counters
6565+- persistent growth in store latency histograms
6666+6767+## Example Scrape
6868+6969+```sh
7070+curl -H 'Authorization: Bearer YOUR_TOKEN' \
7171+ http://127.0.0.1:7755/metrics
7272+```