perlsky is a Perl 5 implementation of an AT Protocol Personal Data Server.
1# Metrics
2
3`perlsky` now exposes Prometheus-style metrics at `/metrics`.
4
5## Security
6
7- If `metrics_token` is configured, the endpoint requires `Authorization: Bearer <token>`.
8- If `metrics_token` is omitted, the endpoint is public.
9- For internet-facing deployments, prefer setting `metrics_token` and/or restricting `/metrics` at the reverse proxy layer.
10
11## Main Metrics
12
13- `perlsky_xrpc_requests_total`
14 Counts HTTP XRPC requests by method, NSID, endpoint type, and status.
15- `perlsky_xrpc_request_duration_seconds`
16 Histogram for HTTP XRPC latency with the same labels.
17- `perlsky_xrpc_errors_total`
18 Counts rendered XRPC failures by method, NSID, endpoint type, status, and error code.
19- `perlsky_xrpc_unhandled_exceptions_total`
20 Counts true unhandled exceptions on XRPC routes by method, NSID, and endpoint type.
21- `perlsky_subscription_connections_total`
22 Counts websocket subscription opens by NSID.
23- `perlsky_subscription_active`
24 Gauge of active websocket subscriptions by NSID.
25- `perlsky_subscription_closes_total`
26 Counts websocket closes by NSID and close code.
27- `perlsky_subscription_frames_total`
28 Counts emitted websocket frames by NSID, frame type, and encoding.
29- `perlsky_subscription_bytes_total`
30 Counts emitted websocket bytes by NSID and encoding.
31- `perlsky_subscription_duration_seconds`
32 Histogram of websocket lifetime by NSID.
33- `perlsky_crawler_requests_total`
34 Counts outbound `com.atproto.sync.requestCrawl` calls by crawler service and result.
35- `perlsky_crawler_request_duration_seconds`
36 Histogram of outbound crawler request latency.
37- `perlsky_blob_ingress_bytes_total`
38 Counts uploaded blob bytes by MIME type.
39- `perlsky_blob_egress_bytes_total`
40 Counts downloaded blob bytes by MIME type.
41- `perlsky_store_operations_total`
42 Counts instrumented SQLite-backed store operations by operation and status.
43- `perlsky_store_operation_duration_seconds`
44 Histogram of instrumented store operation duration.
45- `perlsky_service_proxy_requests_total`
46 Counts local and upstream `app.bsky.*` proxy requests by NSID, source, and status.
47- `perlsky_service_proxy_request_duration_seconds`
48 Histogram for service-proxy request latency with the same labels.
49- `perlsky_service_proxy_local_post_index_cache_access_total`
50 Counts request-local hits, process-cache hits, and rebuilds for the local post index.
51- `perlsky_service_proxy_local_post_index_rebuild_duration_seconds`
52 Histogram of local post-index rebuild time.
53- `perlsky_service_proxy_local_post_index_entries`
54 Gauge of local post-index entry counts by kind.
55- `perlsky_service_proxy_local_post_resolution_total`
56 Counts how local post lookups were resolved: request cache, shared index, store, or non-local bypass.
57- `perlsky_service_proxy_profile_record_cache_total`
58 Counts local profile record cache hits and misses.
59- `perlsky_repo_resolution_total`
60 Counts repo/DID resolution paths, including request-cache reuse versus fallback scans.
61- `perlsky_build_info`
62 Static build/service info gauge.
63
64## Current Store Coverage
65
66The store metrics currently cover the highest-signal operations on the live path:
67
68- transactions
69- event append and event stream reads
70- event high-watermark reads
71- blob put/get
72- label put/list
73- record list
74- repo CAR export
75
76This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase.
77
78## Suggested Alerts
79
80- high error rate on `perlsky_xrpc_requests_total`
81- spikes in `perlsky_xrpc_errors_total` for a specific `nsid` or `error`
82- any growth in `perlsky_xrpc_unhandled_exceptions_total`
83- sustained increase in `perlsky_xrpc_request_duration_seconds`
84- non-zero `perlsky_subscription_active` with no corresponding frame growth
85- crawler errors from `perlsky_crawler_requests_total{result="error"}`
86- large ingress with low egress or vice versa on blob byte counters
87- persistent growth in store latency histograms
88- sustained `result="rebuild"` growth in `perlsky_service_proxy_local_post_index_cache_access_total`
89- high `p95` in `perlsky_service_proxy_local_post_index_rebuild_duration_seconds`
90- unexpected growth in `source="list_scan"` for `perlsky_repo_resolution_total`
91
92## Prometheus
93
94The repo includes a checked-in example scrape job at [ops/prometheus/perlsky.yml](../ops/prometheus/perlsky.yml).
95
96On the live VPS we scrape every `15s` instead of more aggressively to avoid adding pressure while Prometheus is already remote-writing to Grafana Cloud.
97
98## Grafana
99
100The repo includes:
101
102- [ops/grafana/perlsky-dashboard.json](../ops/grafana/perlsky-dashboard.json): overview dashboard for XRPC, service-proxy, store, subscription, and blob metrics
103- [ops/grafana/prometheus-datasource.yml](../ops/grafana/prometheus-datasource.yml): example provisioned Prometheus data source
104- [ops/grafana/perlsky-dashboard-provider.yml](../ops/grafana/perlsky-dashboard-provider.yml): example dashboard provider that watches a dashboard directory
105
106The dashboard expects a Prometheus data source. When provisioning, either keep the checked-in `uid` from the example data source or update the dashboard's `${DS_PROMETHEUS}` mapping during import.
107
108## Sentry
109
110Prometheus is still the main place to watch rates and latency. If you also configure `sentry_dsn`, perlsky will report unhandled XRPC exceptions to Sentry with request metadata and Perl stack frames. That works well as a complement to:
111
112- `perlsky_xrpc_errors_total` for handled request failures
113- `perlsky_xrpc_unhandled_exceptions_total` for internal 500-class failures
114
115## Example Scrape
116
117```sh
118curl -H 'Authorization: Bearer YOUR_TOKEN' \
119 http://127.0.0.1:7755/metrics
120```