Metrics#

perlsky now exposes Prometheus-style metrics at /metrics.

Security#

If metrics_token is configured, the endpoint requires Authorization: Bearer <token>.
If metrics_token is omitted, the endpoint is public.
For internet-facing deployments, prefer setting metrics_token and/or restricting /metrics at the reverse proxy layer.

Main Metrics#

perlsky_xrpc_requests_total Counts HTTP XRPC requests by method, NSID, endpoint type, and status.
perlsky_xrpc_request_duration_seconds Histogram for HTTP XRPC latency with the same labels.
perlsky_xrpc_errors_total Counts rendered XRPC failures by method, NSID, endpoint type, status, and error code.
perlsky_xrpc_unhandled_exceptions_total Counts true unhandled exceptions on XRPC routes by method, NSID, and endpoint type.
perlsky_subscription_connections_total Counts websocket subscription opens by NSID.
perlsky_subscription_active Gauge of active websocket subscriptions by NSID.
perlsky_subscription_closes_total Counts websocket closes by NSID and close code.
perlsky_subscription_frames_total Counts emitted websocket frames by NSID, frame type, and encoding.
perlsky_subscription_bytes_total Counts emitted websocket bytes by NSID and encoding.
perlsky_subscription_duration_seconds Histogram of websocket lifetime by NSID.
perlsky_crawler_requests_total Counts outbound com.atproto.sync.requestCrawl calls by crawler service and result.
perlsky_crawler_request_duration_seconds Histogram of outbound crawler request latency.
perlsky_blob_ingress_bytes_total Counts uploaded blob bytes by MIME type.
perlsky_blob_egress_bytes_total Counts downloaded blob bytes by MIME type.
perlsky_store_operations_total Counts instrumented SQLite-backed store operations by operation and status.
perlsky_store_operation_duration_seconds Histogram of instrumented store operation duration.
perlsky_service_proxy_requests_total Counts local and upstream app.bsky.* proxy requests by NSID, source, and status.
perlsky_service_proxy_request_duration_seconds Histogram for service-proxy request latency with the same labels.
perlsky_service_proxy_local_post_index_cache_access_total Counts request-local hits, process-cache hits, and rebuilds for the local post index.
perlsky_service_proxy_local_post_index_rebuild_duration_seconds Histogram of local post-index rebuild time.
perlsky_service_proxy_local_post_index_entries Gauge of local post-index entry counts by kind.
perlsky_service_proxy_local_post_resolution_total Counts how local post lookups were resolved: request cache, shared index, store, or non-local bypass.
perlsky_service_proxy_profile_record_cache_total Counts local profile record cache hits and misses.
perlsky_repo_resolution_total Counts repo/DID resolution paths, including request-cache reuse versus fallback scans.
perlsky_build_info Static build/service info gauge.

Current Store Coverage#

The store metrics currently cover the highest-signal operations on the live path:

transactions
event append and event stream reads
event high-watermark reads
blob put/get
label put/list
record list
repo CAR export

This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase.

Suggested Alerts#

high error rate on perlsky_xrpc_requests_total
spikes in perlsky_xrpc_errors_total for a specific nsid or error
any growth in perlsky_xrpc_unhandled_exceptions_total
sustained increase in perlsky_xrpc_request_duration_seconds
non-zero perlsky_subscription_active with no corresponding frame growth
crawler errors from perlsky_crawler_requests_total{result="error"}
large ingress with low egress or vice versa on blob byte counters
persistent growth in store latency histograms
sustained result="rebuild" growth in perlsky_service_proxy_local_post_index_cache_access_total
high p95 in perlsky_service_proxy_local_post_index_rebuild_duration_seconds
unexpected growth in source="list_scan" for perlsky_repo_resolution_total

Prometheus#

The repo includes a checked-in example scrape job at ops/prometheus/perlsky.yml.

On the live VPS we scrape every 15s instead of more aggressively to avoid adding pressure while Prometheus is already remote-writing to Grafana Cloud.

Grafana#

The repo includes:

ops/grafana/perlsky-dashboard.json: overview dashboard for XRPC, service-proxy, store, subscription, and blob metrics
ops/grafana/prometheus-datasource.yml: example provisioned Prometheus data source
ops/grafana/perlsky-dashboard-provider.yml: example dashboard provider that watches a dashboard directory

The dashboard expects a Prometheus data source. When provisioning, either keep the checked-in uid from the example data source or update the dashboard's ${DS_PROMETHEUS} mapping during import.

Sentry#

Prometheus is still the main place to watch rates and latency. If you also configure sentry_dsn, perlsky will report unhandled XRPC exceptions to Sentry with request metadata and Perl stack frames. That works well as a complement to:

perlsky_xrpc_errors_total for handled request failures
perlsky_xrpc_unhandled_exceptions_total for internal 500-class failures

Example Scrape#

curl -H 'Authorization: Bearer YOUR_TOKEN' \
  http://127.0.0.1:7755/metrics

Configure Feed