# Metrics `perlsky` now exposes Prometheus-style metrics at `/metrics`. ## Security - If `metrics_token` is configured, the endpoint requires `Authorization: Bearer `. - If `metrics_token` is omitted, the endpoint is public. - For internet-facing deployments, prefer setting `metrics_token` and/or restricting `/metrics` at the reverse proxy layer. ## Main Metrics - `perlsky_xrpc_requests_total` Counts HTTP XRPC requests by method, NSID, endpoint type, and status. - `perlsky_xrpc_request_duration_seconds` Histogram for HTTP XRPC latency with the same labels. - `perlsky_xrpc_errors_total` Counts rendered XRPC failures by method, NSID, endpoint type, status, and error code. - `perlsky_xrpc_unhandled_exceptions_total` Counts true unhandled exceptions on XRPC routes by method, NSID, and endpoint type. - `perlsky_subscription_connections_total` Counts websocket subscription opens by NSID. - `perlsky_subscription_active` Gauge of active websocket subscriptions by NSID. - `perlsky_subscription_closes_total` Counts websocket closes by NSID and close code. - `perlsky_subscription_frames_total` Counts emitted websocket frames by NSID, frame type, and encoding. - `perlsky_subscription_bytes_total` Counts emitted websocket bytes by NSID and encoding. - `perlsky_subscription_duration_seconds` Histogram of websocket lifetime by NSID. - `perlsky_crawler_requests_total` Counts outbound `com.atproto.sync.requestCrawl` calls by crawler service and result. - `perlsky_crawler_request_duration_seconds` Histogram of outbound crawler request latency. - `perlsky_blob_ingress_bytes_total` Counts uploaded blob bytes by MIME type. - `perlsky_blob_egress_bytes_total` Counts downloaded blob bytes by MIME type. - `perlsky_store_operations_total` Counts instrumented SQLite-backed store operations by operation and status. - `perlsky_store_operation_duration_seconds` Histogram of instrumented store operation duration. - `perlsky_service_proxy_requests_total` Counts local and upstream `app.bsky.*` proxy requests by NSID, source, and status. - `perlsky_service_proxy_request_duration_seconds` Histogram for service-proxy request latency with the same labels. - `perlsky_service_proxy_local_post_index_cache_access_total` Counts request-local hits, process-cache hits, and rebuilds for the local post index. - `perlsky_service_proxy_local_post_index_rebuild_duration_seconds` Histogram of local post-index rebuild time. - `perlsky_service_proxy_local_post_index_entries` Gauge of local post-index entry counts by kind. - `perlsky_service_proxy_local_post_resolution_total` Counts how local post lookups were resolved: request cache, shared index, store, or non-local bypass. - `perlsky_service_proxy_profile_record_cache_total` Counts local profile record cache hits and misses. - `perlsky_repo_resolution_total` Counts repo/DID resolution paths, including request-cache reuse versus fallback scans. - `perlsky_build_info` Static build/service info gauge. ## Current Store Coverage The store metrics currently cover the highest-signal operations on the live path: - transactions - event append and event stream reads - event high-watermark reads - blob put/get - label put/list - record list - repo CAR export This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase. ## Suggested Alerts - high error rate on `perlsky_xrpc_requests_total` - spikes in `perlsky_xrpc_errors_total` for a specific `nsid` or `error` - any growth in `perlsky_xrpc_unhandled_exceptions_total` - sustained increase in `perlsky_xrpc_request_duration_seconds` - non-zero `perlsky_subscription_active` with no corresponding frame growth - crawler errors from `perlsky_crawler_requests_total{result="error"}` - large ingress with low egress or vice versa on blob byte counters - persistent growth in store latency histograms - sustained `result="rebuild"` growth in `perlsky_service_proxy_local_post_index_cache_access_total` - high `p95` in `perlsky_service_proxy_local_post_index_rebuild_duration_seconds` - unexpected growth in `source="list_scan"` for `perlsky_repo_resolution_total` ## Prometheus The repo includes a checked-in example scrape job at [ops/prometheus/perlsky.yml](../ops/prometheus/perlsky.yml). On the live VPS we scrape every `15s` instead of more aggressively to avoid adding pressure while Prometheus is already remote-writing to Grafana Cloud. ## Grafana The repo includes: - [ops/grafana/perlsky-dashboard.json](../ops/grafana/perlsky-dashboard.json): overview dashboard for XRPC, service-proxy, store, subscription, and blob metrics - [ops/grafana/prometheus-datasource.yml](../ops/grafana/prometheus-datasource.yml): example provisioned Prometheus data source - [ops/grafana/perlsky-dashboard-provider.yml](../ops/grafana/perlsky-dashboard-provider.yml): example dashboard provider that watches a dashboard directory The dashboard expects a Prometheus data source. When provisioning, either keep the checked-in `uid` from the example data source or update the dashboard's `${DS_PROMETHEUS}` mapping during import. ## Sentry Prometheus is still the main place to watch rates and latency. If you also configure `sentry_dsn`, perlsky will report unhandled XRPC exceptions to Sentry with request metadata and Perl stack frames. That works well as a complement to: - `perlsky_xrpc_errors_total` for handled request failures - `perlsky_xrpc_unhandled_exceptions_total` for internal 500-class failures ## Example Scrape ```sh curl -H 'Authorization: Bearer YOUR_TOKEN' \ http://127.0.0.1:7755/metrics ```