docs/METRICS.md at main · alice.mosphere.at/perlsky

alice.mosphere.at / perlsky
fork
perlsky is a Perl 5 implementation of an AT Protocol Personal Data Server.
fork
perlsky / docs / METRICS.md
at main 120 lines 5.6 kB view raw view rendered
wrap content
alice Add Perl stack traces to Sentry events 7w ago
2b0558a4
  1# Metrics
  2
  3`perlsky` now exposes Prometheus-style metrics at `/metrics`.
  4
  5## Security
  6
  7- If `metrics_token` is configured, the endpoint requires `Authorization: Bearer <token>`.
  8- If `metrics_token` is omitted, the endpoint is public.
  9- For internet-facing deployments, prefer setting `metrics_token` and/or restricting `/metrics` at the reverse proxy layer.
 10
 11## Main Metrics
 12
 13- `perlsky_xrpc_requests_total`
 14  Counts HTTP XRPC requests by method, NSID, endpoint type, and status.
 15- `perlsky_xrpc_request_duration_seconds`
 16  Histogram for HTTP XRPC latency with the same labels.
 17- `perlsky_xrpc_errors_total`
 18  Counts rendered XRPC failures by method, NSID, endpoint type, status, and error code.
 19- `perlsky_xrpc_unhandled_exceptions_total`
 20  Counts true unhandled exceptions on XRPC routes by method, NSID, and endpoint type.
 21- `perlsky_subscription_connections_total`
 22  Counts websocket subscription opens by NSID.
 23- `perlsky_subscription_active`
 24  Gauge of active websocket subscriptions by NSID.
 25- `perlsky_subscription_closes_total`
 26  Counts websocket closes by NSID and close code.
 27- `perlsky_subscription_frames_total`
 28  Counts emitted websocket frames by NSID, frame type, and encoding.
 29- `perlsky_subscription_bytes_total`
 30  Counts emitted websocket bytes by NSID and encoding.
 31- `perlsky_subscription_duration_seconds`
 32  Histogram of websocket lifetime by NSID.
 33- `perlsky_crawler_requests_total`
 34  Counts outbound `com.atproto.sync.requestCrawl` calls by crawler service and result.
 35- `perlsky_crawler_request_duration_seconds`
 36  Histogram of outbound crawler request latency.
 37- `perlsky_blob_ingress_bytes_total`
 38  Counts uploaded blob bytes by MIME type.
 39- `perlsky_blob_egress_bytes_total`
 40  Counts downloaded blob bytes by MIME type.
 41- `perlsky_store_operations_total`
 42  Counts instrumented SQLite-backed store operations by operation and status.
 43- `perlsky_store_operation_duration_seconds`
 44  Histogram of instrumented store operation duration.
 45- `perlsky_service_proxy_requests_total`
 46  Counts local and upstream `app.bsky.*` proxy requests by NSID, source, and status.
 47- `perlsky_service_proxy_request_duration_seconds`
 48  Histogram for service-proxy request latency with the same labels.
 49- `perlsky_service_proxy_local_post_index_cache_access_total`
 50  Counts request-local hits, process-cache hits, and rebuilds for the local post index.
 51- `perlsky_service_proxy_local_post_index_rebuild_duration_seconds`
 52  Histogram of local post-index rebuild time.
 53- `perlsky_service_proxy_local_post_index_entries`
 54  Gauge of local post-index entry counts by kind.
 55- `perlsky_service_proxy_local_post_resolution_total`
 56  Counts how local post lookups were resolved: request cache, shared index, store, or non-local bypass.
 57- `perlsky_service_proxy_profile_record_cache_total`
 58  Counts local profile record cache hits and misses.
 59- `perlsky_repo_resolution_total`
 60  Counts repo/DID resolution paths, including request-cache reuse versus fallback scans.
 61- `perlsky_build_info`
 62  Static build/service info gauge.
 63
 64## Current Store Coverage
 65
 66The store metrics currently cover the highest-signal operations on the live path:
 67
 68- transactions
 69- event append and event stream reads
 70- event high-watermark reads
 71- blob put/get
 72- label put/list
 73- record list
 74- repo CAR export
 75
 76This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase.
 77
 78## Suggested Alerts
 79
 80- high error rate on `perlsky_xrpc_requests_total`
 81- spikes in `perlsky_xrpc_errors_total` for a specific `nsid` or `error`
 82- any growth in `perlsky_xrpc_unhandled_exceptions_total`
 83- sustained increase in `perlsky_xrpc_request_duration_seconds`
 84- non-zero `perlsky_subscription_active` with no corresponding frame growth
 85- crawler errors from `perlsky_crawler_requests_total{result="error"}`
 86- large ingress with low egress or vice versa on blob byte counters
 87- persistent growth in store latency histograms
 88- sustained `result="rebuild"` growth in `perlsky_service_proxy_local_post_index_cache_access_total`
 89- high `p95` in `perlsky_service_proxy_local_post_index_rebuild_duration_seconds`
 90- unexpected growth in `source="list_scan"` for `perlsky_repo_resolution_total`
 91
 92## Prometheus
 93
 94The repo includes a checked-in example scrape job at [ops/prometheus/perlsky.yml](../ops/prometheus/perlsky.yml).
 95
 96On the live VPS we scrape every `15s` instead of more aggressively to avoid adding pressure while Prometheus is already remote-writing to Grafana Cloud.
 97
 98## Grafana
 99
100The repo includes:
101
102- [ops/grafana/perlsky-dashboard.json](../ops/grafana/perlsky-dashboard.json): overview dashboard for XRPC, service-proxy, store, subscription, and blob metrics
103- [ops/grafana/prometheus-datasource.yml](../ops/grafana/prometheus-datasource.yml): example provisioned Prometheus data source
104- [ops/grafana/perlsky-dashboard-provider.yml](../ops/grafana/perlsky-dashboard-provider.yml): example dashboard provider that watches a dashboard directory
105
106The dashboard expects a Prometheus data source. When provisioning, either keep the checked-in `uid` from the example data source or update the dashboard's `${DS_PROMETHEUS}` mapping during import.
107
108## Sentry
109
110Prometheus is still the main place to watch rates and latency. If you also configure `sentry_dsn`, perlsky will report unhandled XRPC exceptions to Sentry with request metadata and Perl stack frames. That works well as a complement to:
111
112- `perlsky_xrpc_errors_total` for handled request failures
113- `perlsky_xrpc_unhandled_exceptions_total` for internal 500-class failures
114
115## Example Scrape
116
117```sh
118curl -H 'Authorization: Bearer YOUR_TOKEN' \
119  http://127.0.0.1:7755/metrics
120```
Configure Feed

Configure Feed