perlsky is a Perl 5 implementation of an AT Protocol Personal Data Server.
13
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 120 lines 5.6 kB view raw view rendered
1# Metrics 2 3`perlsky` now exposes Prometheus-style metrics at `/metrics`. 4 5## Security 6 7- If `metrics_token` is configured, the endpoint requires `Authorization: Bearer <token>`. 8- If `metrics_token` is omitted, the endpoint is public. 9- For internet-facing deployments, prefer setting `metrics_token` and/or restricting `/metrics` at the reverse proxy layer. 10 11## Main Metrics 12 13- `perlsky_xrpc_requests_total` 14 Counts HTTP XRPC requests by method, NSID, endpoint type, and status. 15- `perlsky_xrpc_request_duration_seconds` 16 Histogram for HTTP XRPC latency with the same labels. 17- `perlsky_xrpc_errors_total` 18 Counts rendered XRPC failures by method, NSID, endpoint type, status, and error code. 19- `perlsky_xrpc_unhandled_exceptions_total` 20 Counts true unhandled exceptions on XRPC routes by method, NSID, and endpoint type. 21- `perlsky_subscription_connections_total` 22 Counts websocket subscription opens by NSID. 23- `perlsky_subscription_active` 24 Gauge of active websocket subscriptions by NSID. 25- `perlsky_subscription_closes_total` 26 Counts websocket closes by NSID and close code. 27- `perlsky_subscription_frames_total` 28 Counts emitted websocket frames by NSID, frame type, and encoding. 29- `perlsky_subscription_bytes_total` 30 Counts emitted websocket bytes by NSID and encoding. 31- `perlsky_subscription_duration_seconds` 32 Histogram of websocket lifetime by NSID. 33- `perlsky_crawler_requests_total` 34 Counts outbound `com.atproto.sync.requestCrawl` calls by crawler service and result. 35- `perlsky_crawler_request_duration_seconds` 36 Histogram of outbound crawler request latency. 37- `perlsky_blob_ingress_bytes_total` 38 Counts uploaded blob bytes by MIME type. 39- `perlsky_blob_egress_bytes_total` 40 Counts downloaded blob bytes by MIME type. 41- `perlsky_store_operations_total` 42 Counts instrumented SQLite-backed store operations by operation and status. 43- `perlsky_store_operation_duration_seconds` 44 Histogram of instrumented store operation duration. 45- `perlsky_service_proxy_requests_total` 46 Counts local and upstream `app.bsky.*` proxy requests by NSID, source, and status. 47- `perlsky_service_proxy_request_duration_seconds` 48 Histogram for service-proxy request latency with the same labels. 49- `perlsky_service_proxy_local_post_index_cache_access_total` 50 Counts request-local hits, process-cache hits, and rebuilds for the local post index. 51- `perlsky_service_proxy_local_post_index_rebuild_duration_seconds` 52 Histogram of local post-index rebuild time. 53- `perlsky_service_proxy_local_post_index_entries` 54 Gauge of local post-index entry counts by kind. 55- `perlsky_service_proxy_local_post_resolution_total` 56 Counts how local post lookups were resolved: request cache, shared index, store, or non-local bypass. 57- `perlsky_service_proxy_profile_record_cache_total` 58 Counts local profile record cache hits and misses. 59- `perlsky_repo_resolution_total` 60 Counts repo/DID resolution paths, including request-cache reuse versus fallback scans. 61- `perlsky_build_info` 62 Static build/service info gauge. 63 64## Current Store Coverage 65 66The store metrics currently cover the highest-signal operations on the live path: 67 68- transactions 69- event append and event stream reads 70- event high-watermark reads 71- blob put/get 72- label put/list 73- record list 74- repo CAR export 75 76This is enough to understand the hot PDS paths under load without trying to wrap every SQLite call in the codebase. 77 78## Suggested Alerts 79 80- high error rate on `perlsky_xrpc_requests_total` 81- spikes in `perlsky_xrpc_errors_total` for a specific `nsid` or `error` 82- any growth in `perlsky_xrpc_unhandled_exceptions_total` 83- sustained increase in `perlsky_xrpc_request_duration_seconds` 84- non-zero `perlsky_subscription_active` with no corresponding frame growth 85- crawler errors from `perlsky_crawler_requests_total{result="error"}` 86- large ingress with low egress or vice versa on blob byte counters 87- persistent growth in store latency histograms 88- sustained `result="rebuild"` growth in `perlsky_service_proxy_local_post_index_cache_access_total` 89- high `p95` in `perlsky_service_proxy_local_post_index_rebuild_duration_seconds` 90- unexpected growth in `source="list_scan"` for `perlsky_repo_resolution_total` 91 92## Prometheus 93 94The repo includes a checked-in example scrape job at [ops/prometheus/perlsky.yml](../ops/prometheus/perlsky.yml). 95 96On the live VPS we scrape every `15s` instead of more aggressively to avoid adding pressure while Prometheus is already remote-writing to Grafana Cloud. 97 98## Grafana 99 100The repo includes: 101 102- [ops/grafana/perlsky-dashboard.json](../ops/grafana/perlsky-dashboard.json): overview dashboard for XRPC, service-proxy, store, subscription, and blob metrics 103- [ops/grafana/prometheus-datasource.yml](../ops/grafana/prometheus-datasource.yml): example provisioned Prometheus data source 104- [ops/grafana/perlsky-dashboard-provider.yml](../ops/grafana/perlsky-dashboard-provider.yml): example dashboard provider that watches a dashboard directory 105 106The dashboard expects a Prometheus data source. When provisioning, either keep the checked-in `uid` from the example data source or update the dashboard's `${DS_PROMETHEUS}` mapping during import. 107 108## Sentry 109 110Prometheus is still the main place to watch rates and latency. If you also configure `sentry_dsn`, perlsky will report unhandled XRPC exceptions to Sentry with request metadata and Perl stack frames. That works well as a complement to: 111 112- `perlsky_xrpc_errors_total` for handled request failures 113- `perlsky_xrpc_unhandled_exceptions_total` for internal 500-class failures 114 115## Example Scrape 116 117```sh 118curl -H 'Authorization: Bearer YOUR_TOKEN' \ 119 http://127.0.0.1:7755/metrics 120```