Policy Engine Design Research: Declarative, Deterministic, Transport-Agnostic Replication Policy for AT Protocol#

Executive Summary#

This document maps the design space for a policy engine that governs AT Protocol account replication in P2PDS. The engine must be declarative (policies are data, not code), deterministic (same inputs yield same conclusions everywhere), transport-agnostic (outcomes matter, not mechanisms), and account-centric (DIDs are the primitive, not raw blocks).

The research covers policy language options, set reconciliation primitives, compliance verification, group agreement, failure handling, resource accounting, prior art, and architecture sketches. It concludes with a recommended starting point: a minimal JSON-based policy document published as an atproto record, evaluated locally by each node, using the existing RASL verification layer for compliance checking.

1. Policy Language / Representation#

The Design Space#

The core question: how do you express rules like "node X must hold all blocks for DID Y" or "99.9% uptime required" as evaluable data?

Option A: JSON/CBOR Constraint Documents#

Description: Policies are structured JSON or CBOR objects with a fixed schema, interpreted by a purpose-built evaluator. The simplest possible approach.

{
  "$type": "org.p2pds.policy",
  "version": 1,
  "type": "mutual-aid",
  "members": [
    { "did": "did:plc:alice", "node": "did:plc:alice-node" },
    { "did": "did:plc:bob", "node": "did:plc:bob-node" },
    { "did": "did:plc:carol", "node": "did:plc:carol-node" }
  ],
  "rules": {
    "replication": {
      "strategy": "full",
      "minCopies": 2,
      "subjects": ["did:plc:alice", "did:plc:bob", "did:plc:carol"]
    },
    "verification": {
      "interval": "30m",
      "method": "rasl-sampling",
      "sampleSize": 50
    },
    "sync": {
      "maxLag": "5m"
    }
  }
}

Pros:

Trivially serializable as atproto records (JSON maps to atproto lexicon records, CBOR is native to atproto repos)
No new runtime dependency -- evaluated by application code
Version-controlled, diff-friendly, human-readable
Deterministic by construction: a fixed schema with enumerated strategies leaves no room for nondeterminism
Smallest possible implementation surface

Cons:

Limited expressiveness -- every new rule type requires schema changes and new evaluator code
No composition: cannot combine policies from different sources without custom merge logic
No formal semantics -- the evaluator is the semantics, which risks divergence across implementations

Fit assessment: Best starting point. Covers mutual aid and basic SLA scenarios. Expressiveness limits are a feature at this stage -- they prevent over-engineering.

Option B: Datalog-Style Logic Programs#

Description: Policies expressed as Datalog rules over facts derived from system state. Datalog is a restricted subset of Prolog that always terminates and has well-defined semantics (least fixed-point).

% Facts (derived from system state)
member(alice). member(bob). member(carol).
node_holds(alice_node, alice). node_holds(alice_node, bob).
node_holds(bob_node, alice). node_holds(bob_node, carol).

% Policy rule
compliant(DID) :-
  member(DID),
  #count { N : node_holds(N, DID), N \= home_node(DID) } >= 2.

violation(DID) :- member(DID), \+ compliant(DID).

Pros:

Formally well-defined: evaluation always terminates, results are deterministic given the same facts
Composable: rules from different sources can be combined (union of rule sets)
Well-studied: decades of database and logic programming research
Can express recursive relationships (e.g., transitive trust chains)

Cons:

Requires embedding a Datalog engine (e.g., a JS implementation or WASM module)
Serializing Datalog programs as atproto records is awkward (text blob or custom AST format)
Higher barrier for policy authors -- most people cannot read Datalog
Numerical constraints (latency < 500ms, uptime > 99.9%) require extensions to standard Datalog (aggregates, arithmetic)

Fit assessment: Excellent formal properties, but premature for v1. Good target for v2 if the JSON approach proves too limiting.

Option C: Rego (Open Policy Agent)#

Description: Rego is OPA's purpose-built policy language. It is declarative, supports partial evaluation, and has been widely adopted for Kubernetes, CI/CD, and API gateway policies.

package p2pds.replication

default compliant = false

compliant {
  input.copies >= data.policy.minCopies
  input.sync_lag_seconds <= data.policy.maxLagSeconds
  input.verification.passed == true
}

Pros:

Designed specifically for policy evaluation
Deterministic: given the same input document and data, evaluation produces the same result
Rich ecosystem: tooling, testing frameworks, playground
Handles complex nested data well (JSON-native)
Can be compiled to WASM for portable, sandboxed execution

Cons:

Heavy dependency: OPA is a Go binary; WASM builds add ~5MB
Rego is its own language with a learning curve
Not atproto-native: policies would need to be stored as opaque text blobs in records
OPA's bundle distribution model assumes a central control plane, which conflicts with P2P architecture

Fit assessment: OPA is well-proven but brings significant complexity. The WASM compilation path is interesting for ensuring all nodes run the exact same evaluator. Consider for v2/v3 if policy complexity demands it.

Option D: CUE Language#

Description: CUE is a constraint-based configuration language where types, values, and constraints exist on a single continuum. Combining CUE values is associative, commutative, and idempotent -- ideal for merging policies from multiple sources.

policy: {
  version: 1
  type: "mutual-aid"
  replication: {
    minCopies: >=2 & <=5
    strategy: "full" | "partial"
    subjects: [...string]
  }
  sync: maxLag: <=300 // seconds
}

Pros:

Merge is mathematically well-defined: a & b always produces the same result regardless of order
Constraints are first-class: >=2 & <=5 is a type, not a runtime check
Deterministic by construction (lattice-based evaluation)
Good fit for "multiple parties each add constraints" scenario

Cons:

Go-only implementation; no mature JS/TS runtime
Not widely adopted outside Google-adjacent projects
Serialization as atproto records would require a custom mapping
The lattice model can produce confusing error messages when constraints conflict

Fit assessment: Theoretically elegant for multi-party policy composition. Impractical today due to runtime availability. Worth watching.

Option E: CEL (Common Expression Language)#

Description: CEL is Google's expression language used in Kubernetes ValidatingAdmissionPolicy. It evaluates deterministically, is designed to be embedded, and has a well-defined type system.

// Is this node compliant?
node.copies >= policy.minCopies &&
  node.syncLag <= policy.maxLag &&
  node.verification.lastPassed

Pros:

Designed for embedding in other systems
Deterministic, terminating (no loops, no side effects)
Small, well-defined: no surprise behaviors
JS/TS implementations exist (cel-js)
Good fit for "evaluate a boolean expression against structured input"

Cons:

Expression language only -- no rule composition or inference
Less expressive than Datalog or Rego for complex policies
Still requires a runtime dependency

Fit assessment: Good middle ground between raw JSON and full policy language. Could be used to express compliance predicates within a JSON policy document: the policy defines thresholds, and a CEL expression defines how to combine them.

Option F: AWS IAM / Zanzibar-Style Relationship Model#

Description: Instead of general-purpose rules, model policy as relationships between entities (nodes, DIDs, groups) with permissions derived from relationship traversal.

{
  "definition": {
    "group": {
      "relations": { "member": { "this": {} } },
      "permissions": {
        "replicate": { "computedUserset": { "relation": "member" } }
      }
    }
  }
}

Pros:

Proven at scale (Google Zanzibar handles trillions of relationships)
Natural fit for "who can do what to whose data"
Efficient: relationship checks are O(depth of graph), typically O(1)-O(3)

Cons:

Designed for authorization ("can X do Y to Z?"), not obligation ("X must do Y for Z")
Doesn't naturally express SLA metrics, latency bounds, or storage quotas
Would need significant adaptation for replication policy

Fit assessment: Wrong abstraction for the core problem. Replication policy is about obligations and compliance, not permissions. However, relationship-based modeling could complement the policy engine for "who is allowed to join this group" type decisions.

Recommendation#

Start with Option A (JSON constraint documents) stored as atproto records. The schema is the policy language. Keep the door open for Option E (CEL expressions) as inline predicates within JSON documents for v2. Option B (Datalog) is the long-term target if policy complexity grows significantly.

The key insight: the policy language does not need to be Turing-complete, and it should not be. Determinism and termination are easier to guarantee when expressiveness is bounded. A fixed JSON schema with enumerated rule types is a deliberately limited language, and that limitation is a strength.

2. Set Reconciliation as Policy Primitive#

The Core Insight#

The fundamental assertion in replication policy is set membership: "node X holds all blocks for DID Y." This is a set containment claim:

blocks(node_X, did_Y) ⊇ blocks(source_pds, did_Y)

If we can efficiently verify this set relationship, we can verify policy compliance. Set reconciliation protocols exist precisely for this purpose.

Existing Set Reconciliation Approaches#

Negentropy (Range-Based Set Reconciliation)#

How it works: Both parties sort their items by (timestamp, ID). The initiator partitions its set into ranges and computes a fingerprint (incremental XOR hash) for each range. The responder compares fingerprints and recursively splits ranges where fingerprints differ until individual missing items are identified.

Communication complexity: O(d * log(n)) where d is the number of differences and n is the total set size. When sets are nearly identical (common case for synced repos), this is very efficient.

Relevance to P2PDS: Negentropy was designed for Nostr relay synchronization -- a remarkably similar problem (syncing sets of content-addressed events between nodes). The key difference: Nostr events have natural timestamps, while atproto blocks in an MST don't. However, atproto commits have revisions (TIDs), which provide a total ordering.

Integration approach: Rather than reconciling raw block CIDs, reconcile at the commit level. Each node maintains a set of (rev, commit_cid) pairs for each DID. Range-based reconciliation identifies which commits are missing, and then the node fetches the missing commit's blocks.

Invertible Bloom Lookup Tables (IBLTs)#

How it works: An IBLT is a probabilistic data structure that supports insertion, deletion, lookup, and -- crucially -- listing of elements. Two parties each construct an IBLT of their set, XOR them together, and the resulting IBLT can be "peeled" to recover the symmetric difference.

Communication complexity: O(d) where d is the number of differences, but requires knowing d in advance to size the table correctly. If d is underestimated, decoding fails.

Relevance to P2PDS: IBLTs are optimal when the difference size is known or bounded. For repos that sync frequently, the number of new commits since last sync is typically small and predictable.

CertainSync / Rateless IBLTs (2024-2025)#

How it works: Extends IBLTs to be "rateless" -- the encoder produces an infinite stream of coded symbols, and the decoder can succeed as soon as enough symbols arrive to cover the actual difference. No need to estimate d in advance.

Communication complexity: Near-optimal -- approaches the information-theoretic lower bound.

Relevance to P2PDS: Eliminates the main drawback of IBLTs (requiring pre-knowledge of difference size). Still very new and implementations are limited.

Minisketch (BCH-Based)#

How it works: Uses BCH error-correcting codes to encode set sketches. Two parties exchange sketches; the difference between sketches can be decoded to recover the symmetric difference.

Communication complexity: O(d) with a constant factor determined by the sketch capacity.

Relevance to P2PDS: Used by Bitcoin's Erlay protocol for transaction relay. Highly efficient for small differences. Limited to elements that can be represented as fixed-size integers (requires hashing CIDs to a fixed size).

Set Algebra for Policy#

Replication policies can be expressed as set algebra:

Full replication: blocks(node_A, did_X) = blocks(authoritative_source, did_X) for all members
Minimum copies: |{ node : blocks(node, did_X) ⊇ blocks(source, did_X) }| >= minCopies for each subject DID
Partial replication: blocks(node_A, did_X) ⊇ selected_collections(did_X) (only certain record collections)

The existing codebase already tracks block CIDs per DID in the replication_blocks table (see /Users/dietrich/misc/p2pds/src/replication/sync-storage.ts). This is the foundation for set-based verification.

Proving Set Containment Without Full Transfer#

Several approaches for proving A ⊇ B without transferring either full set:

Fingerprint comparison: Compute a fingerprint (hash) of the set. If fingerprints match, sets are equal with high probability. Fast but only proves equality, not containment.
Merkle proof: The atproto MST (Merkle Search Tree) already provides this. The repo root CID is a commitment to the entire record set. If node X has root CID R for did Y, and R matches the authoritative root, X has all the data. This is exactly what the existing Layer 0 verification does.
Bloom filter challenge: The verifier constructs a Bloom filter of their set and sends it. The prover checks membership of random elements. False positives make this a probabilistic proof.
Random sampling (current approach): Randomly select k block CIDs and check if the node can serve them. This is what the existing RASL verification layer does (see /Users/dietrich/misc/p2pds/src/replication/verification.ts).

Recommendation#

Use commit-level Negentropy for sync, MST root comparison for equality proof, and RASL sampling for ongoing verification. This layered approach matches the existing architecture:

Layer 0 (commit root) = MST root CID comparison = set equality proof
Layer 1 (RASL sampling) = probabilistic set containment proof
Layer 2 (future: Negentropy) = efficient difference computation for sync

The policy engine doesn't need to know how reconciliation happens. It checks the outcome: "does node X's root CID for DID Y match the authoritative root CID?" This is the transport-agnostic principle in action.

3. Compliance Verification#

The Verification Spectrum#

From weakest to strongest:

Self-reporting: Node claims "I have all blocks for DID X." (Trust-based, trivially gameable)
Peer-reporting: Other nodes report on a node's behavior. (Social trust, Sybil-vulnerable)
Spot-check sampling: Random block challenges via RASL. (Existing implementation)
Full set verification: Compare complete block sets via reconciliation. (Expensive but definitive)
Cryptographic proof-of-storage: Node proves it holds data without transferring it. (Filecoin-style, heavyweight)

What P2PDS Already Has#

The existing codebase implements a layered verification system (see /Users/dietrich/misc/p2pds/src/replication/verification.ts):

Layer 0 (Commit Root): Fetch the root CID via RASL endpoint. A 200 with correct bytes proves the remote serves the current repo head. This is a set equality check (if roots match, MSTs match, records match).
Layer 1 (RASL Sampling): Fetch random blocks via HTTP, compare with local copy. Content-addressed retrieval is unforgeable: correct bytes = peer has the data.
Layer 2 (Bitswap): Stub -- future IPFS network verification.
Layer 3 (MST Path Proof): Stub -- future sync.getRecord + CAR verification.

This is already a solid foundation. The policy engine needs to:

Define what verification is required (which layers, how often, what sample sizes)
Record verification results
Evaluate compliance based on results over time

Challenge-Response Protocols#

For stronger guarantees without the full Filecoin treatment:

Lightweight Proof-of-Data-Possession (PDP): Filecoin recently introduced PDP for "hot storage" verification. It uses standard SHA-256 hashing -- no specialized hardware required. For a 1 GiB dataset with 256 KiB chunks, each proof requires only 12 hashes (~384 bytes) and 12 hash computations.

Adaptation for P2PDS: The atproto MST already provides Merkle proofs. To challenge a node:

Pick a random record path (e.g., app.bsky.feed.post/3k2abc...)
Request the MST path proof via com.atproto.sync.getRecord
Verify the proof against the known root CID

This is essentially Layer 3 of the existing verification architecture. The MST is the Merkle tree, and atproto already defines the proof protocol. No new cryptography needed.

Verification Scheduling#

The policy should define:

verificationInterval: How often to run verification (e.g., every 30 minutes)
verificationSampleSize: How many blocks/records to check per run
verificationLayers: Which layers are required (e.g., [0, 1] for basic, [0, 1, 3] for strong)
complianceWindow: How many consecutive failures before non-compliance (e.g., 3)
graceperiod: Time after joining before verification starts (e.g., 1 hour)

Recommendation#

Extend the existing layered verification to be policy-driven. The verification infrastructure exists. What's missing is:

Policy-defined verification parameters (currently hardcoded in DEFAULT_VERIFICATION_CONFIG)
Compliance history tracking (currently only tracks last verification result)
Compliance evaluation logic (pass/fail over a time window)

4. Group Agreement / Policy Distribution#

The Challenge#

Policy is not useful unless all parties agree on it. In a P2P system, there's no central authority to impose policy. Nodes must:

Discover policies that apply to them
Agree to be bound by a policy
Detect when policy changes
Handle the transition period during policy updates

Policy as Atproto Records#

The most natural approach for P2PDS: publish policies as atproto records in a new lexicon namespace.

Lexicon: org.p2pds.policy
Record: {
  $type: "org.p2pds.policy",
  version: 1,
  name: "mutual-aid-group-1",
  description: "Three-node mutual aid cluster",
  rules: { ... },
  members: [ ... ],
  createdAt: "2025-01-01T00:00:00Z",
  updatedAt: "2025-01-01T00:00:00Z"
}

Policy Lifecycle:

Creation: A node publishes a policy record to its repo
Discovery: Other nodes discover it via record listing (similar to how PeerDiscovery works now with org.p2pds.manifest)
Acknowledgment: Each member publishes an org.p2pds.policy.ack record referencing the policy
Activation: Policy becomes active when all members have acknowledged
Update: Publisher creates new version; members must re-acknowledge
Revocation: Publisher deletes the record or sets status to "revoked"

Lexicon: org.p2pds.policy.ack
Record: {
  $type: "org.p2pds.policy.ack",
  policy: "at://did:plc:alice/org.p2pds.policy/mutual-aid-1",
  policyVersion: 1,
  accepted: true,
  acceptedAt: "2025-01-01T00:01:00Z"
}

Multi-Party Agreement#

Simple approach (sufficient for v1): One node is the policy "author." It publishes the policy. Each other member acknowledges it. The policy is active when all listed members have acknowledged.

More decentralized approach (v2): Policy proposals are published as records. Members vote by publishing ack records. A quorum threshold (e.g., 2/3 of members) activates the policy. This mirrors DAO governance without a blockchain.

Signature-based approach (v3): The policy document includes a field for member signatures (using atproto signing keys). A policy is valid only when it contains valid signatures from all/quorum members. This is a multi-sig scheme, similar to how DAOs use multi-sig wallets.

Policy Versioning and Migration#

When policy changes, nodes need a migration path:

Grace period: New policy published with effectiveAt timestamp in the future. Nodes have until then to comply.
Compatibility window: Both old and new policy are valid during transition. Nodes evaluate compliance against the more lenient of the two.
Breaking changes: If new policy adds members or increases minCopies, the grace period must be long enough for sync to complete.

Discovery Mechanisms#

How does a node learn about policies that involve it?

Direct configuration: Node operator configures policy URIs in .env or config file (simplest, similar to current REPLICATE_DIDS)
Record scanning: Node periodically checks other nodes' repos for org.p2pds.policy records that mention its DID
PubSub notification: Policy changes announced via IPFS gossipsub topics
DID document service endpoint: Add a #p2pds_policy service entry to DID documents pointing to the policy

Recommendation#

Start with direct configuration + policy-as-atproto-records. The operator configures a list of policy AT-URIs. The node fetches and evaluates them. Publishing and acknowledgment create an auditable record in each participant's repo.

5. Failure Handling and Incentives#

Failure Modes#

Temporary downtime: Node is offline for minutes to hours (network issues, restarts)
Extended outage: Node is offline for days (hardware failure, migration)
Sync lag: Node is online but behind on syncing (slow network, overloaded)
Verification failure: Node claims to hold data but fails spot checks (data corruption, incomplete sync)
Abandonment: Node permanently disappears
Byzantine behavior: Node deliberately serves incorrect data

Response Spectrum#

For Mutual Aid#

The goal is resilience, not punishment. Graduated responses:

Level 0 - OK:        All checks passing
Level 1 - Degraded:  1-2 consecutive verification failures
                     Action: Log warning, increase verification frequency
Level 2 - At Risk:   3+ consecutive failures OR >1hr sync lag
                     Action: Alert group members, other nodes begin
                             pre-emptive replication of at-risk data
Level 3 - Failed:    >24hr offline OR persistent verification failures
                     Action: Remove from compliance count, redistribute
                             replication obligations to remaining nodes
Level 4 - Abandoned: >7 days with no heartbeat
                     Action: Remove from group membership, recalculate
                             replication targets

Rebalancing: When a node fails, the remaining nodes must absorb its obligations. If the policy requires minCopies=2 and one of three nodes fails, the remaining two nodes each now hold all data (they already did -- they just become the only holders). The policy now shows "at risk" because there's no redundancy margin.

For SaaS / SLA#

The goal is measurable compliance for reputation or payment. Stricter thresholds:

Metric              | Threshold    | Measurement
--------------------|-------------|------------------
Uptime              | 99.9%       | Heartbeat checks
Sync latency        | < 5min      | Time from commit to local sync
Block serve latency | < 500ms     | RASL response time
Verification pass   | 100%        | Layer 0 + Layer 1 must pass

Compliance is calculated over a rolling window (e.g., 30 days). A node earns reputation/payment proportional to its compliance score.

Distinguishing Temporary vs. Permanent Failure#

This is fundamentally a timeout problem. The classic approach:

Heartbeat: Nodes periodically announce presence (e.g., by updating their org.p2pds.peer record's createdAt timestamp, or via IPFS pubsub).
Exponential backoff categorization:
- < 5 min: Probably transient (network blip)
- 5 min - 1 hour: Likely restart or deployment
- 1 hour - 24 hours: Possible hardware issue
- 24 hours: Likely abandoned
Self-reporting on recovery: When a node comes back, it announces its return and catches up on sync. The policy engine should have a "recovery" state that gives the node time to re-sync before enforcing full compliance.

Incentive Mechanisms#

For mutual aid, the incentive is reciprocal: "I hold your data because you hold mine." This is BitTorrent's tit-for-tat insight applied to storage. Key difference: unlike BitTorrent (where choking is instantaneous), storage obligations persist over time. You can't "unchoke" someone's data.

Trust score: Each node maintains a score for each peer based on:

Verification pass rate over time
Sync lag average
Uptime percentage
Duration of membership

This score is locally computed (not broadcast) and can inform local decisions like "should I replicate data for this node if it's not part of my policy?"

Recommendation#

Implement graduated failure handling with configurable thresholds in the policy document. Start simple:

offlineGracePeriod: How long before a node is marked non-compliant (e.g., "1h")
recoveryGracePeriod: How long a returning node has to catch up (e.g., "30m")
maxConsecutiveFailures: Verification failures before non-compliance (e.g., 3)

The policy document defines thresholds; the local evaluator applies them to observed state.

6. Resource Accounting#

What to Track#

Resource	Unit	Source	Verifiable?
Storage used	Bytes per DID	Local measurement	Self-reported, verifiable via block count
Block count	Count per DID	Local DB query	Verifiable via set reconciliation
Sync latency	Seconds	Time between commit and local sync	Self-reported
Verification pass rate	Percentage	Verification history	Recorded by verifier
Uptime	Percentage	Heartbeat history	Observed by peers
Bandwidth consumed	Bytes in/out	Local measurement	Self-reported only

The Self-Reporting Problem#

In a decentralized system, most metrics are self-reported. A dishonest node can claim 100% uptime, zero latency, and perfect verification scores.

Mitigations:

Cross-verification: Node A's claim about serving DID X is verified by Node B fetching blocks from Node A via RASL. The existing verification layer already does this.
Statistical sampling: Don't verify everything -- verify enough to make cheating statistically improbable. If you sample 50 random blocks and all are correct, the probability of the node holding less than 99% of blocks is extremely low. (This is the existing RASL sampling approach.)
Peer-observed metrics: Uptime and latency can be measured by peers. If three nodes all agree that Node D has been offline for 2 hours, that's more credible than Node D's self-report.
Commit-reveal for sync timing: To prevent a node from claiming fast sync while actually syncing lazily:
- When a commit occurs, the authoritative PDS publishes a commitment (hash of commit CID + timestamp)
- Replicating nodes must publish their own receipt within the sync window
- The receipt includes the commit CID, proving they actually synced

Resource Accounting in the Policy#

{
  "accounting": {
    "trackStorage": true,
    "trackSyncLatency": true,
    "trackUptime": true,
    "reportingInterval": "1h",
    "retentionPeriod": "30d"
  }
}

Resource accounting data could be stored in a local SQLite table (extending the existing replication_state schema) and optionally published as atproto records for auditability.

Recommendation#

Track storage and verification locally in SQLite. Measure uptime and sync latency via peer observation. Defer bandwidth accounting to v2. The existing SyncStorage class already tracks per-DID sync state, block counts, and verification timestamps. Extending it with a time-series table for compliance history is straightforward.

7. Existing Systems / Prior Art#

Filecoin: The Heavyweight Approach#

Relevant patterns:

Deal-making protocol: Storage provider and client agree on terms (price, duration, replication factor) before storage begins. Analogous to our policy agreement.
Sector sealing: Data is encoded uniquely per-provider to prevent Sybil attacks (claiming multiple copies from one physical copy). P2PDS doesn't need this -- content-addressing already handles deduplication.
WindowPoSt: Periodic proof that data is still stored, checked every 24-hour window. Analogous to our periodic verification.
Proof of Data Possession (PDP): Lightweight SHA-256 based verification for hot storage. Most applicable to P2PDS.

What to take: The deal-making lifecycle (propose, accept, active, expired, faulted). The periodic windowed verification model. The graduated fault handling (fault -> recovery period -> penalty/termination).

What to skip: The heavyweight cryptography (SNARKs, VDFs), the blockchain settlement layer, sector sealing. These solve problems (preventing Sybil storage, trustless payment) that P2PDS handles differently (social trust + content-addressed verification).

Ceramic / ComposeDB: Decentralized Data with Sync#

Relevant patterns:

Streams: Append-only logs of events, similar to atproto commit sequences
StreamTypes: Define validation rules for streams (who can write, what schema). Analogous to our policy types.
Event-based sync: Nodes cooperate to distribute events to all interested consumers
Historical sync: Nodes can sync data from before they joined the network

What to take: The concept of "interest" -- nodes declare which data they're interested in, and the network routes accordingly. This maps to our manifest records.

What to skip: Ceramic's consensus layer (RAFT-based) and blockchain anchoring. Our content-addressed verification provides similar guarantees without consensus.

Kubernetes: Declarative Desired-State Reconciliation#

Relevant patterns:

Spec vs. Status: Every resource has a spec (desired state) and status (observed state). Controllers continuously reconcile status toward spec.
Reconciliation loop: Observe -> Diff -> Act, repeated. Idempotent.
Conditions: Status includes machine-readable conditions (Ready, Progressing, Degraded) with timestamps and reasons.
ValidatingAdmissionPolicy: CEL expressions evaluate constraints declaratively, in-process, without external webhooks.

What to take: The spec/status pattern is directly applicable. Policy = spec. Observed compliance = status. The reconciliation loop is our sync + verify cycle. Conditions give us a model for graduated compliance states.

Policy Spec (desired state):
  - Node A holds DID X, DID Y
  - Verification every 30m
  - minCopies = 2

Compliance Status (observed state):
  - Node A: DID X synced (rev abc), DID Y synced (rev def)
  - Last verification: 10m ago, passed
  - Conditions:
    - type: Compliant, status: True, reason: AllChecksPassing
    - type: InSync, status: True, reason: SyncedWithin5m

This is the most directly applicable pattern in the entire research. Kubernetes solved declarative desired-state reconciliation at scale. The same pattern works for replication policy.

Open Policy Agent (OPA): General-Purpose Policy Engine#

Relevant patterns:

Decision as data: Policy evaluation produces a JSON decision document, not a side effect
Bundle distribution: Policies packaged as bundles, distributed to evaluators
Partial evaluation: Compile policy partially when some inputs are known, evaluate the rest at runtime

What to take: The separation of "policy definition" from "policy evaluation" from "policy enforcement." In P2PDS terms: define policy (atproto records), evaluate compliance (local engine), enforce (sync/rebalance actions).

What to skip: The centralized bundle server model. In P2PDS, policies are discovered via atproto records, not pushed from a central authority.

BitTorrent: Tit-for-Tat Incentives#

Relevant patterns:

Choking algorithm: Upload to peers that upload to you. Reciprocity drives cooperation.
Optimistic unchoking: Periodically try new peers to discover better partners.
Rarest-first strategy: Prioritize replicating the rarest pieces to maximize network resilience.

What to take: The principle that reciprocity is a sufficient incentive for cooperation in voluntary networks. In mutual aid, "I host your data because you host mine" is the storage equivalent of tit-for-tat.

Adaptation: In BitTorrent, choking is instantaneous (stop uploading). In P2PDS, the equivalent would be "stop syncing new commits for a non-reciprocating peer" -- but you can't delete already-stored data without violating the policy. This asymmetry means tit-for-tat works differently for storage vs. bandwidth.

Hypercore / Dat: Sparse Replication#

Relevant patterns:

Sparse mode: Only download blocks that are explicitly requested. The Want/Have protocol.
Selective sync: Use a .datdownload file to specify which files to sync.
Signed Merkle tree: Each entry in the feed is authenticated by the feed author.

What to take: The concept of selective replication -- not every node needs every record. A policy could specify that Node A replicates app.bsky.feed.post records but not app.bsky.feed.like records. This is "collection-level partial replication."

Relevance to policy: Enables more flexible policies: "hold all posts but not all likes" is a valid replication strategy that reduces storage costs while preserving the most important data.

Nostr: Relay Policy and Negentropy Sync#

Relevant patterns:

NIP-11 (Relay Information Document): Relays publish their policies (what events they accept, retention periods, limitations) as a machine-readable JSON document.
NIP-77 (Negentropy Syncing): Efficient set reconciliation between relays using the Negentropy protocol.
Event filtering: Relays apply policies to decide which events to store and serve.

What to take: NIP-11 is remarkably close to what P2PDS needs -- a machine-readable policy document published by each node. The Negentropy integration shows that range-based set reconciliation works well for syncing content-addressed data between nodes.

Direct applicability: Nostr's relay-relay sync via Negentropy is the closest existing analog to P2PDS node-node repo sync. The main difference: Nostr events are independent, while atproto repos have a Merkle tree structure that provides stronger integrity guarantees.

8. Architecture Sketch#

Core Abstractions#

+-----------+     +-----------+     +------------+     +------------+
|  Policy   |---->| Obligation|---->|Verification|---->| Compliance |
| (desired  |     | (what each|     | (checking  |     | (did they  |
|  state)   |     |  node must|     |  that work |     |  do it?)   |
|           |     |  do)      |     |  was done) |     |            |
+-----------+     +-----------+     +------------+     +------------+
      |                |                  |                   |
      v                v                  v                   v
  atproto          local DB         RASL/IPFS/sync       local DB +
  records       (obligation         (verification         atproto
               schedule)             infrastructure)       records

Policy: The declarative document defining desired state. Published as atproto records. Immutable per version.

Obligation: The derived per-node work items. "Node A must sync DID X, DID Y, DID Z." Computed from policy by the local evaluator. Stored in local DB.

Verification: The process of checking that obligations are met. Uses the existing layered verification system. Transport-agnostic.

Compliance: The result of evaluating verification history against policy requirements. "Node A is compliant / non-compliant / degraded." Stored locally and optionally published.

The Evaluation Loop#

Directly inspired by Kubernetes controllers:

                    +-------------------+
                    |                   |
          +---------+  OBSERVE          |
          |         |  - Fetch policies |
          |         |  - Check sync state|
          |         |  - Run verification|
          |         +--------+----------+
          |                  |
          |                  v
          |         +--------+----------+
          |         |                   |
          |         |  EVALUATE         |
          |         |  - Derive obligations
          |         |  - Compare desired vs actual
          |         |  - Determine compliance
          |         +--------+----------+
          |                  |
          |                  v
          |         +--------+----------+
          |         |                   |
          +<--------+  ACT             |
                    |  - Trigger sync   |
                    |  - Update status  |
                    |  - Notify/alert   |
                    |  - Rebalance      |
                    +-------------------+

This loop runs on a timer (e.g., every 5 minutes) and also in response to events (new commit observed, peer heartbeat missed, verification completed).

Where State Lives#

State	Storage	Why
Policy documents	Atproto records (`org.p2pds.policy`)	Auditable, discoverable, signed by author
Policy acknowledgments	Atproto records (`org.p2pds.policy.ack`)	Auditable proof of agreement
Derived obligations	Local SQLite	Ephemeral, recomputable from policy
Sync state	Local SQLite (existing `replication_state`)	Operational state, changes frequently
Verification results	Local SQLite (new table)	Time-series data, queried for compliance
Compliance status	Local SQLite + optionally atproto records	Local for evaluation, published for transparency
Peer heartbeats	Local SQLite (observed) + atproto records (published)	Both self-reported and peer-observed

Integration with Existing Codebase#

The policy engine slots in between the existing ReplicationManager and the sync/verification infrastructure:

Current:
  Config (REPLICATE_DIDS) -> ReplicationManager -> syncAll() -> verify()

With Policy Engine:
  PolicyEngine -> evaluatePolicies() -> derive obligations
       |
       v
  ReplicationManager -> syncObligations() -> verify() -> reportCompliance()
       ^
       |
  PolicyDiscovery -> fetch org.p2pds.policy records from peers

The REPLICATE_DIDS config becomes a fallback / bootstrap mechanism. Once policies are discovered, they take precedence.

Key Interfaces#

/** A policy document (deserialized from atproto record) */
interface Policy {
  version: number;
  type: "mutual-aid" | "sla" | "custom";
  members: PolicyMember[];
  rules: PolicyRules;
  effectiveAt: string;
  expiresAt?: string;
}

interface PolicyMember {
  did: string;        // The DID being served
  nodeId: string;     // The node responsible (could be same DID or different)
}

interface PolicyRules {
  replication: {
    strategy: "full" | "partial";
    minCopies: number;
    subjects: string[];  // DIDs to replicate
    collections?: string[];  // Optional: only these collections
  };
  verification: {
    interval: string;    // Duration like "30m"
    layers: number[];    // Which verification layers [0, 1, 3]
    sampleSize: number;
  };
  sync: {
    maxLag: string;      // Duration like "5m"
  };
  compliance: {
    offlineGracePeriod: string;
    recoveryGracePeriod: string;
    maxConsecutiveFailures: number;
  };
}

/** The obligation derived for a specific node from a policy */
interface Obligation {
  policyUri: string;     // AT-URI of the policy record
  nodeId: string;        // This node's DID
  subjectDid: string;    // DID to replicate
  strategy: "full" | "partial";
  collections?: string[];
  verificationInterval: number;  // ms
  maxSyncLag: number;            // ms
}

/** Compliance status for a node within a policy */
interface ComplianceStatus {
  policyUri: string;
  nodeId: string;
  status: "compliant" | "degraded" | "non-compliant" | "unknown";
  obligations: ObligationStatus[];
  lastEvaluated: string;
}

interface ObligationStatus {
  subjectDid: string;
  synced: boolean;
  lastSyncRev: string | null;
  syncLag: number | null;        // ms
  verificationPassed: boolean;
  consecutiveFailures: number;
  lastVerified: string | null;
}

Minimum Viable Architecture#

For v1, the simplest implementation that's still useful:

Policy document: JSON object matching the schema above, stored as an org.p2pds.policy record
Policy evaluator: A function that takes a Policy + current SyncState[] and returns ComplianceStatus
Policy-aware ReplicationManager: Instead of iterating REPLICATE_DIDS, iterates derived obligations
Compliance reporting: Log-level reporting + updated manifest records

No new dependencies. No new languages. Just structured JSON and TypeScript evaluation logic.

9. Open Questions#

Policy authority: Who gets to create policies? Anyone who lists your DID? Only you? Only DIDs you've explicitly authorized? This intersects with consent/authorization, which is a separate design problem.
Policy conflicts: What if two policies require different replication strategies for the same DID? Precedence rules? Union of requirements? Error?
DID-to-node mapping: The CLAUDE.md notes this as an open problem. A single DID can have multiple nodes. A single node can serve multiple DIDs. The policy needs to address which nodes are obligated, not just which DIDs.
Storage quotas: How do you limit the total storage a node commits to? If 100 policies each require 1GB, the node needs 100GB. Who enforces limits?
Collection-level policies: Can a policy specify "replicate app.bsky.feed.post but not app.bsky.feed.like"? This requires MST-aware partial replication, which the current sync (full repo CAR fetch) doesn't support.
Cross-policy verification: If a node participates in multiple policies, should verification be shared (verify once, count for all policies) or independent (each policy verifies separately)?
Privacy: Publishing policies as atproto records makes group membership public. Is this always desirable? Some groups might want private policies.
Bootstrapping: How does a new node catch up on existing policy state? It needs to discover policies, acknowledge them, sync all required data, and pass verification before being counted as compliant.
Clock skew: Compliance evaluation depends on timestamps (sync lag, verification intervals). How much clock skew is tolerable between nodes?
Incentive alignment for verification: The verifier and the verified are both group members. What prevents collusion (node A "verifies" node B without actually checking)?

10. Recommended Starting Point#

The Simplest Useful Thing: Mutual Aid with Full Replication#

Scenario: Three nodes (Alice, Bob, Carol) each run a P2PDS instance. Each node holds complete replicas of all three members' repos. The policy is: "every member's data is held by at least 2 other members."

Implementation plan:

Step 1: Define the Policy Schema#

Create a Lexicon definition for org.p2pds.policy with the minimal fields:

{
  "$type": "org.p2pds.policy",
  "version": 1,
  "type": "mutual-aid",
  "name": "my-cluster",
  "members": [
    { "did": "did:plc:alice" },
    { "did": "did:plc:bob" },
    { "did": "did:plc:carol" }
  ],
  "rules": {
    "replication": {
      "strategy": "full",
      "minCopies": 2
    },
    "verification": {
      "interval": "30m",
      "sampleSize": 50
    },
    "sync": {
      "maxLag": "5m"
    }
  }
}

Step 2: Publish and Discover#

On startup, each node publishes its policy record to its own repo
Each node fetches policy records from all configured peer DIDs (extending existing PeerDiscovery)
Derive obligations: "I must replicate all member DIDs except my own"

Step 3: Evaluate Compliance#

Add a PolicyEvaluator class:

class PolicyEvaluator {
  evaluate(policy: Policy, syncStates: SyncState[]): ComplianceStatus {
    // For each subject DID in the policy:
    //   1. Check if we've synced recently (syncLag < maxLag)
    //   2. Check if verification passed
    //   3. Determine compliance status
    // Return aggregate compliance
  }
}

This replaces the hardcoded REPLICATE_DIDS config with policy-driven obligation derivation.

Step 4: Report Compliance#

Update manifest records with compliance status
Log compliance changes
(Future: publish compliance attestations as atproto records)

What This Gets You#

Declarative: Policy is a JSON document, not imperative code
Deterministic: Given the same policy + sync states, any node computes the same compliance result
Transport-agnostic: Policy checks sync state and verification results, not how blocks were transferred
Account-centric: Policy lists DIDs, not block CIDs or IPFS hashes
Publishable: Policy lives in atproto repos, auditable by anyone
Minimal: No new dependencies, no new languages, ~200-300 lines of TypeScript

What It Doesn't Handle (Yet)#

Multi-party policy negotiation (v2: add acknowledgment records)
SLA metrics and reputation (v2: add time-series tracking)
Partial/collection-level replication (v2: needs MST-aware sync)
Failure rebalancing (v2: automated obligation redistribution)
CEL/Datalog predicates (v3: if JSON schema proves too limiting)

This starting point establishes the core abstraction (policy -> obligation -> verification -> compliance) and the integration pattern (policy-driven ReplicationManager). Everything else is additive.

Configure Feed

Configure Feed

Policy Engine Design Research: Declarative, Deterministic, Transport-Agnostic Replication Policy for AT Protocol#

Executive Summary#

1. Policy Language / Representation#

The Design Space#

Option A: JSON/CBOR Constraint Documents#

Option B: Datalog-Style Logic Programs#

Option C: Rego (Open Policy Agent)#

Option D: CUE Language#

Option E: CEL (Common Expression Language)#

Option F: AWS IAM / Zanzibar-Style Relationship Model#

Recommendation#

2. Set Reconciliation as Policy Primitive#

The Core Insight#

Existing Set Reconciliation Approaches#

Negentropy (Range-Based Set Reconciliation)#

Invertible Bloom Lookup Tables (IBLTs)#

CertainSync / Rateless IBLTs (2024-2025)#

Minisketch (BCH-Based)#

Set Algebra for Policy#

Proving Set Containment Without Full Transfer#

Recommendation#

3. Compliance Verification#

The Verification Spectrum#

What P2PDS Already Has#

Challenge-Response Protocols#

Verification Scheduling#

Recommendation#

4. Group Agreement / Policy Distribution#

The Challenge#

Policy as Atproto Records#

Multi-Party Agreement#

Policy Versioning and Migration#

Discovery Mechanisms#

Recommendation#

5. Failure Handling and Incentives#

Failure Modes#

Response Spectrum#

For Mutual Aid#

For SaaS / SLA#

Distinguishing Temporary vs. Permanent Failure#

Incentive Mechanisms#

Recommendation#

6. Resource Accounting#

What to Track#

The Self-Reporting Problem#

Resource Accounting in the Policy#

Recommendation#

7. Existing Systems / Prior Art#

Filecoin: The Heavyweight Approach#

Ceramic / ComposeDB: Decentralized Data with Sync#

Kubernetes: Declarative Desired-State Reconciliation#

Open Policy Agent (OPA): General-Purpose Policy Engine#

BitTorrent: Tit-for-Tat Incentives#

Hypercore / Dat: Sparse Replication#

Nostr: Relay Policy and Negentropy Sync#

8. Architecture Sketch#

Core Abstractions#

The Evaluation Loop#

Where State Lives#

Integration with Existing Codebase#

Key Interfaces#

Minimum Viable Architecture#

9. Open Questions#

10. Recommended Starting Point#

The Simplest Useful Thing: Mutual Aid with Full Replication#

Step 1: Define the Policy Schema#

Step 2: Publish and Discover#

Step 3: Evaluate Compliance#

Step 4: Report Compliance#

What This Gets You#

What It Doesn't Handle (Yet)#