# Action Plan to Calibrate the KLBR Agent Memory Architecture

## Executive summary

KLBR already has a strong conceptual shape: a semantic four-layer memory hierarchy, classifier-first routing, L1 time-windowed retrieval, ANN candidate generation followed by cross-encoder reranking, provenance backlinks, and an append-only archive. The immediate problem is not missing ideas; it is missing **operational artifacts**. The highest-value next steps are to freeze the lifecycle contract and schemas, build a reproducible benchmark harness, establish a flat-memory baseline, and only then calibrate routing, time windows, ANN depth, reranker thresholds, backlink policies, and consolidation triggers. That sequencing matches what the most relevant literature says matters in assistant memory systems: LongMemEval explicitly decomposes performance into indexing, retrieval, and reading; PerLTQA isolates routing/classification as a separate task and reports strong gains from BERT-style classifiers; BEIR shows why retrieve-then-rerank is usually better than one-stage dense retrieval but more expensive; and AgentPoison and MEXTRA show that memory quality work should not be separated from security and privacy work. citeturn1search0turn1search2turn1search3turn2search3turn2search14

For implementation, the safest path is **prototype locally, benchmark aggressively, and delay backend commitment until access patterns are measured**. The backend landscape has shifted since many early SQLite-vector assumptions were written down: SQLite now has an official `vec1` extension that provides ANN search using IVFADC and trained models, while `sqlite-vec` has added ANN-related code and benchmarking support but still describes itself as pre-v1. Qdrant is a strong default when filtered search, persistence, and operational simplicity matter; Milvus is better when distributed scale is already a requirement; Faiss remains the best research control because it is a high-performance library rather than a full database; and DiskANN becomes compelling only when SSD-scale vector search is truly needed. citeturn0search0turn0search1turn5search9turn6search13turn6search2turn5search4

My recommended default for the next 8–12 weeks is this: use a **flat baseline first** with one strong open embedding model, one fast reranker, and either SQLite `vec1` or Qdrant; collect traceable evaluation data; then add the KLBR-specific mechanisms one at a time in the order of routing, L1 windowing, backlinks, and consolidation. That will let you answer the only questions that matter right now: whether layering actually helps your query mix, whether backlinks improve exact recovery enough to justify storage amplification, and whether the classifier meaningfully reduces cost without hurting recall. citeturn1search0turn1search1turn1search2turn11search2

## Assumptions and gap inventory

This report assumes that the only currently specified design elements are the ones in your prompt: the semantic layers, classifier-first routing, L1 time-window narrowing, ANN-plus-rerank retrieval, provenance backlinks, and append-only archives. It also assumes that **no executable code, schemas, representative traces, model selections, calibrated thresholds, concurrency rules, namespace model, deletion semantics, or benchmark scorecards** have yet been provided. Under that assumption, the first goal is to convert KLBR from a concept into a system with falsifiable interfaces and measurable behavior.

| Missing artifact | Temporary assumption | Why the gap matters now | First concrete deliverable |
|---|---|---|---|
| Service code and interface contracts | Retrieval and consolidation are not yet end-to-end reproducible | No benchmark or regression can be trusted without deterministic replay | Minimal reference pipeline with ingest, search, rerank, consolidate, and trace export |
| Storage schemas | Memories are versioned records with backlinks and timestamps, but fields are unspecified | No reliable migration, deletion, or provenance accounting is possible | Schema/IDL for `MemoryRecord`, `MemoryEdge`, `Namespace`, `Tombstone`, `RetrievalTrace`, `ConsolidationJob` |
| Query and conversation traces | No production-like workload exists yet | Thresholds and backend choices will otherwise be tuned on toy data | Gold trace set with 200–500 sessions and 1,000+ labeled queries |
| Model roster | Embedding, classifier, and reranker are not yet fixed | Every threshold depends on score scale and model behavior | Frozen model roster for one benchmark season |
| Thresholds | No calibrated `K`, score thresholds, or expansion policies exist | Routing and traversal behavior will be unstable and non-comparable | Sweep plan plus scorecard and confidence intervals |
| Concurrency model | Reads and writes are logically concurrent, but consistency rules are unspecified | Archival writes, consolidation jobs, and deletion propagation can race | Read/write contract, snapshot semantics, and job queue policy |
| Namespace design | User, agent, team, and memory scopes are unspecified | Multi-tenant contamination and bad deletes become likely | Namespace key plan, auth boundaries, and per-namespace indexes |
| Deletion and supersession semantics | Archive is append-only, but not yet erasable | Privacy, corrections, and policy-driven removal cannot be enforced | Tombstone + redaction design with derived-memory invalidation |
| Metrics and SLOs | No quality or latency gates exist | Engineering work cannot converge | Benchmark scorecard with pass/fail thresholds |

Two of those gaps are more dangerous than they may look: **namespace design** and **deletion semantics**. Production memory systems increasingly make both explicit. LangGraph’s long-term memory is organized by namespace and key rather than as one undifferentiated store, and OpenAI’s memory controls emphasize that users should be able to inspect, delete, or disable memory explicitly. KLBR should copy that governance instinct even if its internal design is more sophisticated than those product patterns. citeturn10search0turn10search1turn10search13

A good working assumption for the prototype is an **event-sourced memory model**: L1 episodic records are immutable source events; L2–L4 are versioned derived views; every derived memory has provenance edges to lower-level evidence; deletes create tombstones immediately and schedule asynchronous re-materialization of any affected higher-layer memories. That assumption is technically conservative and is the cleanest way to preserve both provenance and erasure semantics.

## Prioritized calibration backlog

The table below is the practical backlog I would run. Owners are role-based placeholders rather than named people.

| Priority | Task and owner | Required inputs | Recommended tools | Datasets and experiments | Measurement plan and analysis | Deliverable and success gate | Effort |
|---|---|---|---|---|---|---|---|
| P0 | **Freeze schemas and lifecycle contract** — Owner: Memory engineer + tech lead | Current design doc, sample memory records, privacy requirements, target API surface | JSON Schema or Protocol Buffers for records; SQLite migrations for local dev; LangGraph namespace pattern as reference; OpenAI-style delete controls as product reference. citeturn10search0turn10search1 | Simulate create, archive, supersede, tombstone, restore, and provenance queries across all four layers | Validate schema coverage; migration success; deterministic replay; delete propagation lag; storage amplification formula `(raw archive + active layers + embeddings + edges + indexes) / raw episodic bytes` | Signed-off schema pack for `MemoryRecord`, `MemoryEdge`, `Namespace`, `Tombstone`, `RetrievalTrace`; success = every lifecycle path is executable in tests | 24–40 hours |
| P0 | **Build a gold trace and benchmark harness** — Owner: Evaluation engineer | 200–500 conversation sessions, 1,000+ labeled questions, benchmark adapters | LongMemEval, LoCoMo, PerLTQA for assistant-memory behavior; BEIR and MTEB for retriever-only studies; YCSB for backend load generation. citeturn1search0turn1search1turn1search2turn1search3turn2search0turn8search0turn8search6 | Create a mixed suite: benchmark data + KLBR-native synthetic traces + hand-labeled internal traces; stratify by episodic, pattern, trait, core, update, abstention, temporal | Use per-query labels, confusion buckets, bootstrap 95% CIs, and paired comparisons against baseline | Reproducible benchmark runner with frozen splits; success = nightly run produces a complete scorecard in one command | 40–80 hours |
| P0 | **Establish the flat-memory baseline** — Owner: Retrieval engineer | Gold traces, initial schema, candidate embedding and reranker models | Embeddings: multilingual-E5, BGE-M3, jina-embeddings-v3. Rerankers: `cross-encoder/ms-marco-MiniLM-L6-v2` baseline and `bge-reranker-v2-m3` stretch. Retrieve–rerank design is directly aligned with DPR, monoBERT, and Sentence Transformers documentation. citeturn3search0turn3search1turn3search2turn3search4turn3search8turn11search0turn11search1turn11search2 | Ablate exact flat search vs ANN flat search; compare bi-encoder only vs bi-encoder + cross-encoder; run on BEIR, PerLTQA, LongMemEval, and LoCoMo | Recall@k, nDCG@k, MRR, answer F1, exact match, p50/p95 latency, token cost, and cost-quality Pareto plots | Frozen baseline that KLBR must beat; success = stable metrics with three repeated runs and no unexplained variance | 1–2 weeks |
| P1 | **Calibrate classifier-first routing** — Owner: Applied ML engineer | Labeled query→layer targets from benchmarks and hand annotation | Start with logistic regression over the same embedding vectors in scikit-learn and export to ONNX Runtime for deployment; if macro-F1 misses target, try a small BERT/MiniLM classifier. PerLTQA specifically supports classifier-first decomposition and reports BERT-based routing gains. citeturn1search2turn9search1turn9search0turn9search2 | Compare no-classifier, heuristic rules, embedding-logistic baseline, and small transformer classifier; sweep entropy and top-1 margin gates | Macro-F1, macro-recall by layer, expected calibration error, misroute cost, end-to-end latency saved, end-to-end answer delta | Deployable router model + fallback rules; success = better end-to-end cost-quality than “search all layers equally,” with no statistically significant recall loss | 24–48 hours for baseline, 1 week if transformer classifier is needed |
| P1 | **Calibrate L1 time-window policy** — Owner: Retrieval engineer | Timestamped traces, temporal labels, parser for explicit time expressions | Temporal parser + index partitioning by day/week/month; backend filtering through SQLite metadata, Qdrant payload filters, or analogous store filters. Qdrant’s docs specifically highlight the need to pair vector indexes with payload indexes for filtered search. citeturn0search2turn0search6 | Sweep default windows `{3, 7, 14, 30, 60, 180}` days; explicit-temporal vs default; expanding-window retry; parallel L1+L2 vs sequential fallback | Temporal consistency, Recall@k on episodic queries, reranker score uplift after window expansion, p95 latency, index fan-out | L1 retrieval policy with exact sweep report; success = clear frontier showing best recall/latency operating point, plus a fallback policy for ambiguous temporal queries | 1 week |
| P1 | **Calibrate ANN K, reranker thresholds, backlink traversal, and consolidation triggers together** — Owner: Retrieval engineer + research scientist | Flat baseline, routing model, L1 window policy, provenance edge schema | ANN backends under test; rerankers above; LightMem and Letta sleep-time patterns as references for offline consolidation strategy. citeturn4search0turn4search2turn4search14 | Ablations: flat vs layered; ANN `K` in `{20, 50, 100, 200}`; reranker threshold sweeps by percentile and score margin; backlink depth `{0,1,2}`; traversal only on top-1 vs top-m high-confidence hits; consolidation trigger modes `{time, count, density}` | Recall@k, provenance precision, provenance coverage, storage amplification, consolidation lag, contradiction rate, answer F1, latency p50/p95/p99 | Decision memo on whether layering and backlinks outperform flat baseline enough to justify extra complexity; success = layered KLBR beats flat baseline on LongMemEval/LoCoMo with acceptable storage cost | 2–3 weeks |
| P1 | **Run the backend and concurrency sweep** — Owner: Platform engineer | Fixed query mix, expected write rate, namespace cardinality, deployment constraints | Backends below: SQLite `vec1`, `sqlite-vec`, Qdrant HNSW, Faiss, DiskANN, Milvus. Use YCSB-style workloads for read-heavy, write-heavy, and mixed patterns. SQLite WAL behavior matters if SQLite remains in scope. citeturn0search0turn0search1turn6search0turn6search12turn6search13turn6search1turn6search2turn5search4turn5search5turn8search0 | Benchmark namespace isolation, read/write contention, filtered search, batch ingest, snapshot reads during consolidation, delete propagation, restart recovery | Throughput, p50/p95/p99 latency, persistence/recovery behavior, concurrent reader/writer behavior, operational overhead, dollar cost per million queries | Chosen backend for the next six months with a migration plan; success = evidence-backed backend decision instead of premature commitment | 1–2 weeks |
| P1 | **Implement deletion, supersession, privacy, and security tests** — Owner: Security engineer + platform engineer | Schema/lifecycle contract, redaction requirements, namespace plan | AgentPoison for poisoning scenarios and MEXTRA for memory leakage evaluation. OpenAI memory controls are a useful product-level pattern for user-visible deletion and disablement. citeturn2search3turn2search13turn2search14turn10search1turn10search13 | Tests: poison a memory shard, poison a derived summary, attempt cross-namespace retrieval, issue deletes on raw and derived records, test re-derivation after delete | Attack success rate, private-memory extraction rate, cross-tenant leakage, deletion propagation lag, stale-summary rate after user correction | Threat model + mitigation checklist + test suite; success = no known critical leak path in benchmarked scenarios and deletes visible to retrieval immediately | 1–2 weeks |
| P2 | **Turn the benchmark suite into CI/CD gates** — Owner: DevEx / MLOps engineer | Frozen benchmark harness, versioned artifacts, model registry | GitHub Actions or equivalent; nightly benchmark jobs; artifact store; MLPerf only if hardware inference becomes a deployment bottleneck rather than a semantic-memory bottleneck. citeturn2search2turn8search5turn8search8 | Run fast smoke tests on pull requests; full suites nightly and on release branches; drift alarms on metrics and cost | Delta vs baseline, statistical significance, regression triage, artifact version traceability | CI/CD benchmark pipeline with merge gates; success = regressions are caught before release and every reported score is reproducible | 24–48 hours for initial automation, 1 additional week for hardening |

The logic of this backlog is deliberate. **Do not tune layered retrieval before flat retrieval is strong**; **do not tune thresholds before models are frozen**; **do not tune concurrency before lifecycle semantics are explicit**; and **do not ship memory before delete and poisoning tests exist**. LongMemEval, LoCoMo, and PerLTQA all punish systems that optimize one narrow axis while leaving the rest underspecified. citeturn1search0turn1search1turn1search2

## Backend and model choices

The most practical model plan is to use a **two-tier roster**: one low-risk baseline stack and one higher-ambition stack. For embeddings, multilingual E5 is a strong conservative baseline because it is well documented and available in multiple sizes; BGE-M3 is more ambitious because it supports dense, sparse, and multi-vector retrieval in one model and handles long inputs up to 8,192 tokens; jina-embeddings-v3 is also strong for long-context multilingual retrieval and offers task-specific adapters. For reranking, a fast English baseline such as `cross-encoder/ms-marco-MiniLM-L6-v2` is still useful for quick design loops, while `bge-reranker-v2-m3` is the better multilingual and higher-accuracy option when latency permits. citeturn3search0turn3search1turn3search2turn3search4turn3search8

For routing, start with the simplest deployable thing that can win: **logistic regression over the same embedding vectors** used for retrieval, exported to ONNX Runtime. That keeps training and deployment simple, lets you score on CPU, and makes ablations easy. If that misses the target, switch to a compact BERT/MiniLM classifier. PerLTQA is the strongest direct evidence here because it treats memory classification as a first-class task and reports that BERT-based classifiers outperform LLM-based routing on that subproblem. citeturn1search2turn9search1turn9search0

The backend comparison below is intentionally **qualitative and workload-dependent**. The ratings are expected relative behavior for KLBR-like filtered semantic retrieval, synthesized from the algorithms and deployment models in the cited primary sources. They should be validated against your own traces before any long-term commitment. citeturn0search0turn0search1turn6search0turn6search13turn6search2turn5search4turn5search5

| Backend | Relative latency | Relative throughput | Scalability | Persistence | Concurrency | Cost profile | Operational maturity | Best KLBR fit | Evidence basis |
|---|---|---:|---|---|---|---|---|---|---|
| **SQLite vec1** | Low to medium on single-node workloads if the trained ANN model matches the data well | Medium | Single-node, medium-scale | Strong, inherits SQLite durability | Many readers, effectively one writer in WAL mode under normal SQLite semantics | Very low infra cost | Medium: official SQLite extension, but very new | Best for local prototype and early pilot when transactional simplicity matters | Official SQLite `vec1` provides ANN via IVFADC and training support; SQLite WAL supports concurrent readers with one writer. citeturn0search0turn0search4turn6search0turn6search12 |
| **sqlite-vec** | Medium for small-to-medium local workloads; verify carefully at scale | Medium | Single-node, medium-scale | Strong, inherits SQLite durability | Same SQLite profile as above | Very low infra cost | Low to medium: flexible and fast-moving, but repo still says pre-v1 | Good for rapid prototyping and experiments; less ideal as the long-term contract today | Repo describes pure-C local vector search, metadata/partition columns, and recent ANN additions, but also labels itself pre-v1. citeturn0search1turn0search9 |
| **Qdrant HNSW** | Low with strong filtered-search support | High | Single-node to distributed cluster | Strong, with in-memory or memmap storage | Good service-level concurrency | Moderate | High | Best default production candidate for KLBR if filtered retrieval, persistence, and manageable ops all matter | Qdrant documents HNSW-style vector indexing, payload indexes for filters, memmap storage, and distributed deployment. citeturn0search2turn6search1turn5search3turn6search9 |
| **DiskANN** | Low at very large scale if SSD-backed index layout matches workload | High for very large search sets | Excellent for billion-scale or SSD-first settings | Application-managed or service-wrapped | Good if integrated well, but more engineering-heavy | Low hardware cost per vector at scale, higher engineering cost | Medium | Best if KLBR grows into SSD-scale candidates or very large cold tiers | DiskANN paper shows billion-point search on 64 GB RAM + SSD; Microsoft repo now emphasizes scalable, cost-effective ANN with filters and dynamic changes. citeturn5search5turn5search1turn5search9 |
| **Milvus** | Low to medium depending deployment and index choice | High | Excellent; cloud-native and distributed | Strong with separate storage/compute | Strong service-level concurrency | Higher infra and ops cost | High | Best only if distributed scale or managed-cloud style architecture is already required | Milvus docs describe cloud-native, disaggregated architecture and support for multiple indexes including HNSW, Faiss, and DiskANN families. citeturn6search2turn6search6turn5search10 |
| **Faiss** | Very low in optimized in-process setups | Very high in-process | Excellent as a library; service concerns are externalized | Supports index I/O, but not a full database contract by itself | App-managed | Low software cost, higher engineering cost | High as a research library | Best as the offline evaluation control and for custom services, not as KLBR’s only product backend | Faiss is a high-performance library for similarity search, supports multiple index types and I/O, and can handle datasets that do not fit in RAM. citeturn5search4turn6search3turn6search11 |

If I had to choose one starting stack **today**, I would use **multilingual-E5-base or BGE-M3 for embeddings, MiniLM cross-encoder for the first design loop, Qdrant for the first operational pilot, and SQLite vec1 as the local deterministic reference**. That combination gives you one service backend, one embedded local backend, one conservative retrieval baseline, and one higher-ceiling retrieval option. Only move to Milvus when you know you need distributed operations, and only move to DiskANN when your measured cold tier is large enough that SSD density is the dominant economic constraint. citeturn3search0turn3search1turn3search4turn6search13turn0search0turn5search5

## Measurement and CI/CD runbook

Benchmarking should mirror the structure used in LongMemEval: **indexing, retrieval, and reading** are separate parts of the system and should be measured separately. For KLBR, that means at least five score groups: retrieval quality, answer quality, provenance quality, systems performance, and security/privacy. BEIR and MTEB are the right tools to screen retrievers and embeddings before they are embedded in the full assistant-memory stack; LongMemEval, LoCoMo, and PerLTQA are the right tools to test the full memory architecture; YCSB is the right storage-style workload generator for backend stress; and MLPerf matters only if model inference on the target hardware becomes the bottleneck rather than memory design itself. citeturn1search0turn1search1turn1search2turn1search3turn2search0turn8search0turn8search5turn8search8

| Metric family | Concrete metrics | How to use it |
|---|---|---|
| Retrieval quality | Recall@k, nDCG@k, MRR, layer hit-rate, router macro-F1 | Judge whether the right memory candidates are even entering the final stage |
| Answer quality | Exact match, token F1, temporal consistency, contradiction rate, abstention accuracy | Judge whether the chosen memories lead to correct responses |
| Provenance quality | Provenance precision, provenance coverage, backlink yield, support sufficiency | Judge whether backlinks recover real evidence rather than noise |
| Systems performance | p50/p95/p99 end-to-end latency, ingest throughput, reranker latency share, consolidation lag, recovery time | Judge whether the architecture is operable |
| Economic footprint | Storage amplification, embedding cost, reranker cost, cost per successful answer | Judge whether the architecture is sustainable |
| Security and privacy | Attack success rate, leakage extraction rate, cross-namespace contamination rate, delete propagation lag | Judge whether KLBR is safe enough to expose |

The most important sweep ranges should be explicit and finite. My recommended initial grid is: L1 default windows `{3, 7, 14, 30, 60, 180}` days; ANN candidate sizes `{20, 50, 100, 200}`; reranker acceptance by percentile and by score margin; backlink depth `{0,1,2}` with depth `1` as the default; classifier confidence using entropy and margin gates; and consolidation triggers across `{time-based nightly, count-based every N inserts, density-based cluster threshold}`. Avoid “magic thresholds.” Every threshold should have a sweep report, reliability plot, and error bucket analysis.

```mermaid
flowchart LR
    A[Gold traces and benchmark adapters] --> B[Schema validation and deterministic ingest]
    B --> C[Flat retrieval baseline]
    C --> D[Router ablation]
    D --> E[L1 time-window sweep]
    E --> F[ANN K sweep]
    F --> G[Reranker threshold sweep]
    G --> H[Backlink policy sweep]
    H --> I[Consolidation trigger sweep]
    I --> J[Security and deletion tests]
    J --> K[Final scorecard and operating point]
```

```mermaid
flowchart TD
    PR[Pull request or nightly build] --> UT[Unit tests and schema migration tests]
    UT --> IR[Deterministic ingest replay]
    IR --> BENCH[Offline benchmark suite]
    BENCH --> LME[LongMemEval, LoCoMo, PerLTQA]
    BENCH --> RET[BEIR and MTEB]
    BENCH --> SYS[YCSB-style backend load]
    BENCH --> SEC[AgentPoison and MEXTRA]
    LME --> SCORE[Unified scorecard with bootstrap CIs]
    RET --> SCORE
    SYS --> SCORE
    SEC --> SCORE
    SCORE --> GATE{Regression gate passed}
    GATE -->|Yes| REL[Release candidate]
    GATE -->|No| BLOCK[Block merge and open triage issue]
```

The runbook should be executed in this exact order:

1. **Freeze the record and edge schema.** Until record identity, timestamps, provenance edges, supersession edges, and tombstones are explicit, every later benchmark can be invalidated by implementation drift.

2. **Build the gold set before building the hierarchy.** Use benchmark adapters plus a small hand-labeled KLBR-native set that includes exact episodic queries, vague pattern questions, stable trait identity questions, user updates, and abstentions. LongMemEval and LoCoMo make these distinctions concrete. citeturn1search0turn1search1

3. **Train one flat baseline and never delete it.** It should remain in CI forever as the control arm. Dense-retrieve-plus-rerank is the right baseline because that pattern is well grounded in DPR, monoBERT, and modern retrieve-rerank practice. citeturn11search0turn11search1turn11search2

4. **Calibrate routing independently.** Treat routing as its own supervised problem before feeding it into end-to-end retrieval. That is exactly how PerLTQA structures the problem. citeturn1search2

5. **Calibrate L1 windows on temporal subsets only.** Do not let non-temporal queries dominate the search. Measure retrieval recall and latency as separate outcomes.

6. **Introduce backlinks only after you can score them.** A backlink policy without provenance precision and support coverage metrics is just another uncontrolled retrieval path.

7. **Benchmark backends on your actual access pattern.** YCSB-style mixes are useful, but also replay your real benchmark trace because vector filters, namespace cardinality, and consolidation writes matter more than generic KV throughput. citeturn8search0turn8search6

8. **Do not mark the system production-ready until delete and poisoning tests pass.** AgentPoison and MEXTRA show why this is a first-order requirement for memory systems rather than an afterthought. citeturn2search3turn2search14

## Roadmap and optional hardware exploration

A sensible roadmap is **12 weeks for the semantic-memory core**, with a stretch path to 16–20 weeks if you decide to include deeper backend scaling work, formal privacy hardening, or hardware-tier experiments.

| Window | Milestone | Main outputs | Release gate |
|---|---|---|---|
| Weeks 1–2 | Artifact freeze | Schemas, lifecycle contract, namespace plan, tombstone semantics, benchmark harness scaffold | All lifecycle tests pass locally |
| Weeks 3–4 | Flat baseline | Embedding/reranker bakeoff, exact and ANN flat baselines, first scorecard | Flat baseline reproducible and stable |
| Weeks 5–6 | Router and time windows | Routing classifier, entropy gates, L1 partitions, temporal sweep report | Router reduces cost without recall collapse |
| Weeks 7–8 | Backlinks and consolidation | Provenance edges, traversal policy, consolidation job runner, trigger sweep report | Layered design beats flat baseline on agreed subsets |
| Weeks 9–10 | Backend sweep | SQLite vec1 vs sqlite-vec vs Qdrant vs Faiss service wrapper vs Milvus/DiskANN candidates | Backend decision memo signed off |
| Weeks 11–12 | Security and CI/CD | Delete propagation tests, poisoning and leakage tests, nightly regression gates | No critical open privacy/security blocker |
| Weeks 13–16 | Stretch path | Real-trace hardening, multi-tenant rollout prep, larger-scale load tests | Internal pilot ready |
| Weeks 17–20 | Optional systems co-design | Only if justified: cold-tier disaggregation, PMem/NVM studies, accelerator placement | Decision memo on whether hardware-aware work is worth doing |

The hardware and memory-system simulators you mentioned should stay **out of scope until backend measurements prove that hardware tiers are the bottleneck**. If that day comes, use gem5 for full-system architecture studies and coherence-aware experiments, DRAMsim3 or Ramulator 2.0 for DRAM-controller and memory-standard exploration, and NVMain for DRAM/NVM hybrid studies. These are excellent tools, but they are for the later question of *where the software stack should live*, not for the current question of *whether the semantic architecture is correct*. citeturn7search0turn7search8turn7search1turn7search2turn7search6turn7search3turn7search15

A good decision rule is simple: if your bottlenecks are still dominated by retrieval errors, reranker cost, consolidation policy, or delete propagation, do **not** start a hardware simulation track. If you later discover that memory-mapped vector indexes, NUMA effects, accelerator-side reranking, or cold-tier SSD/DAX placement dominate p95 latency or cost, then the hardware tools become worthwhile.

## Risks, mitigations, and selected references

The main risk is **calibration drift**. A layered memory system can look better simply because it hides mistakes in flattering summaries, while exact episodic failures become harder to notice. Mitigate that by keeping a permanent flat baseline, requiring provenance precision metrics for every backlink policy, and separately tracking update correctness and temporal consistency—the two long-memory failure modes that LongMemEval and LoCoMo make particularly visible. citeturn1search0turn1search1

The second risk is **synthetic-routing bias**. It is fine to bootstrap the classifier with synthetic queries, but do not freeze the routing model on synthetic data alone. PerLTQA’s decomposition strongly supports routing as a cheap first stage, but production routing needs real query traces, active-learning refreshes, and calibration checks. citeturn1search2

The third risk is **governance debt**. Append-only memory is excellent for provenance and debugging, but unsafe without namespace boundaries, tombstones, and derived-view invalidation. The closest product analogues—LangGraph memory namespaces and OpenAI memory controls—show why user-scoped storage and explicit deletion must be designed up front, not retrofitted later. citeturn10search0turn10search1turn10search13

The fourth risk is **security under adversarial memory use**. AgentPoison and MEXTRA show that long-term memory can be poisoned or extracted even when the surrounding application seems benign. The mitigation is to make trust metadata, source provenance, namespace isolation, redaction, and benchmarked security tests part of the mainline roadmap rather than the security backlog nobody reaches. citeturn2search3turn2search14

Selected primary references for the KLBR workstream are below.

- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. citeturn1search0
- LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents. citeturn1search1
- PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering. citeturn1search2
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. citeturn1search3
- MTEB: Massive Text Embedding Benchmark. citeturn2search0
- Dense Passage Retrieval for Open-Domain Question Answering. citeturn11search0
- Multi-Stage Document Ranking with BERT. citeturn11search1
- SQLite `vec1` official documentation. citeturn0search0turn0search4
- `sqlite-vec` repository and release state. citeturn0search1turn0search13
- Qdrant indexing, filtering, storage, and distributed deployment docs. citeturn0search2turn6search1turn5search3
- Milvus architecture and component docs. citeturn6search2turn6search6turn5search10
- Faiss documentation and repository. citeturn5search4turn5search0turn6search11
- DiskANN paper and repository. citeturn5search5turn5search9
- LightMem and Letta sleep-time references. citeturn4search0turn4search2turn4search14
- AgentPoison and MEXTRA. citeturn2search3turn2search14
- YCSB and MLPerf Inference. citeturn8search0turn8search5turn8search8