Long-Term Agent Memory, Deployed as a Single Rust Binary

Long-Term Agent Memory, Deployed as a Single Rust Binary
Lucas Vium
Lucas Vium

Most current approaches to agent memory are overbuilt by default — either relying on ever-growing context windows that degrade with length, or on vector stores that cannot distinguish current facts from outdated ones. Neither captures the structure required for useful long-term memory: what happened, when it was true, how it changed, and why.

Memory is not a retrieval problem. It is a temporal one. The challenge is not finding relevant information — it is knowing which version of a fact is still valid, what changed, and when.

We present a long-term memory system — an event-sourced temporal knowledge store — that addresses this at the systems level. The core shift is treating memory as a system of evolving facts rather than static retrieval. It extracts structured, temporally grounded facts, tracks changes through an append-only ledger, consolidates knowledge into higher-level observations, and retrieves through four parallel strategies fused into a unified ranking. The entire system compiles to a single Rust binary backed by a single PostgreSQL instance.

Architecture

The system runs as a single static binary. All components execute in-process, including cross-encoder reranking via ONNX Runtime. PostgreSQL handles vector search (pgvector), full-text search, graph traversal, and temporal queries. There is no separate vector database, no graph database, no cache layer, and no inter-service communication. Deployment is one binary, one database.

Core mechanisms

Fact extraction with temporal grounding

Conversations are decomposed into atomic facts with resolved entities and absolute timestamps. Relative expressions ("last summer," "two weeks ago") are normalized at ingestion. A separate extraction pass identifies typed relationships between entities.

Append-only relationship ledger

All changes are recorded as new entries. Nothing is overwritten. Each record captures what changed, when it was observed, when it was true in the world, and why it changed. Current state is derived from the latest valid record. Historical state is computed via timestamp filtering. This effectively turns memory into an event-sourced system — the full ledger is not a snapshot, but the complete evolution of each relationship.

Automatic consolidation

A background process aggregates low-level facts into higher-level observations — repeated complaints about layovers become "prefers direct flights." Observations are versioned: new evidence creates updated versions while preserving prior ones.

Four-way parallel retrieval

Retrieval is executed as four parallel strategies: semantic search (pgvector, HNSW), lexical search (tsvector with entity and temporal tokens), meta-path graph traversal (Forward Push propagation with internal rank fusion), and temporal filtering (date-constrained BFS spreading). Results are fused via Reciprocal Rank Fusion and reranked using a local cross-encoder on ONNX Runtime.

Latency

The embedding API call is the only external dependency and the dominant source of latency. With a local embedding model, total recall latency drops to approximately 30–70ms.

Stagep50p95
Query embedding (external API)3050
4-way retrieval (PostgreSQL)1540
Cross-encoder reranking (ONNX)1025
Total60120
Recall latency (milliseconds)

This places retrieval well within the latency budget of interactive agents.

Evaluation

We evaluate on LongMemEval-s, a benchmark for long-term conversational memory consisting of 500 questions across six categories, each backed by conversation stacks averaging 115k+ tokens. We use GPT-5.2 for extraction and answer generation, and text-embedding-3-small (384d) for embeddings.

SystemOverallUserAsstPrefUpdateTemporalMulti
Ciresk84.898.685.776.792.389.569.9
HydraDB84.0100.098.289.791.083.564.6
Full-context*60.281.494.620.078.245.144.3
LongMemEval-s accuracy (%) — GPT-5.2 backbone · *Full-context baseline uses GPT-4o

The system achieves 84.8% overall accuracy, comparable to HydraDB (84.0%) — a system backed by a dedicated graph database and vector store — while requiring significantly less infrastructure. The strongest gains appear in temporal reasoning (89.5% vs 83.5%) and knowledge updates (92.3% vs 91.0%), the categories where the append-only ledger and temporal grounding contribute most directly. HydraDB retains an advantage on preference recall and assistant-generated content. Both systems substantially outperform the full-context baseline. Preference inference remains the weakest signal, particularly under sparse interaction histories, where repeated evidence is limited.

Discussion

Most systems that reach this level of accuracy are overbuilt by default — relying on multiple specialized databases, message queues, and cross-service coordination. The operational cost of that architecture is rarely discussed, but it is substantial — and it compounds with every additional venture or product that needs memory.

This system demonstrates that competitive long-term memory is achievable with a single binary and a single database — without sacrificing retrieval quality or recall speed. The architecture eliminates cross-service coordination entirely, reducing both deployment friction and failure surface. Memory does not require a distributed system — only a temporal one.

Availability

The system is available as a managed API and as a self-hosted binary. We provide evaluation traces and latency breakdowns on request. Contact hello@ciresk.com.