Skip to main content

Benchmarks

Honest numbers, including the unfavorable ones.

Publishing numbers is better than not publishing them, even when some are below state of the art. This page is the current published run. It is honest about what we measure, what we don't, and where the comparisons are and aren't fair.

Run context

Benchmark
LongMemEval (ICLR 2025) — oracle split, 500 instances
Retriever
semantic_search() with BAAI/bge-small-en-v1.5 (384-dim, asymmetric with BGE query prefix)
Indexing
Each LongMemEval session flattened to user-turn text, embedded as one memory. No LLM-based extraction.
Hardware
Mac M2 Pro (16 GB), Linux i7-13700K (62 GB + RTX 2080 Ti), Windows Ryzen 7 5800X (64 GB + RX 6600 XT). See hardware table below.
Date
2026-05-31

About the oracle split

The LongMemEval oracle split only provides the evidence sessions for each question. There is no filler noise. Every instance's haystack equals its answer-session set (verified: 500 of 500). Maximum sessions per instance is 6; median is 2; 176 of 500 instances have exactly 1 session.

That structure means recall_any@10 = 1.00 is structurally guaranteed for any retriever that returns all sessions — with k=10 and a maximum corpus of 6, you cannot miss. The same applies to nDCG@10.

The meaningful number is recall_all@1 = 0.352: across the 324 multi-session instances, the embedder ranks all required sessions at position 1 about 35% of the time. The hard retrieval challenge — filler noise — lives in the LongMemEval-S and -M splits, which require pinning a local generation model. That run is the priority follow-up.

Accuracy — LongMemEval oracle

Embedder behavior is identical across hardware, so the accuracy numbers are the same on all three machines. The structurally-bounded scores are listed for transparency, not as a flex.

LongMemEval oracle accuracy results for the Potluck memory retriever
MetricValueNotes
recall_any@101.00Structurally guaranteed on the oracle split (see note)
recall_all@101.00Structurally guaranteed on the oracle split
recall_all@10.352Meaningful ranking signal across the 324 multi-session instances
nDCG@101.00Same structural bound as recall_any@10
Errors0 / 500 on every machineIdentical accuracy across all three runs

Latency — by machine

Same benchmark, same 500 instances, run end-to-end on three different boxes. Both x86 machines have GPUs available but unused — see the gaps section.

MachineSpec / backendp50p95Wall time
Mac M2 Pro16 GB unified memory · macOS · Metal backend10.6 ms35.0 ms44.8 s
Linux media serveri7-13700K · 62 GB · RTX 2080 Ti · CUDA 13.0 (driver 580)17.8 ms20.8 ms64.6 s
Windows workstationRyzen 7 5800X · 64 GB · RX 6600 XT (CPU fallback — ROCm not configured)18.9 ms21.7 ms141.4 s

Cross-machine peer retrieval (measured)

The Mac issues a memory query. The Tailscale-managed WireGuard mesh routes it to a household peer (Linux or Windows). The peer's sidecar runs semantic_search against its own local bge-small embedding store. The result returns over the same tunnel. End-to-end latency below.

The peer endpoint is POST /peer/memory/search, gated behind POTLUCK_PEER_MEMORY_ENABLED=1 on the machine being queried (opt-in, default off). Trust model: mesh-IP reachability is the credential — the peer_access_middleware on the sidecar already rejects non-loopback callers from /memory and all other strictly-local paths; /peer/* is the only surface peers can reach, and within that surface the new endpoint is the only memory path.

30 samples per cell, warm embedder. Tailscale-managed WireGuard between peers. The baseline row uses an essentially empty store; the peer rows are populated with the indicated number of synthetic memories so the retriever has real work to do.

Pathp50p95min
Mac local (baseline, empty query)14.8 ms18.3 ms7.6 ms
Mac → Linux peer (10 memories on peer)35.9 ms40.7 ms18.1 ms
Mac → Linux peer (100 memories on peer)36.1 ms41.2 ms19.1 ms
Mac → Linux peer (500 memories on peer)36.3 ms46.0 ms20.9 ms
Mac → Windows peer (10 memories on peer)43.7 ms50.7 ms39.7 ms
Mac → Windows peer (100 memories on peer)44.9 ms55.6 ms39.7 ms
Mac → Windows peer (500 memories on peer)49.3 ms64.3 ms45.5 ms

Linux peer p50 stays remarkably flat at ~36 ms as the store grows from 10 to 500 memories — the embedder cost dominates and the linear scan over a few hundred vectors is negligible. Windows peer p50 runs ~8–13 ms higher (~44 ms at small stores, ~49 ms at 500) because the embedder runs on CPU rather than CUDA. Cross-machine overhead vs Mac local: ~21 ms to Linux, ~30 ms to Windows. Day-to-day variance is real: across three runs spanning evening and morning, the same Linux probe returned p50s between 35 ms and 45 ms, tracking Tailscale path conditions. Even at the worst measurement on file, cross-machine retrieval feels instant for any interactive workflow. Where the linear scan starts to matter is characterized below.

Scaling envelope: where the linear scan starts to bite

The retriever is a brute-force cosine scan over every embedding in the store — no ANN index. For typical stores (hundreds to a couple thousand memories) this is fine; the embedder cost dominates and the scan is essentially free. The question is where that stops being true. Same Linux peer, same query, populated stepwise at 1K, 5K, and 10K synthetic memories.

Pathp50p95min
Mac → Linux peer (1,000 memories on peer)42.0 ms51.1 ms21.3 ms
Mac → Linux peer (5,000 memories on peer)40.6 ms116.5 ms32.0 ms
Mac → Linux peer (10,000 memories on peer)57.6 ms131.9 ms47.2 ms

p50 holds at ~40 ms through 5K, then jumps ~15 ms by 10K — the embedder still dominates median latency, but the scan is now contributing visibly. p95 is where the linear cost shows up first: ~50 ms at 1K doubles to ~115 ms at 5K and ~130 ms at 10K, as the tail samples catch the scan doing real work. ANN indexing becomes worth prioritizing somewhere between 5K and 10K memories per user. Below that, the brute-force scan is the right choice — it's simpler, deterministic, and free.

Correctness: recall preserved across the mesh

A separate probe verifies the architecture preserves retrieval accuracy. The peer is populated with 10 known memories, each semantically distinct, then queried 10 times from the Mac via /peer/memory/search with the natural-language question matching each memory. Expected behavior: top-1 is the right memory every time.

MetricValueNote
recall@11.00Top-1 hit on every query. The expected memory was always the first result.
recall@31.00All 10 queries returned the right memory in their top 3.
recall@101.00No misses across 10 distinct semantic queries.

This is the expected behavior given that semantic_search is deterministic and the wire transports JSON. The point of this probe is the negative: confirming the architecture introduces no silent corruption, no truncation, no surprise reordering.

Correctness probe was run independently against both Linux and Windows peers. Both returned recall@1=recall@3=recall@10=1.00 across 10 distinct semantic queries, confirming the architecture is symmetric across the mesh — no wire-path corruption, no platform-specific retrieval drift.

By question type (recall_any@10)

Same structural bound as the headline — included for completeness, not as a flex.

Question typenrecall_any@10
knowledge-update781.00
multi-session1331.00
single-session-assistant561.00
single-session-preference301.00
single-session-user701.00
temporal-reasoning1331.00

Internal Eval B — Potluck custom corpus

42 memories, 25 queries. Exercises the actual product retrieval path (RRF keyword-semantic fusion, Tier-0 pinned injection, cross-project isolation). 2026-05-28 baseline; will be re-published with each release.

MetricValue
recall@5 (keyword only)0.48
recall@5 (RRF, default)0.72
Lift from RRF+0.24
Cross-project leak rate0 / 25 queries

LongMemEval-S — the noise-heavy split

The oracle results above are structurally bounded. The S split is where retrieval actually has to work. Each of the 500 instances includes ~48 sessions (median; range 38–62), with the 1–2 gold-evidence sessions hidden among them. This is the split Zep, Graphiti, and other systems benchmark on for comparable accuracy numbers.

We ran the same retriever (BAAI/bge-small-en-v1.5, 33M params, plain semantic search — no graph extraction, no LLM-based memory extraction) against this split on the Mac M2 Pro.

MetricValueNotes
recall_any@100.980At least one gold session in top-10 across 500 instances
recall_all@100.938All gold sessions for an instance present in top-10
nDCG@100.885Discounted-gain ranking quality
p50 latency12.9 msPer retrieval call; slight bump over oracle (48 sessions vs 3)
p95 latency39.8 msTail expands modestly with corpus size
Errors0 / 500Wall time 369 s on Mac M2 Pro Metal

By question type (recall_any@10)

Question typenrecall_any@10
knowledge-update781.000
multi-session1331.000
single-session-assistant560.964
single-session-preference300.933
single-session-user700.986
temporal-reasoning1330.962

What this number is and what it isn’t.

Zep’s Graphiti paper reports ~0.87 recall@10 on LongMemEval-S using their full graph-extraction pipeline. Our number above is ~0.98 recall_any@10 on the same dataset using a 33M-parameter embedder and plain semantic search.

That delta is real but the comparison is partially apples-to-oranges. Zep’s published number may be end-to-end (retrieval + generation + an LLM judge scoring the answer). Ours is retrieval-only — we measure whether the right session ranks in top-10, not whether the answer comes out right. The honest framing: a small embedder gets most of the way there on the retrieval-layer task without graph extraction; what graph extraction adds beyond that is downstream generation context, not retrieval recall.

The other caveat: bge-small was trained on broad web data that may overlap with LongMemEval’s academic distribution. Our internal Eval B is the more honest dogfood-distribution check.

Comparison to Mem0 / Letta / Zep

LongMemEval-S now provides a partial point of comparison (see the section above). Other published benchmarks remain non-comparable for the reasons below.

SystemPublishedComparable?Why / why not
Zep / Graphitirecall@10 ≈ 0.87 on LongMemEval-S (graph extraction)PartialSame dataset, our retrieval-only number is 0.98 recall_any@10 (see S split section below). The methodology gap: Zep’s published number may be end-to-end (retrieval + generation + LLM judge); ours is retrieval-only. Apples-to-oranges to overall task accuracy, but the retrieval delta is real.
Mem0LOCOMO + custom benchmarks; no LongMemEvalNoNo LongMemEval numbers published.
Letta / MemGPTIn-context window management benchmarksNoDifferent problem shape — context-window management, not vector retrieval over a persistent store.

Local model inference

Throughput across machines.

Same retrieval workloads were dispatch-bound. Inference at 7-8B parameters is the opposite — compute-bound, where GPUs actually earn their keep. We ran the same four models across the same three machines and the story is materially different.

Run context

Runtime
llama-cpp-python 0.3.23 / 0.3.24 (built from source on Linux + Windows for CUDA / Vulkan)
Models
Phi-3-mini 3.8B, Qwen 2.5 7B, Llama 3.1 8B (all Q4_K_M); Llama 3.1 70B (Q3_K_S). Pinned by SHA-256.
Prompts
10 fixed prompts (2 short Q&A, 3 code-gen, 3 summarization, 2 long-context). Seed 42, temp 0.7, top_p 0.9. 3 runs per cell, report median.
Hardware
Same three machines as above. Backends switch per platform: Metal / CUDA / Vulkan.
Date
2026-06-01

Adapter footnote: llama-cpp-python prefix-matches the KV cache across calls. Without an explicit model.reset() between runs, TTFT on run 2+ measures cache hits, not cold prompt evaluation. Documented here because it’s not in the upstream API docs and is a real footgun for benchmark writers.

Machines

Mac
M2 Pro · Metal · 16 GB unified memory
Linux
i7-13700K + RTX 2080 Ti · CUDA 13 · 11 GB VRAM, 62 GB RAM
Windows
Ryzen 7 5800X + RX 6600 XT · Vulkan · 8 GB VRAM, 64 GB RAM

Generation throughput — tokens per second

Median across the 10-prompt suite. Higher is better. The story changes from row to row and column to column — that is the finding.

ModelMacLinuxWindowsNotes
Phi-3-mini 3.8B Q4_K_M19.159.916.4
Qwen 2.5 7B Q4_K_M12.443.020.4
Llama 3.1 8B Q4_K_M11.640.120.9
Llama 3.1 70B Q3_K_Swon’t fit1.3won’t fitLinux runs with 25 of 81 layers on GPU + 56 on CPU. TTFT 3.78 s. Unusable for chat.

Three findings worth surfacing

1. There is no universal winner.

The same RX 6600 XT that beats the Mac M2 Pro by 65–80% on Qwen 7B and Llama 8B loses to it on Phi-3 mini 3.8B. Apple Silicon’s unified-memory cache hierarchy wins for small models that fit in cache; discrete GPU VRAM bandwidth wins once models cross 7B. The right hardware depends on the model size, not the brand.

2. AMD/Vulkan is competitive with NVIDIA/CUDA at this scale.

The RTX 2080 Ti delivers ~2× the throughput of the RX 6600 XT on 7-8B models, but it also costs ~2× as much on the used market. Price-per-token, AMD’s consumer-tier card is not behind. The “you need NVIDIA for local inference” reflex is a generation out of date for this model class.

3. 70B is the local-only ceiling on this hardware tier.

Llama 3.1 70B at Q3_K_S won’t load in 16 GB Mac unified memory or 8 GB AMD VRAM. On the Linux box with 62 GB system RAM + 11 GB VRAM, it runs with 25 of 81 layers offloaded to GPU and the rest on CPU. The result is 1.3 tok/s with 3.8 s time-to-first-token. Technically possible. Unusable for chat. The gap between consumer-hardware ceiling and a usable 70B is ~10–20× — and a single consumer card doesn’t close it. Pooled compute does.

And 70B isn’t even the local-model frontier in 2026. DeepSeek V3 and R1 (671B MoE, ~37B active) at Unsloth’s most aggressive Q1.58 quant need ~131 GB of unified memory or 192 GB+ of system RAM — a hardware tier above any of the three machines benched here. People do run these models on consumer hardware; they run them on Mac Studio M3 Ultras and 192 GB+ DDR5 workstations. The cooperative-compute argument isn’t abstract. It’s whichever model your hardware can’t fit.

What the memory bench measures and what it doesn't

It measures:

  • Whether the bge-small embedder can rank evidence sessions correctly when given only those sessions.
  • Per-call retrieval latency on Apple Silicon (the reference hardware tier).
  • Absence of crashes or OOMs on the full 500-instance suite.

It does not measure:

  • Retrieval quality against filler noise. That requires LongMemEval-S or -M.
  • End-to-end answer accuracy (needs an LLM judge and a generation model).
  • The full get_relevant_memories() path with RRF and ambient-preference injection. Oracle split has no ambient preferences, so this would add noise without measuring anything.
  • Latency on non-Apple-Silicon hardware. Windows / Linux CPU performance will be slower.
  • Memory store scalability past ~10,000 entries (no ANN index yet).

Known gaps to fix before next publish

  1. MCP latency bench: p50 / p95 from memory_search tool-call to result, against a 500-memory store.
  2. NVIDIA on Linux via CUDA — enabled 2026-06-01 (driver 580 / CUDA 13.0 on the RTX 2080 Ti). Real but modest: p50 22.2 → 17.8 ms (~20%), p95 34.6 → 20.8 ms (~40%). Not the order-of-magnitude win — same dispatch-bound dynamic as AMD, but less severe since CUDA dispatch overhead is lower. Wall time actually went up (35.6 → 64.6 s) because of GPU init / context setup cost amortized across cold per-instance runs — that overhead disappears in a long-lived production sidecar.
  3. AMD on Windows via DirectML — investigated 2026-06-01 (RX 6600 XT). torch-directml runs the bge-small embedder on the GPU with full recall preserved, but p50 is break-even with CPU (17.9 ms vs 18.9 ms). The 33M-parameter model at batch=1 is dispatch-bound, not compute-bound: per-call kernel-dispatch and host↔device transfer roughly equal the CPU forward pass. ROCm/HIP is N/A — gfx1032 is dropped from the current HIP SDK. AMD will become useful again at the inference benchmark, where 7-8B-param models are compute-bound.
  4. Cross-machine peer retrieval shipped 2026-06-02 — measured latency + correctness above. Open follow-ups: (a) signed-nonce challenge replacing mesh-IP-as-credential in v0.5, (b) per-project sharing flags instead of all-or-nothing per-machine opt-in, (c) full latency-at-scale measurements against Windows + bigger store sizes (1K, 5K, 10K memories) to characterize the linear-scan ceiling, (d) graceful federation — a peer-aware retriever that automatically merges results from multiple opted-in peers.
  5. Approximate-nearest-neighbor indexing. The current linear embedding scan degrades past ~10,000 memories.
  6. Publish the Eval B corpus for full methodology transparency.
  7. Inference benchmark next runs: a dedicated long-context bench (8K–32K input tokens), batched-inference throughput, multi-user concurrent serving on the same machine, and Llama 70B at quants we haven’t tested (Q4_K_M, Q5_K_M). The cooperative-compute story needs the multi-tenant numbers eventually.
  8. Inference hardware extensions: 24 GB + 48 GB cards (e.g. RTX 3090, 4090, A6000) to find the ceiling where a single consumer card runs 70B usably. Apple Silicon higher tiers (M3 Max, M4 Pro) to know the real Mac ceiling.

How to reproduce

make bench from the repo root runs the full 500-instance suite. Results land in bench/results/<timestamp>/. The embedder (BAAI/bge-small-en-v1.5) downloads on first run from HuggingFace (~33 MB). No OpenAI API key, no GPU required.

LongMemEval data is used under the original MIT license. Vendored at bench/longmemeval/data/longmemeval_oracle.json. Source: github.com/xiaowu0162/LongMemEval.

For the inference bench: make bench-inference MODEL=bench/models/<name>.gguf. Models are pinned by SHA-256 in bench/models/SHA256SUMS.txt and downloaded on first run via the verification helper. Phi-3-mini-3.8B, Qwen-2.5-7B, and Llama-3.1-8B are ~2-5 GB each at Q4_K_M; Llama-3.1-70B-Q3_K_S is ~29 GB.

Refresh cadence

Re-run before each release tag. This page is updated by hand after a human reviews the numbers — no automated publishing.

Read the rest of the operating documents.