Lab report · Measured 2026-06

Honest numbers, including the unfavorable ones.

Publishing numbers is better than not publishing them, even when some are below state of the art. This page is the current published run. It is honest about what we measure, what we don't, and where the comparisons are and aren't fair.

10.6 ms

P50 · Mac local retrieval

0 / 500

Errors · oracle split, all machines

Platforms benchmarked

0.352

recall_all@1 · the meaningful number

Run context

Benchmark: LongMemEval (ICLR 2025), oracle split, 500 instances
Retriever: semantic_search() with BAAI/bge-small-en-v1.5 (384-dim, asymmetric with BGE query prefix)
Indexing: Each LongMemEval session flattened to user-turn text, embedded as one memory. No LLM-based extraction.
Hardware: Mac M2 Pro (16 GB), Linux i7-13700K (62 GB + RTX 2080 Ti), Windows Ryzen 7 5800X (64 GB + RX 6600 XT). See hardware table below.
Date: 2026-05-31

About the oracle split

The LongMemEval oracle split only provides the evidence sessions for each question. There is no filler noise. Every instance's haystack equals its answer-session set (verified: 500 of 500). Maximum sessions per instance is 6; median is 2; 176 of 500 instances have exactly 1 session.

That structure means recall_any@10 = 1.00 is structurally guaranteed for any retriever that returns all sessions, with k=10 and a maximum corpus of 6, you cannot miss. The same applies to nDCG@10.

The meaningful number is recall_all@1 = 0.352: across the 324 multi-session instances, the embedder ranks all required sessions at position 1 about 35% of the time. The hard retrieval challenge, filler noise, lives in the LongMemEval-S and -M splits, which require pinning a local generation model. That run is the priority follow-up.

Accuracy: LongMemEval oracle

Embedder behavior is identical across hardware, so the accuracy numbers are the same on all three machines. The structurally-bounded scores are listed for transparency, not as a flex.

Table I · LongMemEval oracle accuracyMeasured · 2026-05-31

LongMemEval oracle accuracy results for the Potluck memory retriever
Metric	Value	Notes
recall_any@10	1.00	Structurally guaranteed on the oracle split (see note)
recall_all@10	1.00	Structurally guaranteed on the oracle split
recall_all@1	0.352	Meaningful ranking signal across the 324 multi-session instances
nDCG@10	1.00	Same structural bound as recall_any@10
Errors	0 / 500 on every machine	Identical accuracy across all three runs

recall_any@10 = 1.00 and nDCG@10 = 1.00 are structurally guaranteed by the oracle split’s k=10 vs. max-6-session design, not a retrieval win · see “About the oracle split” above.

Latency: by machine

Same benchmark, same 500 instances, run end-to-end on three different boxes. Both x86 machines have GPUs available but unused, see the gaps section.

Table II · Latency by machineMeasured · 2026-05-31

Machine	Spec / backend	p50	p95	Wall time
Mac M2 Pro	16 GB unified memory · macOS · Metal backend	10.6 ms	35.0 ms	44.8 s
Linux media server	i7-13700K · 62 GB · RTX 2080 Ti · CUDA 13.0 (driver 580)	17.8 ms	20.8 ms	64.6 s
Windows workstation	Ryzen 7 5800X · 64 GB · RX 6600 XT (CPU fallback, ROCm not configured)	18.9 ms	21.7 ms	141.4 s

GPUs on both x86 machines sat idle this run · CUDA (Linux) and DirectML (Windows) follow-ups are logged under Known limits below.

Cross-machine peer retrieval (measured)

The Mac issues a memory query. The Tailscale-managed WireGuard mesh routes it to a household peer (Linux or Windows). The peer's sidecar runs semantic_search against its own local bge-small embedding store. The result returns over the same tunnel. End-to-end latency below.

The peer endpoint is POST /peer/memory/search, gated behind POTLUCK_PEER_MEMORY_ENABLED=1 on the machine being queried (opt-in, default off). Trust model: mesh-IP reachability is the credential. The peer_access_middleware on the sidecar already rejects non-loopback callers from /memory and all other strictly-local paths; /peer/* is the only surface peers can reach, and within that surface the new endpoint is the only memory path.

Table III · Cross-machine peer retrieval, baselineMeasured · 2026-06-02

Path	p50	p95	min
Mac local (baseline, empty query)	14.8 ms	18.3 ms	7.6 ms
Mac → Linux peer (10 memories on peer)	35.9 ms	40.7 ms	18.1 ms
Mac → Linux peer (100 memories on peer)	36.1 ms	41.2 ms	19.1 ms
Mac → Linux peer (500 memories on peer)	36.3 ms	46.0 ms	20.9 ms
Mac → Windows peer (10 memories on peer)	43.7 ms	50.7 ms	39.7 ms
Mac → Windows peer (100 memories on peer)	44.9 ms	55.6 ms	39.7 ms
Mac → Windows peer (500 memories on peer)	49.3 ms	64.3 ms	45.5 ms

30 samples per cell, warm embedder · Tailscale-managed WireGuard between peers · baseline row uses an empty store, peer rows populated with the indicated memory count.

Linux peer p50 stays remarkably flat at ~36 ms as the store grows from 10 to 500 memories, the embedder cost dominates and the linear scan over a few hundred vectors is negligible. Windows peer p50 runs ~8-13 ms higher (~44 ms at small stores, ~49 ms at 500) because the embedder runs on CPU rather than CUDA. Cross-machine overhead vs Mac local: ~21 ms to Linux, ~30 ms to Windows. Day-to-day variance is real: across three runs spanning evening and morning, the same Linux probe returned p50s between 35 ms and 45 ms, tracking Tailscale path conditions. Even at the worst measurement on file, cross-machine retrieval feels instant for any interactive workflow. Where the linear scan starts to matter is characterized below.

Scaling envelope: where the linear scan starts to bite

The retriever is a brute-force cosine scan over every embedding in the store, no ANN index. For typical stores (hundreds to a couple thousand memories) this is fine; the embedder cost dominates and the scan is essentially free. The question is where that stops being true.

Table IV · Mesh scaling envelopeMeasured · 2026-06-02

Path	p50	p95	min
Mac → Linux peer (1,000 memories on peer)	42.0 ms	51.1 ms	21.3 ms
Mac → Linux peer (5,000 memories on peer)	40.6 ms	116.5 ms	32.0 ms
Mac → Linux peer (10,000 memories on peer)	57.6 ms	131.9 ms	47.2 ms

Same Linux peer, same query · store populated stepwise at 1,000 / 5,000 / 10,000 synthetic memories.

p50 holds at ~40 ms through 5K, then jumps ~15 ms by 10K, the embedder still dominates median latency, but the scan is now contributing visibly. p95 is where the linear cost shows up first: ~50 ms at 1K doubles to ~115 ms at 5K and ~130 ms at 10K, as the tail samples catch the scan doing real work. ANN indexing becomes worth prioritizing somewhere between 5K and 10K memories per user. Below that, the brute-force scan is the right choice: it's simpler, deterministic, and free.

Correctness: recall preserved across the mesh

A separate probe verifies the architecture preserves retrieval accuracy. The peer is populated with 10 known memories, each semantically distinct, then queried 10 times from the Mac via /peer/memory/search with the natural-language question matching each memory. Expected behavior: top-1 is the right memory every time.

Table V · Mesh correctnessMeasured · 2026-06-02

Metric	Value	Note
recall@1	1.00	Top-1 hit on every query. The expected memory was always the first result.
recall@3	1.00	All 10 queries returned the right memory in their top 3.
recall@10	1.00	No misses across 10 distinct semantic queries.

Expected result: semantic_search is deterministic and the wire carries JSON, so a clean sweep confirms no silent corruption, no truncation, no reordering · probe repeated independently against Linux and Windows peers; both returned recall@1 = recall@3 = recall@10 = 1.00 across 10 distinct queries, confirming the architecture is symmetric across the mesh.

By question type (recall_any@10)

Table VI · Oracle recall by question typeMeasured · 2026-05-31

Question type	n	recall_any@10
knowledge-update	78	1.00
multi-session	133	1.00
single-session-assistant	56	1.00
single-session-preference	30	1.00
single-session-user	70	1.00
temporal-reasoning	133	1.00

Same structural bound as the oracle headline · included for completeness, not as a flex.

Internal Eval B: Potluck custom corpus

42 memories, 25 queries. Exercises the actual product retrieval path (RRF keyword-semantic fusion, Tier-0 pinned injection, cross-project isolation). 2026-05-28 baseline; will be re-published with each release.

Table VII · Eval B custom corpusMeasured · 2026-05-28

Metric	Value
recall@5 (keyword only)	0.48
recall@5 (RRF, default)	0.72
Lift from RRF	+0.24
Cross-project leak rate	0 / 25 queries

LongMemEval-S: the noise-heavy split

The oracle results above are structurally bounded. The S split is where retrieval actually has to work. Each of the 500 instances includes ~48 sessions (median; range 38-62), with the 1-2 gold-evidence sessions hidden among them. This is the split Zep, Graphiti, and other systems benchmark on for comparable accuracy numbers.

We ran the same retriever (BAAI/bge-small-en-v1.5, 33M params, plain semantic search, no graph extraction, no LLM-based memory extraction) against this split on the Mac M2 Pro.

Table VIII · LongMemEval-S retrieval quality

Metric	Value	Notes
recall_any@10	0.980	At least one gold session in top-10 across 500 instances
recall_all@10	0.938	All gold sessions for an instance present in top-10
nDCG@10	0.885	Discounted-gain ranking quality
p50 latency	12.9 ms	Per retrieval call; slight bump over oracle (48 sessions vs 3)
p95 latency	39.8 ms	Tail expands modestly with corpus size
Errors	0 / 500	Wall time 369 s on Mac M2 Pro Metal

By question type (recall_any@10)

Table IX · LongMemEval-S recall by question type

Question type	n	recall_any@10
knowledge-update	78	1.000
multi-session	133	1.000
single-session-assistant	56	0.964
single-session-preference	30	0.933
single-session-user	70	0.986
temporal-reasoning	133	0.962

What this number is and what it isn’t.

Zep’s Graphiti paper reports ~0.87 recall@10 on LongMemEval-S using their full graph-extraction pipeline. Our number above is ~0.98 recall_any@10 on the same dataset using a 33M-parameter embedder and plain semantic search.

That delta is real but the comparison is partially apples-to-oranges. Zep’s published number may be end-to-end (retrieval + generation + an LLM judge scoring the answer). Ours is retrieval-only, we measure whether the right session ranks in top-10, not whether the answer comes out right. The honest framing: a small embedder gets most of the way there on the retrieval-layer task without graph extraction; what graph extraction adds beyond that is downstream generation context, not retrieval recall.

The other caveat: bge-small was trained on broad web data that may overlap with LongMemEval’s academic distribution. Our internal Eval B is the more honest dogfood-distribution check.

Comparison to Mem0 / Letta / Zep

LongMemEval-S now provides a partial point of comparison (see the section above). Other published benchmarks remain non-comparable for the reasons below.

Table X · Comparability: Mem0, Letta, Zep

System	Published	Comparable?	Why / why not
Zep / Graphiti	recall@10 ≈ 0.87 on LongMemEval-S (graph extraction)	Partial	Same dataset, our retrieval-only number is 0.98 recall_any@10 (see S split section below). The methodology gap: Zep’s published number may be end-to-end (retrieval + generation + LLM judge); ours is retrieval-only. Apples-to-oranges to overall task accuracy, but the retrieval delta is real.
Mem0	LOCOMO + custom benchmarks; no LongMemEval	No	No LongMemEval numbers published.
Letta / MemGPT	In-context window management benchmarks	No	Different problem shape, context-window management, not vector retrieval over a persistent store.

Local model inference

Throughput across machines.

Same retrieval workloads were dispatch-bound. Inference at 7-8B parameters is the opposite, compute-bound, where GPUs actually earn their keep. We ran the same four models across the same three machines and the story is materially different.

Run context

Runtime: llama-cpp-python 0.3.23 / 0.3.24 (built from source on Linux + Windows for CUDA / Vulkan)
Models: Phi-3-mini 3.8B, Qwen 2.5 7B, Llama 3.1 8B (all Q4_K_M); Llama 3.1 70B (Q3_K_S). Pinned by SHA-256.
Prompts: 10 fixed prompts (2 short Q&A, 3 code-gen, 3 summarization, 2 long-context). Seed 42, temp 0.7, top_p 0.9. 3 runs per cell, report median.
Hardware: Same three machines as above. Backends switch per platform: Metal / CUDA / Vulkan.
Date: 2026-06-01

Machines

Mac: M2 Pro · Metal · 16 GB unified memory
Linux: i7-13700K + RTX 2080 Ti · CUDA 13 · 11 GB VRAM, 62 GB RAM
Windows: Ryzen 7 5800X + RX 6600 XT · Vulkan · 8 GB VRAM, 64 GB RAM

Generation throughput: tokens per second

Median across the 10-prompt suite. Higher is better. The story changes from row to row and column to column, that is the finding.

Table XI · Generation throughput, tokens/secMeasured · 2026-06-01

Model	Mac	Linux	Windows	Notes
Phi-3-mini 3.8B Q4_K_M	19.1	69.2	16.2
Qwen 2.5 7B Q4_K_M	12.4	45.9	15.9
Llama 3.1 8B Q4_K_M	11.6	43.0	2.1	Windows: the 8B exceeds the 6600 XT 8 GB VRAM at 8K context and spills to CPU.
Llama 3.1 70B Q3_K_S	won’t fit	2.4	won’t fit	Linux runs with about 20 of 80 layers on GPU, the rest on CPU, at a 2,048-token context. Unusable for chat.

Adapter footnote: llama-cpp-python prefix-matches the KV cache across calls. Without an explicit model.reset() between runs, TTFT on run 2+ measures cache hits, not cold prompt evaluation. Not in the upstream API docs, a real footgun for benchmark writers.

Three findings worth surfacing

1. There is no universal winner.

The RX 6600 XT edges the Mac M2 Pro on Qwen 7B (15.9 vs 12.4 tok/s, ~28% faster), trails it on the smaller Phi-3 mini 3.8B (16.2 vs 19.1), and collapses on Llama 8B (2.1 vs 11.6) once the model no longer fits the card’s 8 GB of VRAM and spills to CPU. Apple’s unified-memory cache wins the small model, the AMD card wins the 7B that fits its VRAM, and neither wins the 8B the same way. The right hardware depends on whether the model fits the card, not the brand.

2. AMD/Vulkan is viable up to its VRAM ceiling.

On Qwen 7B, the model both cards run cleanly, the RTX 2080 Ti delivers ~2.9× the throughput of the RX 6600 XT (45.9 vs 15.9 tok/s) for roughly 2× the used-market price, so cost per token is in the same range. The catch is capacity: the 2080 Ti’s 11 GB holds the 8B model, the 6600 XT’s 8 GB does not. The “you need NVIDIA for local inference” reflex is a generation out of date for models that fit, and VRAM size, not vendor, decides which models those are.

3. 70B is the local-only ceiling on this hardware tier.

Llama 3.1 70B at Q3_K_S won’t load in 16 GB Mac unified memory or 8 GB AMD VRAM. On the Linux box with 62 GB system RAM and 11 GB VRAM, it runs with about 20 of its 80 layers offloaded to the GPU and the rest on CPU, at a 2,048-token context. The result is 2.4 tok/s. Technically possible, unusable for chat. The gap between the consumer-hardware ceiling and a usable 70B is roughly 10 to 20×, and a single consumer card does not close it. Pooled compute does.

And 70B isn’t even the local-model frontier in 2026. DeepSeek V3 and R1 (671B MoE, ~37B active) at Unsloth’s most aggressive Q1.58 quant need ~131 GB of unified memory or 192 GB+ of system RAM, a hardware tier above any of the three machines benched here. People do run these models on consumer hardware; they run them on Mac Studio M3 Ultras and 192 GB+ DDR5 workstations. The shared-compute argument isn’t abstract. It’s whichever model your hardware can’t fit.

What the memory bench measures and what it doesn't

It measures:

Whether the bge-small embedder can rank evidence sessions correctly when given only those sessions.
Per-call retrieval latency on Apple Silicon (the reference hardware tier).
Absence of crashes or OOMs on the full 500-instance suite.

It does not measure:

Retrieval quality against filler noise. That requires LongMemEval-S or -M.
End-to-end answer accuracy (needs an LLM judge and a generation model).
The full get_relevant_memories() path with RRF and ambient-preference injection. Oracle split has no ambient preferences, so this would add noise without measuring anything.
Latency on non-Apple-Silicon hardware. Windows / Linux CPU performance will be slower.
Memory store scalability past ~10,000 entries (no ANN index yet).

Known gaps to fix before next publish

Known limits · read this before quoting us

MCP latency bench: p50 / p95 from memory_search tool-call to result, against a 500-memory store.
NVIDIA on Linux via CUDA, enabled 2026-06-01 (driver 580 / CUDA 13.0 on the RTX 2080 Ti). Real but modest: p50 22.2 → 17.8 ms (~20%), p95 34.6 → 20.8 ms (~40%). Not the order-of-magnitude win, same dispatch-bound dynamic as AMD, but less severe since CUDA dispatch overhead is lower. Wall time actually went up (35.6 → 64.6 s) because of GPU init / context setup cost amortized across cold per-instance runs; that overhead disappears in a long-lived production sidecar.
AMD on Windows via DirectML, investigated 2026-06-01 (RX 6600 XT). torch-directml runs the bge-small embedder on the GPU with full recall preserved, but p50 is break-even with CPU (17.9 ms vs 18.9 ms). The 33M-parameter model at batch=1 is dispatch-bound, not compute-bound: per-call kernel-dispatch and host↔device transfer roughly equal the CPU forward pass. ROCm/HIP is N/A; gfx1032 is dropped from the current HIP SDK. AMD will become useful again at the inference benchmark, where 7-8B-param models are compute-bound.
Cross-machine peer retrieval shipped 2026-06-02, measured latency + correctness above. Open follow-ups: (a) signed-nonce challenge replacing mesh-IP-as-credential in v0.5, (b) per-project sharing flags instead of all-or-nothing per-machine opt-in, (c) full latency-at-scale measurements against Windows + bigger store sizes (1K, 5K, 10K memories) to characterize the linear-scan ceiling, (d) graceful federation, a peer-aware retriever that automatically merges results from multiple opted-in peers.
Approximate-nearest-neighbor indexing. The current linear embedding scan degrades past ~10,000 memories.
Publish the Eval B corpus for full methodology transparency.
Inference benchmark next runs: a dedicated long-context bench (8K-32K input tokens), batched-inference throughput, multi-user concurrent serving on the same machine, and Llama 70B at quants we haven’t tested (Q4_K_M, Q5_K_M). The shared-compute story needs the multi-tenant numbers eventually.
Inference hardware extensions: 24 GB + 48 GB cards (e.g. RTX 3090, 4090, A6000) to find the ceiling where a single consumer card runs 70B usably. Apple Silicon higher tiers (M3 Max, M4 Pro) to know the real Mac ceiling.

How to reproduce

make bench from the repo root runs the full 500-instance suite. Results land in bench/results/<timestamp>/. The embedder (BAAI/bge-small-en-v1.5) downloads on first run from HuggingFace (~33 MB). No OpenAI API key, no GPU required.

LongMemEval data is used under the original MIT license. Vendored at bench/longmemeval/data/longmemeval_oracle.json. Source: github.com/xiaowu0162/LongMemEval.

For the inference bench: make bench-inference MODEL=bench/models/<name>.gguf. Models are pinned by SHA-256 in bench/models/SHA256SUMS.txt and downloaded on first run via the verification helper. Phi-3-mini-3.8B, Qwen-2.5-7B, and Llama-3.1-8B are ~2-5 GB each at Q4_K_M; Llama-3.1-70B-Q3_K_S is ~29 GB.

Refresh cadence

Re-run before each release tag. This page is updated by hand after a human reviews the numbers, no automated publishing.

Read the rest of the operating documents.

Roadmap Manifesto