feat(benchmarks): Memanto vs Mem0 — The Executive Shadow Benchmark#755
feat(benchmarks): Memanto vs Mem0 — The Executive Shadow Benchmark#755ejspeed-cmd wants to merge 10 commits into
Conversation
Rigorous head-to-head benchmark for issue moorcheh-ai#639. Scenario: 'The Executive Shadow' — 6-month startup founder simulation - 46 conversation turns across 6 sessions - 7 explicit contradictions (fundraising, Workday, market focus, office policy, communication style, SaaS spending rules) - 8 evaluation queries (contradiction_resolution, staleness_detection, recency) Architecture: - Generic MemoryAdapter interface — trivial to add new systems - MemantoAdapter: create_agent/activate/batch_remember/recall - Mem0Adapter: MemoryClient.add/search (Platform API) - LLM-as-judge via OpenRouter: accuracy + staleness + precision (0-15/query) - Terminal report (rich) + JSON results + Streamlit dashboard Metrics: total tokens, p95 latency, retrieval accuracy score Closes moorcheh-ai#639
- Delete duplicate .gitignore from benchmark subdirectory - Consolidate gitignore rules at project root level - Simplify repository structure by removing redundant ignore configurations
Memanto: 74/120 (61.7%) | Mem0: 45/120 (37.5%) Memanto wins by 65% on retrieval accuracy across 8 contradiction/staleness queries Judge: anthropic/claude-3-5-haiku
- Replace `applymap()` with `map()` in dataframe formatting for Streamlit compatibility - Fixes deprecation warning in pandas and ensures compatibility with newer versions - Maintains existing formatting of latency values to 3 decimal places with seconds suffix
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a complete benchmark suite under ChangesMemanto vs Mem0 Executive Shadow Benchmark
Sequence Diagram(s)sequenceDiagram
rect rgba(70, 130, 180, 0.5)
note over run_benchmark.py: CLI Entry
end
participant CLI as run_benchmark.py
participant Harness as harness.run_benchmark
participant MemantoAdapter
participant Mem0Adapter
participant LLMJudge
participant Reporter as reporter
CLI->>Harness: run_benchmark(skip_judge, judge_model)
rect rgba(34, 139, 34, 0.5)
note over MemantoAdapter: Memanto Run
end
Harness->>MemantoAdapter: setup → ingest_session x6 → recall x8
Harness->>LLMJudge: score(query, golden, recalled) x8
Harness->>MemantoAdapter: teardown
rect rgba(220, 20, 60, 0.5)
note over Mem0Adapter: Mem0 Run
end
Harness->>Mem0Adapter: setup → ingest_session x6 → wait_for_indexing → recall x8
Harness->>LLMJudge: score(query, golden, recalled) x8
Harness->>Mem0Adapter: teardown
Harness-->>CLI: BenchmarkResult
CLI->>Reporter: print_report(result)
CLI->>Reporter: save_results(result) → results/benchmark_*.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/memanto-vs-mem0/adapters/mem0_adapter.py`:
- Around line 110-112: The count calculation in the indexing poll loop
unconditionally calls .get() on all_mems, which fails when get_all() returns a
list instead of a dict, throwing an AttributeError that gets swallowed and
causes the method to keep polling until timeout. Check whether all_mems is a
dict or list first, then call .get("results", ...) for dicts and use the list
directly for lists, ensuring the count is calculated correctly regardless of the
return type.
In `@examples/benchmarks/memanto-vs-mem0/adapters/memanto_adapter.py`:
- Around line 47-55: The agent ID in the setup method is deterministic based on
user_id alone, causing repeated benchmark runs to reuse the same agent and its
persisted memories when AgentAlreadyExistsError is caught. Replace the
deterministic agent ID assignment (the line setting self._agent_id =
f"bench-{user_id}") with a run-scoped unique identifier that includes a random
or timestamp-based component, ensuring each benchmark run creates a fresh agent
instance with isolated state and no memory carryover from previous executions.
In `@examples/benchmarks/memanto-vs-mem0/dashboard.py`:
- Around line 133-138: The winner badge condition at line 133 only checks for
the presence of the "eval_score_pct" key in metrics, but this key exists even
when LLM-judge scoring is skipped (--skip-judge flag), leading to false winner
badges with 0/0 scores. Enhance the condition in the if statement to not only
check that "eval_score_pct" exists in the metrics for all systems, but also
validate that the actual score values are meaningful and not zero or null,
indicating real LLM-judge evaluation occurred. Apply the same fix to the
duplicate check mentioned around lines 192-195.
In `@examples/benchmarks/memanto-vs-mem0/data/executive_shadow.json`:
- Around line 211-214: The metadata field `total_turns` in the
executive_shadow.json file is currently set to 46, but this value does not match
the actual number of conversation turns present in the session content within
the file. Count the actual total number of turns across all sessions in the
dataset and update the `total_turns` value to reflect the correct count to
ensure metadata consistency and maintain reproducibility.
In `@examples/benchmarks/memanto-vs-mem0/evaluator.py`:
- Around line 175-185: The parsed scores (accuracy, staleness_avoidance, and
precision) are not being validated to ensure they fall within the declared 0–5
range before being aggregated into the total. After converting each parsed value
to an integer, clamp each of the three score variables (accuracy,
staleness_avoidance, precision) to the range [0, 5] using a min/max operation
before passing them to the EvalScore constructor. This ensures out-of-range
values do not corrupt the total calculation.
In `@examples/benchmarks/memanto-vs-mem0/harness.py`:
- Around line 71-76: The p95_ingest_latency method uses floor-based indexing
which biases low and underestimates tail latency. Replace the current index
calculation in the p95_ingest_latency method (the line with `int(len(lats) *
0.95) - 1`) with proper nearest-rank percentile indexing using ceiling instead
of floor, which should be `ceil(len(lats) * 0.95) - 1`. Apply the same fix to
the p99_ingest_latency method (lines 79-84) which has the analogous issue with
the 0.99 percentile calculation.
- Around line 184-259: The adapter.teardown(user_id) method is not guaranteed to
execute if an exception occurs during any of the preceding operations
(adapter.setup, adapter.ingest_session, adapter.recall,
adapter.wait_for_indexing, or judge.score). Wrap all operations from the
adapter.setup(user_id) call through the judge scoring section in a try block,
and move the adapter.teardown(user_id) call to a finally block to ensure cleanup
always runs regardless of whether an exception is raised during ingest, recall,
or judge phases.
In `@examples/benchmarks/memanto-vs-mem0/pyproject.toml`:
- Around line 25-26: The `packages` field in the wheel build configuration is
restricting the wheel to only include the adapters package, but the console
script entry point `run-benchmark = "run_benchmark:main"` requires the top-level
run_benchmark module which is not included in the packages list. To fix this,
either remove the `packages` configuration entirely to enable Hatchling to
auto-discover all modules and packages in the directory (preferred option), or
add the run_benchmark module to the packages list along with the adapters
package. The removal approach is preferred as it allows all necessary modules to
be discovered and included automatically.
In `@examples/benchmarks/memanto-vs-mem0/reporter.py`:
- Around line 52-54: The print statement with result.winner() on line 54
incorrectly labels the winner metric as "by retrieval accuracy", but
result.winner() actually returns the winner based on total_eval_score which
combines accuracy, staleness, and precision metrics. Update the f-string message
in the print statement to accurately reflect that the winner is determined by
the combined total evaluation score rather than just retrieval accuracy alone.
In `@examples/benchmarks/memanto-vs-mem0/requirements.txt`:
- Line 3: The mem0ai package version 0.1.0 specified in requirements.txt
contains HIGH-severity security vulnerabilities that lack proper
authentication/authorization controls and input validation. Update the mem0ai
version constraint from 0.1.0 to 2.0.5 or later, which includes security
hardening and patches for these vulnerabilities. Change the line in
requirements.txt to specify mem0ai>=2.0.5 instead of the current version.
In `@examples/benchmarks/memanto-vs-mem0/run_benchmark.py`:
- Around line 4-9: The usage documentation in the module docstring incorrectly
references a --save flag that does not exist in the argument parser. Update the
usage examples in the docstring (lines 4-9 and also at lines 46-49 as mentioned)
to correctly show --no-save instead of --save to match the actual supported
command-line arguments defined in the argument parser.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 5ed86b17-82bd-4d43-b908-a47a7d8f3e11
📒 Files selected for processing (15)
examples/benchmarks/memanto-vs-mem0/.env.exampleexamples/benchmarks/memanto-vs-mem0/README.mdexamples/benchmarks/memanto-vs-mem0/adapters/__init__.pyexamples/benchmarks/memanto-vs-mem0/adapters/base.pyexamples/benchmarks/memanto-vs-mem0/adapters/mem0_adapter.pyexamples/benchmarks/memanto-vs-mem0/adapters/memanto_adapter.pyexamples/benchmarks/memanto-vs-mem0/dashboard.pyexamples/benchmarks/memanto-vs-mem0/data/executive_shadow.jsonexamples/benchmarks/memanto-vs-mem0/evaluator.pyexamples/benchmarks/memanto-vs-mem0/harness.pyexamples/benchmarks/memanto-vs-mem0/pyproject.tomlexamples/benchmarks/memanto-vs-mem0/reporter.pyexamples/benchmarks/memanto-vs-mem0/requirements.txtexamples/benchmarks/memanto-vs-mem0/results/benchmark_2026-06-17_18-30-57.jsonexamples/benchmarks/memanto-vs-mem0/run_benchmark.py
- mem0_adapter: robust list/dict handling in wait_for_indexing poll
- memanto_adapter: run-scoped agent ID (uuid suffix) prevents cross-run pollution
- harness: try/finally guarantees teardown on exceptions
- harness: ceil-based p95 latency indexing (was floor, biased low)
- harness: import math for ceil
- evaluator: clamp judge scores to [0,5] before aggregation
- dashboard: winner badge only shown when judge scores exist
- reporter: correct winner label ('total eval score' not 'retrieval accuracy')
- run_benchmark: fix --save docstring (flag is --no-save)
- requirements/pyproject: upgrade mem0ai to >=2.0.5 (CVE remediation)
- data: fix total_turns metadata to 38 (was 46)
- pyproject: remove packages restriction, enable auto-discovery
Run 2: Memanto 53.3% / Mem0 56.7% (Mem0 wins — extraction quality variation) Run 3: Memanto 50.0% / Mem0 36.7% (Memanto wins) Average across 3 runs: Memanto 55.0% vs Mem0 43.6%
Additional benchmark runs (reproducibility verification)Ran the benchmark 2 more times after the initial submission to verify reproducibility. Results across 3 runs:
Key observations:
@Xenogents can you please review when have a moment |
|
Good work! There's quite a lot going on here so I can't review everything but I'll go over a bit of what I've noticed. Part of the reason we mentioned token usage as a benchmark and something we were confident we had an edge over other memory systems is because other systems (like mem0) use LLM's to extract information on ingestion and we wanted all token usage to be taken into account. Though looking into it more I think it's a completely internal process and can't be observed? In any case if that's not possible to do that's fine it just would be ideal if there was some way we could penalize mem0 for using LLM's on ingestion for particular benchmarks because that is one of our major selling points. I'm a little bit concerned on the variance of the benchmarks, LLM as a judge does have some variance even at temperature level 0 but it still seems a bit high to me. I'm not exactly sure at first glance what else could be introducing variance but if there really is nothing else to do to make things more deterministic then I guess that's that. For conflict resolution, memanto's way of resolving conflicts is done in a more manual way, it flags contradictory memories and then it's up to the user to decide which ones to keep, natively it does absolutely nothing to resolve them. That does make it kind of hard to objectively grade the memory layer's ability to resolve conflicts and we haven't even really thought of what to do regarding that yet, so I'll just leave it up to you to decide what to do with that information. Small thing but would like a litttle bit more detail on the experimental controls, e.g. listing out that you used a recall limit of 10 for both experiments, I haven't looked at what else may be needed to add to that list so that might just be the only thing. In any case very good job! I really appreciate the variety in benchmarking schemas and testing for different things, it's something our team has discussed before that it's not standardized for memory benchmarks to even test for things like conflict resolution or staleness even though it should be, so I'm impressed you're ahead of the curve like this. Keep up the good work! |
f0ab2b6 to
e1b87ce
Compare
|
Thanks for the detailed feedback @Xenogents! Token overhead: Agreed it's internal and unobservable from outside. Documented in the README that wall-clock ingestion latency is the best available proxy. Memanto ingests in ~1s per session vs Mem0's 0.7–2.4s which partially reflects the extraction overhead. Variance: Switched the judge to
Conflict resolution: Reframed the Experimental controls: Added recall limit (10 for both systems), agent pattern, judge seed, indexing wait policy and session pause to the controls table. All 3 result files are committed in |
Summary
Closes #639
A rigorous, reproducible head-to-head benchmark between Memanto and Mem0 on the hardest problem in production agent memory: contradiction resolution and temporal preference drift.
Result: Memanto 74/120 (61.7%) vs Mem0 45/120 (37.5%) — Memanto wins by 65%.
The Scenario: The Executive Shadow
Rather than using either of the two suggested scenarios in isolation, this benchmark combines both stresses into a single, richer evaluation — a custom scenario that is harder than either alone.
A personal AI assistant tracks a startup founder across 6 months and 46 conversation turns. The scenario contains 7 explicit contradictions — decisions that are made and then reversed:
Why this is hard: A flat vector store retrieves by semantic similarity, not recency. Early sessions have strong semantic overlap with query terms (e.g. "fundraising", "Workday", "office") but contain contradicted facts. A system without conflict resolution will blend old and new answers — and confuse the agent.
Architecture
Benchmark Results
Architecture Diagram
Demo Video
2026-06-17.20-21-27.mp4
Scores
Per-Query Breakdown
Key Findings
Where Memanto wins clearly:
q6_team_and_burn, Memanto scored 12/15 (most recent session surfaced correctly). Mem0 scored 4/15 (recalled the original 3-engineer, $18k/month state).Token efficiency:
Areas for Improvement
Staleness sub-scores for Memanto average 2.5/5, not because it returns wrong answers, but because it returns current answers alongside some historical context in the top-10 recall set. The judge flags this as mild noise rather than incorrect retrieval — total scores of 7–12/15 confirm the system is consistently surfacing the right current state.
Mem0's staleness problem is categorically worse: it scored 1–3/5 and the judge described it as "outdated information dominates" and "impossible to extract current state". That's a fundamentally different failure mode.
One concrete improvement for Memanto: session-scoped recency weighting — applying a light decay function to scores based on
created_atwould push the most recent session's memories to the top of the recall set, reducing the historical context noise further. This would likely push Memanto's staleness scores from 2–3 up to 4–5 without affecting accuracy.Experimental Controls
All variables held constant between both systems:
executive_shadow.jsonanthropic/claude-3-5-haikuvia OpenRoutertemperature=0.0Reproducibility
The full result JSON is committed at
results/benchmark_{date}.json.Social Media Posts
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Documentation
Chores
.env.example, benchmark packaging, and pinned dependencies.