Skip to content

feat(benchmarks): Memanto vs Mem0 — The Executive Shadow Benchmark#755

Open
ejspeed-cmd wants to merge 10 commits into
moorcheh-ai:mainfrom
ejspeed-cmd:feat/benchmarks-memeanto-vs-mem0
Open

feat(benchmarks): Memanto vs Mem0 — The Executive Shadow Benchmark#755
ejspeed-cmd wants to merge 10 commits into
moorcheh-ai:mainfrom
ejspeed-cmd:feat/benchmarks-memeanto-vs-mem0

Conversation

@ejspeed-cmd

@ejspeed-cmd ejspeed-cmd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #639

A rigorous, reproducible head-to-head benchmark between Memanto and Mem0 on the hardest problem in production agent memory: contradiction resolution and temporal preference drift.

Result: Memanto 74/120 (61.7%) vs Mem0 45/120 (37.5%) — Memanto wins by 65%.


The Scenario: The Executive Shadow

Rather than using either of the two suggested scenarios in isolation, this benchmark combines both stresses into a single, richer evaluation — a custom scenario that is harder than either alone.

A personal AI assistant tracks a startup founder across 6 months and 46 conversation turns. The scenario contains 7 explicit contradictions — decisions that are made and then reversed:

Contradiction Old State New State
Fundraising "No VC until $10k MRR" (Month 2) $1.85M seed closed, Series A planning (Month 5-6)
Workday integration "Dropping Workday" (Month 2) "Workday complete by Miguel" (Month 6)
Market focus "SMB only" (Month 2) "Dual SMB + enterprise" (Month 5)
Office policy "Full-time in-office" (Month 4) "Hybrid Tue/Thu only" (Month 6)
Communication style "Always 5 bullets" (Month 1) "Detailed narrative for board, 5 bullets internal" (Month 6)
SaaS spending rule "Hard $50/month limit" (Month 1) "Rule retired, ~$870/month approved tools" (Month 6)
Sequoia "Avoid Sequoia" (Month 4) "Sequoia re-engaged via seed extension" (Month 6)

Why this is hard: A flat vector store retrieves by semantic similarity, not recency. Early sessions have strong semantic overlap with query terms (e.g. "fundraising", "Workday", "office") but contain contradicted facts. A system without conflict resolution will blend old and new answers — and confuse the agent.


Architecture

executive_shadow.json     ← deterministic golden dataset (46 turns, 8 queries)
        ↓
harness.py                ← drives both systems identically
  ├── MemantoAdapter       ← create_agent / activate / batch_remember / recall
  └── Mem0Adapter          ← MemoryClient.add / search (Platform API)
        ↓
evaluator.py (LLMJudge)   ← anthropic/claude-3-5-haiku via OpenRouter
                             3 dimensions × 5 points = 15/query × 8 queries = 120 max
        ↓
reporter.py               ← rich terminal table + results/benchmark_*.json
        ↓
dashboard.py              ← Streamlit visualisation

Benchmark Results

Architecture Diagram

Jun 17, 2026, 07_56_39 PM

Demo Video

2026-06-17.20-21-27.mp4

Scores

System Eval Score Tokens Used p95 Recall Latency
Memanto 74/120 (61.7%) 6,641 2.32s
Mem0 45/120 (37.5%) 5,642 2.05s

Per-Query Breakdown

Query Test Type Memanto Mem0
What is the founder's current fundraising strategy? contradiction_resolution 10/15 5/15
Is Workday integration being built? contradiction_resolution 10/15 5/15
Is this company targeting SMB or enterprise? contradiction_resolution 8/15 5/15
What are the founder's work preferences? staleness_detection 8/15 6/15
How should I format messages to this founder? staleness_detection 11/15 8/15
What is the current team size and burn? recency 12/15 4/15
What is the rule around SaaS tool spending? contradiction_resolution 8/15 6/15
What is the biggest risk right now? recency 7/15 6/15
TOTAL 74/120 (61.7%) 45/120 (37.5%)

Key Findings

Where Memanto wins clearly:

  • Accuracy — consistently scores 3–5 vs Mem0's 1–2. Mem0 frequently recalled Session 1 facts (e.g. team of 3, $18k burn, pre-seed strategy) when the current state is Session 6.
  • Recency — on q6_team_and_burn, Memanto scored 12/15 (most recent session surfaced correctly). Mem0 scored 4/15 (recalled the original 3-engineer, $18k/month state).
  • Contradiction queries — Mem0's judge reasoning on every contradiction query included phrases like "completely fails to address", "outdated information dominates", and "impossible to extract". Memanto returned the current state alongside historical context.

Token efficiency:

  • Mem0 uses fewer total tokens (5,642 vs 6,641) because it auto-extracts and deduplicates during ingest. This is a genuine advantage but comes at the cost of the precision and recency needed to answer the contradiction queries.
  • Memanto stores all user turns verbatim — this preserves fidelity but returns more context per recall.

Areas for Improvement

Staleness sub-scores for Memanto average 2.5/5, not because it returns wrong answers, but because it returns current answers alongside some historical context in the top-10 recall set. The judge flags this as mild noise rather than incorrect retrieval — total scores of 7–12/15 confirm the system is consistently surfacing the right current state.

Mem0's staleness problem is categorically worse: it scored 1–3/5 and the judge described it as "outdated information dominates" and "impossible to extract current state". That's a fundamentally different failure mode.

One concrete improvement for Memanto: session-scoped recency weighting — applying a light decay function to scores based on created_at would push the most recent session's memories to the top of the recall set, reducing the historical context noise further. This would likely push Memanto's staleness scores from 2–3 up to 4–5 without affecting accuracy.


Experimental Controls

All variables held constant between both systems:

Variable Value
Input dataset Identical — executive_shadow.json
Session order Sequential sessions 1–6
Query set 8 identical evaluation queries
Judge LLM anthropic/claude-3-5-haiku via OpenRouter
Judge prompt Identical system prompt, temperature=0.0
Token counting Character-based (len/4), applied identically
Host environment Same machine, sequential runs
Indexing wait Mem0 polled until memories indexed (max 60s); Memanto 3s fixed wait

Reproducibility

# 1. Install
cd examples/benchmarks/memanto-vs-mem0
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Fill in MOORCHEH_API_KEY, MEM0_API_KEY, OPENROUTER_API_KEY

# 3. Run
python run_benchmark.py

# 4. Dashboard
streamlit run dashboard.py

The full result JSON is committed at results/benchmark_{date}.json.


Social Media Posts

Summary by CodeRabbit

Summary by CodeRabbit

New Features

  • Added the “Memanto vs Mem0” Executive Shadow benchmark suite to compare ingest/recall performance on a multi-month, contradiction-heavy scenario.
  • Added optional LLM judging (accuracy, staleness avoidance, precision) to compute a winner and scores.
  • Added terminal reporting, timestamped JSON result saving, and a Streamlit dashboard with charts and per-query deep dives.
  • Added a CLI to run the benchmark, with options to skip judging and choose a judge model.

Documentation

  • Added benchmark README covering setup, metrics, and experimental controls.

Chores

  • Added a .env.example, benchmark packaging, and pinned dependencies.

Rigorous head-to-head benchmark for issue moorcheh-ai#639.

Scenario: 'The Executive Shadow' — 6-month startup founder simulation
  - 46 conversation turns across 6 sessions
  - 7 explicit contradictions (fundraising, Workday, market focus,
    office policy, communication style, SaaS spending rules)
  - 8 evaluation queries (contradiction_resolution, staleness_detection, recency)

Architecture:
  - Generic MemoryAdapter interface — trivial to add new systems
  - MemantoAdapter: create_agent/activate/batch_remember/recall
  - Mem0Adapter: MemoryClient.add/search (Platform API)
  - LLM-as-judge via OpenRouter: accuracy + staleness + precision (0-15/query)
  - Terminal report (rich) + JSON results + Streamlit dashboard

Metrics: total tokens, p95 latency, retrieval accuracy score
Closes moorcheh-ai#639
- Delete duplicate .gitignore from benchmark subdirectory
- Consolidate gitignore rules at project root level
- Simplify repository structure by removing redundant ignore configurations
Memanto: 74/120 (61.7%) | Mem0: 45/120 (37.5%)
Memanto wins by 65% on retrieval accuracy across 8 contradiction/staleness queries
Judge: anthropic/claude-3-5-haiku
- Replace `applymap()` with `map()` in dataframe formatting for Streamlit compatibility
- Fixes deprecation warning in pandas and ensures compatibility with newer versions
- Maintains existing formatting of latency values to 3 decimal places with seconds suffix
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a2ac46e2-f587-425e-b45d-f82d489f97ba

📥 Commits

Reviewing files that changed from the base of the PR and between c6702d8 and b41e78d.

📒 Files selected for processing (1)
  • examples/benchmarks/memanto-vs-mem0/.env.example
✅ Files skipped from review due to trivial changes (1)
  • examples/benchmarks/memanto-vs-mem0/.env.example

📝 Walkthrough

Walkthrough

Adds a complete benchmark suite under examples/benchmarks/memanto-vs-mem0/ that compares Memanto and Mem0 memory systems on a 6-month "Executive Shadow" scenario with 46 conversation turns and explicit contradictions. The suite includes adapter abstractions, two implementations, an LLM-as-judge evaluator via OpenRouter, an orchestration harness, terminal reporting, a Streamlit dashboard, the scenario dataset, and full project configuration.

Changes

Memanto vs Mem0 Executive Shadow Benchmark

Layer / File(s) Summary
Adapter contracts and scenario dataset
adapters/base.py, adapters/__init__.py, data/executive_shadow.json
IngestResult and RecallResult dataclasses and the MemoryAdapter abstract base class with lifecycle/benchmark methods and shared token-counting helpers form the benchmark interface. The Executive Shadow JSON dataset provides six monthly sessions with evolving conversation histories containing explicit cross-month contradictions, eight evaluation queries with golden answers and stale/current signal lists, and scenario metadata including test-type coverage and domain mapping.
Mem0Adapter implementation
adapters/mem0_adapter.py
Mem0Adapter lazily imports MemoryClient, creates a per-benchmark user id in setup, ingests full message lists via add() with session metadata while computing input token counts, polls get_all() with user filter in wait_for_indexing to detect indexing completion, and recalls via search() while extracting memory text and aggregating query+returned-text token usage.
MemantoAdapter implementation
adapters/memanto_adapter.py
MemantoAdapter initializes SdkClient with MOORCHEH_API_KEY, creates uniquely-scoped per-user agent ids (bench-{user_id}-{suffix}) with 2-hour activation in setup, builds batch payloads with one memory per user turn (auto-classified by Memanto) and attempts batch_remember with per-item fallback, and recalls via Memanto's recall endpoint while extracting content and aggregating token usage.
LLM-as-judge evaluator
evaluator.py
LJMJudge connects to OpenRouter-hosted models via OpenAI-compatible SDK, scoring each query/answer pair using system and user prompt templates with accuracy, staleness-avoidance, and precision dimensions (0–5 each). It handles up to 3 retry attempts with exponential backoff, extracts and parses JSON from responses, and returns zeroed EvalScore instances with failure reasoning on all-retries-exhausted.
Benchmark harness orchestration
harness.py
run_benchmark loads the Executive Shadow scenario, instantiates both adapters and optional LJMJudge, sequences per-adapter setup → ingest-all-sessions → wait-for-indexing → recall-all-queries, optionally scores each query result via the judge, and aggregates into SystemResult (token totals, p95/mean latencies, eval-score sums and percentages) and BenchmarkResult with a winner() method based on total eval score.
Terminal reporter and results persistence
reporter.py
print_report renders per-system token/latency/accuracy metrics in either a rich table or plain-text fallback, conditionally showing a winner line and per-query accuracy breakdown. save_results creates the results/ directory, constructs a timestamped JSON filename, recursively serializes dataclass instances via asdict, and persists the full benchmark output.
CLI entry point and orchestration
run_benchmark.py
CLI entry point loads environment variables via dotenv, configures logging, parses --skip-judge, --judge-model, and --no-save flags, calls harness.run_benchmark, prints a formatted report via reporter.print_report, conditionally persists results via reporter.save_results, and wraps execution in try/except for error handling.
Streamlit results dashboard
dashboard.py
Dashboard auto-loads the latest benchmark_*.json from results/ or runs a fresh benchmark via sidebar controls. It displays headline metric cards with conditional winner badge, token-efficiency and latency comparison bar charts with pandas, a conditional "Retrieval Accuracy" breakdown table and dimension chart when judge scores exist, and a Query Deep Dive selectbox showing golden answers, per-system recalled answers, and judge reasoning.
Project configuration, environment, and docs
pyproject.toml, requirements.txt, .env.example, README.md
pyproject.toml registers hatchling build backend, lists runtime dependencies, and defines the run-benchmark console script. requirements.txt pins specific dependency versions. .env.example documents required API keys for Memanto, Mem0, and OpenRouter, plus the optional JUDGE_MODEL override. README.md covers the Executive Shadow scenario, benchmark architecture, metrics, quick-start instructions, experimental controls, environment requirements, project structure, and social-post placeholders.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(70, 130, 180, 0.5)
    note over run_benchmark.py: CLI Entry
  end
  participant CLI as run_benchmark.py
  participant Harness as harness.run_benchmark
  participant MemantoAdapter
  participant Mem0Adapter
  participant LLMJudge
  participant Reporter as reporter

  CLI->>Harness: run_benchmark(skip_judge, judge_model)
  rect rgba(34, 139, 34, 0.5)
    note over MemantoAdapter: Memanto Run
  end
  Harness->>MemantoAdapter: setup → ingest_session x6 → recall x8
  Harness->>LLMJudge: score(query, golden, recalled) x8
  Harness->>MemantoAdapter: teardown
  rect rgba(220, 20, 60, 0.5)
    note over Mem0Adapter: Mem0 Run
  end
  Harness->>Mem0Adapter: setup → ingest_session x6 → wait_for_indexing → recall x8
  Harness->>LLMJudge: score(query, golden, recalled) x8
  Harness->>Mem0Adapter: teardown
  Harness-->>CLI: BenchmarkResult
  CLI->>Reporter: print_report(result)
  CLI->>Reporter: save_results(result) → results/benchmark_*.json
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • Neelpatel1604
  • Xenogents

Poem

🐇 Hop, hop, the memories race,
Memanto and Mem0 share the space.
Six months of turns, contradictions galore,
The judge scores answers, then asks for more.
Token by token, latency low —
The rabbit declares: let benchmarks flow! 🏆

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementation of a benchmark comparing Memanto and Mem0 memory systems using the Executive Shadow scenario.
Linked Issues check ✅ Passed All primary coding objectives from issue #639 are met: rigorous benchmarking suite built with MemoryAdapter interface [#639], core metrics measured (tokens, latency, accuracy) [#639], meaningful scenario implemented (Executive Shadow, 6 months, 46 turns, 7 contradictions) [#639], scientific rigor ensured with documented controls [#639], reproducibility delivered with requirements.txt, .env.example, dataset, and result files [#639], and complete output (terminal reports, JSON, Streamlit dashboard) [#639].
Out of Scope Changes check ✅ Passed All code changes directly support the benchmark implementation objectives from issue #639. Files focus on adapters (Memanto/Mem0), evaluation harness, LLM judge, reporting, CLI entry point, dashboard visualization, dataset, and configuration—all required for the benchmarking suite.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ejspeed-cmd ejspeed-cmd marked this pull request as ready for review June 17, 2026 19:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto-vs-mem0/adapters/mem0_adapter.py`:
- Around line 110-112: The count calculation in the indexing poll loop
unconditionally calls .get() on all_mems, which fails when get_all() returns a
list instead of a dict, throwing an AttributeError that gets swallowed and
causes the method to keep polling until timeout. Check whether all_mems is a
dict or list first, then call .get("results", ...) for dicts and use the list
directly for lists, ensuring the count is calculated correctly regardless of the
return type.

In `@examples/benchmarks/memanto-vs-mem0/adapters/memanto_adapter.py`:
- Around line 47-55: The agent ID in the setup method is deterministic based on
user_id alone, causing repeated benchmark runs to reuse the same agent and its
persisted memories when AgentAlreadyExistsError is caught. Replace the
deterministic agent ID assignment (the line setting self._agent_id =
f"bench-{user_id}") with a run-scoped unique identifier that includes a random
or timestamp-based component, ensuring each benchmark run creates a fresh agent
instance with isolated state and no memory carryover from previous executions.

In `@examples/benchmarks/memanto-vs-mem0/dashboard.py`:
- Around line 133-138: The winner badge condition at line 133 only checks for
the presence of the "eval_score_pct" key in metrics, but this key exists even
when LLM-judge scoring is skipped (--skip-judge flag), leading to false winner
badges with 0/0 scores. Enhance the condition in the if statement to not only
check that "eval_score_pct" exists in the metrics for all systems, but also
validate that the actual score values are meaningful and not zero or null,
indicating real LLM-judge evaluation occurred. Apply the same fix to the
duplicate check mentioned around lines 192-195.

In `@examples/benchmarks/memanto-vs-mem0/data/executive_shadow.json`:
- Around line 211-214: The metadata field `total_turns` in the
executive_shadow.json file is currently set to 46, but this value does not match
the actual number of conversation turns present in the session content within
the file. Count the actual total number of turns across all sessions in the
dataset and update the `total_turns` value to reflect the correct count to
ensure metadata consistency and maintain reproducibility.

In `@examples/benchmarks/memanto-vs-mem0/evaluator.py`:
- Around line 175-185: The parsed scores (accuracy, staleness_avoidance, and
precision) are not being validated to ensure they fall within the declared 0–5
range before being aggregated into the total. After converting each parsed value
to an integer, clamp each of the three score variables (accuracy,
staleness_avoidance, precision) to the range [0, 5] using a min/max operation
before passing them to the EvalScore constructor. This ensures out-of-range
values do not corrupt the total calculation.

In `@examples/benchmarks/memanto-vs-mem0/harness.py`:
- Around line 71-76: The p95_ingest_latency method uses floor-based indexing
which biases low and underestimates tail latency. Replace the current index
calculation in the p95_ingest_latency method (the line with `int(len(lats) *
0.95) - 1`) with proper nearest-rank percentile indexing using ceiling instead
of floor, which should be `ceil(len(lats) * 0.95) - 1`. Apply the same fix to
the p99_ingest_latency method (lines 79-84) which has the analogous issue with
the 0.99 percentile calculation.
- Around line 184-259: The adapter.teardown(user_id) method is not guaranteed to
execute if an exception occurs during any of the preceding operations
(adapter.setup, adapter.ingest_session, adapter.recall,
adapter.wait_for_indexing, or judge.score). Wrap all operations from the
adapter.setup(user_id) call through the judge scoring section in a try block,
and move the adapter.teardown(user_id) call to a finally block to ensure cleanup
always runs regardless of whether an exception is raised during ingest, recall,
or judge phases.

In `@examples/benchmarks/memanto-vs-mem0/pyproject.toml`:
- Around line 25-26: The `packages` field in the wheel build configuration is
restricting the wheel to only include the adapters package, but the console
script entry point `run-benchmark = "run_benchmark:main"` requires the top-level
run_benchmark module which is not included in the packages list. To fix this,
either remove the `packages` configuration entirely to enable Hatchling to
auto-discover all modules and packages in the directory (preferred option), or
add the run_benchmark module to the packages list along with the adapters
package. The removal approach is preferred as it allows all necessary modules to
be discovered and included automatically.

In `@examples/benchmarks/memanto-vs-mem0/reporter.py`:
- Around line 52-54: The print statement with result.winner() on line 54
incorrectly labels the winner metric as "by retrieval accuracy", but
result.winner() actually returns the winner based on total_eval_score which
combines accuracy, staleness, and precision metrics. Update the f-string message
in the print statement to accurately reflect that the winner is determined by
the combined total evaluation score rather than just retrieval accuracy alone.

In `@examples/benchmarks/memanto-vs-mem0/requirements.txt`:
- Line 3: The mem0ai package version 0.1.0 specified in requirements.txt
contains HIGH-severity security vulnerabilities that lack proper
authentication/authorization controls and input validation. Update the mem0ai
version constraint from 0.1.0 to 2.0.5 or later, which includes security
hardening and patches for these vulnerabilities. Change the line in
requirements.txt to specify mem0ai>=2.0.5 instead of the current version.

In `@examples/benchmarks/memanto-vs-mem0/run_benchmark.py`:
- Around line 4-9: The usage documentation in the module docstring incorrectly
references a --save flag that does not exist in the argument parser. Update the
usage examples in the docstring (lines 4-9 and also at lines 46-49 as mentioned)
to correctly show --no-save instead of --save to match the actual supported
command-line arguments defined in the argument parser.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ed86b17-82bd-4d43-b908-a47a7d8f3e11

📥 Commits

Reviewing files that changed from the base of the PR and between c39af7a and 70d455e.

📒 Files selected for processing (15)
  • examples/benchmarks/memanto-vs-mem0/.env.example
  • examples/benchmarks/memanto-vs-mem0/README.md
  • examples/benchmarks/memanto-vs-mem0/adapters/__init__.py
  • examples/benchmarks/memanto-vs-mem0/adapters/base.py
  • examples/benchmarks/memanto-vs-mem0/adapters/mem0_adapter.py
  • examples/benchmarks/memanto-vs-mem0/adapters/memanto_adapter.py
  • examples/benchmarks/memanto-vs-mem0/dashboard.py
  • examples/benchmarks/memanto-vs-mem0/data/executive_shadow.json
  • examples/benchmarks/memanto-vs-mem0/evaluator.py
  • examples/benchmarks/memanto-vs-mem0/harness.py
  • examples/benchmarks/memanto-vs-mem0/pyproject.toml
  • examples/benchmarks/memanto-vs-mem0/reporter.py
  • examples/benchmarks/memanto-vs-mem0/requirements.txt
  • examples/benchmarks/memanto-vs-mem0/results/benchmark_2026-06-17_18-30-57.json
  • examples/benchmarks/memanto-vs-mem0/run_benchmark.py

Comment thread examples/benchmarks/memanto-vs-mem0/adapters/mem0_adapter.py
Comment thread examples/benchmarks/memanto-vs-mem0/adapters/memanto_adapter.py Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/dashboard.py Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/data/executive_shadow.json
Comment thread examples/benchmarks/memanto-vs-mem0/evaluator.py Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/harness.py
Comment thread examples/benchmarks/memanto-vs-mem0/pyproject.toml Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/reporter.py Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/requirements.txt Outdated
Comment thread examples/benchmarks/memanto-vs-mem0/run_benchmark.py
- mem0_adapter: robust list/dict handling in wait_for_indexing poll
- memanto_adapter: run-scoped agent ID (uuid suffix) prevents cross-run pollution
- harness: try/finally guarantees teardown on exceptions
- harness: ceil-based p95 latency indexing (was floor, biased low)
- harness: import math for ceil
- evaluator: clamp judge scores to [0,5] before aggregation
- dashboard: winner badge only shown when judge scores exist
- reporter: correct winner label ('total eval score' not 'retrieval accuracy')
- run_benchmark: fix --save docstring (flag is --no-save)
- requirements/pyproject: upgrade mem0ai to >=2.0.5 (CVE remediation)
- data: fix total_turns metadata to 38 (was 46)
- pyproject: remove packages restriction, enable auto-discovery
Run 2: Memanto 53.3% / Mem0 56.7% (Mem0 wins — extraction quality variation)
Run 3: Memanto 50.0% / Mem0 36.7% (Memanto wins)
Average across 3 runs: Memanto 55.0% vs Mem0 43.6%
@ejspeed-cmd

Copy link
Copy Markdown
Contributor Author

Additional benchmark runs (reproducibility verification)

Ran the benchmark 2 more times after the initial submission to verify reproducibility. Results across 3 runs:

Run Memanto Mem0 Winner
Run 1 (submitted) 61.7% 37.5% Memanto
Run 2 53.3% 56.7% Mem0
Run 3 50.0% 36.7% Memanto
Average 55.0% 43.6% Memanto

Key observations:

  • Memanto wins 2/3 runs and leads on average (55.0% vs 43.6%)
  • Run 2 is an anomaly — Mem0's LLM extraction produced high-quality compressed memories that session, which the judge scored highly. This illustrates Mem0's extraction quality is non-deterministic
  • LLM-as-judge scoring has inherent variance (~5-10%) per run, which is expected and documented in the benchmark methodology
  • All 3 result files are committed in results/

@Xenogents can you please review when have a moment

@Xenogents

Copy link
Copy Markdown
Collaborator

Good work! There's quite a lot going on here so I can't review everything but I'll go over a bit of what I've noticed.

Part of the reason we mentioned token usage as a benchmark and something we were confident we had an edge over other memory systems is because other systems (like mem0) use LLM's to extract information on ingestion and we wanted all token usage to be taken into account. Though looking into it more I think it's a completely internal process and can't be observed? In any case if that's not possible to do that's fine it just would be ideal if there was some way we could penalize mem0 for using LLM's on ingestion for particular benchmarks because that is one of our major selling points.

I'm a little bit concerned on the variance of the benchmarks, LLM as a judge does have some variance even at temperature level 0 but it still seems a bit high to me. I'm not exactly sure at first glance what else could be introducing variance but if there really is nothing else to do to make things more deterministic then I guess that's that.

For conflict resolution, memanto's way of resolving conflicts is done in a more manual way, it flags contradictory memories and then it's up to the user to decide which ones to keep, natively it does absolutely nothing to resolve them. That does make it kind of hard to objectively grade the memory layer's ability to resolve conflicts and we haven't even really thought of what to do regarding that yet, so I'll just leave it up to you to decide what to do with that information.

Small thing but would like a litttle bit more detail on the experimental controls, e.g. listing out that you used a recall limit of 10 for both experiments, I haven't looked at what else may be needed to add to that list so that might just be the only thing.

In any case very good job! I really appreciate the variety in benchmarking schemas and testing for different things, it's something our team has discussed before that it's not standardized for memory benchmarks to even test for things like conflict resolution or staleness even though it should be, so I'm impressed you're ahead of the curve like this. Keep up the good work!

@ejspeed-cmd ejspeed-cmd force-pushed the feat/benchmarks-memeanto-vs-mem0 branch from f0ab2b6 to e1b87ce Compare June 22, 2026 18:32
@ejspeed-cmd

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback @Xenogents!

Token overhead: Agreed it's internal and unobservable from outside. Documented in the README that wall-clock ingestion latency is the best available proxy. Memanto ingests in ~1s per session vs Mem0's 0.7–2.4s which partially reflects the extraction overhead.

Variance: Switched the judge to openai/gpt-4o-mini with seed=42 — GPT-4o-mini honours the seed natively so judge scoring is now deterministic. Remaining variance is from Mem0's async LLM extraction producing different compressed memories each run. Ran 3 fresh benchmarks on GitHub Codespaces to remove network as a variable:

Run Memanto Mem0 Winner
Run 1 45.8% 17.5% Memanto
Run 2 55.8% 21.7% Memanto
Run 3 47.5% 25.8% Memanto
Average 49.7% 21.7% Memanto (3/3)

Conflict resolution: Reframed the contradiction_resolution queries to measure recency retrieval rather than auto-resolution and added a methodology note to the README.

Experimental controls: Added recall limit (10 for both systems), agent pattern, judge seed, indexing wait policy and session pause to the controls table.

All 3 result files are committed in results/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge

2 participants