Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions examples/benchmarks/memanto-vs-mem0/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Memanto (Moorcheh) — required
# Get your free key at https://moorcheh.ai
MOORCHEH_API_KEY=your_moorcheh_api_key_here

# Mem0 Platform — required
# Get your free key at https://app.mem0.ai
MEM0_API_KEY=your_mem0_api_key_here

# OpenRouter — required for LLM judge
# Get your key at https://openrouter.ai
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional: override the judge model (default :openai/gpt-4o-mini)
# JUDGE_MODEL=anthropic/claude-3-5-haiku
# JUDGE_MODEL=openai/gpt-4o-mini
160 changes: 160 additions & 0 deletions examples/benchmarks/memanto-vs-mem0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Memanto vs Mem0 — The Executive Shadow Benchmark

> *A rigorous, reproducible benchmark that stress-tests AI memory systems on the hardest problem in production agents: contradiction resolution and temporal preference drift.*

## The Scenario

**The Executive Shadow** — a personal AI assistant tracking a startup founder across 6 months of real-world complexity:

- **46 conversation turns** across 6 monthly sessions
- **7 explicit contradictions** — decisions that are made and then reversed (fundraising strategy, Workday integration, market focus, office policy, communication style, SaaS spending rules)
- **Dense, mixed-domain context** — product, finance, hiring, personal preferences, investor relationships all interleaved
- **8 evaluation queries** crafted to expose the exact failure modes of flat vector stores

The core thesis: **a flat vector store retrieves by semantic similarity, not recency or conflict resolution.** When a founder says "we're raising from Sequoia" in Month 1 and "we're dropping Sequoia" in Month 4, a flat store returns both — and the agent is confused. Memanto's typed memories and conflict detection should surface the current state cleanly.

## Architecture

```text
executive_shadow.json ← deterministic golden dataset
harness.py ← drives both systems identically
├── MemantoAdapter ← Memanto SDK (create/activate/remember/recall)
└── Mem0Adapter ← Mem0 Platform SDK (add/search)
evaluator.py (LLMJudge) ← OpenRouter LLM scores each answer 0–15
reporter.py ← terminal table + results/benchmark_*.json
dashboard.py ← Streamlit visualisation
```

## Metrics

| Metric | What it measures |
|--------|-----------------|
| **Total tokens ingested** | How much context each system needs to store 6 sessions |
| **Total tokens recalled** | How much context is returned per query (bloat = noise) |
| **p95 ingest latency** | 95th-percentile time to store one session |
| **p95 recall latency** | 95th-percentile time to answer one query |
| **Accuracy (0–5)** | Does the answer match the golden answer? |
| **Staleness avoidance (0–5)** | Does it avoid contradicted older facts? |
| **Precision (0–5)** | Is the answer focused, or polluted with noise? |

**Max eval score:** 120 (8 queries × 15 points each)

## Evaluation Query Types

| Type | What it tests | Example |
|------|--------------|---------|
| `contradiction_resolution` | Must surface current decision over earlier one | "Is Workday being built?" — was dropped in Month 2, reinstated in Month 5 |
| `staleness_detection` | Must deprioritise superseded preferences | "How should I format messages?" — rule changed in Month 6 |
| `recency` | Must return latest state, not historical average | "What is current team size and burn?" |

## Quick Start

### 1. Install

```bash
cd examples/benchmarks/memanto-vs-mem0
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```

### 2. Configure

```bash
cp .env.example .env
# Fill in MOORCHEH_API_KEY, MEM0_API_KEY, OPENROUTER_API_KEY
```

### 3. Run the benchmark

```bash
# Full run with LLM judge (~3–5 minutes)
python run_benchmark.py

# Metrics only (no LLM judge, ~1 minute)
python run_benchmark.py --skip-judge

# Custom judge model
python run_benchmark.py --judge-model openai/gpt-4o-mini
```

### 4. View results

```bash
# Terminal report (printed automatically after each run)

# Streamlit dashboard
streamlit run dashboard.py
```

## Experimental Controls

All variables held constant between the two systems:

| Variable | Value |
|---------|-------|
| Input dataset | Identical — `executive_shadow.json` |
| Session order | Sequential sessions 1–6 |
| Query set | 8 identical evaluation queries |
| Recall limit | 10 memories per query (both systems) |
| Judge LLM | `openai/gpt-4o-mini` via OpenRouter |
| Judge temperature | 0.0 |
| Judge seed | 42 (`gpt-4o-mini` honours this natively via OpenRouter) |
| Judge prompt | Identical system prompt for both systems |
| Timing | `time.perf_counter()` wall time per operation |
| Token counting | Character-based estimate (len/4) applied identically to both |
| Memanto agent pattern | `tool` |
| Indexing wait | Mem0: polled via `get_all` until memories visible (max 60s, 4s interval); Memanto: 3s fixed wait |
| Session pause | 0.5s between sessions to respect rate limits |

### Token methodology note

Mem0 Platform runs an LLM extraction pass on every `add()` call to extract and deduplicate memories. This is an internal, async process — token cost is not exposed by the API. The benchmark captures wall-clock ingestion latency as a proxy for this overhead. Memanto stores memories directly with no ingestion-time LLM call; its LLM cost appears only at recall time.

### Conflict resolution note

Memanto's conflict resolution is human-in-the-loop: it flags contradictory memories and surfaces them to the user, who decides which to keep. It does not auto-resolve conflicts at recall time. The `contradiction_resolution` queries therefore measure whether the system returns the *most recent* relevant memories, not whether it has merged conflicting facts. A lower staleness score reflects semantic retrieval returning both old and new memories simultaneously, not a resolution failure.

### Variance

Three runs were conducted using `anthropic/claude-3-5-haiku` as judge. Memanto won 2/3 runs; average scores were Memanto 55.0% vs Mem0 43.6%. The default judge is now `openai/gpt-4o-mini` which honours `seed=42` natively, providing better reproducibility for future runs. Mem0's ingestion quality also varies between runs (async LLM extraction), which affects what gets stored and therefore what scores are achievable.

## Environment

| Requirement | Version |
|------------|---------|
| Python | 3.10+ |
| memanto | ≥0.1.0 |
| mem0ai | ≥2.0.5 |
| openai | ≥1.30.0 (OpenRouter-compatible) |

## Project Structure

```text
memanto-vs-mem0/
├── data/
│ └── executive_shadow.json # Scenario dataset + golden answers
├── adapters/
│ ├── __init__.py
│ ├── base.py # MemoryAdapter interface
│ ├── memanto_adapter.py # Memanto implementation
│ └── mem0_adapter.py # Mem0 implementation
├── evaluator.py # LLM-as-judge
├── harness.py # Benchmark orchestrator
├── reporter.py # Terminal + JSON output
├── run_benchmark.py # CLI entry point
├── dashboard.py # Streamlit visualisation
├── requirements.txt
├── .env.example
└── results/ # Auto-created, holds JSON run outputs
```

---

## Social Posts

- X: [add after publishing]
- Reddit (r/AgenticMemory): [add after publishing]
11 changes: 11 additions & 0 deletions examples/benchmarks/memanto-vs-mem0/adapters/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from .base import IngestResult, MemoryAdapter, RecallResult
from .mem0_adapter import Mem0Adapter
from .memanto_adapter import MemantoAdapter

__all__ = [
"MemoryAdapter",
"IngestResult",
"RecallResult",
"MemantoAdapter",
"Mem0Adapter",
]
112 changes: 112 additions & 0 deletions examples/benchmarks/memanto-vs-mem0/adapters/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
"""
Base adapter interface.

Every memory system under test must implement this interface so the
benchmark harness can drive them identically.
"""

from __future__ import annotations

import time
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Any


@dataclass
class IngestResult:
"""Result from ingesting one session."""
system: str
session_id: str
latency_s: float
tokens_ingested: int # tokens sent to the memory system
raw_response: Any = None


@dataclass
class RecallResult:
"""Result from a single recall query."""
system: str
query_id: str
query: str
answer: str
latency_s: float
tokens_used: int # tokens in the recall round-trip
memories_returned: list[str] = field(default_factory=list)
raw_response: Any = None


class MemoryAdapter(ABC):
"""Abstract base class for a memory system adapter."""

name: str = "unnamed"

@abstractmethod
def setup(self, user_id: str) -> None:
"""Initialise the memory system for this benchmark run."""

@abstractmethod
def ingest_session(
self,
user_id: str,
session_id: str,
messages: list[dict],
) -> IngestResult:
"""
Store a full conversation session into memory.

Args:
user_id: The user whose profile is being built.
session_id: An identifier for this session.
messages: List of {"role": str, "content": str} dicts.

Returns:
IngestResult with latency and token counts.
"""

@abstractmethod
def recall(
self,
user_id: str,
query_id: str,
query: str,
) -> RecallResult:
"""
Query the memory system and return the best answer.

Args:
user_id: The user whose memories to search.
query_id: Identifier for the evaluation query.
query: Natural-language question.

Returns:
RecallResult with the answer, latency, and token counts.
"""

@abstractmethod
def teardown(self, user_id: str) -> None:
"""Clean up any state created during the benchmark run."""

# -----------------------------------------------------------------------
# Shared helpers
# -----------------------------------------------------------------------

@staticmethod
def count_tokens(text: str) -> int:
"""
Rough token estimate: ~4 characters per token (GPT-style).
Used for systems that don't expose token counts directly.
"""
return max(1, len(text) // 4)

@staticmethod
def messages_to_text(messages: list[dict]) -> str:
"""Flatten a message list to plain text for token counting."""
return "\n".join(f"{m['role']}: {m['content']}" for m in messages)

@staticmethod
def timed(fn, *args, **kwargs):
"""Run fn(*args, **kwargs) and return (result, elapsed_seconds)."""
t0 = time.perf_counter()
result = fn(*args, **kwargs)
return result, time.perf_counter() - t0
Loading