Benchmarks

The gp-bench crate provides a comprehensive benchmark suite for GraphPalace, measuring recall accuracy, pathfinding performance, and throughput. It includes deterministic test data generators and structured result reporting.

Benchmark Categories

Category	Target	Comparison
Recall	≥96.6%	MemPalace's LongMemEval benchmark
Pathfinding	≥90.9% success	STAN_X v8's Semantic A*
Throughput	See §Performance Targets	Raw operations/sec

Recall Benchmarks

The recall benchmark measures how accurately GraphPalace retrieves previously stored memories — the core promise inherited from MemPalace's verbatim storage philosophy.

Methodology

Generate a test palace with N drawers containing diverse, realistic content
Query with paraphrased versions of stored content (never the exact text)
Score: a query is a "hit" if the correct drawer appears in the top-k results
Recall = hits / total_queries

use gp_bench::{RecallBenchmark, TestPalaceGenerator};

let generator = TestPalaceGenerator::new(42); // deterministic seed
let palace = generator.generate(1000);        // 1,000 drawers

let bench = RecallBenchmark::new()
    .k(10)                    // top-10 retrieval
    .num_queries(500)         // 500 paraphrased queries
    .similarity_threshold(0.0); // no minimum score cutoff

let result = bench.run(&palace)?;
println!("Recall@10: {:.1}%", result.recall * 100.0);
println!("Mean Reciprocal Rank: {:.3}", result.mrr);
println!("Avg latency: {:.1}ms", result.avg_latency_ms);

Scales

Scale	Drawers	Wings	Rooms	Purpose
Tiny	100	2	5	Quick sanity check
Small	1,000	5	20	Default benchmark
Medium	10,000	10	50	Realistic workload
Large	100,000	20	100	Stress test

Comparison Approach

To compare against MemPalace's 96.6% LongMemEval result:

Use the same query types: single-session (retrieve from one conversation), multi-session (retrieve across conversations), knowledge-graph (entity relationship queries), temporal (time-bounded queries)
Paraphrase queries using synonym substitution and sentence restructuring
Report recall@k for k ∈ {1, 3, 5, 10}
Measure with and without pheromone boosting to isolate stigmergy's contribution

Result Structure

pub struct RecallResult {
    pub recall: f64,           // Recall@k
    pub mrr: f64,              // Mean Reciprocal Rank
    pub precision: f64,        // Precision@k
    pub avg_latency_ms: f64,   // Average query latency
    pub p50_latency_ms: f64,   // Median latency
    pub p99_latency_ms: f64,   // 99th percentile latency
    pub total_queries: usize,
    pub hits: usize,
    pub misses: usize,
    pub scale: BenchScale,
    pub pheromone_boosted: bool,
}

Pathfinding Benchmarks

The pathfinding benchmark measures Semantic A*'s ability to find routes through the palace graph, targeting STAN_X v8's 90.9% success rate.

Methodology

Generate a palace with cross-wing tunnels and inter-room halls
Select random start/goal pairs at varying graph distances
Run Semantic A* with the default 40/30/30 cost weights
Measure success rate, path cost, latency, and nodes expanded

use gp_bench::PathfindingBenchmark;

let bench = PathfindingBenchmark::new()
    .num_pairs(200)           // 200 start/goal pairs
    .max_iterations(10_000)   // per-path iteration budget
    .include_cross_wing(true); // include cross-wing (harder) pairs

let result = bench.run(&palace)?;
println!("Success rate: {:.1}%", result.success_rate * 100.0);
println!("Avg path cost: {:.3}", result.avg_cost);
println!("Avg latency: {:.1}ms", result.avg_latency_ms);
println!("Avg nodes expanded: {:.0}", result.avg_nodes_expanded);

With vs Without Pheromones

A key question: does stigmergy actually help pathfinding?

// Run twice: pristine palace (no pheromones) vs primed palace (after swarm exploration)
let cold_result = bench.run(&cold_palace)?;

// Run a swarm for 100 cycles to build up pheromone trails
swarm.run(&mut warm_palace, 100)?;
let warm_result = bench.run(&warm_palace)?;

println!("Cold success: {:.1}%, Warm success: {:.1}%",
    cold_result.success_rate * 100.0,
    warm_result.success_rate * 100.0);
println!("Cold latency: {:.1}ms, Warm latency: {:.1}ms",
    cold_result.avg_latency_ms,
    warm_result.avg_latency_ms);

Result Structure

pub struct PathfindingResult {
    pub success_rate: f64,          // Fraction of pairs where path was found
    pub avg_cost: f64,              // Average path cost (successful only)
    pub avg_latency_ms: f64,        // Average time per search
    pub p50_latency_ms: f64,
    pub p99_latency_ms: f64,
    pub avg_nodes_expanded: f64,    // Average nodes expanded per search
    pub avg_path_length: f64,       // Average edges in found paths
    pub total_pairs: usize,
    pub successes: usize,
    pub failures: usize,
    pub pheromone_state: String,    // "cold" or "warm"
}

Throughput Benchmarks

Raw operation throughput for capacity planning:

use gp_bench::ThroughputBenchmark;

let bench = ThroughputBenchmark::new()
    .warmup_iterations(100)
    .measurement_iterations(1000);

let result = bench.run(&mut palace)?;
println!("Insert:  {:.0} drawers/sec", result.insert_rate);
println!("Search:  {:.0} queries/sec", result.search_rate);
println!("Decay:   {:.0} nodes/sec", result.decay_rate);
println!("Export:  {:.1}ms for {} nodes", result.export_ms, result.node_count);
println!("Import:  {:.1}ms for {} nodes", result.import_ms, result.node_count);

Operations Measured

Operation	What	Target
Insert	Create drawer + compute embedding + index	>1,000/sec (in-memory)
Search	Semantic search top-10	<50ms (HNSW), <200ms (brute-force)
Navigate	A* pathfinding	<200ms (cached), <500ms (uncached)
Decay	One full pheromone decay cycle	<500ms for 10K edges
Export	Serialize entire palace to JSON	<5s for 100K drawers
Import	Deserialize and load from JSON	<10s for 100K drawers

Test Data Generators

The TestPalaceGenerator creates deterministic, reproducible test palaces:

pub struct TestPalaceGenerator {
    seed: u64,
    domain_mix: Vec<(&'static str, f64)>,  // domain name, proportion
}

impl TestPalaceGenerator {
    /// Create a generator with a fixed seed for reproducibility
    pub fn new(seed: u64) -> Self;

    /// Generate a palace with the given number of drawers
    pub fn generate(&self, num_drawers: usize) -> GraphPalace;

    /// Generate with specific structure
    pub fn generate_structured(
        &self,
        num_wings: usize,
        rooms_per_wing: usize,
        closets_per_room: usize,
        drawers_per_closet: usize,
    ) -> GraphPalace;
}

Default domain mix:

Domain	Proportion	Content Style
Climate Science	20%	Temperature, CO₂, sea level data
Economics	20%	GDP, inflation, trade statistics
Astrophysics	15%	Stars, galaxies, cosmic measurements
Epidemiology	15%	Disease prevalence, mortality rates
Materials Science	15%	Material properties, synthesis methods
General Knowledge	15%	Wikipedia-style factual statements

Each drawer gets:

Unique, realistic content (~50-200 words)
Deterministic mock embedding (seeded)
Assigned wing/room/closet based on domain
Random importance score (0.3–1.0)
Source tag from ["api", "conversation", "file", "agent"]

How to Run Benchmarks

Cargo Bench

# Run all benchmarks
cd rust && cargo bench --package gp-bench

# Run specific benchmark
cargo bench --package gp-bench -- recall
cargo bench --package gp-bench -- pathfinding
cargo bench --package gp-bench -- throughput

# With specific scale
BENCH_SCALE=medium cargo bench --package gp-bench -- recall

# Generate HTML report (via criterion)
cargo bench --package gp-bench -- --output-format=criterion
# Report at: target/criterion/report/index.html

Programmatic

use gp_bench::{run_all_benchmarks, BenchConfig};

let config = BenchConfig {
    scale: BenchScale::Small,
    recall_k: 10,
    pathfinding_pairs: 200,
    throughput_iterations: 1000,
    output_json: true,
};

let report = run_all_benchmarks(config)?;
println!("{}", serde_json::to_string_pretty(&report)?);

Report Format

Benchmark results are output as structured JSON for integration with CI/CD:

{
  "timestamp": "2026-04-13T10:30:00Z",
  "graphpalace_version": "0.1.0",
  "scale": "small",
  "recall": {
    "recall_at_10": 0.968,
    "mrr": 0.891,
    "avg_latency_ms": 12.3,
    "p99_latency_ms": 45.1,
    "total_queries": 500,
    "pheromone_boosted": true
  },
  "pathfinding": {
    "success_rate": 0.925,
    "avg_cost": 2.341,
    "avg_latency_ms": 34.7,
    "pheromone_state": "warm"
  },
  "throughput": {
    "insert_per_sec": 4521.0,
    "search_per_sec": 892.0,
    "decay_ms": 12.4,
    "export_ms": 234.5,
    "import_ms": 456.7
  },
  "comparison": {
    "mempalace_recall": 0.966,
    "graphpalace_recall": 0.968,
    "recall_delta": "+0.2%",
    "stanx_pathfinding": 0.909,
    "graphpalace_pathfinding": 0.925,
    "pathfinding_delta": "+1.6%"
  }
}

Interpreting Results

Recall@10 ≥ 96.6%: Matches or exceeds MemPalace's LongMemEval score. The verbatim storage philosophy (never summarize) plus embedding search should achieve this by design.
Pathfinding success ≥ 90.9%: Matches STAN_X v8. Pheromone-boosted ("warm") runs should exceed cold runs by 5-15%.
Latency under targets: Search <50ms, pathfinding <500ms. If exceeded, check palace size vs benchmark scale.
Throughput: In-memory backend should exceed targets easily. Kuzu backend numbers reflect real I/O.

CI Integration

The benchmark suite can be run in CI to catch performance regressions:

# .github/workflows/graphpalace-ci.yml
- name: Run benchmarks
  run: |
    cd rust
    BENCH_SCALE=tiny cargo bench --package gp-bench -- --output-format=json > bench-results.json
    # Compare against baseline
    python scripts/check_regression.py bench-results.json baseline.json --threshold 5%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Benchmark Categories

Recall Benchmarks

Methodology

Scales

Comparison Approach

Result Structure

Pathfinding Benchmarks

Methodology

With vs Without Pheromones

Result Structure

Throughput Benchmarks

Operations Measured

Test Data Generators

How to Run Benchmarks

Cargo Bench

Programmatic

Report Format

Interpreting Results

CI Integration

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmarks

Benchmark Categories

Recall Benchmarks

Methodology

Scales

Comparison Approach

Result Structure

Pathfinding Benchmarks

Methodology

With vs Without Pheromones

Result Structure

Throughput Benchmarks

Operations Measured

Test Data Generators

How to Run Benchmarks

Cargo Bench

Programmatic

Report Format

Interpreting Results

CI Integration