reev 🪸

Reev 🪸: Production-Ready Framework for Solana LLM Agent Evaluation

🎯 Production Status: Complete & Fully Functional

reev is a mature, production-ready Rust framework for rigorously evaluating Solana-native LLM agents. After extensive development and testing, the framework now provides a complete, reliable platform for assessing autonomous agents in realistic blockchain environments.

✅ Current Capabilities: Production Ready

The framework achieves 100% success rates across all benchmark categories:

🔄 Real Jupiter Integration: Full swap, lending, mint/redeem operations with Jupiter SDK
🤖 Advanced Agent Support: Both deterministic (ground truth) and AI agents working perfectly
🔄 Multi-Step Workflows: Complex DeFi flows with step-by-step orchestration (200-series)
📊 Comprehensive Scoring: Granular instruction quality evaluation + on-chain execution metrics
🎮 Professional Tooling: Interactive TUI cockpit, database persistence, detailed logging
🔬 Real-World Testing: Mainnet fork validation with actual deployed programs
✅ Scoring System Validation: Complete test suite covering 0%, 50%, 75%, and 100% score scenarios
🌊 Flow Support: Step-by-step flow execution with proper transaction isolation

🚀 Core Architecture: Real Programs, Controlled State

The framework operates on surfpool, a high-performance in-memory fork of Solana mainnet, providing:

🌐 Real-World Logic: Agents interact with actual deployed programs (Jupiter, SPL Token, etc.)
🔒 Controlled Environment: Precise state management via RPC cheat codes for reproducible testing
⚡ High Performance: In-memory execution with fast state manipulation and transaction simulation
🔄 Hermetic Testing: Every test run starts from identical, controlled initial conditions

Core Principles

Reproducibility: The primary goal. Every test run is hermetic, guaranteeing that a given benchmark will produce the exact same result every time.
Service-Oriented Environment: The Solana test validator (surfpool) is treated as a managed, external service that the environment connects to and configures via RPC. This ensures a clean architectural boundary and prevents dependency conflicts.
Gymnasium-Inspired API: The agent-environment interaction is modeled via a standard Rust trait (GymEnv) inspired by the Gymnasium API, promoting a clear separation of concerns.

Key Components

reev-lib (Core Library):
- SolanaEnv: A custom, hermetic evaluation environment that connects to an external surfpool process. It handles state setup, transaction execution, and observation generation.
- Agent Interface: Defines a simple Agent trait and provides an LlmAgent that can reason about prompts.
- Benchmark Structs: Rust types that define the structure of a benchmark YAML file, enabling strongly-typed parsing.
reev-runner (CLI Orchestrator):
- The command-line tool for loading and running benchmarks.
- Orchestrates the entire evaluation loop, from setting up the environment to calculating metrics and reporting results.
reev-agent (LLM Service):
- A standalone server that exposes an LLM's reasoning capabilities over an API.
- Can be configured to use different models (local, Gemini, etc.) and includes a deterministic agent for generating ground-truth instructions.
Benchmark Suite:
- A suite of evaluation tasks defined in YAML files located in the benchmarks/ directory.
- Each test case includes a declarative initial_state, a natural language prompt, and ground_truth criteria for success.

🚀 Quick Start

Prerequisites

Rust Toolchain: Install Rust (latest stable recommended)
Git: Clone the repository
Optional LLM: Install LM Studio or have Gemini API key for AI agents

🎯 Running Benchmarks

The framework now provides automatic surfpool management - no manual setup required:

# All benchmarks work out of the box
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

# Jupiter protocols (swap, lending, mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

# Multi-step flows (swap + lend)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# API benchmarks (positions, earnings)
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

# Scoring validation tests
cargo run -p reev-runner -- benchmarks/003-spl-transfer-fail.yml --agent deterministic  # 0% score
cargo run -p reev-runner -- benchmarks/004-partial-score-spl-transfer.yml --agent deterministic  # ~50% score

🤖 Agent Options

Deterministic Agent (Ground Truth):

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

Local Model Agent:

cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

Gemini Agent:

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent gemini-2.5-flash-lite

🎮 Interactive TUI

Launch the interactive cockpit for real-time monitoring:

cargo run -p reev-tui

Features:

📊 Live benchmark execution with status updates
🔍 Detailed execution trace analysis
🏷️ Agent selection (deterministic, local, glm-4.6, gemini)
📈 Real-time scoring and metrics

📊 Benchmark Categories

🔧 Transaction Benchmarks (100-series)

Real on-chain operations with Jupiter protocols:

# Jupiter swap
cargo run -p reev-runner -- benchmarks/100-jup-swap-sol-usdc.yml --agent local

# Jupiter lending (mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

🌊 Flow Benchmarks (200-series)

Multi-step DeFi workflows with step-by-step execution:

# Swap then lend (2 steps: swap SOL→USDC, then deposit USDC)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# More flow benchmarks coming soon...

Flow Execution Features:

✅ Step-by-Step Processing: Each flow step executes as a separate transaction
✅ Transaction Isolation: Proper error handling per step, no cascading failures
✅ State Management: Account state flows between steps automatically
✅ Agent Consistency: Both deterministic and AI agents handle flows identically

📡 API Benchmarks (100-series)

Data retrieval and portfolio management:

# Positions and earnings
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

🎯 Success Metrics

Current Performance:

✅ 100% Success Rate: All benchmarks passing with local model
✅ Real Jupiter Integration: Full protocol stack working
✅ Multi-Step Flows: Complex workflows executing step-by-step successfully
✅ Production Infrastructure: TUI, database, logging all operational
✅ Scoring System Validation: Comprehensive test suite covering full score spectrum
✅ Anti-False-Positive Protection: Differentiates failure modes accurately
✅ Flow Framework: Robust step-by-step execution with proper error handling

Scoring System:

The framework implements a sophisticated two-tiered scoring system:

Component Breakdown:

Instruction Quality (75%): Granular evaluation of generated transactions
- Program ID matching (configurable weight)
- Instruction data validation (configurable weight)
- Account metadata verification (signer/writable flags)
On-Chain Execution (25%): Binary success/failure on surfpool
Composite Scoring: Weighted average for final assessment

Flow Scoring:

Per-Step Evaluation: Each flow step is scored individually
Combined Results: Step scores aggregated for final flow assessment
Partial Credit: Successful steps count even if later steps fail

Validated Score Scenarios:

Score Range	Test Case	Purpose	Status
~75%	`003-spl-transfer-fail`	Correct instruction, on-chain failure	✅ Validated
~78.6%	`004-partial-score-spl-transfer`	Partial credit (correct ID, some errors)	✅ Validated
~75%	`100-jup-swap-sol-usdc` (pre-fix)	Good reasoning, execution failure	✅ Validated
100%	`001-sol-transfer`, `002-spl-transfer`	Perfect execution	✅ Validated

Anti-False-Positive Testing:

Differentiates between "no attempt" (0%) vs "attempted but failed" (partial credit)
Validates granular component scoring (program ID vs data vs accounts)
Ensures weighted scoring prevents gaming the system

🔧 Development & Testing

Integration Tests:

# Full test suite (deterministic + AI agents)
cargo test -p reev-runner

# Specific agent testing
cargo test -p reev-runner --test deterministic_agent_test
cargo test -p reev-runner --test llm_agent_test

Example Testing:

# Protocol examples
cargo run -p reev-agent --example 115-jup-lend-mint-usdc

# Flow examples
cargo run -p reev-agent --example 200-jup-swap-then-lend-deposit

Debugging:

# Enable verbose logging
RUST_LOG=debug cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml

# Check surfpool status
curl http://localhost:8899/health

Name		Name	Last commit message	Last commit date
Latest commit History 547 Commits
benchmarks		benchmarks
crates		crates
db		db
docs		docs
protocols/jupiter		protocols/jupiter
raw		raw
turso-test		turso-test
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
IDEA.md		IDEA.md
LICENSE		LICENSE
OTEL.md		OTEL.md
PLAN.md		PLAN.md
README.md		README.md
REFLECT.md		REFLECT.md
RULES.md		RULES.md
SURFPOOL.md		SURFPOOL.md
TASKS.md		TASKS.md
TODO.md		TODO.md
TOFIX.md		TOFIX.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

reev 🪸

🎯 Production Status: Complete & Fully Functional

✅ Current Capabilities: Production Ready

🚀 Core Architecture: Real Programs, Controlled State

Core Principles

Key Components

🚀 Quick Start

Prerequisites

🎯 Running Benchmarks

🤖 Agent Options

🎮 Interactive TUI

📊 Benchmark Categories

🔧 Transaction Benchmarks (100-series)

🌊 Flow Benchmarks (200-series)

📡 API Benchmarks (100-series)

🎯 Success Metrics

Current Performance:

Scoring System:

🔧 Development & Testing

Integration Tests:

Example Testing:

Debugging:

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

gist-rs/reev

Folders and files

Latest commit

History

Repository files navigation

reev 🪸

🎯 Production Status: Complete & Fully Functional

✅ Current Capabilities: Production Ready

🚀 Core Architecture: Real Programs, Controlled State

Core Principles

Key Components

🚀 Quick Start

Prerequisites

🎯 Running Benchmarks

🤖 Agent Options

🎮 Interactive TUI

📊 Benchmark Categories

🔧 Transaction Benchmarks (100-series)

🌊 Flow Benchmarks (200-series)

📡 API Benchmarks (100-series)

🎯 Success Metrics

Current Performance:

Scoring System:

🔧 Development & Testing

Integration Tests:

Example Testing:

Debugging:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages