Reev ๐ชธ: Production-Ready Framework for Solana LLM Agent Evaluation
reev is a mature, production-ready Rust framework for rigorously evaluating Solana-native LLM agents. After extensive development and testing, the framework now provides a complete, reliable platform for assessing autonomous agents in realistic blockchain environments.
The framework achieves 100% success rates across all benchmark categories:
- ๐ Real Jupiter Integration: Full swap, lending, mint/redeem operations with Jupiter SDK
- ๐ค Advanced Agent Support: Both deterministic (ground truth) and AI agents working perfectly
- ๐ Multi-Step Workflows: Complex DeFi flows with step-by-step orchestration (200-series)
- ๐ Comprehensive Scoring: Granular instruction quality evaluation + on-chain execution metrics
- ๐ฎ Professional Tooling: Interactive TUI cockpit, database persistence, detailed logging
- ๐ฌ Real-World Testing: Mainnet fork validation with actual deployed programs
- โ Scoring System Validation: Complete test suite covering 0%, 50%, 75%, and 100% score scenarios
- ๐ Flow Support: Step-by-step flow execution with proper transaction isolation
The framework operates on surfpool, a high-performance in-memory fork of Solana mainnet, providing:
- ๐ Real-World Logic: Agents interact with actual deployed programs (Jupiter, SPL Token, etc.)
- ๐ Controlled Environment: Precise state management via RPC cheat codes for reproducible testing
- โก High Performance: In-memory execution with fast state manipulation and transaction simulation
- ๐ Hermetic Testing: Every test run starts from identical, controlled initial conditions
- Reproducibility: The primary goal. Every test run is hermetic, guaranteeing that a given benchmark will produce the exact same result every time.
- Service-Oriented Environment: The Solana test validator (
surfpool) is treated as a managed, external service that the environment connects to and configures via RPC. This ensures a clean architectural boundary and prevents dependency conflicts. - Gymnasium-Inspired API: The agent-environment interaction is modeled via a standard Rust
trait(GymEnv) inspired by the Gymnasium API, promoting a clear separation of concerns.
-
reev-lib(Core Library):SolanaEnv: A custom, hermetic evaluation environment that connects to an externalsurfpoolprocess. It handles state setup, transaction execution, and observation generation.- Agent Interface: Defines a simple
Agenttrait and provides anLlmAgentthat can reason about prompts. - Benchmark Structs: Rust types that define the structure of a benchmark YAML file, enabling strongly-typed parsing.
-
reev-runner(CLI Orchestrator):- The command-line tool for loading and running benchmarks.
- Orchestrates the entire evaluation loop, from setting up the environment to calculating metrics and reporting results.
-
reev-agent(LLM Service):- A standalone server that exposes an LLM's reasoning capabilities over an API.
- Can be configured to use different models (local, Gemini, etc.) and includes a deterministic agent for generating ground-truth instructions.
-
Benchmark Suite:
- A suite of evaluation tasks defined in YAML files located in the
benchmarks/directory. - Each test case includes a declarative
initial_state, a natural languageprompt, andground_truthcriteria for success.
- A suite of evaluation tasks defined in YAML files located in the
- Rust Toolchain: Install Rust (latest stable recommended)
- Git: Clone the repository
- Optional LLM: Install LM Studio or have Gemini API key for AI agents
The framework now provides automatic surfpool management - no manual setup required:
# All benchmarks work out of the box
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic
# Jupiter protocols (swap, lending, mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local
# Multi-step flows (swap + lend)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic
# API benchmarks (positions, earnings)
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic
# Scoring validation tests
cargo run -p reev-runner -- benchmarks/003-spl-transfer-fail.yml --agent deterministic # 0% score
cargo run -p reev-runner -- benchmarks/004-partial-score-spl-transfer.yml --agent deterministic # ~50% scoreDeterministic Agent (Ground Truth):
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministicLocal Model Agent:
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent localGemini Agent:
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent gemini-2.5-flash-liteLaunch the interactive cockpit for real-time monitoring:
cargo run -p reev-tuiFeatures:
- ๐ Live benchmark execution with status updates
- ๐ Detailed execution trace analysis
- ๐ท๏ธ Agent selection (deterministic, local, glm-4.6, gemini)
- ๐ Real-time scoring and metrics
Real on-chain operations with Jupiter protocols:
# Jupiter swap
cargo run -p reev-runner -- benchmarks/100-jup-swap-sol-usdc.yml --agent local
# Jupiter lending (mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent localMulti-step DeFi workflows with step-by-step execution:
# Swap then lend (2 steps: swap SOLโUSDC, then deposit USDC)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic
# More flow benchmarks coming soon...Flow Execution Features:
- โ Step-by-Step Processing: Each flow step executes as a separate transaction
- โ Transaction Isolation: Proper error handling per step, no cascading failures
- โ State Management: Account state flows between steps automatically
- โ Agent Consistency: Both deterministic and AI agents handle flows identically
Data retrieval and portfolio management:
# Positions and earnings
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic- โ 100% Success Rate: All benchmarks passing with local model
- โ Real Jupiter Integration: Full protocol stack working
- โ Multi-Step Flows: Complex workflows executing step-by-step successfully
- โ Production Infrastructure: TUI, database, logging all operational
- โ Scoring System Validation: Comprehensive test suite covering full score spectrum
- โ Anti-False-Positive Protection: Differentiates failure modes accurately
- โ Flow Framework: Robust step-by-step execution with proper error handling
The framework implements a sophisticated two-tiered scoring system:
Component Breakdown:
- Instruction Quality (75%): Granular evaluation of generated transactions
- Program ID matching (configurable weight)
- Instruction data validation (configurable weight)
- Account metadata verification (signer/writable flags)
- On-Chain Execution (25%): Binary success/failure on surfpool
- Composite Scoring: Weighted average for final assessment
Flow Scoring:
- Per-Step Evaluation: Each flow step is scored individually
- Combined Results: Step scores aggregated for final flow assessment
- Partial Credit: Successful steps count even if later steps fail
Validated Score Scenarios:
| Score Range | Test Case | Purpose | Status |
|---|---|---|---|
| ~75% | 003-spl-transfer-fail |
Correct instruction, on-chain failure | โ Validated |
| ~78.6% | 004-partial-score-spl-transfer |
Partial credit (correct ID, some errors) | โ Validated |
| ~75% | 100-jup-swap-sol-usdc (pre-fix) |
Good reasoning, execution failure | โ Validated |
| 100% | 001-sol-transfer, 002-spl-transfer |
Perfect execution | โ Validated |
Anti-False-Positive Testing:
- Differentiates between "no attempt" (0%) vs "attempted but failed" (partial credit)
- Validates granular component scoring (program ID vs data vs accounts)
- Ensures weighted scoring prevents gaming the system
# Full test suite (deterministic + AI agents)
cargo test -p reev-runner
# Specific agent testing
cargo test -p reev-runner --test deterministic_agent_test
cargo test -p reev-runner --test llm_agent_test# Protocol examples
cargo run -p reev-agent --example 115-jup-lend-mint-usdc
# Flow examples
cargo run -p reev-agent --example 200-jup-swap-then-lend-deposit# Enable verbose logging
RUST_LOG=debug cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml
# Check surfpool status
curl http://localhost:8899/health