Automated Experiment Orchestration for AI Research
ResearchHarness is a comprehensive Elixir library for orchestrating, executing, and analyzing large-scale AI research experiments. It provides the infrastructure to systematically run experiments across multiple conditions, datasets, and configurations while maintaining reproducibility, fault tolerance, and detailed statistical analysis.
Think of it as "pytest + MLflow + Weights & Biases" for Elixir AI research.
- Declarative Experiment Definition - DSL for expressing complex experimental designs
- Parallel Execution - Leverage BEAM's concurrency for efficient multi-condition runs
- Fault Tolerance - Resume experiments after failures without data loss
- Statistical Analysis - Automated significance testing across all condition pairs
- Multi-Format Reporting - Generate Markdown, LaTeX, HTML, and Jupyter notebooks
- Cost Management - Estimate and control API costs before execution
- Reproducibility - Version control for experiments, controlled random seeds, full audit trails
- Lifecycle Hooks (v0.2.0) - Extensible callbacks for setup, teardown, and custom error handling
- Error Recovery (v0.2.0) - Automatic retry with exponential backoff and circuit breaker
- Metric Validation (v0.2.0) - Runtime schema validation with type coercion
- Solver Pipelines (v0.3.1) - Composable execution steps inspired by inspect-ai
- State Threading (v0.3.1) - TaskState carries messages, targets, and limits through solver chains
- LLM Backend Abstraction (v0.3.1) - Swappable Generate backends with tool-call support
defmodule MyExperiment do
use CrucibleHarness.Experiment
name "My Research Experiment"
description "Comparing baseline vs treatment"
dataset :mmlu_200
conditions [
%{name: "baseline", fn: &baseline_condition/1},
%{name: "treatment", fn: &treatment_condition/1}
]
metrics [:accuracy, :latency_p99, :cost_per_query]
repeat 3
config %{
timeout: 30_000,
rate_limit: 10
}
def baseline_condition(query) do
# Your implementation
%{prediction: "answer", accuracy: 0.75, latency: 100, cost: 0.01}
end
def treatment_condition(query) do
# Your implementation
%{prediction: "answer", accuracy: 0.82, latency: 150, cost: 0.02}
end
end# Estimate cost and time first
{:ok, estimates} = CrucibleHarness.estimate(MyExperiment)
IO.puts("Estimated cost: $#{estimates.cost.total_cost}")
IO.puts("Estimated time: #{estimates.time.estimated_duration}ms")
# Run the experiment
{:ok, report} = CrucibleHarness.run(MyExperiment,
output_dir: "./results",
formats: [:markdown, :latex, :html]
)Reports are automatically generated in your specified formats:
results/exp_12345_report.markdown- Markdown reportresults/exp_12345_report.latex- LaTeX tables and figuresresults/exp_12345_report.html- Interactive HTML report
Hooks provide extension points during experiment execution for setup, teardown, logging, and custom error handling:
defmodule MyExperiment do
use CrucibleHarness.Experiment
name "Experiment with Hooks"
dataset :my_dataset
conditions [%{name: "test", fn: &test_condition/1}]
metrics [:accuracy, :latency]
# Called once before experiment starts - can modify config
before_experiment fn config ->
Logger.info("Starting experiment: #{config.name}")
{:ok, Map.put(config, :start_time, DateTime.utc_now())}
end
# Called once after experiment completes
after_experiment fn config, results ->
duration = DateTime.diff(DateTime.utc_now(), config.start_time, :second)
Logger.info("Completed in #{duration}s with #{length(results)} results")
:ok
end
# Called before each condition execution
before_condition fn condition, query ->
Logger.metadata(condition: condition.name, query_id: query.id)
:ok
end
# Called after each condition execution
after_condition fn condition, query, result ->
:telemetry.execute([:experiment, :task, :complete], %{latency: result.latency}, %{})
:ok
end
# Called when a condition fails - return :retry, :skip, or :abort
on_error fn condition, query, error ->
case error do
{:error, :timeout} -> :retry
{:error, :rate_limited} -> :retry
{:error, :authentication_failed} -> :abort
_ -> :skip
end
end
def test_condition(query), do: %{accuracy: 0.85, latency: 100}
endHook Signatures:
before_experiment(config)→{:ok, config}or:okafter_experiment(config, results)→:okbefore_condition(condition, query)→:okafter_condition(condition, query, result)→:okon_error(condition, query, error)→:retry|:skip|:abort
All hooks are optional and errors in hooks are handled gracefully (they won't crash your experiment).
Configure automatic retry with exponential backoff for transient failures:
config %{
error_handling: %{
# Retry strategy: :exponential_backoff, :constant, or :linear
retry_strategy: :exponential_backoff,
max_retries: 3,
initial_delay_ms: 1000,
max_delay_ms: 30_000,
backoff_factor: 2.0,
jitter: true, # Add randomness to prevent thundering herd
# Dead letter queue for permanently failed tasks
dlq_enabled: true,
dlq_path: "./failed_tasks.jsonl",
# Circuit breaker - abort if failure rate exceeds threshold
max_failure_rate: 0.1, # Abort if >10% tasks fail
failure_window: 100 # Over last 100 tasks
}
}Error Classification:
- Retryable errors:
:timeout,:connection_refused,:rate_limited, HTTP 429/502/503/504 - Permanent errors:
:invalid_query,:authentication_failed, HTTP 400/401/403/404
Task results now include retry information:
%{
result: {:ok, %{accuracy: 0.85}},
attempts: 2,
retry_delays: [1000, 2000],
final_status: :success, # :success | :failed_permanent | :failed_retries_exhausted
error_history: [%{attempt: 1, error: :timeout, timestamp: ~U[...]}]
}Define schemas to validate metrics at runtime and catch errors early:
defmodule MyExperiment do
use CrucibleHarness.Experiment
name "Validated Experiment"
dataset :my_dataset
conditions [%{name: "test", fn: &test_condition/1}]
metrics [:accuracy, :latency, :cost]
# Define validation schemas for each metric
metric_schemas %{
accuracy: %{type: :float, min: 0.0, max: 1.0, required: true},
latency: %{type: :number, min: 0, unit: :milliseconds, required: true},
cost: %{type: :float, min: 0.0, required: false, default: 0.0},
custom: %{
type: :map,
schema: %{
value: %{type: :number, min: 0},
confidence: %{type: :float, min: 0.0, max: 1.0}
}
}
}
config %{
metric_validation: %{
enabled: true,
on_invalid: :log_and_continue, # :log_and_continue | :log_and_retry | :abort
coerce_types: true # Try to convert "0.85" -> 0.85
}
}
def test_condition(query) do
%{accuracy: 0.85, latency: 123, custom: %{value: 42, confidence: 0.95}}
end
endSchema Helpers:
alias CrucibleHarness.Validation.Schema
# Common schema types
Schema.float(min: 0.0, max: 1.0) # Float with range
Schema.number(min: 0) # Integer or float
Schema.map(schema: %{...}) # Nested map validation
Schema.percentage() # 0-100 float
Schema.probability() # 0-1 float
Schema.positive_number() # >= 0
Schema.duration_ms() # Positive number in millisecondsBuild composable LLM execution pipelines using inspect-ai-inspired patterns:
alias CrucibleHarness.{Solver, TaskState}
alias CrucibleHarness.Solver.{Chain, Generate}
# Define a custom solver
defmodule SystemPromptSolver do
use CrucibleHarness.Solver
@impl true
def solve(state, _generate_fn) do
msg = %{role: "system", content: "You are a helpful assistant."}
{:ok, TaskState.add_message(state, msg)}
end
end
# Create a solver chain
chain = Chain.new([
SystemPromptSolver,
Generate.new(%{
model: "gpt-4",
temperature: 0.7,
max_tokens: 500,
stop: [],
tool_calls: "loop"
}),
])
# Initialize state from a sample
sample = %{id: "sample_1", input: "Explain recursion briefly."}
tool =
CrucibleHarness.Tool.new(
name: "adder",
handler: fn %{"a" => a, "b" => b} -> {:ok, a + b} end
)
state =
TaskState.new(sample)
|> TaskState.set_tools([tool])
# Define your LLM backend
generate_fn = fn state, config ->
# Call your LLM (Tinkex, OpenAI, etc.)
MyLLMBackend.generate(state.messages, config)
end
# Execute the chain
{:ok, result} = Chain.solve(chain, state, generate_fn)
# Access results
IO.puts(result.output.content)Key Concepts:
- Solver - A behaviour for execution steps (
solve/2callback) - Chain - Composes solvers sequentially; stops on error or
state.completed - TaskState - Carries messages, targets, limits, and inter-solver data via
store - Generate - Behaviour for LLM backends;
Solver.Generateis a built-in solver
Implementing a Generate Backend:
defmodule MyBackend do
@behaviour CrucibleHarness.Generate
@impl true
def generate(messages, config) do
# Call your LLM API
{:ok, %{
content: "Response text",
finish_reason: "stop",
usage: %{prompt_tokens: 10, completion_tokens: 20, total_tokens: 30},
tool_calls: []
}}
end
endCrucibleHarness can interpret a Jido.Plan DAG as a solver chain for evaluation pipelines.
This adapter is optional and does not add a hard dependency on jido_action. If you want
to use it, add {:jido_action, "..."}
to your application deps.
alias CrucibleHarness.{PlanAdapter, Solver, TaskState}
alias CrucibleHarness.Solver.Chain
plan =
Jido.Plan.new()
|> Jido.Plan.add(:prepare, MyApp.Actions.Prepare)
|> Jido.Plan.add(:generate, MyApp.Actions.Generate, depends_on: :prepare)
|> Jido.Plan.add(:score, MyApp.Actions.Score, depends_on: :generate)
step_runner = fn state, step, meta, generate_fn ->
# Use step.instruction + meta to execute your Jido Action
# and return {:ok, TaskState.t()} or {:error, term()}
MyPlanRunner.run_step(state, step.instruction, meta, generate_fn)
end
{:ok, chain} = PlanAdapter.to_solver_chain(plan, step_runner: step_runner)
sample = %{id: "sample-1", input: "Explain recursion."}
state = TaskState.new(sample)
{:ok, result_state} = Chain.solve(chain, state, &MyBackend.generate/2)The adapter forwards LineageIR dimensions (trace_id, work_id, plan_id, step_id)
into step metadata so you can emit consistent telemetry inside your runner.
defmodule EnsembleSizeSweep do
use CrucibleHarness.Experiment
name "Ensemble Size Sweep (1-10 models)"
dataset :mmlu_200
conditions for n <- 1..10 do
%{
name: "ensemble_#{n}",
fn: &ensemble(&1, models: n)
}
end
metrics [:accuracy, :latency_p99, :cost_per_query]
repeat 5
endcost_budget %{
max_total: 100.00, # $100 maximum
max_per_condition: 25.00, # $25 per condition max
currency: :usd
}statistical_analysis %{
significance_level: 0.05,
multiple_testing_correction: :bonferroni,
confidence_interval: 0.95
}# Run experiment (will checkpoint automatically)
{:ok, report} = CrucibleHarness.run(MyExperiment)
# If interrupted, resume from last checkpoint
{:ok, report} = CrucibleHarness.resume("exp_12345")CrucibleHarness
├── Experiment (DSL & Definition)
├── Runner (Execution Engine with GenStage/Flow)
├── Collector (Results Aggregation & Statistical Analysis)
├── Reporter (Multi-Format Output Generation)
├── Hooks (Lifecycle Hook Execution) [v0.2.0]
│ └── Executor (Safe hook execution with error handling)
├── Errors (Error Recovery Framework) [v0.2.0]
│ ├── Classifier (Error type classification)
│ ├── Retry (Exponential backoff logic)
│ └── DLQ (Dead letter queue for failed tasks)
├── Validation (Metric Validation) [v0.2.0]
│ ├── Schema (Schema definition helpers)
│ └── MetricValidator (Runtime validation)
├── Solver (Composable Execution Steps) [v0.3.2]
│ ├── Chain (Sequential solver composition)
│ └── Generate (Built-in LLM generation solver)
├── TaskState (State Threading for Pipelines) [v0.3.2]
├── Generate (LLM Backend Abstraction) [v0.3.2]
└── Utilities (Cost/Time Estimation, Checkpointing)
See the examples/ directory for complete examples:
simple_comparison.ex- Basic two-condition comparisonensemble_comparison.ex- Multi-condition ensemble evaluationplan_pipeline.exs- Jido.Plan adapter pipeline example
Runs an experiment and generates reports.
Options:
:output_dir- Directory for results (default: "./results"):formats- Report formats (default:[:markdown]):checkpoint_dir- Checkpoint directory (default: "./checkpoints"):dry_run- Validate without executing (default:false)
Estimates cost and time without running the experiment.
Resumes a failed or interrupted experiment from checkpoint.
name- Experiment namedataset- Dataset identifierconditions- List of experimental conditionsmetrics- Metrics to collect
description- Detailed descriptionauthor- Experiment authorversion- Experiment versiontags- Tags for organizationrepeat- Number of repetitions (default: 1)config- Execution configurationcost_budget- Budget constraintsstatistical_analysis- Analysis parameterscustom_metrics- Custom metric definitionsmetric_schemas- Validation schemas for metrics (v0.2.0)before_experiment- Hook called before experiment starts (v0.2.0)after_experiment- Hook called after experiment completes (v0.2.0)before_condition- Hook called before each condition (v0.2.0)after_condition- Hook called after each condition (v0.2.0)on_error- Hook for custom error handling (v0.2.0)
Add to your config.exs:
config :research_harness,
checkpoint_dir: "./checkpoints",
results_dir: "./results"mix testAdd research_harness to your list of dependencies in mix.exs:
def deps do
[
{:crucible_harness, "~> 0.3.3"}
]
endOr install from GitHub:
def deps do
[
{:crucible_harness, github: "nshkrdotcom/elixir_ai_research", sparse: "apps/research_harness"}
]
endVersion 0.3.3 updates telemetry dependency and cleans up dependency declarations.
- Updated
telemetryfrom~> 1.2to~> 1.3 - Removed commented-out dependency declarations
All existing code continues to work unchanged.
Version 0.3.2 expands inspect-ai parity for TaskState and tool-calling generate flows.
CrucibleHarness.Tool- Tool definitions and execution for tool-call flowsCrucibleHarness.TaskState.Choices- Multiple-choice helper struct- TaskState enhancements: model/epoch/targets/limits/tool metadata
alias CrucibleHarness.{Solver, TaskState}
alias CrucibleHarness.Solver.{Chain, Generate}
# Create a sample and initial state
sample = %{id: "test_1", input: "What is 2+2?"}
state = TaskState.new(sample)
# Define a generate function (your LLM backend)
generate_fn = fn state, config ->
{:ok, %{content: "4", finish_reason: "stop", usage: %{}}}
end
# Build a solver chain
chain = Chain.new([
MySystemPromptSolver,
Generate.new(%{temperature: 0.7, max_tokens: 100}),
MyValidationSolver
])
# Execute the chain
{:ok, result_state} = Chain.solve(chain, state, generate_fn)Existing experiment definitions continue to work unchanged. The new modules are additive.
Version 0.3.0 introduces integration with the new crucible_ir library for shared IR structs. This change is mostly transparent to users.
- CrucibleHarness now depends on
crucible_irv0.1.1 for IR struct definitions - Updated to work with
crucible_frameworkv0.5.0 - Internal IR module references updated from
Crucible.IR.*toCrucibleIR.*
For most users: No. The public API remains unchanged.
Only if you were directly using internal IR structs (uncommon), update your imports:
# Before
alias Crucible.IR.{Experiment, BackendRef, StageDef}
# After
alias CrucibleIR.{Experiment, BackendRef, StageDef}The new version automatically brings in:
crucible_irv0.1.1 - Shared IR structscrucible_frameworkv0.5.0 - Updated framework integration
Documentation can be generated with ExDoc:
mix docsThis is part of the Spectra AI research infrastructure. Contributions welcome!
MIT License - see LICENSE file for details
Built for systematic AI research experimentation with a focus on ensemble methods, hedging strategies, and model comparisons.