Autonomous code generation pipeline using LangGraph orchestration and multi-model LLM approach
A reusable LangGraph workflow that automatically generates, validates, and executes Python data analysis tools from natural language queries. Designed as a composable subgraph for integration into larger agent systems. Built with LangGraph state machine orchestration and powered by specialized LLMs for reasoning and code generation.
What it does:
- Takes a natural language data analysis query
- Extracts structured intent using reasoning model (DeepSeek-R1)
- Generates formal tool specifications
- Generates Python code using specialized coding model (Qwen 2.5-Coder)
- Validates code in isolated Docker/subprocess sandbox
- Executes and captures analysis results
- Promotes validated tools to active registry
- Packages all outputs into
projected_*fields for seamless parent-graph integration
Use as a Child Graph:
This pipeline is designed to be integrated into larger agent systems as a reusable LangGraph subgraph. Use build_graph() to get the compiled graph and integrate it into your parent workflow.
Example Query:
"Run ANOVA across groups, then perform Tukey HSD post-hoc test with p-values and effect sizes"
Result:
- Generated tool:
anova_tukeyhsd_traffic_injuries_<timestamp>.py - Statistical analysis with validated output
- Automatically added to active tool registry (
tools/active/)
- LangGraph - StateGraph workflow orchestration and composability
- DeepSeek-R1 70B - Reasoning model for intent extraction and spec generation
- Qwen 2.5-Coder 32B - Code generation and repair
- Ollama - Local LLM inference server
- Pydantic v2.5+ - Data validation
- Python 3.10+ - Core runtime
- Docker - Sandbox isolation (optional subprocess mode available)
User Query
↓
Intent Extraction (DeepSeek-R1)
↓
Spec Generation (DeepSeek-R1)
↓
Code Generation (Qwen 2.5-Coder)
↓
Validation (syntax + schema + sandbox)
├─→ PASS → Executor
└─→ FAIL → Repair (Qwen 2.5-Coder, max 3 attempts)
↓
Validator
↓
Executor (run on actual data)
├─→ SUCCESS → Promoter
└─→ FAIL → END (with error report)
↓
Promoter (save to active registry)
↓
Projection (package outputs into projected_* fields for parent graph)
↓
END
The pipeline consists of the following nodes:
- intent_node - Extracts structured intent from natural language using reasoning model
- spec_generator_node - Creates formal tool specification with I/O schemas
- code_generator_node - Generates Python code implementing the specification
- validator_node - Validates syntax, schema compliance, and sandbox execution
- repair_node - Repairs code based on validation errors (max 3 attempts)
- executor_node - Executes the tool on actual user data
- promoter_node - Promotes successful tool to active registry
- projection_node - Terminal node; packages all child outputs into
projected_*fields compatible with the parent graph (AnalysisPipelineState)
For detailed architecture and module descriptions, see module_prs/README.md
MCP_Tool_Code_Interpreter_Generator/
├── src/ # Core modules
│ ├── models.py # Pydantic models and LangGraph state
│ ├── llm_client.py # Multi-model LLM client
│ ├── intent_extraction.py # Intent extraction node
│ ├── intent_validator.py # Intent validation logic
│ ├── spec_generator.py # Specification generation node
│ ├── code_generator.py # Code generation + repair nodes
│ ├── validator.py # Validation node
│ ├── executor.py # Execution node
│ ├── promoter.py # Registry promotion node
│ ├── sandbox.py # Sandboxed code execution
│ ├── pipeline.py # LangGraph orchestrator & graph builder
│ └── logger_config.py # Logging configuration
│
├── tools/ # Generated tools
│ ├── draft/ # Initial generated code
│ ├── active/ # Promoted, production-ready tools
│ └── sandbox/ # Sandbox workspace for execution
│
├── output/ # Execution results
│ ├── active/ # Successful execution outputs
│ └── draft/ # Failed/debug outputs
│
├── config/ # Configuration files
│ ├── config.yaml # Main configuration
│ ├── sandbox_policy.yaml # Sandbox security policy
│ └── prompts/ # LLM prompt templates
│ ├── intent_extraction_v2.txt
│ ├── spec_generation.txt
│ ├── code_generation.txt
│ └── code_repair.txt
│
├── registry/ # Tool registry
│ └── tools.json # Active tool metadata
│
├── docker/ # Docker sandbox
│ ├── Dockerfile.sandbox
│ └── docker-compose.sandbox.yml
│
├── integration/ # Parent-graph integration adapter
│ ├── __init__.py # Exports build_child_input, apply_child_output
│ └── mapper.py # Input mapper + output projector implementation
│
├── tests/ # Test suite
├── docs/ # Documentation
└── test.py # Interactive pipeline testing
- Python 3.10 or higher
- Ollama installed and running
- Docker (optional, for Docker sandbox mode)
# Pull the reasoning model (for intent extraction & spec generation)
ollama pull deepseek-r1:70b
# Pull the coding model (for code generation & repair)
ollama pull qwen2.5-coder:32b
# Verify models are available
ollama list# Clone the repository (if applicable)
cd MCP_Tool_Code_Interpreter_Generator
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txtThe default configuration in config/config.yaml should work with Ollama:
llm:
base_url: "http://localhost:11434/v1" # Ollama default endpoint
models:
reasoning: "deepseek-r1:70b" # Intent + spec generation
coding: "qwen2.5-coder:32b" # Code generation + repair
temperature: 0.0
sandbox:
mode: "docker" # or "subprocess" for faster testing
timeout_seconds: 120
memory_limit_mb: 512# Run interactive test with a sample query
python test.py "Calculate average values by group"
# Run with specific query
python test.py "Run ANOVA across groups with Tukey HSD post-hoc test"
# Adjust verbosity
python test.py -v "your query here"
python test.py -d "your query here"The system uses two specialized models for optimal performance:
- Used for: Intent extraction, spec generation
- Why: Better at understanding complex requirements and planning
- Behavior: May include
<think>tags in reasoning process (automatically stripped)
- Used for: Code generation, code repair
- Why: Specialized for generating clean, efficient Python code
- Behavior: Focused output without meta-commentary
LLM Client Internals (src/llm_client.py)
model_overrideparameter onQwenLLMClient.__init__selects model by alias ("reasoning"or"coding")- Temperature fixed at
0.0for deterministic structured output <think>tag stripping: all content between<think>...</think>is removed before JSON parsing- Smart brace boundary detection: finds the outermost
{...}block in the response to handle wrapped or padded output - Markdown code fence extraction: strips triple-backtick blocks before JSON parsing
Prompt Design (config/prompts/)
Strict JSON enforcement instructions sent with every reasoning-model call:
CRITICAL INSTRUCTIONS:
- Return ONLY valid JSON conforming to the schema below
- DO NOT include any explanatory text, thinking process, or commentary
- DO NOT use <think> tags or similar meta-text
- DO NOT add markdown code fences around the JSON
- Output must be pure JSON starting with { and ending with }
Operation selection guide embedded in the intent extraction prompt:
OPERATION SELECTION GUIDE:
- "top N X by Y" or "most common X" -> use "groupby_aggregate"
- "filter by X" -> use "filter"
- "summary statistics" -> use "describe_summary"
- "ANOVA / statistical test" -> use "statistical_test"
For development and testing, use the interactive test script:
# Test with a specific query
python test.py "your analysis query"
# Adjust verbosity
python test.py -d "query here"The child graph (ToolGeneratorState) is designed to be called as a black-box node
from a parent graph (AnalysisPipelineState). Because the parent uses extra='forbid',
the child schema was adapted so no parent schema changes are required.
| Concern | How it is resolved |
|---|---|
Parent extra='forbid' |
Child outputs are projected into existing parent channels only |
messages type mismatch |
Child messages migrated to List[BaseMessage] + add_messages, matching parent exactly |
Child-internal fields (tool_spec, generated_code, etc.) |
Never written to parent; stay inside child state |
| All child results | Packaged by projection_node (terminal child node) into 6 projected_* fields |
After child_graph.invoke() completes, projection_node has pre-packaged
all results into these fields on the returned child state:
| Child field | Parent channel | Type |
|---|---|---|
projected_tool_transcript |
tool_transcript |
List[Dict[str, Any]] |
projected_artifact_log |
artifact_log |
List[str] |
projected_capability_gap |
capability_gap |
Optional[Dict[str, Any]] |
projected_errors |
errors |
List[str] |
projected_warnings |
warnings |
List[str] |
projected_final_artifacts |
final_artifacts |
Dict[str, Any] |
The integration/ package at the project root provides two functions that
implement the full integration contract. The parent-graph owner only needs these two calls.
integration/
├── __init__.py # exports build_child_input, apply_child_output
└── mapper.py # full implementation with docstrings
import sys
sys.path.insert(0, "/path/to/MCP_Tool_Code_Interpreter_Generator")
from integration import build_child_input, apply_child_output
from src.pipeline import build_graphchild_graph = build_graph()def tool_generator_node(parent_state: AnalysisPipelineState) -> dict:
"""Parent graph node that runs the child tool-generator pipeline."""
# Build a valid initial ToolGeneratorState from parent fields:
# instruction -> user_query
# dataset_path -> data_path
child_init = build_child_input(parent_state)
# Run the child graph (projection_node runs last, populates projected_* fields)
config = {"configurable": {"thread_id": "toolgen-1"}}
child_result = child_graph.invoke(child_init, config)
# Write child projected_* fields into parent-safe channels (in-place):
# projected_tool_transcript -> tool_transcript (list extend, deduped)
# projected_artifact_log -> artifact_log (list extend, deduped)
# projected_capability_gap -> capability_gap (replace)
# projected_errors -> errors (list extend)
# projected_warnings -> warnings (list extend)
# projected_final_artifacts -> final_artifacts (dict.update)
apply_child_output(child_result, parent_state)
# Return any additional parent-level fields you want to update
return {}from langgraph.graph import StateGraph, END
parent = StateGraph(AnalysisPipelineState)
parent.add_node("tool_generator_node", tool_generator_node)
parent.add_edge("planner", "tool_generator_node")
parent.add_edge("tool_generator_node", "reviewer")
parent_graph = parent.compile()from integration import build_child_input, apply_child_output
from src.pipeline import build_graph
from langgraph.graph import StateGraph, END
child_graph = build_graph()
def tool_generator_node(state):
child_result = child_graph.invoke(
build_child_input(state),
{"configurable": {"thread_id": "toolgen-1"}}
)
apply_child_output(child_result, state)
return {}
parent = StateGraph(AnalysisPipelineState)
parent.add_node("tool_generator_node", tool_generator_node)
# ... add remaining nodes and edges
parent_graph = parent.compile()capability_gap being non-None after the child runs means the gap detector
found no existing tool with ≥ 85% overlap and triggered generation of a new one.
This is the normal success path, not an error. Interpret it as:
"A new tool was generated to fill this capability gap."
The promoted tool metadata is available in final_artifacts["promoted_tool"].
If you need to expose this as a standalone MCP server (not recommended for agent integration), you can use the optional server.py and run_server.py files:
# This is only needed if NOT using as a child graph
python run_server.pyNote: For agent integration, server.py is not needed - use build_graph() or run_pipeline() directly.
DRAFT → Validation → Execution → PROMOTED
↓ ↓ ↓
└─ (repair loop) ─────────→ REJECTED
Tools are stored in different directories based on their status:
- DRAFT (
tools/draft/) - Freshly generated code, may have errors - ACTIVE (
tools/active/) - Validated and executed successfully, production-ready - Outputs (
output/active/) - Execution results from successful tool runs - Registry (
registry/tools.json) - Metadata for all active tools
Generated outputs follow the naming pattern:
<operation>_<dataset>_<timestamp>_output.json
Example:
output/active/anova_tukeyhsd_traffic_injuries_20260210_171102_output.json
Each output includes:
- Original query and parameters
- Generated code
- Execution results
- Validation report
- Timestamps and metadata
# Run all tests
pytest tests/
# Run specific module tests
pytest tests/test_intent_extraction.py -v
pytest tests/test_validator.py -v
# Run with coverage
pytest --cov=src tests/
# Integration tests
pytest tests/test_integration.py
# Interactive pipeline test
python test.py "your analysis query"# Quiet - minimal output
python test.py -q "query"
# Normal - standard progress (default)
python test.py "query"
# Verbose - detailed step information
python test.py -v "query"
# Debug - full LLM prompts and responses
python test.py -d "query"All generated code runs in an isolated sandbox with:
- No network access - Prevents data exfiltration
- Restricted file system - Read-only access to data files only
- Resource limits - CPU, memory, and timeout constraints
- Import restrictions - Only allowlisted libraries permitted
- Subprocess restrictions - No shell commands or external processes
1. Docker Mode (Recommended for Production)
sandbox:
mode: "docker"
timeout_seconds: 120
memory_limit_mb: 512- Full container isolation
- Complete environment control
- Higher security guarantees
- Slower startup time
2. Subprocess Mode (Development)
sandbox:
mode: "subprocess"
timeout_seconds: 30
memory_limit_mb: 512- Faster execution
- Shared host environment
- Lower isolation guarantees
- Good for rapid iteration
Configure allowed/blocked imports in config/sandbox_policy.yaml:
allowed_libraries:
- pandas
- numpy
- scipy
- statsmodels
- matplotlib
- seaborn
blocked_imports:
- os
- subprocess
- sys
- requests
- urllibSee docs/SANDBOX_SECURITY.md for comprehensive security documentation.
# LLM Configuration
llm:
base_url: "http://localhost:11434/v1"
models:
reasoning: "deepseek-r1:70b" # Intent extraction & spec generation
coding: "qwen2.5-coder:32b" # Code generation & repair
temperature: 0.0
# Directory paths
paths:
draft_dir: "./tools/draft"
staged_dir: "./tools/staged"
active_dir: "./tools/active"
registry: "./registry/tools.json"
sandbox_workspace: "./tools/sandbox"
# Validation settings
validation:
max_repair_attempts: 5 # Code repair retry limit
sandbox_timeout_seconds: 120
# Sandbox configuration
sandbox:
mode: "docker" # "docker" or "subprocess"
timeout_seconds: 120 # Execution timeout
memory_limit_mb: 512 # Memory limit
# Logging
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR
file: "./logs/pipeline.log"intent_extraction_v2.txt- Extract structured intent from queriesspec_generation.txt- Generate tool specificationscode_generation.txt- Generate Python implementation codecode_repair.txt- Repair code based on validation errors
Controls security restrictions for code execution:
- Allowed/blocked Python imports
- Resource limits (CPU, memory, timeout)
- Filesystem access restrictions
- Network policies
- Multi-model LLM integration (DeepSeek-R1 + Qwen 2.5-Coder)
- LangGraph pipeline orchestration
- Intent extraction with structured output
- Specification generation with I/O schemas
- Code generation with FastMCP decorators
- Multi-stage validation (syntax, schema, sandbox)
- Automated code repair (up to 3 attempts)
- Sandboxed execution (Docker + subprocess modes)
- Tool registry and promotion system
- Comprehensive logging and debugging
- Statistical analysis support (ANOVA, Tukey HSD, etc.)
- Graph visualization (Mermaid diagrams)
projection_node— terminal node packaging all outputs intoprojected_*fieldsToolGeneratorStateschema updated for parent-graph compatibility (messages,errors, 6projected_*fields)- Parent-graph integration adapter (
integration/—build_child_input,apply_child_output)
- Enhanced error recovery strategies
- Additional statistical operations
- Performance optimizations
- Extended test coverage
For detailed implementation specs, see module_prs/README.md
# Install dev dependencies
pip install -r requirements-dev.txt
# Run linter
ruff check src/
# Format code
black src/ tests/
# Type checking
mypy src/# Clean sandbox temporary files
python scripts/clean_sandbox.py
# Generate graph visualization
python visualize_graph.pyThe pipeline automatically generates a Mermaid diagram (pipeline_graph.mmd) showing the LangGraph workflow. View it at mermaid.live.
1. Ollama Connection Failed
# Check if Ollama is running
ollama list
# Check API endpoint
curl http://localhost:11434/v1/models
# Restart Ollama if needed
# (OS-specific restart command)2. Models Not Found
# Pull required models
ollama pull deepseek-r1:70b
ollama pull qwen2.5-coder:32b
# Verify models are available
ollama list3. Validation Failures
- Review
ValidationReport.errorsin output - Check generated code in
tools/draft/ - Examine validation details in logs
- Review sandbox execution logs
4. Docker Sandbox Issues
# Check Docker is running
docker ps
# Build sandbox image
cd docker
docker-compose -f docker-compose.sandbox.yml build
# Check sandbox logs
docker logs <container_id>5. Import Errors in Sandbox
- Verify library is listed in
config/sandbox_policy.yaml - Check library is installed in sandbox environment
- For Docker mode: rebuild sandbox image after adding libraries
6. Memory/Timeout Errors
Adjust limits in config/config.yaml:
sandbox:
timeout_seconds: 120 # Increase timeout
memory_limit_mb: 1024 # Increase memoryEnable detailed logging to troubleshoot issues:
# Set debug verbosity
python test.py -d "query"Or configure in code:
import logging
logging.basicConfig(level=logging.DEBUG)Check logs in logs/pipeline.log for detailed execution traces.
# Check draft tools
ls -la tools/draft/
# Check active tools
ls -la tools/active/
# Check execution outputs
ls -la output/active/
# View specific output
cat output/active/anova_tukeyhsd_traffic_injuries_<timestamp>_output.json- Quick Start - Get up and running
- Configuration - System configuration
- Testing - Run tests and validate
- Troubleshooting - Common issues and solutions
- Module PRs - Complete implementation specs
- Sandbox Security - Security policies and implementation
- Logging - Logging configuration and usage
- Architecture Docs - Design decisions and diagrams
- LangGraph - Workflow orchestration and graph composition
- Ollama - Local LLM inference
- DeepSeek - Reasoning model
- Qwen - Code generation model
Last Updated: February 25, 2026
For detailed module documentation and implementation specs, see module_prs/README.md.