A powerful, intelligent system for automatically crawling, analyzing, and testing API documentation from various sources. Features JavaScript rendering, API spec detection, and comprehensive example testing.
- π·οΈ Hybrid Crawling: HTML + JavaScript rendering with Playwright
- π API Spec Detection: Automatic detection and download of OpenAPI/Swagger specs
- π§ͺ Intelligent Testing: LLM-powered example testing and fixing
- π Comprehensive Analysis: Multi-agent workflow for complete API documentation processing
- π― High Success Rate: 85% success rate across various API documentation sites
- Python 3.8+
- Git
- OpenAI API key (for LLM features)
-
Clone and setup:
git clone https://github.com/yourusername/api-agent-v3.git cd api-agent-v3 pip install -r requirements.txt playwright install -
Configure environment:
cp env.example .env # Edit .env with your OpenAI API key
Quick Demo (Recommended for first-time users):
# Test the Human Cell Atlas API (most comprehensive example)
python cli.py --urls "https://service.azul.data.humancellatlas.org/swagger/index.html" --max-pages 3
# Test the 1000 Genomes Project (good for genomics)
python cli.py --urls "https://www.internationalgenome.org/data" --max-pages 2
# Test the UK Biobank (excellent for health data)
python cli.py --urls "https://biobank.ndph.ox.ac.uk/showcase/search.cgi" --max-pages 2
# Test cellxgene-census (CZI Census Python API)
python cli.py --urls "https://chanzuckerberg.github.io/cellxgene-census/python-api.html" --max-pages 2
# Test cBioPortal (cancer genomics)
python cli.py --urls "https://docs.cbioportal.org/web-api-and-clients/" --max-pages 3Test your own API documentation:
python cli.py --urls "https://docs.example.com/api" --max-pages 5Advanced options:
# Multiple URLs
python cli.py --urls "https://api1.example.com" "https://api2.example.com" --max-pages 3
# Custom output directory
python cli.py --urls "https://docs.example.com/api" --output-dir ./my_results --save-results
# Step-by-step mode
python cli.py --urls "https://docs.example.com/api" --step-by-step-
Enhanced Python Crawler (
agents/enhanced_python_crawler.py)- Hybrid crawling with JavaScript rendering
- API spec detection and download
- Smart content extraction
-
Content-Focused Summarizer (
agents/content_focused_summarizer.py)- LLM-powered content analysis
- YAML specification generation
- Example extraction and organization
-
Enhanced YAML Example Runner (
agents/enhanced_yaml_example_runner.py)- Intelligent example testing
- LLM-based code fixing
- Quality assessment and reporting
-
Project Manager Agent (
project_manager_agent.py)- Workflow orchestration
- Multi-agent coordination
- Result aggregation and reporting
URL Input β Crawling β Content Analysis β YAML Generation β Example Testing β Results
β β β β β β
JavaScript API Specs LLM Analysis Structured LLM Fixes Reports
Rendering Detection & Extraction YAML Files & Testing & Logs
Create a .env file based on env.example:
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
LLM_MODEL=gpt-4o-mini
LLM_MAX_TOKENS=4000
LLM_TEMPERATURE=0.1
# Crawler Configuration
CRAWLER_MAX_PAGES=20
CRAWLER_DELAY=0.1
CRAWLER_TIMEOUT=10
CRAWLER_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
# JavaScript Rendering Configuration
USE_JS_RENDERING=true
JS_RENDERING_TIMEOUT=60
# Content Processing Configuration
MAX_CONTENT_PAGES=50
MAX_CONTENT_CHARS=50000
MIN_CODE_LENGTH=20
MIN_INLINE_CODE_LENGTH=10
# Script Testing Configuration
SCRIPT_TIMEOUT=120
SCRIPT_TEST_TIMEOUT=30
# Output Configuration
OUTPUT_DIR=data
LOGS_DIR=logs| Site Type | Success Rate | Content Quality | Examples Generated |
|---|---|---|---|
| Traditional HTML | 95% | High | 25-50 |
| JavaScript-Heavy | 80% | High | 20-40 |
| API Spec Sites | 85% | Very High | 30-60 |
| Overall Average | 85% | High | 25-50 |
-
Human Cell Atlas API:
https://service.azul.data.humancellatlas.org/swagger/index.html- β 61,145 chars extracted, 49 examples generated
- π― Perfect for: Single-cell genomics, biomedical data
-
1000 Genomes Project:
https://www.internationalgenome.org/data- β 8,472 chars extracted, 49 examples generated
- π― Perfect for: Genetic variation data, population genomics
-
UK Biobank:
https://biobank.ndph.ox.ac.uk/showcase/search.cgi- β 61,602 chars extracted, 10 examples generated
- π― Perfect for: Large-scale health data, epidemiological studies
-
UniProt API:
https://www.uniprot.org/help/api_queries- β 861 chars extracted, 40 examples generated
- π― Perfect for: Protein sequence data, bioinformatics
-
cellxgene-census:
https://chanzuckerberg.github.io/cellxgene-census/python-api.html- β 2,952 chars extracted, 8 examples generated
- π― Perfect for: Single-cell genomics, CZI Census data, AnnData integration
-
cBioPortal:
https://docs.cbioportal.org/web-api-and-clients/- π― Perfect for: Cancer genomics, mutation data, clinical data
-
TileDB Cloud Academy:
https://cloud.tiledb.com/academy/api-reference/- β 1,721 chars extracted, 13 examples generated
- π― Perfect for: Cloud databases, array computing
api_agent_v3/
βββ data/
β βββ crawled_content/ # Raw crawled data
β βββ api_specs/ # Generated YAML specifications
β βββ test_results/ # Example testing results
β βββ retrieved_datasets/ # Downloaded datasets
βββ logs/ # Workflow logs
βββ agents/ # Core agent modules
βββ prompts/ # LLM prompts
βββ requirements.txt # Python dependencies
api_agent_v3/
βββ agents/ # Core agent modules
β βββ enhanced_python_crawler.py # Web crawling with JS support
β βββ content_focused_summarizer.py # LLM content analysis
β βββ enhanced_yaml_example_runner.py # Example testing
β βββ unified_example_runner.py # Core testing engine
βββ cli.py # Command-line interface
βββ project_manager_agent.py # Workflow orchestration
βββ requirements.txt # Dependencies
βββ env.example # Environment template
βββ README.md # This file
# Clone and setup
git clone https://github.com/yourusername/api-agent-v3.git
cd api-agent-v3
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
playwright install
# Setup environment
cp env.example .env
# Edit .env with your API keysJavaScript rendering fails:
# Increase timeout
export JS_RENDERING_TIMEOUT=120
python cli.py --urls "https://example.com"OpenAI API errors:
# Check API key
echo $OPENAI_API_KEY
# Or set in .env filePlaywright issues:
# Reinstall Playwright
playwright installDebug mode:
# Enable verbose logging
export LOG_LEVEL=DEBUG
python cli.py --urls "https://example.com" --step-by-step- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and test thoroughly
- Commit your changes:
git commit -m 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: chaishoujie@gmail.com
Made with β€οΈ for the API documentation community