Skip to content

chayan75/mcp_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏆 MCP Evaluation Server

The Ultimate AI Evaluation Platform

📊 Tools: 63 Specialized Evaluation Tools

👨‍💻 Author: Mihai Criveti

A MCP server providing the most comprehensive AI evaluation platform in the ecosystem. Features 63 specialized tools across 14 categories for complete AI system assessment using LLM-as-a-judge techniques combined with rule-based metrics.

🎯 Tool Categories Overview

📊 Core Evaluation (15 tools)

🤖 4 Judge Tools - LLM-as-a-judge evaluation with bias mitigation 📝 4 Prompt Tools - Clarity, consistency, completeness analysis 🛠️ 4 Agent Tools - Tool usage, reasoning, task completion assessment 🔍 3 Quality Tools - Factuality, coherence, toxicity detection

🔬 Advanced Assessment (39 tools)

🔗 8 RAG Tools - Retrieval relevance, context utilization, grounding verification ⚖️ 6 Bias & Fairness - Demographic bias, representation equity, intersectional analysis 🛡️ 5 Robustness Tools - Adversarial testing, injection resistance, stability analysis 🔒 4 Safety & Alignment - Harmful content detection, instruction adherence, value alignment 🌍 4 Multilingual Tools - Translation quality, cross-lingual consistency, cultural adaptation ⚡ 4 Performance Tools - Latency tracking, efficiency metrics, throughput scaling 🔐 8 Privacy Tools - PII detection, data minimization, compliance, anonymization

🔧 System Management (9 tools)

🔄 3 Workflow Tools - Evaluation suites, parallel execution, results comparison 📊 2 Calibration Tools - Judge agreement testing, rubric optimization 🏥 4 Server Tools - Health monitoring, cache statistics, system management

⚡ Technology

  • 🤖 LLM-as-a-Judge - GPT-4, Azure OpenAI, with position bias mitigation
  • 📈 Statistical Rigor - Confidence intervals, significance testing, correlation analysis
  • 🎪 Multi-Modal Assessment - Pattern matching + LLM evaluation + rule-based metrics
  • 🏗️ Extensible Architecture - Configurable rubrics, custom criteria, plugin system

🚀 Quick Start

📡 Multiple Server Modes

🔌 MCP Server Mode (stdio)

# 🎯 One-command setup
pip install -e ".[dev]"

# 🔥 Launch MCP server for Claude Desktop, MCP clients
python -m mcp_eval_server.server
# or
make dev

# 🏥 Health check (automatic on port 8080)
curl http://localhost:8080/health   # ✅ Liveness probe
curl http://localhost:8080/ready    # 🎯 Readiness probe
curl http://localhost:8080/metrics  # 📊 Performance metrics

🌐 REST API Server Mode (HTTP)

# 🚀 Launch REST API server with FastAPI
python -m mcp_eval_server.rest_server --port 8080 --host 0.0.0.0
# or
make serve-rest

# 📚 Interactive API documentation
open http://localhost:8080/docs

# 🧪 Quick API test
curl http://localhost:8080/health
curl http://localhost:8080/tools/categories

🔄 HTTP Bridge Mode (MCP over HTTP)

# 🌍 MCP protocol over HTTP with Server-Sent Events
make serve-http

# 📡 Access via JSON-RPC over HTTP on port 9000
curl -X POST -H 'Content-Type: application/json' \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
     http://localhost:9000/

🚀 MCP Wrapper Mode (FastMCP + REST API)

# 🎯 Start REST API server first
make serve-rest

# 🔗 Launch MCP wrapper (FastMCP around REST API)
make serve-wrapper

# 📡 Access via MCP protocol over HTTP/SSE on port 9001
# Endpoint: http://localhost:9001/mcp
# Headers: Accept: application/json, text/event-stream
# Protocol: Streamable HTTP (SSE) with session management

# 🧪 Test the wrapper
python test_sse_client.py

☁️ AWS App Runner Deployment (Live)

# 🚀 Deploy to AWS App Runner
make deploy-apprunner

# 🌐 Access deployed MCP wrapper
# URL: https://6xaate4xrt.us-east-1.awsapprunner.com/mcp
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream

# 🧪 Test deployed wrapper
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}'

Complete Tool Arsenal

🤖 LLM-as-a-Judge Tools (4 Tools)

  • 🎯 Single Response Evaluation: Customizable criteria with weighted scoring and confidence metrics
  • ⚖️ Pairwise Comparison: Head-to-head analysis with automatic position bias mitigation
  • 🏆 Multi-Response Ranking: Tournament, round-robin, and scoring-based ranking algorithms
  • 📊 Reference-Based Evaluation: Gold standard comparison for factuality, completeness, and style
  • 🤝 Multi-Judge Consensus: Ensemble evaluation with agreement analysis and confidence weighting

📝 Prompt Evaluation Tools (4 Tools)

  • 🔍 Clarity Analysis: Rule-based ambiguity detection + LLM semantic analysis with improvement recommendations
  • 🔄 Consistency Testing: Multi-run variance analysis across temperature settings with outlier detection
  • ✅ Completeness Measurement: Component coverage analysis with visual heatmap generation
  • 🎯 Relevance Assessment: Semantic alignment using TF-IDF vectorization with drift analysis

🛠️ Agent Evaluation Tools (4 Tools)

  • ⚙️ Tool Usage Evaluation: Selection accuracy, sequence optimization, parameter validation with efficiency scoring
  • ✅ Task Completion Analysis: Multi-criteria success evaluation with partial credit and failure analysis
  • 🧠 Reasoning Assessment: Decision-making quality, logical coherence, and hallucination detection
  • 📈 Performance Benchmarking: Comprehensive capability testing across skill levels with baseline comparison

🔍 Quality Assessment Tools (3 Tools)

  • ✅ Factuality Checking: Claims verification against knowledge bases with confidence scoring and evidence tracking
  • 🧩 Coherence Analysis: Logical flow assessment, contradiction detection, and structural analysis
  • 🛡️ Toxicity Detection: Multi-category harmful content identification with bias pattern analysis

🔗 RAG Evaluation Tools (8 Tools)

  • 📊 Retrieval Relevance: Semantic similarity assessment with LLM judge validation and configurable thresholds
  • 🎯 Context Utilization: Analysis of how well retrieved context is integrated into generated responses
  • ⚓ Answer Groundedness: Claim verification against supporting context with strictness controls
  • 🚨 Hallucination Detection: Contradiction identification between responses and source context
  • 🎯 Retrieval Coverage: Topic completeness assessment and information gap analysis
  • 📝 Citation Accuracy: Reference validation and citation quality scoring across multiple formats
  • 🧩 Chunk Relevance: Individual document segment evaluation with ranking and scoring
  • 🏆 Retrieval Benchmarking: Comparative analysis using standard IR metrics (precision, recall, MRR, NDCG)

⚖️ Bias & Fairness Tools (6 Tools)

  • 🎯 Demographic Bias Detection: Pattern matching and LLM assessment for protected group bias
  • 📊 Representation Fairness: Balanced representation analysis across contexts and groups
  • ⚖️ Outcome Equity: Disparate impact analysis across protected attributes
  • 🌍 Cultural Sensitivity: Cross-cultural appropriateness and awareness evaluation
  • 🗣️ Linguistic Bias Detection: Language-based discrimination and dialect bias identification
  • 🔗 Intersectional Fairness: Compound bias effects across multiple identity dimensions

🛡️ Robustness Tools (5 Tools)

  • ⚔️ Adversarial Testing: Malicious prompt resistance and attack vector evaluation
  • 🔄 Input Sensitivity: Response stability testing under input variations and perturbations
  • 🛡️ Prompt Injection Resistance: Security defense evaluation against injection attacks
  • 📈 Distribution Shift: Performance degradation analysis on out-of-domain data
  • 🎯 Consistency Under Perturbation: Output stability measurement across input modifications

🔒 Safety & Alignment Tools (4 Tools)

  • ⚠️ Harmful Content Detection: Multi-category risk assessment across safety dimensions
  • 📋 Instruction Following: Constraint adherence and safety instruction compliance
  • 🚫 Refusal Appropriateness: Evaluation of appropriate system refusal behavior
  • 💎 Value Alignment: Human values and ethical principles alignment assessment

🌍 Multilingual Tools (4 Tools)

  • 🔄 Translation Quality: Accuracy, fluency, and completeness assessment across languages
  • 🔗 Cross-Lingual Consistency: Consistency evaluation across multiple language versions
  • 🎭 Cultural Adaptation: Localization quality and cultural appropriateness evaluation
  • 🔀 Language Mixing Detection: Inappropriate code-switching and language mixing identification

Performance Tools (4 Tools)

  • ⏱️ Response Latency: Generation speed tracking with statistical analysis and percentiles
  • 💻 Computational Efficiency: Resource usage monitoring and efficiency metrics
  • 📈 Throughput Scaling: Concurrent request handling and scaling behavior analysis
  • 💾 Memory Monitoring: Memory consumption pattern tracking and leak detection

🔐 Privacy Tools (8 Tools)

  • 🔍 PII Detection: Personally identifiable information detection with configurable sensitivity
  • 📊 Data Minimization: Evaluation of data collection necessity and purpose alignment
  • 📋 Consent Compliance: Privacy regulation compliance assessment (GDPR, CCPA, COPPA, HIPAA)
  • 🎭 Anonymization Effectiveness: Re-identification risk analysis and utility preservation
  • 🚨 Data Leakage Detection: Unintended data exposure and inference leakage identification
  • 📖 Consent Clarity: Readability and comprehensibility assessment of privacy notices
  • 🗃️ Data Retention Compliance: Retention policy alignment and regulatory adherence
  • 🏗️ Privacy-by-Design: System-level privacy implementation and design principle evaluation

🔄 Workflow Management Tools (3 Tools)

  • 🎛️ Evaluation Suites: Customizable multi-step pipelines with weighted criteria and success thresholds
  • ⚡ Parallel/Sequential Execution: Optimized processing with configurable concurrency and resource management
  • 📊 Results Comparison: Statistical analysis with trend detection, significance testing, and regression analysis

📊 Judge Calibration Tools (2 Tools)

  • 🤝 Agreement Testing: Inter-judge correlation analysis with human baseline comparison
  • 🎯 Rubric Optimization: Automatic tuning using machine learning for improved human alignment

🔧 Server Management Tools (9 Tools)

  • 📋 Judge Management: Available model listing, capability assessment, configuration validation
  • 💾 Results Storage: Comprehensive evaluation history with metadata and statistical reporting
  • ⚡ Cache Management: Multi-level caching statistics and performance optimization
  • 🔍 Health Monitoring: System status checks and performance metrics

🚀 Advanced Features

🎯 LLM-as-a-Judge Best Practices

  • Position Bias Mitigation: Automatic response position randomization for fair comparisons
  • Chain-of-Thought Integration: Step-by-step reasoning for enhanced evaluation quality
  • Confidence Calibration: Self-assessment metrics for evaluation reliability
  • Multiple Judge Consensus: Ensemble methods with disagreement analysis
  • Human Alignment: Regular calibration against ground truth evaluations

⚡ Performance & Scalability

  • Lightweight Dependencies: Uses standard libraries (scikit-learn, numpy) instead of heavy ML frameworks
  • Smart Caching: Multi-level caching (memory + disk) with TTL and invalidation
  • Async Processing: Non-blocking evaluation execution with configurable concurrency
  • Batch Operations: Efficient multi-item processing with progress tracking
  • Resource Management: Memory and CPU optimization with automatic scaling
  • Fast Startup: Quick initialization without loading large pre-trained models

🔒 Enterprise Security

  • Cryptographic Random: Secure random number generation for bias mitigation
  • API Key Management: Secure credential handling with environment variable integration
  • Input Validation: Comprehensive parameter validation and sanitization
  • Error Isolation: Graceful failure handling with detailed error reporting
  • Audit Trail: Complete evaluation history with compliance reporting

📊 Analytics & Insights

  • Statistical Analysis: Correlation analysis, significance testing, trend detection
  • Performance Metrics: Latency tracking, throughput monitoring, success rate analysis
  • Quality Dashboards: Real-time evaluation quality monitoring with alerting
  • Comparative Analysis: A/B testing capabilities with regression detection
  • Predictive Analytics: Performance trend forecasting and anomaly detection

🚀 MCP Wrapper (FastMCP + REST API)

🎯 What is the MCP Wrapper?

The MCP Wrapper is a FastMCP-based server that wraps your existing REST API and exposes it as a Model Context Protocol (MCP) server. This allows MCP clients (like Claude Desktop) to access all your evaluation tools through the MCP protocol while keeping your REST API unchanged.

✨ Key Benefits

  • 🔄 Dual Protocol Support: Access the same tools via both REST API and MCP protocol
  • 🌐 HTTP/SSE Transport: Modern streamable HTTP with Server-Sent Events
  • 🔗 REST API Integration: Seamlessly wraps existing REST endpoints
  • 📡 Session Management: Automatic session handling with mcp-session-id headers
  • ⚡ Real-time Communication: Bidirectional communication over single HTTP connection
  • 🛡️ Protocol Compliance: Full MCP protocol compliance with proper initialization

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │    │   MCP Wrapper    │    │   REST API      │
│ (Claude Desktop)│◄──►│  (FastMCP + SSE) │◄──►│  (FastAPI)      │
│                 │    │                  │    │                 │
│ • stdio         │    │ • /mcp/ endpoint │    │ • /judge/*      │
│ • HTTP/SSE      │    │ • Session mgmt   │    │ • /quality/*    │
│ • JSON-RPC      │    │ • Tool wrapping  │    │ • /agent/*      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

🔧 Technical Details

  • Transport: Streamable HTTP (SSE) with session management
  • Endpoint: http://localhost:9001/mcp/
  • Headers: Accept: application/json, text/event-stream
  • Session: Automatic via mcp-session-id header
  • Protocol: Full MCP protocol with initialization and notifications
  • Tools: All 63 evaluation tools exposed as MCP tools

🎮 Usage Examples

# Start both servers
make serve-rest      # REST API on port 8080
make serve-wrapper   # MCP wrapper on port 9001

# Test the wrapper
python test_sse_client.py
make test-wrapper

🛠️ Installation & Setup

Quick Installation

# Clone and install (lightweight dependencies only)
cd mcp-servers/python/mcp_eval_server
pip install -e ".[dev]"

# Set up API keys (optional - rule-based judge works without them)
export OPENAI_API_KEY="sk-your-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-azure-api-key"

# Configure health check endpoints (optional)
export HEALTH_CHECK_PORT=8080        # Default: 8080
export HEALTH_CHECK_HOST=0.0.0.0     # Default: 0.0.0.0

# Note: No heavy ML dependencies required!
# Uses efficient TF-IDF + scikit-learn instead of transformers

MCP Client Connection

Option 1: Direct MCP Server (stdio)

{
  "command": "python",
  "args": ["-m", "mcp_eval_server.server"],
  "cwd": "/path/to/mcp-servers/python/mcp_eval_server"
}

Protocol: stdio (Model Context Protocol)
Transport: Standard input/output (no HTTP port needed)
Tools Available: 63 specialized evaluation tools

Option 2: MCP Wrapper (HTTP/SSE)

{
  "url": "http://localhost:9001/mcp/",
  "headers": {
    "Accept": "application/json, text/event-stream"
  }
}

Protocol: Streamable HTTP (SSE)
Transport: HTTP with Server-Sent Events
Session Management: Automatic via mcp-session-id header
Tools Available: 63 specialized evaluation tools (via REST API wrapper)

Health Check Endpoints

The server automatically starts health check HTTP endpoints for monitoring:

# Health endpoints (started automatically with the MCP server)
curl http://localhost:8080/health    # Liveness probe
curl http://localhost:8080/ready     # Readiness probe
curl http://localhost:8080/metrics   # Basic metrics
curl http://localhost:8080/          # Service info

# Kubernetes-style endpoints
curl http://localhost:8080/healthz   # Alternative health
curl http://localhost:8080/readyz    # Alternative readiness

Health Check Response Example:

{
  "status": "healthy",
  "timestamp": 1698765432.123,
  "uptime_seconds": 45.67,
  "service": "mcp-eval-server",
  "version": "0.1.0",
  "checks": {
    "server_running": true,
    "uptime_ok": true
  }
}

Readiness Check Response Example:

{
  "status": "ready",
  "timestamp": 1698765432.123,
  "service": "mcp-eval-server",
  "version": "0.1.0",
  "checks": {
    "server_initialized": true,
    "judge_tools_loaded": true,
    "storage_initialized": true
  }
}

Docker Deployment

# Build container
make build

# Run with environment
make run

# Or use docker-compose
make compose-up

AWS App Runner Deployment

# 1. Prerequisites: AWS CLI configured, Docker installed
aws configure

# 2. Create IAM role for App Runner
aws iam create-role --role-name AppRunnerInstanceRole --assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --role-name AppRunnerInstanceRole --policy-arn arn:aws:iam::aws:policy/service-role/AppRunnerServicePolicyForECRAccess

# 3. Set environment variable
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"

# 4. Deploy to AWS App Runner
make deploy-apprunner

# 5. Set API keys in App Runner console (OPENAI_API_KEY, etc.)

🌐 Live Deployed Endpoints:

  • MCP Wrapper: https://6xaate4xrt.us-east-1.awsapprunner.com/mcp
  • Health Check: https://6xaate4xrt.us-east-1.awsapprunner.com/health
  • Service Info: https://6xaate4xrt.us-east-1.awsapprunner.com/

📚 Detailed Guide: See AWS_APP_RUNNER_DEPLOYMENT.md for complete deployment instructions.

🎉 Live MCP Wrapper Testing

The MCP wrapper is now live and fully functional! Here's how to test it:

1. Initialize MCP Session

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -i \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "initialize",
    "params": {
      "protocolVersion": "2024-11-05",
      "capabilities": {},
      "clientInfo": {
        "name": "test-client",
        "version": "1.0.0"
      }
    }
  }'

2. Send Initialized Notification

# Use the session ID from Step 1
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "method": "notifications/initialized"
  }'

3. List Available Tools

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/list",
    "params": {}
  }'

4. Call a Tool

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 3,
    "method": "tools/call",
    "params": {
      "name": "judge_evaluate",
      "arguments": {
        "response": "Paris is the capital of France.",
        "criteria": [
          {
            "name": "accuracy",
            "description": "Factual accuracy",
            "scale": "1-5",
            "weight": 1.0
          }
        ],
        "rubric": {
          "criteria": [],
          "scale_description": {
            "1": "Wrong",
            "5": "Correct"
          }
        },
        "judge_model": "rule-based"
      }
    }
  }'

Development Setup

# Install development dependencies
make dev-install

# Run development server
make dev

# Run tests
make test

# Check code quality
make lint

🎮 Usage Examples

🎯 MCP Client Integration

# Multi-criteria evaluation with MCP client
result = await mcp_client.call_tool("judge.evaluate_response", {
    "response": "Detailed technical explanation...",
    "criteria": [
        {"name": "technical_accuracy", "description": "Correctness of technical details", "scale": "1-5", "weight": 0.4},
        {"name": "clarity", "description": "Explanation clarity", "scale": "1-5", "weight": 0.3},
        {"name": "completeness", "description": "Coverage of key points", "scale": "1-5", "weight": 0.3}
    ],
    "rubric": {
        "criteria": [],
        "scale_description": {
            "1": "Severely lacking",
            "2": "Below expectations",
            "3": "Meets basic requirements",
            "4": "Exceeds expectations",
            "5": "Outstanding quality"
        }
    },
    "judge_model": "gpt-4",
    "use_cot": True
})

🌐 REST API Integration

# Evaluate response via REST API
curl -X POST http://localhost:8080/judge/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "response": "Paris is the capital of France",
    "criteria": [
      {
        "name": "accuracy",
        "description": "Factual accuracy",
        "scale": "1-5",
        "weight": 1.0
      }
    ],
    "rubric": {
      "criteria": [],
      "scale_description": {
        "1": "Wrong",
        "5": "Correct"
      }
    },
    "judge_model": "gpt-4o-mini"
  }'
# Python REST API client
import httpx
import asyncio

async def evaluate_via_rest():
    async with httpx.AsyncClient() as client:
        response = await client.post("http://localhost:8080/judge/evaluate", json={
            "response": "Technical explanation...",
            "criteria": [
                {"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
            ],
            "rubric": {
                "criteria": [],
                "scale_description": {"1": "Poor", "5": "Excellent"}
            },
            "judge_model": "gpt-4o-mini"
        })
        result = response.json()
        return result

# Run evaluation
result = asyncio.run(evaluate_via_rest())
print(f"Overall score: {result['overall_score']}")

⚖️ Advanced Pairwise Comparison

# Head-to-head comparison with bias mitigation
comparison = await mcp_client.call_tool("judge.pairwise_comparison", {
    "response_a": "Technical solution A with implementation details...",
    "response_b": "Alternative solution B with different approach...",
    "criteria": [
        {"name": "innovation", "description": "Novelty and creativity", "scale": "1-5", "weight": 0.4},
        {"name": "feasibility", "description": "Implementation practicality", "scale": "1-5", "weight": 0.3},
        {"name": "efficiency", "description": "Resource optimization", "scale": "1-5", "weight": 0.3}
    ],
    "context": "Solutions for enterprise-scale data processing challenge",
    "position_bias_mitigation": True,
    "judge_model": "gpt-4-turbo"
})

📊 Comprehensive Agent Benchmarking

# Full agent performance assessment
benchmark_result = await mcp_client.call_tool("agent.benchmark_performance", {
    "benchmark_suite": "advanced_skills",
    "agent_config": {
        "model": "gpt-4",
        "temperature": 0.7,
        "tools_enabled": ["search", "calculator", "code_executor"]
    },
    "baseline_comparison": {
        "name": "GPT-3.5 Baseline",
        "scores": {"accuracy": 0.75, "efficiency": 0.68, "reliability": 0.72}
    },
    "metrics_focus": ["accuracy", "efficiency", "reliability", "creativity"]
})

🔄 Advanced Evaluation Suite

# Create sophisticated evaluation pipeline
suite = await mcp_client.call_tool("workflow.create_evaluation_suite", {
    "suite_name": "comprehensive_ai_assessment",
    "description": "Full-spectrum AI capability evaluation",
    "evaluation_steps": [
        {
            "tool": "prompt.evaluate_clarity",
            "weight": 0.15,
            "parameters": {"target_model": "gpt-4", "domain_context": "technical"}
        },
        {
            "tool": "judge.evaluate_response",
            "weight": 0.25,
            "parameters": {
                "criteria": [
                    {"name": "technical_depth", "description": "Technical sophistication", "scale": "1-5", "weight": 0.4},
                    {"name": "practical_utility", "description": "Real-world applicability", "scale": "1-5", "weight": 0.6}
                ],
                "judge_model": "gpt-4"
            }
        },
        {
            "tool": "quality.evaluate_factuality",
            "weight": 0.20
        },
        {
            "tool": "quality.measure_coherence",
            "weight": 0.15
        },
        {
            "tool": "quality.assess_toxicity",
            "weight": 0.10
        },
        {
            "tool": "agent.analyze_reasoning",
            "weight": 0.15,
            "parameters": {"judge_model": "gpt-4-turbo"}
        }
    ],
    "success_thresholds": {
        "overall": 0.85,
        "quality.evaluate_factuality": 0.90,
        "quality.assess_toxicity": 0.95
    },
    "weights": {
        "accuracy": 0.4,
        "safety": 0.3,
        "utility": 0.3
    }
})

# Execute comprehensive evaluation
results = await mcp_client.call_tool("workflow.run_evaluation", {
    "suite_id": suite["suite_id"],
    "test_data": {
        "response": "Complex AI system response...",
        "context": "Enterprise deployment scenario...",
        "reasoning_trace": [...],
        "agent_trace": {...}
    },
    "parallel_execution": True,
    "max_concurrent": 5
})

🎛️ Advanced Configuration

Custom Model Configuration

The MCP Eval Server supports complete customization of judge models, allowing you to:

  • Configure custom API endpoints and deployments
  • Set provider-specific parameters and capabilities
  • Create domain-specific model configurations
  • Use custom environment variable names
# Use custom model configuration
export MCP_EVAL_MODELS_CONFIG="./my-custom-models.yaml"
export DEFAULT_JUDGE_MODEL="my-custom-judge"

# Copy default config for customization
make copy-config                    # Copies to ./custom-config/
make show-config                    # Show current configuration status
make validate-config                # Validate configuration syntax

Model Configuration with Capabilities

models:
  azure:
    my-enterprise-gpt4:
      provider: "azure"
      deployment_name: "my-gpt4-deployment"
      model_name: "gpt-4"
      api_base_env: "AZURE_OPENAI_ENDPOINT"
      api_key_env: "AZURE_OPENAI_API_KEY"
      api_version_env: "AZURE_OPENAI_API_VERSION"
      deployment_name_env: "AZURE_DEPLOYMENT_NAME"
      default_temperature: 0.1  # Custom temperature
      max_tokens: 3000           # Custom token limit
      capabilities:
        supports_cot: true
        supports_pairwise: true
        supports_ranking: true
        supports_reference: true
        max_context_length: 8192
        optimal_temperature: 0.1
        consistency_level: "very_high"
      metadata:
        purpose: "production_evaluation"
        cost_tier: "premium"

  ollama:
    my-local-llama:
      provider: "ollama"
      model_name: "llama3:70b"
      base_url_env: "OLLAMA_BASE_URL"
      default_temperature: 0.3
      max_tokens: 2000
      request_timeout: 120  # Longer timeout for large models

# Custom defaults
defaults:
  primary_judge: "my-enterprise-gpt4"
  fallback_judge: "my-local-llama"

# Custom recommendations
recommendations:
  production: ["my-enterprise-gpt4"]
  development: ["my-local-llama"]

Advanced Evaluation Rubrics

rubrics:
  technical_excellence:
    name: "Technical Excellence Assessment"
    criteria:
      - name: "code_quality"
        description: "Code structure, efficiency, and best practices"
        scale: "1-10"
        weight: 0.3
      - name: "innovation"
        description: "Novel approaches and creative solutions"
        scale: "1-10"
        weight: 0.25
      - name: "scalability"
        description: "System scalability and performance considerations"
        scale: "1-10"
        weight: 0.25
      - name: "maintainability"
        description: "Code maintainability and documentation quality"
        scale: "1-10"
        weight: 0.2
    scale_description:
      "1-2": "Severely deficient, requires major rework"
      "3-4": "Below standards, significant improvements needed"
      "5-6": "Meets basic requirements, minor improvements possible"
      "7-8": "Exceeds expectations, high quality work"
      "9-10": "Exceptional excellence, industry-leading quality"

Multi-Domain Benchmarks

benchmarks:
  enterprise_readiness:
    name: "Enterprise Readiness Assessment"
    category: "production"
    tasks:
      - name: "security_analysis"
        description: "Security vulnerability assessment and mitigation"
        difficulty: "advanced"
        expected_tools: ["security_scanner", "vulnerability_analyzer", "mitigation_planner"]
        evaluation_metrics: ["threat_identification", "risk_assessment", "solution_quality"]
      - name: "performance_optimization"
        description: "System performance analysis and optimization"
        difficulty: "advanced"
        expected_tools: ["profiler", "optimizer", "benchmarker"]
        evaluation_metrics: ["performance_gain", "resource_efficiency", "scalability_impact"]

🔬 Research-Grade Features

📊 Statistical Analysis

  • Correlation Analysis: Pearson, Spearman, Cohen's Kappa for agreement measurement
  • Significance Testing: Statistical validation of evaluation differences
  • Trend Analysis: Performance trajectory analysis with volatility assessment
  • Outlier Detection: Anomaly identification in evaluation results
  • Confidence Intervals: Uncertainty quantification for evaluation scores

🧪 Experimental Capabilities

  • Judge Calibration: Systematic bias detection and correction algorithms
  • Rubric Evolution: Machine learning-powered rubric optimization
  • Meta-Evaluation: Evaluation of evaluation quality itself
  • Human Alignment: Continuous calibration against expert human judgments
  • Cross-Validation: K-fold validation for evaluation reliability

🎯 Domain-Specific Evaluations

  • Technical Content: Code quality, architecture assessment, security analysis
  • Creative Writing: Originality, engagement, style consistency evaluation
  • Academic Work: Research quality, citation analysis, argument strength
  • Customer Service: Helpfulness, politeness, problem resolution effectiveness
  • Educational Content: Learning objective achievement, instructional clarity

🏗️ Production Architecture

🔧 Infrastructure Components

  • Multi-Judge Runtime: Supports OpenAI, Azure OpenAI, and rule-based evaluation engines
  • Caching Layer: Redis-compatible distributed caching with automatic invalidation
  • Results Database: SQLite/PostgreSQL storage with comprehensive indexing
  • API Gateway: RESTful endpoints with authentication and rate limiting
  • Monitoring System: Prometheus metrics with Grafana dashboards

📦 Deployment Options

  • Container Deployment: Production-ready Docker/Podman containers with security hardening
  • Kubernetes Support: Helm charts with auto-scaling and service mesh integration
  • Cloud Integration: AWS ECS, Azure Container Instances, Google Cloud Run compatibility
  • Edge Deployment: Lightweight containers for edge computing scenarios
  • Development Mode: Hot-reload development server with debugging capabilities

🔒 Security & Compliance

  • Enterprise Security: OAuth 2.0, JWT tokens, API key rotation
  • Data Privacy: Encryption at rest and in transit, PII detection and filtering
  • Audit Logging: Comprehensive audit trails with tamper detection
  • Compliance Ready: SOC 2, GDPR, HIPAA compliance frameworks supported
  • Vulnerability Management: Continuous security scanning and automated patching

🗺️ Tool Ecosystem Map

🏆 MCP EVALUATION SERVER - 63 SPECIALIZED TOOLS 🏆
═══════════════════════════════════════════════════════════

📊 CORE EVALUATION SUITE (15 tools)
├── 🤖 Judge Tools (4) ────── LLM-as-a-judge evaluation
├── 📝 Prompt Tools (4) ───── Clarity, consistency, optimization
├── 🛠️ Agent Tools (4) ────── Performance, reasoning, benchmarking
└── 🔍 Quality Tools (3) ──── Factuality, coherence, toxicity

🔬 ADVANCED ASSESSMENT SUITE (39 tools)
├── 🔗 RAG Tools (8) ──────── Retrieval relevance, grounding, citations
├── ⚖️ Bias & Fairness (6) ── Demographic bias, intersectional analysis
├── 🛡️ Robustness (5) ──────── Adversarial testing, injection resistance
├── 🔒 Safety & Alignment (4) Harmful content, value alignment
├── 🌍 Multilingual (4) ────── Translation, cultural adaptation
├── ⚡ Performance (4) ──────── Latency, efficiency, scaling
└── 🔐 Privacy (8) ───────── PII detection, compliance, anonymization

🔧 SYSTEM MANAGEMENT (9 tools)
├── 🔄 Workflow Tools (3) ─── Evaluation suites, parallel execution
├── 📊 Calibration (2) ────── Judge agreement, rubric optimization
└── 🏥 Server Tools (4) ───── Health monitoring, system management

🎯 TOTAL: 63 TOOLS ACROSS 14 CATEGORIES 🎯

📋 Complete Tool Reference

Judge Tools (4/63)

Tool Description Key Features
judge.evaluate_response Single response evaluation Customizable criteria, weighted scoring, confidence metrics
judge.pairwise_comparison Two-response comparison Position bias mitigation, criterion-level analysis
judge.rank_responses Multi-response ranking Tournament/scoring algorithms, consistency measurement
judge.evaluate_with_reference Reference-based evaluation Gold standard comparison, similarity scoring

Prompt Tools (4/63)

Tool Description Key Features
prompt.evaluate_clarity Clarity assessment Rule-based + LLM analysis, ambiguity detection
prompt.test_consistency Consistency testing Multi-run analysis, temperature variance
prompt.measure_completeness Completeness analysis Component coverage, heatmap visualization
prompt.assess_relevance Relevance measurement TF-IDF semantic alignment, drift analysis

Agent Tools (4/63)

Tool Description Key Features
agent.evaluate_tool_use Tool usage analysis Selection accuracy, sequence optimization
agent.measure_task_completion Task success evaluation Multi-criteria assessment, partial credit
agent.analyze_reasoning Reasoning quality assessment Logic analysis, hallucination detection
agent.benchmark_performance Performance benchmarking Multi-domain testing, baseline comparison

Quality Tools (3/63)

Tool Description Key Features
quality.evaluate_factuality Factual accuracy checking Claims verification, confidence scoring
quality.measure_coherence Logical flow analysis Coherence scoring, contradiction detection
quality.assess_toxicity Harmful content detection Multi-category analysis, bias detection

RAG Tools (8/63)

Tool Description Key Features
rag.evaluate_retrieval_relevance Document relevance assessment Semantic similarity, LLM validation
rag.measure_context_utilization Context usage analysis Word overlap, sentence integration
rag.assess_answer_groundedness Claim verification Context support, strictness control
rag.detect_hallucination_vs_context Contradiction detection Statement verification, confidence scoring
rag.evaluate_retrieval_coverage Topic completeness check Information gap analysis, coverage scoring
rag.assess_citation_accuracy Reference validation Citation quality, format support
rag.measure_chunk_relevance Document segment scoring Individual chunk analysis, ranking
rag.benchmark_retrieval_systems System comparison IR metrics, performance analysis

Bias & Fairness Tools (6/63)

Tool Description Key Features
bias.detect_demographic_bias Protected group bias detection Pattern matching, LLM assessment, sensitivity control
bias.measure_representation_fairness Balanced representation analysis Context evaluation, fairness metrics
bias.evaluate_outcome_equity Disparate impact assessment Outcome analysis, equity scoring
bias.assess_cultural_sensitivity Cultural appropriateness evaluation Cross-cultural awareness, sensitivity dimensions
bias.detect_linguistic_bias Language-based discrimination Dialect bias, formality assessment
bias.measure_intersectional_fairness Multi-dimensional bias analysis Compound effects, intersectional metrics

Robustness Tools (5/63)

Tool Description Key Features
robustness.test_adversarial_inputs Malicious prompt testing Attack vectors, injection resistance
robustness.measure_input_sensitivity Perturbation stability testing Input variations, sensitivity thresholds
robustness.evaluate_prompt_injection_resistance Security defense evaluation Injection strategies, resistance scoring
robustness.assess_distribution_shift Out-of-domain performance Domain adaptation, degradation analysis
robustness.measure_consistency_under_perturbation Output stability measurement Perturbation consistency, variance analysis

Safety & Alignment Tools (4/63)

Tool Description Key Features
safety.detect_harmful_content Harmful content identification Multi-category risk assessment, severity classification
safety.assess_instruction_following Constraint adherence evaluation Instruction parsing, compliance scoring
safety.evaluate_refusal_appropriateness Refusal behavior assessment Decision accuracy, precision/recall metrics
safety.measure_value_alignment Human values alignment Ethical principles, weighted assessment

Multilingual Tools (4/63)

Tool Description Key Features
multilingual.evaluate_translation_quality Translation assessment Accuracy, fluency, cultural adaptation
multilingual.measure_cross_lingual_consistency Multi-language consistency Semantic preservation, factual alignment
multilingual.assess_cultural_adaptation Localization evaluation Cultural dimensions, adaptation scoring
multilingual.detect_language_mixing Code-switching detection Language purity, mixing appropriateness

Performance Tools (4/63)

Tool Description Key Features
performance.measure_response_latency Latency measurement Statistical analysis, percentiles, timeout tracking
performance.assess_computational_efficiency Resource usage monitoring CPU/memory efficiency, per-token metrics
performance.evaluate_throughput_scaling Scaling behavior analysis Concurrency testing, bottleneck detection
performance.monitor_memory_usage Memory consumption tracking Usage patterns, leak detection, threshold monitoring

Privacy Tools (8/63)

Tool Description Key Features
privacy.detect_pii_exposure PII detection and analysis Pattern matching, sensitivity levels, context analysis
privacy.assess_data_minimization Data collection necessity Purpose alignment, minimization scoring
privacy.evaluate_consent_compliance Regulatory compliance assessment GDPR/CCPA/COPPA/HIPAA standards, gap analysis
privacy.measure_anonymization_effectiveness Anonymization quality evaluation Re-identification risk, utility preservation
privacy.detect_data_leakage Data exposure identification Direct/inference leakage, unexpected data flow
privacy.assess_consent_clarity Consent readability analysis Grade level, accessibility, comprehension
privacy.evaluate_data_retention_compliance Retention policy adherence Policy-practice alignment, regulatory requirements
privacy.assess_privacy_by_design System privacy implementation Design principles, control effectiveness

Workflow Tools (3/63)

Tool Description Key Features
workflow.create_evaluation_suite Evaluation pipeline creation Multi-step workflows, weighted criteria
workflow.run_evaluation Suite execution Parallel processing, progress tracking
workflow.compare_evaluations Results comparison Statistical analysis, trend detection

Calibration Tools (2/63)

Tool Description Key Features
calibration.test_judge_agreement Judge agreement testing Correlation analysis, bias detection
calibration.optimize_rubrics Rubric optimization ML-powered tuning, human alignment

Server Tools (4/63)

Tool Description Key Features
server.get_available_judges List available judges Model capabilities, status checking
server.get_evaluation_suites List evaluation suites Suite management, configuration viewing
server.get_evaluation_results Retrieve results History browsing, filtering, pagination
server.get_cache_stats Cache statistics Performance monitoring, optimization

💡 Innovation & Research Integration

🧠 AI Research Applications

  • Model Comparison Studies: Systematic evaluation of different LLM architectures
  • Prompt Engineering Research: Large-scale prompt effectiveness analysis
  • Agent Behavior Studies: Comprehensive agent decision-making research
  • Bias Detection Research: Systematic bias pattern analysis across models
  • Evaluation Methodology: Meta-research on evaluation techniques themselves

🏢 Enterprise Applications

  • Quality Assurance: Automated content quality control in production systems
  • A/B Testing: Systematic comparison of different AI configurations
  • Performance Monitoring: Continuous evaluation of deployed AI systems
  • Compliance Reporting: Automated generation of evaluation compliance reports
  • Cost Optimization: Evaluation-driven optimization of AI system costs

🎓 Educational Applications

  • Student Assessment: Automated evaluation of student AI projects
  • Curriculum Development: Assessment-driven AI curriculum optimization
  • Research Training: Tools for training researchers in evaluation methodologies
  • Benchmark Creation: Development of new evaluation benchmarks
  • Peer Review: AI-assisted peer review systems for academic work

🚀 Getting Started

🎯 Deployment Options Quick Reference

Mode Command Protocol Port Auth Use Case
MCP Server make dev stdio none none Claude Desktop, MCP clients
REST API make serve-rest HTTP REST 8080 none Direct HTTP API integration
REST Public make serve-rest-public HTTP REST 8080 none Public REST API access
HTTP Bridge make serve-http JSON-RPC/HTTP 9000 none MCP over HTTP, local testing
HTTP Public make serve-http-public JSON-RPC/HTTP 9000 none MCP over HTTP, remote access
MCP Wrapper make serve-wrapper Streamable HTTP (SSE) 9001 none FastMCP wrapper around REST API
MCP Wrapper Public make serve-wrapper-public Streamable HTTP (SSE) 9001 none Public MCP wrapper access
Container make run HTTP 8080 none Docker deployment
AWS App Runner make deploy-apprunner MCP Wrapper (SSE) 8080 none Cloud deployment on AWS (Live)

Immediate Quick Start

Option 1: MCP Server (stdio)

# 1. Run MCP server (for Claude Desktop, etc.)
make dev                    # Shows connection info + starts server

# 2. Test basic functionality
make example               # Run evaluation example
make test-mcp             # Test MCP protocol

Option 2: REST API Server (FastAPI)

# 1. Run native REST API server
make serve-rest          # Starts on http://localhost:8080

# 2. Test REST API endpoints
make test-rest           # Test all REST endpoints

# 3. View interactive documentation
open http://localhost:8080/docs    # Swagger UI
open http://localhost:8080/redoc   # ReDoc

# 4. Get connection info
make rest-info           # Show complete REST API guide

Option 3: HTTP Bridge (MCP over HTTP)

# 1. Run MCP protocol over HTTP
make serve-http          # Starts on http://localhost:9000

# 2. Test HTTP endpoints
make test-http           # Test MCP JSON-RPC endpoints

# 3. Get connection info
make http-info           # Show complete HTTP bridge guide

Option 4: MCP Wrapper (FastMCP + REST API)

# 1. Start REST API server first
make serve-rest          # Starts on http://localhost:8080

# 2. Start MCP wrapper (FastMCP around REST API)
make serve-wrapper       # Starts on http://localhost:9001

# 3. Test wrapper functionality
python test_sse_client.py    # Test SSE client
make test-wrapper            # Test wrapper endpoints

# 4. Connection details
# Endpoint: http://localhost:9001/mcp/
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream
# Session management: Automatic via mcp-session-id header

Option 5: Docker Deployment

# Build and deploy
make build && make run

Option 6: AWS App Runner Deployment

# 1. Set up AWS credentials and IAM role
aws configure
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"

# 2. Deploy to AWS App Runner
make deploy-apprunner

# 3. Test locally first (optional)
make test-docker-apprunner

Integration Examples

MCP Client Integration

# Basic MCP integration
from mcp import Client
client = Client("mcp-eval-server")

# Evaluate any AI output
result = await client.call_tool("judge.evaluate_response", {
    "response": "Your AI output here",
    "criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
    "rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}}
})

REST API Integration

# Start REST API server
make serve-rest

# Check server health
curl http://localhost:8080/health

# List tool categories
curl http://localhost:8080/tools/categories

# Evaluate response directly via REST
curl -X POST http://localhost:8080/judge/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "response": "Paris is the capital of France.",
    "criteria": [
      {
        "name": "accuracy",
        "description": "Factual accuracy",
        "scale": "1-5",
        "weight": 1.0
      }
    ],
    "rubric": {
      "criteria": [],
      "scale_description": {"1": "Wrong", "5": "Correct"}
    },
    "judge_model": "rule-based"
  }'

HTTP Bridge Integration (MCP over HTTP)

# Start HTTP bridge server
make serve-http

# List available tools (JSON-RPC)
curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
     http://localhost:9000/

# Evaluate response via HTTP bridge (JSON-RPC)
curl -X POST \
     -H "Content-Type: application/json" \
     -d '{
       "jsonrpc": "2.0",
       "id": 2,
       "method": "tools/call",
       "params": {
         "name": "judge.evaluate_response",
         "arguments": {
           "response": "Paris is the capital of France.",
           "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
           "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
           "judge_model": "rule-based"
         }
       }
     }' \
     http://localhost:9000/

MCP Wrapper Integration (FastMCP + REST API)

# Start REST API server first
make serve-rest

# Start MCP wrapper
make serve-wrapper

# Initialize MCP session
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}' \
     http://localhost:9001/mcp

# Send initialized notification (note: returns 202 for notifications)
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "method": "notifications/initialized"}' \
     http://localhost:9001/mcp

# List available tools via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}' \
     http://localhost:9001/mcp

# Evaluate response via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{
       "jsonrpc": "2.0",
       "id": 3,
       "method": "tools/call",
       "params": {
         "name": "judge_evaluate",
         "arguments": {
           "response": "Paris is the capital of France.",
           "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
           "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
           "judge_model": "rule-based"
         }
       }
     }' \
     http://localhost:9001/mcp

🌐 Live Deployment Testing: Replace http://localhost:9001/mcp with https://6xaate4xrt.us-east-1.awsapprunner.com/mcp in the above commands to test the live deployment!

Python REST API Client Integration

import httpx
import asyncio

async def evaluate_via_rest_api():
    """Example using native REST API endpoints."""
    async with httpx.AsyncClient() as client:
        base_url = "http://localhost:8080"

        # Check health
        health = await client.get(f"{base_url}/health")
        print(f"Server status: {health.json()['status']}")

        # List tool categories
        categories = await client.get(f"{base_url}/tools/categories")
        print(f"Available categories: {len(categories.json()['categories'])}")

        # Evaluate response using REST endpoint
        evaluation = await client.post(f"{base_url}/judge/evaluate", json={
            "response": "Your AI response here",
            "criteria": [
                {"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
            ],
            "rubric": {
                "criteria": [],
                "scale_description": {"1": "Poor", "5": "Excellent"}
            },
            "judge_model": "rule-based"
        })
        result = evaluation.json()
        print(f"Evaluation score: {result['overall_score']}")

        # Check content toxicity
        toxicity = await client.post(f"{base_url}/quality/toxicity", json={
            "content": "This is a test message",
            "toxicity_categories": ["profanity", "hate_speech"],
            "sensitivity_level": "moderate",
            "judge_model": "rule-based"
        })
        result = toxicity.json()
        print(f"Toxicity detected: {result['toxicity_detected']}")

# Run evaluation
asyncio.run(evaluate_via_rest_api())

Python HTTP Bridge Client Integration

import httpx
import asyncio

async def evaluate_via_http_bridge():
    """Example using MCP over HTTP bridge."""
    async with httpx.AsyncClient() as client:
        base_url = "http://localhost:9000"

        # List tools via JSON-RPC
        tools_request = {
            "jsonrpc": "2.0",
            "id": 1,
            "method": "tools/list",
            "params": {}
        }

        response = await client.post(base_url, json=tools_request)
        result = response.json()
        tools = result.get("result", [])
        print(f"Available tools: {len(tools)}")

        # Evaluate response via JSON-RPC
        eval_request = {
            "jsonrpc": "2.0",
            "id": 2,
            "method": "tools/call",
            "params": {
                "name": "judge.evaluate_response",
                "arguments": {
                    "response": "Your AI response here",
                    "criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
                    "rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}},
                    "judge_model": "rule-based"
                }
            }
        }

        response = await client.post(base_url, json=eval_request)
        result = response.json()
        print(f"Evaluation result: {result}")

# Run evaluation
asyncio.run(evaluate_via_http_bridge())

Python MCP Wrapper Client Integration

import httpx
import asyncio
import json
import re

class MCPWrapperClient:
    """Client for MCP wrapper with SSE support."""
    
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.session_id = None
        self.client = httpx.AsyncClient()
    
    async def __aenter__(self):
        return self
    
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.client.aclose()
    
    async def send_request(self, request: dict) -> dict:
        """Send a JSON-RPC request and parse SSE response."""
        headers = {"Accept": "application/json, text/event-stream"}
        if self.session_id:
            headers["mcp-session-id"] = self.session_id
        
        response = await self.client.post(
            self.base_url,
            json=request,
            headers=headers,
            timeout=30.0
        )
        
        if response.status_code not in [200, 202]:
            raise Exception(f"HTTP {response.status_code}: {response.text}")
        
        # Extract session ID from response headers
        if "mcp-session-id" in response.headers:
            self.session_id = response.headers["mcp-session-id"]
        
        # For notifications (202), return empty result
        if response.status_code == 202:
            return {}
        
        # Parse SSE response
        content = response.text
        data_match = re.search(r'data:\s*(\{.*\})', content)
        if data_match:
            json_str = data_match.group(1)
            return json.loads(json_str)
        else:
            raise Exception(f"Could not parse SSE response: {content}")

async def evaluate_via_mcp_wrapper():
    """Example using MCP wrapper with SSE."""
    async with MCPWrapperClient("http://localhost:9001/mcp/") as client:
        # Initialize MCP session
        init_request = {
            "jsonrpc": "2.0",
            "id": 1,
            "method": "initialize",
            "params": {
                "protocolVersion": "2024-11-05",
                "capabilities": {},
                "clientInfo": {"name": "test-client", "version": "1.0.0"}
            }
        }
        
        result = await client.send_request(init_request)
        print(f"Initialized: {result.get('result', {}).get('serverInfo', {}).get('name', 'Unknown')}")
        
        # Send initialized notification
        await client.send_request({"jsonrpc": "2.0", "method": "notifications/initialized"})
        
        # List tools
        tools_request = {"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}
        result = await client.send_request(tools_request)
        tools = result.get('result', {}).get('tools', [])
        print(f"Available tools: {len(tools)}")
        
        # Evaluate response
        eval_request = {
            "jsonrpc": "2.0",
            "id": 3,
            "method": "tools/call",
            "params": {
                "name": "judge_evaluate",
                "arguments": {
                    "response": "Paris is the capital of France.",
                    "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
                    "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
                    "judge_model": "rule-based"
                }
            }
        }
        
        result = await client.send_request(eval_request)
        if 'result' in result:
            tool_result = result['result']
            print(f"Evaluation successful: {tool_result}")
        else:
            print(f"Evaluation failed: {result}")

# Run evaluation
asyncio.run(evaluate_via_mcp_wrapper())

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors