A professional, extensible framework for systematically evaluating, scoring, and comparing Large Language Model (LLM) prompts. Designed for ML Engineers and Researchers who need to move beyond ad-hoc testing to rigorous, metric-driven prompt engineering.
In production environments, prompt engineering cannot be based on "vibes" or single examples. This playground provides a deterministic test harness to:
- Quantify Quality: Assign numerical scores to subjective metrics like Relevance, Clarity, and Safety.
- Compare Variations: A/B test prompt templates on consistent datasets to statistically determine the winner.
- Detect Regressions: Catch failures (hallucinations, refusals, safety breaches) before deployment.
- Abstract Backends: Seamlessly switch between OpenAI, Anthropic, or Local models for evaluation.
The system is designed with a modular "Plug-and-Play" architecture:
[Datasets] [Prompts] [Models]
| | |
v v v
+-------------------------------+
| Prompt Execution Engine |
+---------------+---------------+
|
v
[Raw Responses]
|
+---------------+---------------+
| Evaluation Framework |
| (Relevance, Safety, Accuracy) |
+---------------+---------------+
|
v
[Results Store] -> [Analysis & Reporting]
models/: Abstracted LLM clients. Includes aLocalModelClientfor cost-free testing andOpenAIClientfor production usage.evaluators/: Scoring logic. Supports both heuristic-based rules (regex, keywords) and LLM-as-a-Judge patterns.scripts/: CLI tools for running experiments (run_experiment.py) and analyzing results (compare_prompts.py).config/: YAML-based declarative configuration for reproducible experiments.
We employ a multi-dimensional scoring rubric. Each dimension scores from 0.0 to 10.0.
| Dimension | Description | Method |
|---|---|---|
| Relevance | Does the response directly address the user query? | NLP Keyword Overlap / Vector Similarity |
| Accuracy | Is the information factually correct relative to ground truth? | Sequence Matching / Fact Checking Evaluator |
| Clarity | Is the response well-structured (bullet points, length)? | Heuristic Rules |
| Safety | Does the response avoid harmful content and policy violations? | Keyword Blacklist / Classifier |
Note: For production use cases, we recommend extending evaluators/base.py to use a strong LLM (e.g., GPT-4) as a Judge for nuanced scoring.
git clone https://github.com/kanyingidickson-dev/Prompt-evaluation-playground.git
cd Prompt-evaluation-playground
pip install -r requirements.txtSet your API keys (if using real models):
export OPENAI_API_KEY="sk-..."Review the experiment config in config/evaluation.yaml to select your models and datasets.
Execute the main test harness:
python scripts/run_experiment.py --config config/evaluation.yamlOutput:
Starting Experiment: financial-advisor-v1-benchmark
Loaded Evaluators: ['relevance', 'safety', 'accuracy', 'clarity']
...
Experiment Complete. Results saved to results/
Compare prompts and rank performance:
python scripts/compare_prompts.py --results results/results.csvDetect constraints failures:
python scripts/analyze_failures.py --threshold 5.0Create a new file in evaluators/ inheriting from BaseEvaluator:
from .base import BaseEvaluator, EvaluationResult
class ToneEvaluator(BaseEvaluator):
def evaluate(self, query, response, reference=None):
# Your custom logic here
return EvaluationResult(score=8.5, reasoning="Professional tone detected.")Implement the BaseModelClient interface in models/:
class AnthropicClient(BaseModelClient):
def generate(self, prompt, **kwargs):
# Call Claude API
passMIT License