Skip to content

Add LLM prompting benchmark framework for controlled experiments#17

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/setup-benchmarking-framework
Draft

Add LLM prompting benchmark framework for controlled experiments#17
Copilot wants to merge 2 commits intomainfrom
copilot/setup-benchmarking-framework

Conversation

Copy link
Contributor

Copilot AI commented Feb 18, 2026

Implements a standardized framework for evaluating LLM performance on data science tasks through controlled experiments that isolate individual variables (prompt strategy, model version, task type).

Structure

  • benchmarks/prompting/README.md: Methodology for controlled experiments (change one variable at a time), evaluation dimensions (reasoning depth, hallucination rate, numerical reliability, verbosity, tool-use), reproducibility checklist

  • benchmarks/prompting/results-template.md: Result matrix template with columns for model version, task type, prompt strategy, success rate, hallucination notes. Includes sections for statistical analysis, cross-model comparison, and raw data preservation

  • benchmarks/prompting/run_benchmark.py: Python runner with placeholder LLM client for API integration. Includes sample test cases (statistical reasoning, ML algorithm selection, data cleaning, code generation), CLI interface, and automated JSON/text logging

Usage

python run_benchmark.py \
  --model gpt-4-turbo-preview \
  --temperature 0.7 \
  --max-tokens 1000 \
  --experiment-id "statistical-reasoning-fewshot" \
  --output-dir results/

The script logs all prompts, responses, timing, and metadata for manual evaluation. Contributors integrate their API clients into the LLMClient.query() method.

Original prompt

This section details on the original issue you should resolve

<issue_title>Comparative Prompting Benchmarks Across Leading LLMs</issue_title>
<issue_description>This issue aims to build a structured comparison of how different models respond to identical data science tasks. Contributors should design controlled prompt experiments and document variations in reasoning depth, hallucination rate, numerical reliability, verbosity, and tool-use performance. The goal is to create a reproducible evaluation layer, not anecdotal commentary.</issue_description>

<agent_instructions>For Issue #12, I need to set up a standardized benchmarking framework.

Create a folder benchmarks/prompting/.
Create a Markdown file benchmarks/prompting/README.md that defines the methodology. It must explain how to run a "controlled prompt experiment" (changing only one variable at a time).
Create a result matrix template in benchmarks/prompting/results-template.md with columns for: Model Version, Task Type, Prompt Strategy (e.g., Zero-shot vs Few-shot), Success Rate (%), and Notes on Hallucinations.
Generate a Python script benchmarks/prompting/run_benchmark.py that uses a simple list of prompts to query an LLM API (placeholder function is fine) and logs the output for manual review.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Add structured benchmarking framework for LLM comparisons Add LLM prompting benchmark framework for controlled experiments Feb 18, 2026
Copilot AI requested a review from natnew February 18, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comparative Prompting Benchmarks Across Leading LLMs

2 participants