Add LLM prompting benchmark framework for controlled experiments by Copilot · Pull Request #17 · natnew/Awesome-Data-Science

Copilot · 2026-02-18T11:17:02Z

Implements a standardized framework for evaluating LLM performance on data science tasks through controlled experiments that isolate individual variables (prompt strategy, model version, task type).

Structure

benchmarks/prompting/README.md: Methodology for controlled experiments (change one variable at a time), evaluation dimensions (reasoning depth, hallucination rate, numerical reliability, verbosity, tool-use), reproducibility checklist
benchmarks/prompting/results-template.md: Result matrix template with columns for model version, task type, prompt strategy, success rate, hallucination notes. Includes sections for statistical analysis, cross-model comparison, and raw data preservation
benchmarks/prompting/run_benchmark.py: Python runner with placeholder LLM client for API integration. Includes sample test cases (statistical reasoning, ML algorithm selection, data cleaning, code generation), CLI interface, and automated JSON/text logging

Usage

python run_benchmark.py \
  --model gpt-4-turbo-preview \
  --temperature 0.7 \
  --max-tokens 1000 \
  --experiment-id "statistical-reasoning-fewshot" \
  --output-dir results/

The script logs all prompts, responses, timing, and metadata for manual evaluation. Contributors integrate their API clients into the LLMClient.query() method.

Original prompt

This section details on the original issue you should resolve

<issue_title>Comparative Prompting Benchmarks Across Leading LLMs</issue_title>
<issue_description>This issue aims to build a structured comparison of how different models respond to identical data science tasks. Contributors should design controlled prompt experiments and document variations in reasoning depth, hallucination rate, numerical reliability, verbosity, and tool-use performance. The goal is to create a reproducible evaluation layer, not anecdotal commentary.</issue_description>

<agent_instructions>For Issue #12, I need to set up a standardized benchmarking framework.

Create a folder benchmarks/prompting/.
Create a Markdown file benchmarks/prompting/README.md that defines the methodology. It must explain how to run a "controlled prompt experiment" (changing only one variable at a time).
Create a result matrix template in benchmarks/prompting/results-template.md with columns for: Model Version, Task Type, Prompt Strategy (e.g., Zero-shot vs Few-shot), Success Rate (%), and Notes on Hallucinations.
Generate a Python script benchmarks/prompting/run_benchmark.py that uses a simple list of prompts to query an LLM API (placeholder function is fine) and logs the output for manual review.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Comparative Prompting Benchmarks Across Leading LLMs #12

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…hon script Co-authored-by: natnew <[email protected]>

Initial plan

a91a06d

Copilot AI assigned Copilot and natnew Feb 18, 2026

Copilot started work on behalf of natnew February 18, 2026 11:17 View session

Add LLM prompting benchmarks framework with README, template, and Pyt…

c689d5c

…hon script Co-authored-by: natnew <[email protected]>

Copilot AI changed the title ~~[WIP] Add structured benchmarking framework for LLM comparisons~~ Add LLM prompting benchmark framework for controlled experiments Feb 18, 2026

Copilot AI requested a review from natnew February 18, 2026 11:21

Copilot finished work on behalf of natnew February 18, 2026 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM prompting benchmark framework for controlled experiments#17

Add LLM prompting benchmark framework for controlled experiments#17
Copilot wants to merge 2 commits intomainfrom
copilot/setup-benchmarking-framework

Copilot AI commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Structure

Usage

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 18, 2026 •

edited

Loading