Skip to content

Roverlucas/genai-reproducibility-protocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Same Prompt, Different Answer

Hidden Non-Determinism in LLM APIs Undermines Scientific Reproducibility

Lucas Rover, Hugo Valadares Siqueira, Eduardo Tadeu Bacalhau, Anibal Tavares de Azevedo & Yara de Souza Tadano

UTFPR --- Universidade Tecnológica Federal do Paraná

Status: Submitted to Nature Communications (March 2026) | Major Revision (May 2026) | Revision package finalised 2026-05-13 — awaiting coauthor sign-off and MTS upload

Figshare DOI OSF (companion) Release tag License: code License: data

Overview

This repository contains the reference implementation, experimental data, analysis scripts, and manuscript for a study demonstrating that API-served large language models fail to reproduce their own outputs under documented "deterministic" settings. We provide a lightweight provenance protocol grounded in W3C PROV that makes this invisible variation visible, auditable, and attributable.

Headline finding (original + revision): Across 7,004 controlled experiments on nine deployment stacks — tuples of (weights, provider, infrastructure, API) — and six task families (extraction, summarisation, multi-turn refinement, RAG, code generation, math reasoning), API-served stacks reproduce their own outputs as little as 1% of the time under temperature-zero greedy decoding with fixed seeds, while local-stack averages reach 93–98%.

Companion paper (OSF preregistration 2026-05-12): Rover, L. & de Souza Tadano, Y. Reproducibility of Pollution–Health Evidence Synthesis using LLM-Assisted Screening and Extraction. OSF (2026). https://doi.org/10.17605/OSF.IO/VR934. Quantifies the empirical impact of the phenomenon documented in this paper on a 500-abstract evidence base in environmental health (23 study-level effect estimates appear/disappear depending on which run is used). Available for parallel editorial consideration at Nature Communications.

Repository Structure

├── article/                          # Manuscript and submission materials
│   ├── ncomms_main.tex               # Revised manuscript (Nature Communications, 28p)
│   ├── ncomms_main.pdf
│   ├── supplementary_nature_mi.tex   # Supplementary Information (S1–S13, 23p)
│   ├── supplementary_nature_mi.pdf
│   ├── CODE_SOFTWARE_CHECKLIST.md    # Nature Code/Software submission checklist
│   ├── ML_CHECKLIST_FILLED.md        # Machine Learning checklist
│   ├── REPORTING_SUMMARY_FILLED.md   # Reporting Summary
│   ├── references.bib                # Bibliography (56 entries, 0 orphan)
│   ├── sn-jnl.cls + sn-nature.bst    # Springer Nature template
│   └── figures/                      # 6 publication figures (PDF, 600 DPI)
├── response_letter/                  # Major revision response materials
│   ├── 01_point_by_point_response.tex   # 22 verbatim revquote blocks (R1.1–R1.15, R3.1–R3.6)
│   ├── 01_point_by_point_response.pdf   # 15 pages
│   ├── 03_revised_cover_letter.tex   # Cover letter for resubmission (May, 2026)
│   └── 03_revised_cover_letter.pdf   # 3 pages
├── submission_revision_v1/           # Complete resubmission package
│   ├── ncomms_main_tracked.tex       # latexdiff (track changes vs T5-snapshot)
│   ├── ncomms_main_tracked.pdf       # 30 pages with diff highlights
│   └── READY_FOR_REVIEW/             # 10 final documents (5 PDFs + 5 md notes)
├── overleaf_upload/                  # Submission ZIPs (rebuilt 2026-05-12)
│   ├── manuscript.zip                # Overleaf-ready (.tex + figures + bib + cls)
│   ├── response_letter.zip
│   ├── overleaf_complete.zip         # Both above combined
│   └── submission_mts.zip            # Final PDFs for Editorial Manager upload
├── src/                              # Reference implementation
│   ├── protocol/                     # Core protocol (logger, hasher, run/prompt cards, PROV)
│   ├── models/                       # Model runners (llama, gpt4, claude, gemini)
│   ├── metrics/                      # EMR, NED, ROUGE-L, BERTScore, validation, overhead
│   ├── tasks/                        # Revision-batch task modules (NEW)
│   │   ├── humaneval_loader.py       # HumanEval code generation
│   │   ├── gsm8k_loader.py           # GSM8K math reasoning
│   │   ├── pubmed_loader.py          # 10 PubMed PM2.5 abstracts
│   │   ├── pass_at_1.py              # Sandboxed code execution
│   │   ├── gsm8k_extractor.py        # Math final-answer regex
│   │   ├── llm_judge.py              # Claude Opus 4.7 LLM-as-judge
│   │   └── pm25_case_loader.py       # T3 case sampling
│   └── cost_estimator.py             # Per-call cost estimation + budget guard (NEW)
├── data/
│   └── inputs/
│       ├── abstracts.json            # 30 AI/ML abstracts (original)
│       └── revision/                 # Revision-batch inputs (NEW)
│           ├── humaneval.jsonl       # 164 HumanEval problems (30 sampled)
│           ├── gsm8k_test.jsonl      # 1,319 GSM8K problems (30 sampled)
│           ├── pubmed_pm25_t14.json  # 10 PubMed PM2.5 abstracts
│           └── t3_judge_cases.json   # 10 T3 LLM-judge sample cases
├── outputs/
│   ├── runs/                         # Original 4,104 PROV records
│   ├── run_cards/, prov/, prompt_cards/
│   └── revision/                     # NEW: 2,900 revision runs + T3 judge
│       ├── runs/                     # 2,900 JSONs (HumanEval, GSM8K, PubMed, multi-turn)
│       ├── t3_judge/                 # 10 Claude Opus verdict records
│       └── checkpoint.json           # Budget + resume state
├── analysis/                         # Analysis scripts and results
│   ├── regenerate_figures_nature_mi.py
│   ├── bootstrap_cis.json
│   ├── bertscore_per_field.py        # NEW: per-field BERTScore (R1.5)
│   ├── bertscore_per_field_results.json
│   ├── tables/
│   │   ├── table_per_field_metrics.tex
│   │   └── table_t1_t4_t14.tex       # NEW: revision EMR table
│   ├── figures/per_field_radar.pdf
│   └── revision/                     # NEW: revision-batch analyses
├── tests/                            # 102 tests (51 original + 51 revision)
├── run_experiments.py                # Original experiment runner
├── run_revision_experiments.py       # Unified revision pipeline
├── run_revision_full.sh              # Resumable orchestrator (~$25 budget guard)
├── run_t3_validation.py              # T3 LLM-judge runner (Claude Opus)
├── run_t3_extended.py                # T3 extended LLM-judge runner (gpt-4o)
├── analyze_revision_results.py       # Post-hoc revision analysis
├── REVISION_PLAN.md                  # Revision strategy (archival, 2026-05-08)
├── STATUS.md                         # Current state checkpoint
├── requirements.txt
└── LICENSE                           # MIT (code) + CC-BY 4.0 (data via Figshare)

Key Results

Original analysis (4,104 experiments, March 2026 submission)

Model Deployment Extraction EMR Summarisation EMR
Gemma 2 9B Local 1.000 [1.00, 1.00] 1.000 [1.00, 1.00]
LLaMA 3 8B Local 0.987 [0.96, 1.00] 0.947 [0.89, 0.99]
Mistral 7B Local 0.960 [0.88, 1.00] 0.840 [0.72, 0.96]
DeepSeek Chat API 0.800 0.760
GPT-4 (gpt-4-0613) API 0.443 [0.32, 0.57] 0.230 [0.16, 0.30]
Claude Sonnet 4.5 API 0.190 [0.05, 0.40] 0.020 [0.00, 0.05]
Gemini 2.5 Pro API Multi-turn: 0.010 [0.00, 0.03] RAG: 0.070 [0.02, 0.13]
Perplexity Sonar API 0.100 0.010

All comparisons survive Holm-Bonferroni correction (51/68 tests significant). Cliff's delta: 0.784–0.896.

Revision additions (2,900 new experiments, May 2026)

Following the editor's request for experiments beyond summarisation, and Reviewers 1 and 3:

HumanEval (code generation, 30 problems × 5 reps × 8 stacks)

Stack EMR 95% CI
Locals (Gemma 2 9B, LLaMA 3 8B, Mistral 7B) + Together AI + Gemini 2.5 Pro 0.92–1.000
deepseek-chat 0.837 [0.72, 0.93]
gpt-4o (gpt-4o-2024-11-20) 0.837 [0.75, 0.92]
Claude Sonnet 4.5 0.393 [0.27, 0.52]

GSM8K (math reasoning, 30 problems × 5 reps × 8 stacks)

Stack EMR 95% CI
Locals + Together AI + Gemini 0.84–1.000
deepseek-chat 0.370 [0.26, 0.49]
gpt-4o 0.267 [0.17, 0.37]
Claude Sonnet 4.5 0.063 [0.03, 0.10]

Multi-turn refinement extension to gpt-4o + deepseek-chat (R3.3)

Stack EMR 95% CI
deepseek-chat 0.350 [0.13, 0.60]
gpt-4o 0.090 [0.02, 0.16]

Confirms the near-zero EMR pattern previously reported for Claude (0.040) and Gemini (0.010) is universal across major cloud-served interactive pipelines.

PubMed PM2.5/respiratory health (10 abstracts, out-of-AI/ML probe)

Stack EMR 95% CI
Locals + Together AI 0.96–1.000
deepseek-chat 0.660 [0.46, 0.85]
gemini-2.5-pro 0.490 [0.26, 0.72]
gpt-4o 0.420 [0.27, 0.58]
Claude Sonnet 4.5 0.010 [0.00, 0.03]

T3 LLM-as-judge triangulation (R3.6)

10 PM2.5 disagreement cases judged blind by Claude Opus 4.7 with three pre-registered criteria (direction, magnitude ±20%, CI overlap):

  • 5 truly_contradictory
  • 3 semantically_equivalent
  • 2 ambiguous

Half of the divergences are substantive contradictions even though BERTScore F1 > 0.97 on the same outputs.

Per-field reproducibility analysis (R1.5)

BERTScore F1 saturates across all five extracted fields (Δ = 0.001, paired Cohen's d = -0.10), while EMR exposes a paired Cohen's d = +1.41 between conclusion-relevant fields (mean EMR = 0.455) and metadata fields (mean EMR = 0.684). BERTScore alone is structurally unable to detect substantive payload divergence.

Revision total: 7,004 experiments across 9 deployment stacks and 6 task families

Reproducing the Experiments

Prerequisites

  • Python 3.10+ (developed and tested with Python 3.14.3)
  • Ollama v0.15+ (for local models)
  • API keys (optional): OpenAI, Anthropic, Google Gemini, DeepSeek, Together AI

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
ollama pull llama3:8b && ollama pull mistral:7b && ollama pull gemma2:9b

Running Original Experiments

# Local models (no API keys needed)
python run_experiments.py

# Expanded experiments (30 abstracts, all conditions)
python run_expanded_experiments.py

# API models (require respective API keys)
export OPENAI_API_KEY="..." && python run_experiments.py --gpt4-only
export ANTHROPIC_API_KEY="..." && python run_claude_multiturn.py
export GEMINI_API_KEY="..." && python run_gemini_multiturn.py

Running Revision Experiments (May 2026)

# Dry-run (no API calls, cost estimate only)
python run_revision_experiments.py --task all --stack all --condition C1 --dry-run

# Single task on a single stack
python run_revision_experiments.py \
  --task humaneval --stack gpt-4o --condition C1 \
  --n-problems 30 --n-reps 5 --execute

# Full revision batch (orchestrator with checkpoint + resume)
bash run_revision_full.sh

# T3 LLM-as-judge triangulation (PM2.5)
python run_t3_validation.py --sample           # Phase A: cache 10 cases
python run_t3_validation.py --judge --execute  # Phase B: ~$0.28 USD

Analysis and Figures

# Original figures (Nature MI / NComms format)
python analysis/regenerate_figures_nature_mi.py

# Revision analysis (per-stack EMR + tables)
python analyze_revision_results.py

# Per-field BERTScore (R1.5 — revision)
python analysis/bertscore_per_field.py

Compiling the Manuscript

cd article
pdflatex ncomms_main.tex
pdflatex ncomms_main.tex                       # second pass for cross-refs
pdflatex supplementary_nature_mi.tex
pdflatex supplementary_nature_mi.tex

# Track-changes (latexdiff)
cd ..
latexdiff submission_revision_v1/ncomms_main_post_T5.tex article/ncomms_main.tex \
  > submission_revision_v1/ncomms_main_tracked.tex
cd submission_revision_v1
pdflatex ncomms_main_tracked.tex
pdflatex ncomms_main_tracked.tex

Tests

python -m pytest tests/ -v   # 102 tests (51 original + 51 revision-batch)

Test breakdown:

  • tests/test_core.py — 51 original protocol/metric/PROV tests
  • tests/test_cost_estimator.py — 16 revision tests (pricing, alias resolution, BudgetGuard)
  • tests/test_humaneval_loader.py — 8 tests (load, stratified sample, determinism)
  • tests/test_gsm8k_loader.py — 17 tests (loader + extractor + grade_runs)
  • tests/test_pass_at_1.py — 10 tests (correct, wrong, timeout, code-fence handling)

Editorial Submission Package

The full Major Revision package is at submission_revision_v1/READY_FOR_REVIEW/:

# Document Pages
00 00_README.md (instructions for coauthors)
01 01_revised_manuscript_clean.pdf 27
02 02_revised_manuscript_tracked.pdf (latexdiff) 27
03 03_supplementary.pdf (incl. §S11 + §S12 added in revision) 18
04 04_point_by_point_response.pdf (verbatim quotes for 15 R1 + 6 R3) 15
05 05_cover_letter.pdf 2
06 06_changes_log.md (granular by reviewer point)
07 07_ml_checklist.md (updated with revision additions)
08 08_reporting_summary.md (T13 deployment-mode clarification)
09 09_code_software_checklist.md (new for revision)

Companion Paper

The 500-abstract PM2.5/respiratory-health analysis underlying the applied-impact finding is reported in detail in our companion paper:

Rover, L. & Tadano, Y. When the same question gets different answers: quantifying LLM non-determinism in evidence synthesis. Research Synthesis Methods (2026, submitted).

This NatComms manuscript does NOT reproduce the companion paper's detailed pairwise reproducibility, silver-standard validation, or meta-analytic propagation tables; it cites them and reports only the aggregate finding (23 study-level effect estimates appear/disappear depending on which run is used) plus an independent in-paper LLM-judge triangulation on 10 new cases distinct from the 23 effects.

Citation

If you use this protocol or dataset, please cite:

Rover, L., Siqueira, H. V., Bacalhau, E. T., de Azevedo, A. T. & Tadano, Y. S. Same Prompt, Different Answer: Hidden Non-Determinism in LLM APIs Undermines Scientific Reproducibility. Nature Communications (2026, major revision).

License

  • Code: MIT License
  • Data and manuscript: CC-BY 4.0

About

Reference implementation, full provenance dataset (7,004 experiments), and revision-ready submission package for the Nature Communications study 'Same Prompt, Different Answer: Hidden Non-Determinism in LLM APIs Undermines Scientific Reproducibility' (Rover et al., 2026, under Major Revision).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors