Examples

Real model output, verbatim from benchmark runs, the same task answered by the same model with no skill (## Without Ponytail) and with ponytail (## With Ponytail), so you can compare side by side. Model: Claude Haiku 4.5, temperature 1, source benchmarks/output.json.

These are not hand-written. Reproduce them yourself: npx promptfoo@latest eval -c benchmarks/promptfooconfig.yaml. Method, all three models, and median-of-10 numbers: ../benchmarks/.