Real model output, verbatim from benchmark runs, the same task answered by the same model
with no skill (## Without Ponytail) and with ponytail (## With Ponytail), so you can
compare side by side. Model: Claude Haiku 4.5, temperature 1, source benchmarks/output.json.
These are not hand-written. Reproduce them yourself:
npx promptfoo@latest eval -c benchmarks/promptfooconfig.yaml. Method, all three models, and
median-of-10 numbers: ../benchmarks/.
| Example | Without (LOC) | With (LOC) |
|---|---|---|
| Email Validation | 75 | 3 |
| Debounce | 116 | 10 |
| CSV Sum | 20 | 3 |
| Countdown Timer | 267 | 9 |
| Rate Limiting | 128 | 10 |