LLM Benchmarking #3486

bradleyshep · 2025-10-25T02:20:39Z

Notes

Seems like CI check on PR isn't working as intended. New to GH actions/CI. Guidance welcome.

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

CLI: llm with subcommands run, routes list, diff, ci-check.
Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
Build & guards: build script for compile-time setup;
Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

Pick what to run
- Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
- Optionally limit vendors/models (--providers …, --models …).
Resolve routes
- Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
Build context
- Start Spacetime
- Publish golden answer modules
- Prepare prompts and send to LLM model
- Attempt to publish LLM module
Execute calls
- Run the selected tasks within each test against selected models and languages.
Score outputs
- Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
- Record the score and any short failure reason.
Update results file
- Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

.github/workflows/ci.yml

+    name: Verify docs/rustdoc_json hashes
+    if: ${{ github.event_name == 'pull_request' }}
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: dtolnay/rust-toolchain@stable
+      - uses: Swatinem/rust-cache@v2
+
+      - name: Run hash check (both langs)
+        working-directory: public/crates/xtask-llm-benchmark
+        run: cargo llm ci-check


The best way to fix the problem is to restrict the permissions of the GITHUB_TOKEN explicitly for the llm_ci_check job, by adding a permissions: block just below its job name in .github/workflows/ci.yml. As per the CodeQL suggestion and established best practices, the minimal required permission is contents: read, which grants the job read-only access to code in the repository. This allows the job to perform checkout and CI actions without granting unnecessary write permissions. No changes are required to existing imports, steps, or functionality.

Implement this by editing .github/workflows/ci.yml to add:

permissions: contents: read

directly beneath the name: field for the llm_ci_check job (line 413).

bradleyshep added 8 commits October 24, 2025 15:00

init files

7110164

Update llm-benchmark-details.json

ef61f28

llm benchmarks (moved from private)

a200254

remove dotenvy

58a5a08

ignore registry

af45a20

summary updates; command

961dd1c

Merge branch 'LLM-benchmarks' into bradley/llm-benchmark

8e6624f

develop updates

b59650f

bradleyshep requested a review from cloutiertyler October 25, 2025 02:20

github-advanced-security bot found potential problems Oct 25, 2025

View reviewed changes

bradleyshep added 4 commits October 24, 2025 22:28

DEVELOP + registry ignored

38596ba

change generated registry to use relative paths + include in git

7d69779

attempt fix to pass

3aa051b

DEVELOP updates; clippy fixes?

e443251

bfops added the release-any To be landed in any release window label Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LLM Benchmarking #3486

LLM Benchmarking #3486

Uh oh!

bradleyshep commented Oct 25, 2025 •

edited

Loading

Uh oh!

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@@ -411,6 +411,8 @@
               llm_ci_check:
                 name: Verify docs/rustdoc_json hashes
+                permissions:
+                  contents: read
                 if: ${{ github.event_name == 'pull_request' }}
                 runs-on: ubuntu-latest
                 steps:

Uh oh!

LLM Benchmarking #3486

Are you sure you want to change the base?

LLM Benchmarking #3486

Uh oh!

Conversation

bradleyshep commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Description of Changes

How it works

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bradleyshep commented Oct 25, 2025 •

edited

Loading