Skip to content

Conversation

@bradleyshep
Copy link

@bradleyshep bradleyshep commented Oct 25, 2025

Notes

  • Seems like CI check on PR isn't working as intended. New to GH actions/CI. Guidance welcome.

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

  • CLI: llm with subcommands run, routes list, diff, ci-check.
  • Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
  • Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
  • Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
  • Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
  • Build & guards: build script for compile-time setup;
  • Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

  1. Pick what to run

    • Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
    • Optionally limit vendors/models (--providers …, --models …).
  2. Resolve routes

    • Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
  3. Build context

    • Start Spacetime
    • Publish golden answer modules
    • Prepare prompts and send to LLM model
    • Attempt to publish LLM module
  4. Execute calls

    • Run the selected tasks within each test against selected models and languages.
  5. Score outputs

    • Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
    • Record the score and any short failure reason.
  6. Update results file

    • Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

  • External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

  • llm run --tasks 0,1,2 (explicit run)
  • llm run --lang rust --categories basics (filters)
  • llm run --categories basics,schema (multiple categories)
  • llm run --lang csharp (language switch)
  • llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5" (provider/model limits)
  • llm run --hash-only (dry integrity)
  • llm run --goldens-only (test goldens only)
  • llm run --force (skip hash check)
  • llm ci-check
  • Stats viewer loads the JSON; filtering and CSV export work
  • CI works as intended

Comment on lines +413 to +424
name: Verify docs/rustdoc_json hashes
if: ${{ github.event_name == 'pull_request' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2

- name: Run hash check (both langs)
working-directory: public/crates/xtask-llm-benchmark
run: cargo llm ci-check

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 5 days ago

The best way to fix the problem is to restrict the permissions of the GITHUB_TOKEN explicitly for the llm_ci_check job, by adding a permissions: block just below its job name in .github/workflows/ci.yml. As per the CodeQL suggestion and established best practices, the minimal required permission is contents: read, which grants the job read-only access to code in the repository. This allows the job to perform checkout and CI actions without granting unnecessary write permissions. No changes are required to existing imports, steps, or functionality.

Implement this by editing .github/workflows/ci.yml to add:

permissions:
  contents: read

directly beneath the name: field for the llm_ci_check job (line 413).


Suggested changeset 1
.github/workflows/ci.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -411,6 +411,8 @@
 
   llm_ci_check:
     name: Verify docs/rustdoc_json hashes
+    permissions:
+      contents: read
     if: ${{ github.event_name == 'pull_request' }}
     runs-on: ubuntu-latest
     steps:
EOF
@@ -411,6 +411,8 @@

llm_ci_check:
name: Verify docs/rustdoc_json hashes
permissions:
contents: read
if: ${{ github.event_name == 'pull_request' }}
runs-on: ubuntu-latest
steps:
Copilot is powered by AI and may make mistakes. Always verify output.
@bfops bfops added the release-any To be landed in any release window label Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-any To be landed in any release window

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants