-
Notifications
You must be signed in to change notification settings - Fork 2
Add MiniMax M3 (post-v1.1, rotated v1.3 council) #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,10 +24,14 @@ Models evaluated after the v1.1-frozen snapshot are appended to the committed da | |
| | Model | Provider | Released | Tested | Council | Snapshot | Headline | | ||
| |---|---|---|---|---|---|---| | ||
| | **Claude Opus 4.8** \* | Anthropic | [2026-05-28](https://www.anthropic.com/news/claude-opus-4-8) | 2026-05-29 | **v1.3** (rotated) | post-v1.1 | PC Tier A (TPR 100 %); benign 57 %, dual-use 100 %, Youden's J **+0.43** | | ||
| | **MiniMax M3** \* | MiniMax | early Jun 2026 | 2026-06-03 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 80 %, between B-cap 73 % and A-floor 95 %); benign 21 %, dual-use 29 %, Youden's J **+0.59** | | ||
|
|
||
| The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. Opus 4.8 walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (Youden's J +0.23 → +0.43) while holding dual-use refusal at 100 %. | ||
| The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix model-count arithmetic in snapshot description. “18 frontier + Llama control + Nemotron” totals 20, which conflicts with the surrounding 19-model framing. Please correct either the count or the listed components. 🤖 Prompt for AI Agents |
||
|
|
||
| > **\* Rotated v1.3 council.** Claude Opus 4.8 was adjudicated under a rotated three-judge panel (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), **not** the original v1.1 panel (NVIDIA Nemotron + Cohere via Bedrock + AI21 Jamba). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` was no longer available on OpenRouter (HTTP 404, no endpoints found) and had no corresponding Bedrock deployment; `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock and access-denied due to >30 days inactivity. Both judges were replaced with verified-live alternatives maintaining the no-org-overlap invariant. Two of three judges differ from the original panel, so cross-panel comparisons should be read with that caveat (mean inter-judge agreement is comparable: 0.955 vs 0.975). Full judge history is documented in [`benchmark/council/v1.1.json`](benchmark/council/v1.1.json). | ||
| - **Opus 4.8** walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (Youden's J +0.23 → +0.43) while holding dual-use refusal at 100 %. | ||
| - **MiniMax M3** refuses more on every tier than M2.7 (dual-use 14 % → 29 %, PC TPR 72 % → 80 %, moving from Tier B into the gap zone), but benign over-refusal more than tripled (6 % → 21 %), so Youden's J slips slightly (+0.66 → +0.59). Dangerous-side gain didn't outpace the benign-side drift. | ||
|
|
||
| > **\* Rotated v1.3 council.** Both post-frozen models (Opus 4.8 and MiniMax M3) were adjudicated under a rotated three-judge panel (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), **not** the original v1.1 panel (NVIDIA Nemotron + Cohere via Bedrock + AI21 Jamba). As of 2026-05-29, `nvidia/llama-3.1-nemotron-70b-instruct` was no longer available on OpenRouter (HTTP 404, no endpoints found) and had no corresponding Bedrock deployment; `cohere.command-r-plus-v1:0` was marked Legacy on Bedrock and access-denied due to >30 days inactivity. Both judges were replaced with verified-live alternatives maintaining the no-org-overlap invariant. Two of three judges differ from the original panel, so cross-panel comparisons should be read with that caveat (mean inter-judge agreement is comparable: ~0.96 for the post-frozen models vs 0.975 for the original panel). Full judge history is documented in [`benchmark/council/v1.1.json`](benchmark/council/v1.1.json). | ||
|
|
||
| --- | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -167,6 +167,16 @@ | |
| "role": "primary", | ||
| "pricing_usd_per_mtok": {"input": 0.75, "output": 4.5} | ||
| }, | ||
| { | ||
| "model_id": "minimax/minimax-m3", | ||
| "display_name": "MiniMax M3", | ||
| "provider": "openrouter", | ||
| "jurisdiction": "asia", | ||
| "organization": "minimax", | ||
| "role": "v1.3_addition", | ||
| "routing_note": "Released 2026-05-31. OpenRouter ID: minimax/minimax-m3. Multimodal (text/image/video input), 1M context. Replaces M2.7 in the panel.", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify the M3 routing note to avoid contradiction with active M2.7 entry.
🤖 Prompt for AI Agents |
||
| "pricing_usd_per_mtok": {"input": 0.3, "output": 1.2} | ||
| }, | ||
| { | ||
| "model_id": "minimax/minimax-m2.7-20260318", | ||
| "display_name": "MiniMax M2.7", | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a concrete M3 release date for consistency across docs.
This row says “early Jun 2026,” while config documents
Released 2026-05-31. Using the exact date in both places avoids timeline ambiguity.🤖 Prompt for AI Agents