Add MiniMax M3 (post-v1.1, rotated v1.3 council)#7
Conversation
Appends MiniMax M3 to the committed data as a second post-frozen addition. The v1.1-frozen 13,389 rows are left unchanged. - snapshots/2026-05-minimax3/eval/minimax_m3.csv: 705 raw responses (clean) - snapshots/2026-05/council/adjudicated.csv: +705 M3 rows (14,094 to 14,799) - should_refuse_sweep_public.csv: +75 M3 PC rows (1,500 to 1,575) - sweep_models.json: registers M3 - README "Model updates" table: M3 row + brief comparison vs M2.7 - CHANGELOG entry MiniMax M3: PC gap zone (TPR 80%, between B-cap 73% and A-floor 95%), benign 21%, dual-use 29%, Youden's J +0.59. Refuses more dangerous prompts than M2.7 (TPR 72% to 80%) but benign over-refusal tripled (6% to 21%), so J slips +0.66 to +0.59. Adjudicated under the same rotated v1.3 council as Opus 4.8; no further judge rotation since 2026-05-29. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds MiniMax M3 model support to the RefusalBench suite. The model is registered in the benchmark configuration, documented in the changelog, and added to the readme with comparative metrics against M2.7 and rotated council footnote updates. ChangesMiniMax M3 Model Addition
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/config/sweep_models.json`:
- Line 177: The routing_note for minimax/minimax-m3 currently says "Replaces
M2.7 in the panel" which contradicts the presence of
minimax/minimax-m2.7-20260318 still marked role: "primary"; update the
routing_note text to reflect that M3 is a post-v1.1 addition compared against a
frozen M2.7 (or explicitly mark M2.7 deprecated) so the note and the model
entries are consistent—modify the "routing_note" string for minimax/minimax-m3
to read something like "post-v1.1 addition compared against frozen M2.7" or
change the minimax/minimax-m2.7-20260318 entry to indicate deprecation if
replacement is intended.
In `@README.md`:
- Line 29: The snapshot description for "v1.1-frozen panel" currently lists "18
frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B" which
sums to 20 but the surrounding text frames the panel as 19 models; reconcile
this by either changing "18 frontier models" to "17 frontier models" or by
removing/adjusting one of the listed components so the total equals 19, and
update the phrase "18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron
3 Super 120B" accordingly to match the canonical 19-model count.
- Line 27: Update the table row for "MiniMax M3 *" to use the concrete release
date used in config documents: replace the string "early Jun 2026" with
"2026-05-31" so the README's table entry for MiniMax M3 (the row containing
"MiniMax M3 * | MiniMax | early Jun 2026 | ...") matches the configured
"Released 2026-05-31" date across docs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 66ae8ca7-e6a0-422d-8532-8fb33d417137
⛔ Files ignored due to path filters (3)
results/should_refuse/should_refuse_sweep_public.csvis excluded by!**/*.csvresults/snapshots/2026-05-minimax3/eval/minimax_m3.csvis excluded by!**/*.csvresults/snapshots/2026-05/council/adjudicated.csvis excluded by!**/*.csv
📒 Files selected for processing (3)
CHANGELOG.mdREADME.mdbenchmark/config/sweep_models.json
| "jurisdiction": "asia", | ||
| "organization": "minimax", | ||
| "role": "v1.3_addition", | ||
| "routing_note": "Released 2026-05-31. OpenRouter ID: minimax/minimax-m3. Multimodal (text/image/video input), 1M context. Replaces M2.7 in the panel.", |
There was a problem hiding this comment.
Clarify the M3 routing note to avoid contradiction with active M2.7 entry.
routing_note says M3 “Replaces M2.7 in the panel,” but minimax/minimax-m2.7-20260318 is still present as role: "primary" (Lines 180-188). Please reword to “post-v1.1 addition compared against frozen M2.7” (or explicitly mark M2.7 deprecated if replacement is intended).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/config/sweep_models.json` at line 177, The routing_note for
minimax/minimax-m3 currently says "Replaces M2.7 in the panel" which contradicts
the presence of minimax/minimax-m2.7-20260318 still marked role: "primary";
update the routing_note text to reflect that M3 is a post-v1.1 addition compared
against a frozen M2.7 (or explicitly mark M2.7 deprecated) so the note and the
model entries are consistent—modify the "routing_note" string for
minimax/minimax-m3 to read something like "post-v1.1 addition compared against
frozen M2.7" or change the minimax/minimax-m2.7-20260318 entry to indicate
deprecation if replacement is intended.
| | Model | Provider | Released | Tested | Council | Snapshot | Headline | | ||
| |---|---|---|---|---|---|---| | ||
| | **Claude Opus 4.8** \* | Anthropic | [2026-05-28](https://www.anthropic.com/news/claude-opus-4-8) | 2026-05-29 | **v1.3** (rotated) | post-v1.1 | PC Tier A (TPR 100 %); benign 57 %, dual-use 100 %, Youden's J **+0.43** | | ||
| | **MiniMax M3** \* | MiniMax | early Jun 2026 | 2026-06-03 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 80 %, between B-cap 73 % and A-floor 95 %); benign 21 %, dual-use 29 %, Youden's J **+0.59** | |
There was a problem hiding this comment.
Use a concrete M3 release date for consistency across docs.
This row says “early Jun 2026,” while config documents Released 2026-05-31. Using the exact date in both places avoids timeline ambiguity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 27, Update the table row for "MiniMax M3 *" to use the
concrete release date used in config documents: replace the string "early Jun
2026" with "2026-05-31" so the README's table entry for MiniMax M3 (the row
containing "MiniMax M3 * | MiniMax | early Jun 2026 | ...") matches the
configured "Released 2026-05-31" date across docs.
| | **MiniMax M3** \* | MiniMax | early Jun 2026 | 2026-06-03 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 80 %, between B-cap 73 % and A-floor 95 %); benign 21 %, dual-use 29 %, Youden's J **+0.59** | | ||
|
|
||
| The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. Opus 4.8 walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (Youden's J +0.23 → +0.43) while holding dual-use refusal at 100 %. | ||
| The v1.1-frozen panel (18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3 Super 120B, all under the v1.1 council) remains the canonical snapshot referenced in the manuscript. |
There was a problem hiding this comment.
Fix model-count arithmetic in snapshot description.
“18 frontier + Llama control + Nemotron” totals 20, which conflicts with the surrounding 19-model framing. Please correct either the count or the listed components.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 29, The snapshot description for "v1.1-frozen panel"
currently lists "18 frontier models + Llama 3.3 70B control + NVIDIA Nemotron 3
Super 120B" which sums to 20 but the surrounding text frames the panel as 19
models; reconcile this by either changing "18 frontier models" to "17 frontier
models" or by removing/adjusting one of the listed components so the total
equals 19, and update the phrase "18 frontier models + Llama 3.3 70B control +
NVIDIA Nemotron 3 Super 120B" accordingly to match the canonical 19-model count.
Summary
Adds MiniMax M3 as a second post-v1.1-frozen addition (after Claude Opus 4.8). Same rotated v1.3 council — no further judge changes since 2026-05-29.
results/snapshots/2026-05-minimax3/eval/minimax_m3.csv— 705 raw responses (clean; 8 retry-eligible API errors filtered)results/snapshots/2026-05/council/adjudicated.csv— +705 M3 rows (14,094 → 14,799); frozen rows untouchedresults/should_refuse/should_refuse_sweep_public.csv— +75 M3 PC rows (1,500 → 1,575); 21 distinct modelsbenchmark/config/sweep_models.json— registers M3README.md"Model updates" table — adds M3 row + brief comparisonCHANGELOG.md— new entryResult
M3 refuses more dangerous prompts (TPR 72 % → 80 %, moving out of Tier B into the gap zone) and more dual-use prompts (14 % → 29 %), but benign over-refusal tripled (6 % → 21 %). Net: Youden's J slips slightly (+0.66 → +0.59) — the dangerous-side gain didn't outpace the benign-side drift.
Test plan
adjudicated.csv= 14,799 rows; M3 = 705; v1.1-frozen 13,389 unchangedshould_refuse_sweep_public.csv= 1,575 rows; M3 = 75; 21 distinct modelsCo-authored with Claude Code.
Summary by CodeRabbit
Documentation
Chores