Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council)#8
Conversation
Appends Nemotron 3 Ultra 550B as a third post-frozen addition. Same rotated v1.3 council as Opus 4.8 and MiniMax M3 (no further judge changes). - snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv: 705 raw responses, 0 errors - snapshots/2026-05/council/adjudicated.csv: +705 Ultra rows (14,799 to 15,504) - should_refuse_sweep_public.csv: +75 Ultra PC rows (1,575 to 1,650) - sweep_models.json: registers Ultra - README "Model updates" table: Ultra row + comparison vs Nemotron Super 120B - CHANGELOG entry Nemotron 3 Ultra: PC gap zone (TPR 93.3%, just below A-floor 95%), benign 0.4%, dual-use 18.3%, Youden's J +0.93. Improves on Nemotron 3 Super 120B across the board while keeping benign over-refusal near-zero. Lands at #2 by Youden's J behind Grok 4.20. Methodology caveat: mean inter-judge agreement is 0.806 (vs ~0.96 for Opus 4.8 / M3 and 0.975 for v1.1 panel) — Ultra's response style appears harder for the judges to classify consistently. Modal-label aggregation still yields valid labels; 3% rows flagged for spot-check, 2% parse failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds NVIDIA's Nemotron 3 Ultra 550B model to the RefusalBench sweep configuration. The configuration is updated with a new model entry and version bump, while release notes and documentation are updated to reflect the new model's metrics and performance standing. ChangesNemotron 3 Ultra 550B Model Integration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
README.md (1)
28-28: 💤 Low valueConsider specifying exact release date for consistency.
The table shows "early Jun 2026" but
sweep_models.jsonline 177 specifies "Released 2026-06-04". For consistency and precision, consider using the exact date here as well (following the pattern of Opus 4.8 which shows the full date).📝 Proposed consistency improvement
-| **Nemotron 3 Ultra 550B** \* | NVIDIA | early Jun 2026 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard | +| **Nemotron 3 Ultra 550B** \* | NVIDIA | 2026-06-04 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 28, Update the date text in the README table row for "Nemotron 3 Ultra 550B" (the row containing "v1.3 (rotated)" and "PC gap zone") to use the exact release date "2026-06-04" instead of "early Jun 2026" so it matches the release entry in sweep_models.json (line referencing Nemotron 3 Ultra 550B).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@README.md`:
- Line 28: Update the date text in the README table row for "Nemotron 3 Ultra
550B" (the row containing "v1.3 (rotated)" and "PC gap zone") to use the exact
release date "2026-06-04" instead of "early Jun 2026" so it matches the release
entry in sweep_models.json (line referencing Nemotron 3 Ultra 550B).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9f339f58-496a-4ebd-9692-ea869a1bb922
⛔ Files ignored due to path filters (3)
results/should_refuse/should_refuse_sweep_public.csvis excluded by!**/*.csvresults/snapshots/2026-05/council/adjudicated.csvis excluded by!**/*.csvresults/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csvis excluded by!**/*.csv
📒 Files selected for processing (3)
CHANGELOG.mdREADME.mdbenchmark/config/sweep_models.json
Summary
Adds Nemotron 3 Ultra 550B as a third post-v1.1-frozen addition (after Claude Opus 4.8 and MiniMax M3). Same rotated v1.3 council — no further judge changes since 2026-05-29.
results/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv— 705 raw responses, zero errorsresults/snapshots/2026-05/council/adjudicated.csv— +705 Ultra rows (14,799 → 15,504); frozen 13,389 untouchedresults/should_refuse/should_refuse_sweep_public.csv— +75 Ultra PC rows (1,575 → 1,650); 22 distinct modelsbenchmark/config/sweep_models.json— registers UltraREADME.md"Model updates" table — adds Ultra row + comparison vs Nemotron SuperCHANGELOG.md— new entryResult — lands at #2 on the leaderboard
Ultra improves on Super across the board (TPR 89 → 93, dual-use 7 → 18, J +0.89 → +0.93) while keeping benign over-refusal near-zero. Top of leaderboard context: Grok 4.20 (+0.97), Nemotron Ultra (+0.93), Gemini 3.1 Pro (+0.92).
Methodology note worth flagging
Mean inter-judge agreement is 0.806 for Ultra vs ~0.96 for Opus 4.8 / MiniMax M3 and 0.975 for the v1.1 panel. 3 % of rows flagged for spot-check, 2 % parse failures — Ultra's response style appears harder for the v1.3 council judges to classify consistently. Modal-label aggregation still yields valid labels (3-judge fractions 0.333/0.667/1.0 confirmed); not blocking, but documented in the dataset card and README so downstream analyses know.
Test plan
adjudicated.csv= 15,504 rows; Ultra = 705; v1.1-frozen 13,389 unchangedshould_refuse_sweep_public.csv= 1,650 rows; Ultra = 75; 22 distinct modelsCo-authored with Claude Code.
Summary by CodeRabbit
New Features
Documentation