Skip to content

Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council)#8

Merged
VibeCodingScientist merged 1 commit into
mainfrom
add-nemotron-ultra
Jun 6, 2026
Merged

Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council)#8
VibeCodingScientist merged 1 commit into
mainfrom
add-nemotron-ultra

Conversation

@VibeCodingScientist

@VibeCodingScientist VibeCodingScientist commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Nemotron 3 Ultra 550B as a third post-v1.1-frozen addition (after Claude Opus 4.8 and MiniMax M3). Same rotated v1.3 council — no further judge changes since 2026-05-29.

  • results/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv — 705 raw responses, zero errors
  • results/snapshots/2026-05/council/adjudicated.csv — +705 Ultra rows (14,799 → 15,504); frozen 13,389 untouched
  • results/should_refuse/should_refuse_sweep_public.csv — +75 Ultra PC rows (1,575 → 1,650); 22 distinct models
  • benchmark/config/sweep_models.json — registers Ultra
  • README.md "Model updates" table — adds Ultra row + comparison vs Nemotron Super
  • CHANGELOG.md — new entry

Result — lands at #2 on the leaderboard

Benign Borderline Dual-use Overall PC TPR Youden's J
Nemotron 3 Ultra 550B 0.4 % 1.3 % 18.3 % 6.7 % 93.3 % (gap zone) +0.93
Nemotron 3 Super 120B (v1.1, ref) 0 % 1 % 7 % 3 % 89.3 % (gap zone) +0.89

Ultra improves on Super across the board (TPR 89 → 93, dual-use 7 → 18, J +0.89 → +0.93) while keeping benign over-refusal near-zero. Top of leaderboard context: Grok 4.20 (+0.97), Nemotron Ultra (+0.93), Gemini 3.1 Pro (+0.92).

Methodology note worth flagging

Mean inter-judge agreement is 0.806 for Ultra vs ~0.96 for Opus 4.8 / MiniMax M3 and 0.975 for the v1.1 panel. 3 % of rows flagged for spot-check, 2 % parse failures — Ultra's response style appears harder for the v1.3 council judges to classify consistently. Modal-label aggregation still yields valid labels (3-judge fractions 0.333/0.667/1.0 confirmed); not blocking, but documented in the dataset card and README so downstream analyses know.

Test plan

  • adjudicated.csv = 15,504 rows; Ultra = 705; v1.1-frozen 13,389 unchanged
  • should_refuse_sweep_public.csv = 1,650 rows; Ultra = 75; 22 distinct models
  • Eval CSV = 705 responses, 0 errors
  • HF Space + Dataset already updated to match

Co-authored with Claude Code.

Summary by CodeRabbit

  • New Features

    • Integrated Nemotron 3 Ultra 550B model into the main benchmark sweep with performance metrics and leaderboard rankings.
  • Documentation

    • Updated changelog with new release entry and expanded README with model performance highlights, including inter-judge agreement metrics and ranking details.

Appends Nemotron 3 Ultra 550B as a third post-frozen addition. Same
rotated v1.3 council as Opus 4.8 and MiniMax M3 (no further judge changes).

- snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv: 705 raw responses, 0 errors
- snapshots/2026-05/council/adjudicated.csv: +705 Ultra rows (14,799 to 15,504)
- should_refuse_sweep_public.csv: +75 Ultra PC rows (1,575 to 1,650)
- sweep_models.json: registers Ultra
- README "Model updates" table: Ultra row + comparison vs Nemotron Super 120B
- CHANGELOG entry

Nemotron 3 Ultra: PC gap zone (TPR 93.3%, just below A-floor 95%),
benign 0.4%, dual-use 18.3%, Youden's J +0.93. Improves on Nemotron 3
Super 120B across the board while keeping benign over-refusal near-zero.
Lands at #2 by Youden's J behind Grok 4.20.

Methodology caveat: mean inter-judge agreement is 0.806 (vs ~0.96 for
Opus 4.8 / M3 and 0.975 for v1.1 panel) — Ultra's response style appears
harder for the judges to classify consistently. Modal-label aggregation
still yields valid labels; 3% rows flagged for spot-check, 2% parse failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds NVIDIA's Nemotron 3 Ultra 550B model to the RefusalBench sweep configuration. The configuration is updated with a new model entry and version bump, while release notes and documentation are updated to reflect the new model's metrics and performance standing.

Changes

Nemotron 3 Ultra 550B Model Integration

Layer / File(s) Summary
Sweep configuration and model metadata
benchmark/config/sweep_models.json
Version is incremented from 1.7 to 1.8. A new OpenRouter-hosted model entry for nvidia/nemotron-3-ultra-550b-a55b is added with display name "Nemotron Ultra", role v1.4_addition, and pricing metadata ($0.5 input / $2.5 output per MTok).
Changelog and model updates documentation
CHANGELOG.md, README.md
New [Unreleased] — 2026-06-06 section in CHANGELOG.md documents Nemotron's inclusion in the main sweep, updated PC gap zone metrics, and inter-judge agreement caveats. README model updates table adds the new model row with release/test dates and performance metrics. The v1.3 council footnote is revised to cover all three post-frozen models.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • AppliedScientific/refusalbench#7: Both PRs add a new model to the benchmark sweep configuration by extending benchmark/config/sweep_models.json and updating related documentation in CHANGELOG.md and README.md.
  • AppliedScientific/refusalbench#5: Both PRs update benchmark/config/sweep_models.json by incrementing the configuration version and extending the model roster.

Poem

🐰 A speedy model swift and strong,

Nemotron joins our benchmark throng,

Five-fifty billion dreams take flight,

Config and docs now set just right! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding Nemotron 3 Ultra 550B as a new model with context about its versioning position (post-v1.1 with rotated v1.3 council).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-nemotron-ultra

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
README.md (1)

28-28: 💤 Low value

Consider specifying exact release date for consistency.

The table shows "early Jun 2026" but sweep_models.json line 177 specifies "Released 2026-06-04". For consistency and precision, consider using the exact date here as well (following the pattern of Opus 4.8 which shows the full date).

📝 Proposed consistency improvement
-| **Nemotron 3 Ultra 550B** \* | NVIDIA | early Jun 2026 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard |
+| **Nemotron 3 Ultra 550B** \* | NVIDIA | 2026-06-04 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 28, Update the date text in the README table row for
"Nemotron 3 Ultra 550B" (the row containing "v1.3 (rotated)" and "PC gap zone")
to use the exact release date "2026-06-04" instead of "early Jun 2026" so it
matches the release entry in sweep_models.json (line referencing Nemotron 3
Ultra 550B).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@README.md`:
- Line 28: Update the date text in the README table row for "Nemotron 3 Ultra
550B" (the row containing "v1.3 (rotated)" and "PC gap zone") to use the exact
release date "2026-06-04" instead of "early Jun 2026" so it matches the release
entry in sweep_models.json (line referencing Nemotron 3 Ultra 550B).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f339f58-496a-4ebd-9692-ea869a1bb922

📥 Commits

Reviewing files that changed from the base of the PR and between 7968472 and 059f366.

⛔ Files ignored due to path filters (3)
  • results/should_refuse/should_refuse_sweep_public.csv is excluded by !**/*.csv
  • results/snapshots/2026-05/council/adjudicated.csv is excluded by !**/*.csv
  • results/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • CHANGELOG.md
  • README.md
  • benchmark/config/sweep_models.json

@VibeCodingScientist VibeCodingScientist merged commit 28fd0a3 into main Jun 6, 2026
4 checks passed
@VibeCodingScientist VibeCodingScientist deleted the add-nemotron-ultra branch June 6, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants