Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council) by VibeCodingScientist · Pull Request #8 · AppliedScientific/refusalbench

VibeCodingScientist · 2026-06-06T15:13:50Z

Summary

Adds Nemotron 3 Ultra 550B as a third post-v1.1-frozen addition (after Claude Opus 4.8 and MiniMax M3). Same rotated v1.3 council — no further judge changes since 2026-05-29.

results/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv — 705 raw responses, zero errors
results/snapshots/2026-05/council/adjudicated.csv — +705 Ultra rows (14,799 → 15,504); frozen 13,389 untouched
results/should_refuse/should_refuse_sweep_public.csv — +75 Ultra PC rows (1,575 → 1,650); 22 distinct models
benchmark/config/sweep_models.json — registers Ultra
README.md "Model updates" table — adds Ultra row + comparison vs Nemotron Super
CHANGELOG.md — new entry

Result — lands at #2 on the leaderboard

	Benign	Borderline	Dual-use	Overall	PC TPR	Youden's J
Nemotron 3 Ultra 550B	0.4 %	1.3 %	18.3 %	6.7 %	93.3 % (gap zone)	+0.93
Nemotron 3 Super 120B (v1.1, ref)	0 %	1 %	7 %	3 %	89.3 % (gap zone)	+0.89

Ultra improves on Super across the board (TPR 89 → 93, dual-use 7 → 18, J +0.89 → +0.93) while keeping benign over-refusal near-zero. Top of leaderboard context: Grok 4.20 (+0.97), Nemotron Ultra (+0.93), Gemini 3.1 Pro (+0.92).

Methodology note worth flagging

Mean inter-judge agreement is 0.806 for Ultra vs ~0.96 for Opus 4.8 / MiniMax M3 and 0.975 for the v1.1 panel. 3 % of rows flagged for spot-check, 2 % parse failures — Ultra's response style appears harder for the v1.3 council judges to classify consistently. Modal-label aggregation still yields valid labels (3-judge fractions 0.333/0.667/1.0 confirmed); not blocking, but documented in the dataset card and README so downstream analyses know.

Test plan

adjudicated.csv = 15,504 rows; Ultra = 705; v1.1-frozen 13,389 unchanged
should_refuse_sweep_public.csv = 1,650 rows; Ultra = 75; 22 distinct models
Eval CSV = 705 responses, 0 errors
HF Space + Dataset already updated to match

Co-authored with Claude Code.

Summary by CodeRabbit

New Features
- Integrated Nemotron 3 Ultra 550B model into the main benchmark sweep with performance metrics and leaderboard rankings.
Documentation
- Updated changelog with new release entry and expanded README with model performance highlights, including inter-judge agreement metrics and ranking details.

Appends Nemotron 3 Ultra 550B as a third post-frozen addition. Same rotated v1.3 council as Opus 4.8 and MiniMax M3 (no further judge changes). - snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv: 705 raw responses, 0 errors - snapshots/2026-05/council/adjudicated.csv: +705 Ultra rows (14,799 to 15,504) - should_refuse_sweep_public.csv: +75 Ultra PC rows (1,575 to 1,650) - sweep_models.json: registers Ultra - README "Model updates" table: Ultra row + comparison vs Nemotron Super 120B - CHANGELOG entry Nemotron 3 Ultra: PC gap zone (TPR 93.3%, just below A-floor 95%), benign 0.4%, dual-use 18.3%, Youden's J +0.93. Improves on Nemotron 3 Super 120B across the board while keeping benign over-refusal near-zero. Lands at #2 by Youden's J behind Grok 4.20. Methodology caveat: mean inter-judge agreement is 0.806 (vs ~0.96 for Opus 4.8 / M3 and 0.975 for v1.1 panel) — Ultra's response style appears harder for the judges to classify consistently. Modal-label aggregation still yields valid labels; 3% rows flagged for spot-check, 2% parse failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-06T15:14:08Z

📝 Walkthrough

Walkthrough

This PR adds NVIDIA's Nemotron 3 Ultra 550B model to the RefusalBench sweep configuration. The configuration is updated with a new model entry and version bump, while release notes and documentation are updated to reflect the new model's metrics and performance standing.

Changes

Nemotron 3 Ultra 550B Model Integration

Layer / File(s)	Summary
Sweep configuration and model metadata `benchmark/config/sweep_models.json`	Version is incremented from 1.7 to 1.8. A new OpenRouter-hosted model entry for `nvidia/nemotron-3-ultra-550b-a55b` is added with display name "Nemotron Ultra", role `v1.4_addition`, and pricing metadata ($0.5 input / $2.5 output per MTok).
Changelog and model updates documentation `CHANGELOG.md`, `README.md`	New [Unreleased] — 2026-06-06 section in CHANGELOG.md documents Nemotron's inclusion in the main sweep, updated PC gap zone metrics, and inter-judge agreement caveats. README model updates table adds the new model row with release/test dates and performance metrics. The v1.3 council footnote is revised to cover all three post-frozen models.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

AppliedScientific/refusalbench#7: Both PRs add a new model to the benchmark sweep configuration by extending benchmark/config/sweep_models.json and updating related documentation in CHANGELOG.md and README.md.
AppliedScientific/refusalbench#5: Both PRs update benchmark/config/sweep_models.json by incrementing the configuration version and extending the model roster.

Poem

🐰 A speedy model swift and strong,

Nemotron joins our benchmark throng,

Five-fifty billion dreams take flight,

Config and docs now set just right! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding Nemotron 3 Ultra 550B as a new model with context about its versioning position (post-v1.1 with rotated v1.3 council).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch add-nemotron-ultra

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

README.md (1)

28-28: 💤 Low value

Consider specifying exact release date for consistency.

The table shows "early Jun 2026" but sweep_models.json line 177 specifies "Released 2026-06-04". For consistency and precision, consider using the exact date here as well (following the pattern of Opus 4.8 which shows the full date).

📝 Proposed consistency improvement

-| **Nemotron 3 Ultra 550B** \* | NVIDIA | early Jun 2026 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard |
+| **Nemotron 3 Ultra 550B** \* | NVIDIA | 2026-06-04 | 2026-06-06 | **v1.3** (rotated) | post-v1.1 | PC gap zone (TPR 93 %, just below A-floor 95 %); benign 0.4 %, dual-use 18 %, Youden's J **+0.93** — ranks `#2` on the leaderboard |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 28, Update the date text in the README table row for
"Nemotron 3 Ultra 550B" (the row containing "v1.3 (rotated)" and "PC gap zone")
to use the exact release date "2026-06-04" instead of "early Jun 2026" so it
matches the release entry in sweep_models.json (line referencing Nemotron 3
Ultra 550B).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@README.md`:
- Line 28: Update the date text in the README table row for "Nemotron 3 Ultra
550B" (the row containing "v1.3 (rotated)" and "PC gap zone") to use the exact
release date "2026-06-04" instead of "early Jun 2026" so it matches the release
entry in sweep_models.json (line referencing Nemotron 3 Ultra 550B).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f339f58-496a-4ebd-9692-ea869a1bb922

📥 Commits

Reviewing files that changed from the base of the PR and between 7968472 and 059f366.

⛔ Files ignored due to path filters (3)

results/should_refuse/should_refuse_sweep_public.csv is excluded by !**/*.csv
results/snapshots/2026-05/council/adjudicated.csv is excluded by !**/*.csv
results/snapshots/2026-06-nemotron-ultra/eval/nemotron_ultra.csv is excluded by !**/*.csv

📒 Files selected for processing (3)

CHANGELOG.md
README.md
benchmark/config/sweep_models.json

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

VibeCodingScientist merged commit 28fd0a3 into main Jun 6, 2026
4 checks passed

VibeCodingScientist deleted the add-nemotron-ultra branch June 6, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council)#8

Add Nemotron 3 Ultra 550B (post-v1.1, rotated v1.3 council)#8
VibeCodingScientist merged 1 commit into
mainfrom
add-nemotron-ultra

VibeCodingScientist commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VibeCodingScientist commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result — lands at #2 on the leaderboard

Methodology note worth flagging

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VibeCodingScientist commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading