[codex] Add Nemotron-CLIMB data-curation recipe by lbliii · Pull Request #2138 · NVIDIA-NeMo/Curator

lbliii · 2026-06-29T21:39:18Z

Summary

add an operator-focused Nemotron-CLIMB recipe covering prerequisites, compute and storage planning, external model assets, quick validation, the full 64 → 32 → 16 → 1 search, and safe restart behavior
add a detailed stage reference for scripts 1–8 with exact commands, inputs, outputs, key controls, artifact contracts, completion checks, and Megatron tokenizer-output merging
add the recipe to Fern navigation and link it from the text-curation and text-tutorial landing pages

Why

The complete workflow currently lives only in the tutorial README and scripts, which makes it difficult to discover from the documentation site and leaves important production concerns—particularly expensive stages, intermediate artifacts, and partial-run recovery—implicit.

Validation

cd fern && npm run check — 0 errors; 103 existing removed-reference redirect warnings
bash -n for 6_train.sh, 7_evaluate.sh, and e2e.sh
parsed all seven tutorial Python files with ast.parse
validated every documented Python command flag against the current argparse definitions (including shared Ray flags)
directly exercised --help for 4_tokenize.py and 5_mixture.py; the remaining entry points require optional CUDA, FastText, or LightGBM packages not present in the local validation environment

Closes #2119

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-29T21:39:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T14:59:24Z

Greptile Summary

This PR adds a full operator-focused documentation suite for the Nemotron-CLIMB data-curation recipe, making a workflow that previously lived only in the tutorial README discoverable from the docs site.

Navigation and landing pages: Promotes the flat Tutorials page to a nested section in main.yml and adds linking cards in both curate-text/index.mdx and tutorials/index.mdx.
Recipe overview (nemotron-climb/index.mdx): Covers prerequisites, storage planning, configuration variables, quick-validation vs. full-scale run profiles (64→32→16→1 search), and safe restart procedure with clear warnings about the round/directory pairing contract.
Stage reference (nemotron-climb/stages.mdx): Documents all 8 stages with exact CLI commands, input/output field contracts, key flags, and completion checks; notably clarifies that --embedding-dim 3072 is an intentional memory-safety overestimate relative to Stella's actual 1024-dimensional output.

Confidence Score: 5/5

Documentation-only change; no executable code paths are modified.

All five changed files are MDX documentation and YAML navigation config. The technical details in both new pages were validated against the actual argparse definitions and shell scripts per the PR description, and the content is internally consistent — stage inputs and outputs chain correctly, directory names match across examples, and the embedding-dim discrepancy is now clearly explained. No issues were found that would mislead readers or break the site build.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main.yml	Converts the flat Tutorials page entry into a nested section and adds Nemotron-CLIMB as a sub-section with Recipe and Stage Reference pages; navigation slugs look correct.
fern/versions/main/pages/curate-text/index.mdx	Adds a Tutorials card linking to /curate-text/tutorials with a description and tags; straightforward addition consistent with surrounding cards.
fern/versions/main/pages/curate-text/tutorials/index.mdx	Adds an End-to-End Recipes section with a card linking to the new Nemotron-CLIMB tutorial; no issues found.
fern/versions/main/pages/curate-text/tutorials/nemotron-climb/index.mdx	New recipe overview page covering prerequisites, storage planning, configuration, run profiles, and restart guidance; technically accurate with appropriate warnings about the compute scale.
fern/versions/main/pages/curate-text/tutorials/nemotron-climb/stages.mdx	New stage reference covering all 8 pipeline stages with commands, input/output contracts, key flags, and restart guidance; the embedding-dim clarification now correctly explains the 3072 vs 1024 discrepancy.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
    S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
    S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
    S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
    S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
    S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
    S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
    S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
    S8 -->|"next round mixtures"| S6
    S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
    S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
    S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
    S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
    S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
    S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
    S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
    S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
    S8 -->|"next round mixtures"| S6
    S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]

_{Reviews (4): Last reviewed commit: "docs: clarify CLIMB embedding flags" | Re-trigger Greptile}

greptile-apps · 2026-07-02T14:59:28Z

+    --output-path /data/climb/mixtures_3 \
+    --metric valid_avg \
+    --num-mixtures 16
+```


Incorrect embedding dimension for stella_en_400M_v5

The page documents --embedding-dim 3072 as "the conservative estimate for Stella embeddings" and tells readers to "Adjust it for the model's embedding width." However, NovaSearch/stella_en_400M_v5 produces 1024-dimensional embeddings by default. A reader following the advice to set this flag to the model's embedding width would arrive at 1024, not 3072.

If 3072 is deliberately set higher than the actual dimension to create more, smaller Parquet groups for cuDF memory safety, that intent should be stated explicitly. As written, the passage implies 3072 is the actual embedding width, which is misleading.

greptile-apps · 2026-07-02T14:59:29Z

+
+**Key configuration**
+
+- A cluster survives only when its average is at least every corresponding `--pruning-threshold`.


Grammatically ambiguous — "is at least every threshold" reads as a comparison against an implicit zero rather than against each threshold value. The suggested phrasing makes the per-threshold comparison explicit.

Suggested change

- A cluster survives only when its average is at least every corresponding `--pruning-threshold`.

- A cluster survives only when its average score meets or exceeds every corresponding `--pruning-threshold`.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-07-02T14:59:30Z

+<Card title="Tutorials" href="/curate-text/tutorials">
+Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization
+recipes
+hands-on
+pretraining
+</Card>


The card description ends without a period, unlike every other card description in this file. Inconsistent punctuation is a minor style issue, but fixing it keeps the page uniform.

Suggested change

<Card title="Tutorials" href="/curate-text/tutorials">

Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization

recipes

hands-on

pretraining

</Card>

<Card title="Tutorials" href="/curate-text/tutorials">

Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization.

recipes

hands-on

pretraining

</Card>

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: add Nemotron-CLIMB recipe

9e06e95

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Jun 30, 2026

lbliii mentioned this pull request Jun 30, 2026

[codex] publish 26.06 release notes and migration checklist #2143

Open

lbliii marked this pull request as ready for review July 2, 2026 14:53

lbliii requested a review from a team as a code owner July 2, 2026 14:53

lbliii requested review from meatybobby and removed request for a team July 2, 2026 14:53

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

lbliii added 3 commits July 2, 2026 11:01

docs: clarify CLIMB embedding sizing

0b74003

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'main' into codex/docs-nemotron-climb-recipe

f39253f

docs: clarify CLIMB embedding flags

d627672

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Add Nemotron-CLIMB data-curation recipe#2138

[codex] Add Nemotron-CLIMB data-curation recipe#2138
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-climb-recipe

lbliii commented Jun 29, 2026

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Key configuration

		- A cluster survives only when its average is at least every corresponding `--pruning-threshold`.

Uh oh!

Conversation

lbliii commented Jun 29, 2026

Summary

Why

Validation

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading