Skip to content

[codex] Add Nemotron-CLIMB data-curation recipe#2138

Open
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-climb-recipe
Open

[codex] Add Nemotron-CLIMB data-curation recipe#2138
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-climb-recipe

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add an operator-focused Nemotron-CLIMB recipe covering prerequisites, compute and storage planning, external model assets, quick validation, the full 64 → 32 → 16 → 1 search, and safe restart behavior
  • add a detailed stage reference for scripts 1–8 with exact commands, inputs, outputs, key controls, artifact contracts, completion checks, and Megatron tokenizer-output merging
  • add the recipe to Fern navigation and link it from the text-curation and text-tutorial landing pages

Why

The complete workflow currently lives only in the tutorial README and scripts, which makes it difficult to discover from the documentation site and leaves important production concerns—particularly expensive stages, intermediate artifacts, and partial-run recovery—implicit.

Validation

  • cd fern && npm run check — 0 errors; 103 existing removed-reference redirect warnings
  • bash -n for 6_train.sh, 7_evaluate.sh, and e2e.sh
  • parsed all seven tutorial Python files with ast.parse
  • validated every documented Python command flag against the current argparse definitions (including shared Ray flags)
  • directly exercised --help for 4_tokenize.py and 5_mixture.py; the remaining entry points require optional CUDA, FastText, or LightGBM packages not present in the local validation environment

Closes #2119

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jun 30, 2026
@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from meatybobby and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a full operator-focused documentation suite for the Nemotron-CLIMB data-curation recipe, making a workflow that previously lived only in the tutorial README discoverable from the docs site.

  • Navigation and landing pages: Promotes the flat Tutorials page to a nested section in main.yml and adds linking cards in both curate-text/index.mdx and tutorials/index.mdx.
  • Recipe overview (nemotron-climb/index.mdx): Covers prerequisites, storage planning, configuration variables, quick-validation vs. full-scale run profiles (64→32→16→1 search), and safe restart procedure with clear warnings about the round/directory pairing contract.
  • Stage reference (nemotron-climb/stages.mdx): Documents all 8 stages with exact CLI commands, input/output field contracts, key flags, and completion checks; notably clarifies that --embedding-dim 3072 is an intentional memory-safety overestimate relative to Stella's actual 1024-dimensional output.

Confidence Score: 5/5

Documentation-only change; no executable code paths are modified.

All five changed files are MDX documentation and YAML navigation config. The technical details in both new pages were validated against the actual argparse definitions and shell scripts per the PR description, and the content is internally consistent — stage inputs and outputs chain correctly, directory names match across examples, and the embedding-dim discrepancy is now clearly explained. No issues were found that would mislead readers or break the site build.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main.yml Converts the flat Tutorials page entry into a nested section and adds Nemotron-CLIMB as a sub-section with Recipe and Stage Reference pages; navigation slugs look correct.
fern/versions/main/pages/curate-text/index.mdx Adds a Tutorials card linking to /curate-text/tutorials with a description and tags; straightforward addition consistent with surrounding cards.
fern/versions/main/pages/curate-text/tutorials/index.mdx Adds an End-to-End Recipes section with a card linking to the new Nemotron-CLIMB tutorial; no issues found.
fern/versions/main/pages/curate-text/tutorials/nemotron-climb/index.mdx New recipe overview page covering prerequisites, storage planning, configuration, run profiles, and restart guidance; technically accurate with appropriate warnings about the compute scale.
fern/versions/main/pages/curate-text/tutorials/nemotron-climb/stages.mdx New stage reference covering all 8 pipeline stages with commands, input/output contracts, key flags, and restart guidance; the embedding-dim clarification now correctly explains the 3072 vs 1024 discrepancy.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
    S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
    S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
    S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
    S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
    S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
    S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
    S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
    S8 -->|"next round mixtures"| S6
    S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
    S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
    S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
    S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
    S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
    S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
    S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
    S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
    S8 -->|"next round mixtures"| S6
    S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]
Loading

Reviews (4): Last reviewed commit: "docs: clarify CLIMB embedding flags" | Re-trigger Greptile

--output-path /data/climb/mixtures_3 \
--metric valid_avg \
--num-mixtures 16
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Incorrect embedding dimension for stella_en_400M_v5

The page documents --embedding-dim 3072 as "the conservative estimate for Stella embeddings" and tells readers to "Adjust it for the model's embedding width." However, NovaSearch/stella_en_400M_v5 produces 1024-dimensional embeddings by default. A reader following the advice to set this flag to the model's embedding width would arrive at 1024, not 3072.

If 3072 is deliberately set higher than the actual dimension to create more, smaller Parquet groups for cuDF memory safety, that intent should be stated explicitly. As written, the passage implies 3072 is the actual embedding width, which is misleading.


**Key configuration**

- A cluster survives only when its average is at least every corresponding `--pruning-threshold`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Grammatically ambiguous — "is at least every threshold" reads as a comparison against an implicit zero rather than against each threshold value. The suggested phrasing makes the per-threshold comparison explicit.

Suggested change
- A cluster survives only when its average is at least every corresponding `--pruning-threshold`.
- A cluster survives only when its average score meets or exceeds every corresponding `--pruning-threshold`.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +70 to +75
<Card title="Tutorials" href="/curate-text/tutorials">
Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization
recipes
hands-on
pretraining
</Card>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The card description ends without a period, unlike every other card description in this file. Inconsistent punctuation is a minor style issue, but fixing it keeps the page uniform.

Suggested change
<Card title="Tutorials" href="/curate-text/tutorials">
Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization
recipes
hands-on
pretraining
</Card>
<Card title="Tutorials" href="/curate-text/tutorials">
Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization.
recipes
hands-on
pretraining
</Card>

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

lbliii added 3 commits July 2, 2026 11:01
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Add a Nemotron-CLIMB data-curation recipe

1 participant