[codex] Add Nemotron-CLIMB data-curation recipe#2138
Conversation
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Greptile SummaryThis PR adds a full operator-focused documentation suite for the Nemotron-CLIMB data-curation recipe, making a workflow that previously lived only in the tutorial README discoverable from the docs site.
Confidence Score: 5/5Documentation-only change; no executable code paths are modified. All five changed files are MDX documentation and YAML navigation config. The technical details in both new pages were validated against the actual argparse definitions and shell scripts per the PR description, and the content is internally consistent — stage inputs and outputs chain correctly, directory names match across examples, and the embedding-dim discrepancy is now clearly explained. No issues were found that would mislead readers or break the site build. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
S8 -->|"next round mixtures"| S6
S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["Source corpus\n(JSONL / Parquet)"] --> S1["Stage 1: Embed\n1_embed.py"]
S1 -->|"computed_embeddings/"| S2["Stage 2: Cluster\n2_cluster.py"]
S2 -->|"clusters/ + centroids/"| S3["Stage 3: Prune & Merge\n3_prune.py"]
S3 -->|"pruned_clusters/"| S4["Stage 4: Tokenize\n4_tokenize.py"]
S4 -->|"domains/ (.bin/.idx)"| S5["Stage 5: Sample Mixtures\n5_mixture.py"]
S5 -->|"mixtures_N/ (n1..nK.sh)"| S6["Stage 6: Train Proxy\n6_train.sh x K"]
S6 -->|"megatron_exp_N/"| S7["Stage 7: Evaluate\n7_evaluate.sh"]
S7 -->|"lm_eval_results_N/"| S8["Stage 8: Fit Predictor\n8_predict.py"]
S8 -->|"next round mixtures"| S6
S8 -->|"optimal_mixture/n1.sh"| FULL["Full-scale Megatron-LM training"]
Reviews (4): Last reviewed commit: "docs: clarify CLIMB embedding flags" | Re-trigger Greptile |
| --output-path /data/climb/mixtures_3 \ | ||
| --metric valid_avg \ | ||
| --num-mixtures 16 | ||
| ``` |
There was a problem hiding this comment.
Incorrect embedding dimension for stella_en_400M_v5
The page documents --embedding-dim 3072 as "the conservative estimate for Stella embeddings" and tells readers to "Adjust it for the model's embedding width." However, NovaSearch/stella_en_400M_v5 produces 1024-dimensional embeddings by default. A reader following the advice to set this flag to the model's embedding width would arrive at 1024, not 3072.
If 3072 is deliberately set higher than the actual dimension to create more, smaller Parquet groups for cuDF memory safety, that intent should be stated explicitly. As written, the passage implies 3072 is the actual embedding width, which is misleading.
|
|
||
| **Key configuration** | ||
|
|
||
| - A cluster survives only when its average is at least every corresponding `--pruning-threshold`. |
There was a problem hiding this comment.
Grammatically ambiguous — "is at least every threshold" reads as a comparison against an implicit zero rather than against each threshold value. The suggested phrasing makes the per-threshold comparison explicit.
| - A cluster survives only when its average is at least every corresponding `--pruning-threshold`. | |
| - A cluster survives only when its average score meets or exceeds every corresponding `--pruning-threshold`. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| <Card title="Tutorials" href="/curate-text/tutorials"> | ||
| Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization | ||
| recipes | ||
| hands-on | ||
| pretraining | ||
| </Card> |
There was a problem hiding this comment.
The card description ends without a period, unlike every other card description in this file. Inconsistent punctuation is a minor style issue, but fixing it keeps the page uniform.
| <Card title="Tutorials" href="/curate-text/tutorials"> | |
| Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization | |
| recipes | |
| hands-on | |
| pretraining | |
| </Card> | |
| <Card title="Tutorials" href="/curate-text/tutorials"> | |
| Run end-to-end text curation recipes, including Nemotron-CLIMB data-mixture optimization. | |
| recipes | |
| hands-on | |
| pretraining | |
| </Card> |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Summary
Why
The complete workflow currently lives only in the tutorial README and scripts, which makes it difficult to discover from the documentation site and leaves important production concerns—particularly expensive stages, intermediate artifacts, and partial-run recovery—implicit.
Validation
cd fern && npm run check— 0 errors; 103 existing removed-reference redirect warningsbash -nfor6_train.sh,7_evaluate.sh, ande2e.shast.parseargparsedefinitions (including shared Ray flags)--helpfor4_tokenize.pyand5_mixture.py; the remaining entry points require optional CUDA, FastText, or LightGBM packages not present in the local validation environmentCloses #2119