docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544dhruvnathawani wants to merge 4 commits intomainfrom
Conversation
|
Docs preview: https://7c3c9cf7.dd-docs-preview.pages.dev
|
Greptile SummaryThis PR adds three Nemotron Nano SDG recipe pipelines (structured data generation, prompt sensitivity, InfiniByte) along with their documentation pages, recipe cards, and
|
| Filename | Overview |
|---|---|
| docs/assets/recipes/model_usability/prompt_sensitivity.py | Seed-driven preamble generation pipeline; multiple regex patterns (fmt_00, fmt_05, fmt_08, fmt_09) contain known correctness bugs flagged in prior review threads that remain unaddressed. |
| docs/assets/recipes/code_generation/infinibyte.py | 5-stage cross-source problem augmentation pipeline; HF streaming, random cross-join, Pydantic structured outputs, and solution generation all look logically correct. |
| docs/assets/recipes/model_usability/structured_data.py | Multi-format schema generation pipeline with SubcategorySamplerParams, 5 LLM stages, and best-of-3 output; logic is sound. |
| docs/recipes/cards.md | Adds three new recipe cards for structured data, prompt sensitivity, and InfiniByte; links and icons look correct. |
| mkdocs.yml | Adds nav entries for the new Model Usability category and InfiniByte under Code Generation; structure is correct. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph InfiniByte["InfiniByte Pipeline"]
A1[HF Dataset A\nOpenCodeReasoning] --> C1[Cross-Join\nRandom Sampling]
A2[HF Dataset B\nOpenMathReasoning] --> C1
C1 --> S1[Stage 1: Seed + Sampler\ncombination_type]
S1 --> S2[Stage 2: Candidate Generation\nLLMStructured → NewProblemList]
S2 --> S3[Stage 3: Best Selection\nLLMStructured → NewProblemWithReasoning]
S3 --> S3b[ExpressionColumn\nnew_problem]
S3b --> S4[Stage 4: Evaluation\nLLMStructured → NewProblemEvals]
S4 --> S5[Stage 5: Solution\nLLMText]
end
subgraph PromptSensitivity["Prompt Sensitivity Pipeline"]
P0[Seed: 10 formats x 30 preambles] --> P1[Stage 1: 7 Diversity Samplers]
P1 --> P2[Stage 2: Preamble Generation\nLLMText]
P2 --> P3[Stage 3: Format Instruction\nLLMText]
P3 --> P4[Stage 4: User Prompt Composition\nLLMText]
P4 --> P5A[Judge: Format Compliance\n0-2]
P4 --> P5B[Judge: Regex Alignment\n0-1]
P4 --> P5C[Judge: Order Coherence\n0-1]
P4 --> P5D[Judge: Preamble Quality\n0-3]
end
subgraph StructuredData["Structured Data Pipeline"]
D1[Stage 1: Sampling\nformat, topic, schema controls] --> D2[Stage 2: Schema Generation\nLLMText]
D2 --> D3[Stage 3: User Prompt\nLLMText]
D3 --> D4[Stage 4: Conversation Pairs\nLLMText]
D4 --> D5A[structured_output_0]
D4 --> D5B[structured_output_1]
D4 --> D5C[structured_output_2]
end
Reviews (3): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile
| }, | ||
| { | ||
| "format_key": "fmt_05", | ||
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", |
There was a problem hiding this comment.
Mismatched closing delimiter in
fmt_05 regex
The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", | |
| "output_regex": r"\[Answer:\s*([A-Za-z])\]", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103
Comment:
**Mismatched closing delimiter in `fmt_05` regex**
The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.
```suggestion
"output_regex": r"\[Answer:\s*([A-Za-z])\]",
```
How can I resolve this? If you propose a fix, please make it concise.| FORMAT_TEMPLATES = [ | ||
| { | ||
| "format_key": "fmt_00", | ||
| "output_regex": r"\boxed{([.*?])}", |
There was a problem hiding this comment.
Incorrect regex for LaTeX
\boxed{} format
r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:
| "output_regex": r"\boxed{([.*?])}", | |
| "output_regex": r"\\boxed\{(.*?)\}", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78
Comment:
**Incorrect regex for LaTeX `\boxed{}` format**
`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:
```suggestion
"output_regex": r"\\boxed\{(.*?)\}",
```
How can I resolve this? If you propose a fix, please make it concise.| }, | ||
| { | ||
| "format_key": "fmt_08", | ||
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", |
There was a problem hiding this comment.
[.*?] character class captures only ., *, or ?
([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", | |
| "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118
Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**
`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:
```suggestion
"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```
How can I resolve this? If you propose a fix, please make it concise.
Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)SummaryThis PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, FindingsHigh — Regex Patterns in
|
| # "data-designer", | ||
| # "datasets", | ||
| # "pandas", |
There was a problem hiding this comment.
Should these recipes have version range pinning?
📋 Summary
Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.
🔄 Changes
Added the following :
SubcategorySamplerParams for conditional topic sampling.
coherence, preamble quality).
🔧 Changed
🧪 Testing
✅ Checklist