docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) by dhruvnathawani · Pull Request #544 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-04-14T16:22:27Z

📋 Summary

Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.

🔄 Changes

Added the following :

docs/assets/recipes/model_usability/structured_data.py — Five-stage pipeline: samplers → schema generation → user prompt → conversation pairs → best-of-3 structured output across JSON, YAML, XML, Markdown. Demonstrates
SubcategorySamplerParams for conditional topic sampling.
docs/assets/recipes/model_usability/prompt_sensitivity.py — Seed-driven pipeline with 10 regex answer formats × 30 preambles, 7 diversity samplers, 3 LLM paraphrasing stages, and 4 LLM judges (format compliance, regex alignment, order
coherence, preamble quality).
docs/assets/recipes/code_generation/infinibyte.py — Cross-source problem generation using HF streaming, random cross-join, LLMStructuredColumnConfig with Pydantic models for candidate generation/selection/evaluation, and solution generation.
docs/recipes/model_usability/structured_data.md — recipe doc page
docs/recipes/model_usability/prompt_sensitivity.md — recipe doc page
docs/recipes/code_generation/infinibyte.md — recipe doc page

🔧 Changed

docs/recipes/cards.md — three new recipe cards added
mkdocs.yml — nav entries for new Model Usability category and InfiniByte under Code Generation

🧪 Testing

structured_data.py --num-records 2 — 2/2 records, all columns generated
prompt_sensitivity.py --num-records 2 — 2/2 records, all 4 judges scored
infinibyte.py --num-records 2 --limit 100 — pipeline stages execute correctly (streaming + cross-join + structured columns all work; default nvidia-text model times out on long coding problems, documented in prerequisites)
uv run mkdocs build — no errors for new recipe files
make check-all-fix — all checks passed

✅ Checklist

… infinibyte)

github-actions · 2026-04-14T16:23:47Z

Docs preview: https://7c3c9cf7.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

greptile-apps · 2026-04-15T05:16:57Z

Greptile Summary

This PR adds three Nemotron Nano SDG recipe pipelines (structured data generation, prompt sensitivity, InfiniByte) along with their documentation pages, recipe cards, and mkdocs.yml nav entries. The infinibyte.py and structured_data.py implementations are clean and logically correct; prompt_sensitivity.py carries several broken regex patterns (fmt_00, fmt_05, fmt_08, fmt_09) flagged in prior review threads that have not yet been corrected — these will cause the Regex Alignment LLM judge to evaluate against patterns that never match their stated intent.

Confidence Score: 4/5

Safe to merge once the broken regex patterns in prompt_sensitivity.py are corrected.

The InfiniByte and Structured Data recipes are logically sound and well-structured. However, prompt_sensitivity.py has multiple broken regex patterns (fmt_00/fmt_09 \boxed{} word-boundary misuse, fmt_05 mismatched bracket/paren delimiter, fmt_08 [.*?] character class) that were flagged in prior review threads and remain unaddressed. These are P1 correctness issues that cause the Regex Alignment judge to score against patterns that will never match their intended format, undermining the recipe's primary quality-scoring mechanism.

docs/assets/recipes/model_usability/prompt_sensitivity.py — regex patterns in FORMAT_TEMPLATES need correction before this recipe produces meaningful judge scores.

Important Files Changed

Filename	Overview
docs/assets/recipes/model_usability/prompt_sensitivity.py	Seed-driven preamble generation pipeline; multiple regex patterns (fmt_00, fmt_05, fmt_08, fmt_09) contain known correctness bugs flagged in prior review threads that remain unaddressed.
docs/assets/recipes/code_generation/infinibyte.py	5-stage cross-source problem augmentation pipeline; HF streaming, random cross-join, Pydantic structured outputs, and solution generation all look logically correct.
docs/assets/recipes/model_usability/structured_data.py	Multi-format schema generation pipeline with SubcategorySamplerParams, 5 LLM stages, and best-of-3 output; logic is sound.
docs/recipes/cards.md	Adds three new recipe cards for structured data, prompt sensitivity, and InfiniByte; links and icons look correct.
mkdocs.yml	Adds nav entries for the new Model Usability category and InfiniByte under Code Generation; structure is correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph InfiniByte["InfiniByte Pipeline"]
        A1[HF Dataset A\nOpenCodeReasoning] --> C1[Cross-Join\nRandom Sampling]
        A2[HF Dataset B\nOpenMathReasoning] --> C1
        C1 --> S1[Stage 1: Seed + Sampler\ncombination_type]
        S1 --> S2[Stage 2: Candidate Generation\nLLMStructured → NewProblemList]
        S2 --> S3[Stage 3: Best Selection\nLLMStructured → NewProblemWithReasoning]
        S3 --> S3b[ExpressionColumn\nnew_problem]
        S3b --> S4[Stage 4: Evaluation\nLLMStructured → NewProblemEvals]
        S4 --> S5[Stage 5: Solution\nLLMText]
    end

    subgraph PromptSensitivity["Prompt Sensitivity Pipeline"]
        P0[Seed: 10 formats x 30 preambles] --> P1[Stage 1: 7 Diversity Samplers]
        P1 --> P2[Stage 2: Preamble Generation\nLLMText]
        P2 --> P3[Stage 3: Format Instruction\nLLMText]
        P3 --> P4[Stage 4: User Prompt Composition\nLLMText]
        P4 --> P5A[Judge: Format Compliance\n0-2]
        P4 --> P5B[Judge: Regex Alignment\n0-1]
        P4 --> P5C[Judge: Order Coherence\n0-1]
        P4 --> P5D[Judge: Preamble Quality\n0-3]
    end

    subgraph StructuredData["Structured Data Pipeline"]
        D1[Stage 1: Sampling\nformat, topic, schema controls] --> D2[Stage 2: Schema Generation\nLLMText]
        D2 --> D3[Stage 3: User Prompt\nLLMText]
        D3 --> D4[Stage 4: Conversation Pairs\nLLMText]
        D4 --> D5A[structured_output_0]
        D4 --> D5B[structured_output_1]
        D4 --> D5C[structured_output_2]
    end

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile}

greptile-apps · 2026-04-15T05:17:01Z

+    },
+    {
+        "format_key": "fmt_05",
+        "output_regex": r"\[Answer:\s*([A-Za-z])\)",


Mismatched closing delimiter in fmt_05 regex

The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.

Suggested change

"output_regex": r"\[Answer:\s*([A-Za-z])\)",

"output_regex": r"\[Answer:\s*([A-Za-z])\]",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 103 Comment: **Mismatched closing delimiter in `fmt_05` regex** The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce. ```suggestion "output_regex": r"\[Answer:\s*([A-Za-z])\]", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:02Z

+FORMAT_TEMPLATES = [
+    {
+        "format_key": "fmt_00",
+        "output_regex": r"\boxed{([.*?])}",


Incorrect regex for LaTeX \boxed{} format

r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:

Suggested change

"output_regex": r"\boxed{([.*?])}",

"output_regex": r"\\boxed\{(.*?)\}",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 78 Comment: **Incorrect regex for LaTeX `\boxed{}` format** `r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is: ```suggestion "output_regex": r"\\boxed\{(.*?)\}", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:03Z

+    },
+    {
+        "format_key": "fmt_08",
+        "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",


[.*?] character class captures only ., *, or ?

([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:

Suggested change

"output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",

"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 118 Comment: **`[.*?]` character class captures only `.`, `*`, or `?`** `([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets: ```suggestion "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-15T05:21:03Z

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

This PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in docs/ — no library code is modified.

Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, cards.md, mkdocs.yml)
Lines: +1568, -0

Findings

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

File: docs/assets/recipes/model_usability/prompt_sensitivity.py (lines 63–111 in the diff, the FORMAT_TEMPLATES list)

Several output_regex patterns appear incorrect. Since these are passed to LLM judges to evaluate "regex alignment," incorrect patterns will produce unreliable evaluation scores:

\boxed is not escaped correctly (fmt_00, fmt_09): In Python regex, \b is the word-boundary anchor, so r"\boxed{...}" matches word-boundary + oxed{...}, not the literal string \boxed{...}. To match the LaTeX \boxed{} literally, use r"\\boxed\{..." or a double backslash.
[.*?] is a character class, not a wildcard (fmt_00, fmt_08, fmt_09): [.*?] matches a single character that is ., *, or ?. The likely intent is (.*?) (non-greedy capture group) or .+?.
fmt_05 has mismatched brackets: The regex r"\[Answer:\s*([A-Za-z])\)" opens with \[ (literal [) but closes with \) (literal )). The seed_format_instruction says "end with [Answer: X]" — so the closing delimiter should be \], not \).
fmt_00 and fmt_09 are duplicates: Both use the identical regex r"\boxed{([.*?])}". Their seed_format_instruction values differ, but the regex and format_key should likely differ too, or one should be removed.

Impact: These patterns are seed data for an LLM pipeline. The regex_alignment LLM judge evaluates whether generated format instructions match the output_regex. If the regex itself is wrong, the judge's scoring reference is unreliable, degrading the quality signal in the generated dataset.

Recommendation: Verify these patterns against the original Nemotron Nano pipeline. If they were copied verbatim from the training codebase, document that in a comment. If they are new to this recipe, fix them.

Low — Doc Page Format Inconsistency

Files: docs/recipes/code_generation/infinibyte.md, docs/recipes/model_usability/structured_data.md, docs/recipes/model_usability/prompt_sensitivity.md

The new recipe doc pages include a # heading and a description paragraph before the download button:

# Nemotron Nano InfiniByte

Generate more diverse and complex training problems...

[Download Code ...]

Most existing recipe doc pages (e.g., text_to_python.md, product_info_qa.md) have only the download button and code include — no heading or description. Some newer ones (e.g., enterprise_text_to_sql.md) do include a heading and notes.

Impact: Minor visual inconsistency in docs. The added heading/description is arguably an improvement, providing better context to readers. Not blocking.

Informational — `hashlib.md5` in `infinibyte.py`

File: docs/assets/recipes/code_generation/infinibyte.py, line ~102 (within fetch_hf_dataset_to_df)

rec_id = rec.get("id") or hashlib.md5(text.encode("utf-8")).hexdigest()

MD5 is used as a fallback ID when a HuggingFace record has no id field. This is fine for deduplication/identification (not security), but some linters and security scanners flag hashlib.md5 usage. Consider hashlib.sha256 for forward-compatibility if the pipeline is adapted to stricter environments.

Informational — Single Strategy Defined in InfiniByte

File: docs/assets/recipes/code_generation/infinibyte.py, lines ~55–57

STRATEGIES = {
    "ocr_omr": ("ocr", "omr"),
}

The --strategy CLI arg accepts choices=list(STRATEGIES.keys()) but only one strategy (ocr_omr) is defined. This is fine for a recipe — it demonstrates extensibility — but the CLI help could note that additional strategies can be added by extending the STRATEGIES dict.

Positive Observations

Well-structured pipeline designs with clear ASCII architecture diagrams in each recipe's docstring.
Proper use of DataDesigner APIs: LLMStructuredColumnConfig with Pydantic models (infinibyte), LLMJudgeColumnConfig with Score rubrics (prompt_sensitivity), SubcategorySamplerParams for conditional sampling (structured_data), ExpressionColumnConfig for extracting structured fields.
SPDX license headers present on all new files.
PEP 723 inline script metadata (# /// script) correctly specified for uv run compatibility.
from __future__ import annotations included in all three recipe files (consistent with project style guide).
Consistent CLI interface across all three recipes (--model-alias, --num-records, --artifact-path).
Recipe cards in cards.md follow the established grid pattern with icons, descriptions, "Demonstrates" sections, and action buttons.
mkdocs.yml nav entries are properly structured, creating a new "Model Usability" category cleanly.

Verdict

Approve with suggestions. The recipes are well-crafted, demonstrate advanced DataDesigner features, and follow established patterns. The regex pattern issues in prompt_sensitivity.py are the primary concern — they should be verified against the original Nemotron Nano pipeline and corrected or annotated. The other findings are minor and non-blocking.

nabinchha · 2026-04-21T15:33:12Z

+#     "data-designer",
+#     "datasets",
+#     "pandas",


Should these recipes have version range pinning?

docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…

38c2559

… infinibyte)

Merge branch 'main' into dhruv/recipes/nano

b1b34ec

dhruvnathawani changed the title ~~[DRAFT] docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…)~~ docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) Apr 15, 2026

dhruvnathawani marked this pull request as ready for review April 15, 2026 05:13

dhruvnathawani requested a review from a team as a code owner April 15, 2026 05:13

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

Merge branch 'main' into dhruv/recipes/nano

7a03446

github-actions Bot mentioned this pull request Apr 20, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

Merge branch 'main' into dhruv/recipes/nano

21efc59

andy9trider69-bit mentioned this pull request Apr 21, 2026

<h3>Greptile Summary</h3> crossfilter/crossfilter#194

Open

nabinchha reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
dhruvnathawani wants to merge 4 commits intomainfrom
dhruv/recipes/nano

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 15, 2026 •

edited

Loading

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps Bot Apr 15, 2026

Uh oh!

greptile-apps Bot Apr 15, 2026

Uh oh!

greptile-apps Bot Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

nabinchha Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"output_regex": r"\[Answer:\s*([A-Za-z])\)",
	"output_regex": r"\[Answer:\s*([A-Za-z])\]",

	"output_regex": r"\boxed{([.*?])}",
	"output_regex": r"\\boxed\{(.*?)\}",

	"output_regex": r"<final_answer>\s([.?])\s*</final_answer>",
	"output_regex": r"<final_answer>\s(.?)\s*</final_answer>",

Conversation

dhruvnathawani commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔄 Changes

🔧 Changed

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 15, 2026

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

Findings

High — Regex Patterns in prompt_sensitivity.py Have Multiple Issues

Low — Doc Page Format Inconsistency

Informational — hashlib.md5 in infinibyte.py

Informational — Single Strategy Defined in InfiniByte

Positive Observations

Verdict

Uh oh!

nabinchha Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading

greptile-apps Bot commented Apr 15, 2026 •

edited

Loading

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

Informational — `hashlib.md5` in `infinibyte.py`