[codex] docs: add Nemotron OCR pipeline guide by lbliii · Pull Request #2137 · NVIDIA-NeMo/Curator

lbliii · 2026-06-29T21:26:09Z

Summary

add a Nemotron OCR concept page covering architecture, installation, input sources, schemas, stage defaults, verifier behavior, and failure/restart semantics
add a runnable tutorial for OCR-only and optional NVIDIA Inference API scoring/QA paths
document the six user-visible QA interaction shapes, four internal QA tags, and all 11 dense-answer formats
provide CLI and Python examples, complete argument reference, result schemas, batching behavior, secrets handling, and shard lifecycle guidance
add navigation and cross-links from synthetic data, image curation, image tutorials, and LLM client documentation

Current API boundary

Issue #2121 names several APIs that were not present in the final merged #1899 implementation. These docs use the actual current names and behavior:

OCRNemotronV2Stage, not NemotronOCRV2Stage
HFDatasetImageReaderStage and JsonlSampleWriterStage, not HFDatasetImageReader / ResultWriterStage
NVInferenceClient, not NVInferenceModel
OCRDenseItem, not OCRDenseWord
conversation and dense-QA helper functions called by OCRScoringQAStage; there are no standalone OCRConversationalizeStage or OCRDenseQAStage classes
no OCR-specific TarImageReader, ParquetReader, or SkipProcessedStage

The pages call this out directly and document custom ImageSampleTask[OCRData] input as the extension point instead of publishing non-runnable configuration.

Validation

fern check — 0 errors (103 existing warnings)
fern docs broken-links — no errors in changed pages; 22 pre-existing API-reference errors elsewhere
documented scoring CLI arguments build the expected four-stage pipeline
focused OCR, task, conversation, model-stage, and NVIDIA client tests — 120 passed, 1 live GPU model-load test deselected
python -m py_compile for the tutorial and all documented source modules
git diff --cached --check

The repository's Linux-only import guard was bypassed for local CPU unit validation on macOS. The live NemotronOCR-v2 model load requires the documented container and GPU and was not run locally.

Closes #2121

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-29T21:26:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T14:59:30Z

Greptile Summary

This PR adds a comprehensive Nemotron OCR documentation set to the NeMo Curator docs: a concept page covering architecture, schemas, verifier behavior, QA shapes, and failure semantics, plus a runnable tutorial covering OCR-only and scoring+QA paths. Navigation entries and cross-links from image-curation, synthetic-data, and LLM-client pages are also added.

Concept page (nemotron-ocr/index.mdx) documents the four-stage pipeline architecture, all API names (correcting earlier development names), OCRData/OCRConversationData schemas, bounding-box coordinate system caveat, verifier defaults, six QA interaction shapes, eleven dense-answer formats, and the conversation wrapper schema.
Tutorial page (nemotron-ocr/tutorial.mdx) provides step-by-step CLI and Python examples, a full CLI reference table, JsonlSampleWriterStage default-vs-CLI discrepancy note, and rerun/restart guidance with the merge_output_shards call placed in a finally block.
Navigation and cards (main.yml, four existing index pages) wire the new section into the sidebar and add discovery cards consistent with the existing pattern.

Confidence Score: 5/5

Documentation-only change with no runtime code modifications; safe to merge.

All changes are Fern MDX documentation and navigation YAML. The concept page and tutorial are internally consistent with each other and with the described implementation. All three concerns raised in prior review threads have been addressed: merge_output_shards now appears inside the finally block, the JsonlSampleWriterStage default-vs-CLI discrepancy is explicitly called out, and the conversation schema correctly documents the nested wrapper. The {{ container_version }} template placeholder follows the same pattern used across at least seven pre-existing docs pages. No code is executed and no APIs are changed.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx	New concept page; internally consistent schema, coordinate, and conversation-wrapper documentation; previous review thread concerns about the nested conversation shape have been addressed.
fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx	New tutorial; merge_output_shards is now inside the finally block, valid_only discrepancy is explicitly documented, and CLI/Python examples are consistent with the concept page.
fern/versions/main.yml	Adds Nemotron OCR section between Multilingual Q&A and Nemotron-CC with correct slug and paths; no issues found.
fern/versions/main/pages/curate-images/index.mdx	Adds cross-link card following the existing tag pattern; no issues.
fern/versions/main/pages/curate-images/tutorials/index.mdx	Adds tutorial discovery card consistent with neighboring cards; no issues.
fern/versions/main/pages/curate-text/synthetic/index.mdx	Adds Nemotron OCR card to the synthetic data landing page; consistent with existing cards.
fern/versions/main/pages/curate-text/synthetic/llm-client.mdx	Adds a next-steps link to the Nemotron OCR page; the description accurately reflects NVInferenceClient's streaming use for multimodal tasks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart LR
    A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
    B --> C{"--run-scoring-qa?"}
    C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
    C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
    E --> D
    D --> F["merge_output_shards\n(Driver · shared FS)"]
    F --> G["Final JSONL"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart LR
    A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
    B --> C{"--run-scoring-qa?"}
    C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
    C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
    E --> D
    D --> F["merge_output_shards\n(Driver · shared FS)"]
    F --> G["Final JSONL"]

_{Reviews (6): Last reviewed commit: "docs: distinguish OCR conversation types" | Re-trigger Greptile}

greptile-apps · 2026-07-02T14:59:34Z

+from nemo_curator.stages.synthetic.omni.io import HFDatasetImageReaderStage, JsonlSampleWriterStage
+from nemo_curator.stages.synthetic.omni.ocr_nemotron_v2 import OCRNemotronV2Stage
+from nemo_curator.stages.synthetic.omni.ocr_scoring_qa import OCRScoringQAStage
+from nemo_curator.tasks.ocr import OCRData
+
+pipeline = Pipeline(name="ocr-nemotron-custom")
+pipeline.add_stage(HFDatasetImageReaderStage(
+    dataset_name="lmms-lab/textvqa",
+    image_dir="/workspace/ocr-data/images",
+    split="validation",
+    limit=25,


Example doesn't demonstrate the recovery pattern it recommends

The code block places merge_output_shards after the try/finally, so it is only reached when pipeline.run() succeeds — identical to the CLI behaviour. The adjacent prose advises "call merge_output_shards in your own outer finally after workers stop", but no example of that pattern is shown. A reader who follows the code verbatim (e.g. after an OOM or network error during pipeline.run) will lose their partially-written shards. Consider either showing the finally variant or removing the recommendation sentence to avoid the contradiction.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-07-02T14:59:35Z

+    max_text_errors=0,
+    dense_dump_prob=0.05,
+    batch_size=16,
+    priority_mode=False,
+))
+pipeline.add_stage(JsonlSampleWriterStage(
+    output_path="/workspace/ocr-data/results/custom.jsonl",
+    valid_only=False,
+    image_parent="/workspace/ocr-data/images",
+))
+```
+
+## 5. Read scored output
+
+```json
+{
+  "image_path": "000000.jpg",
+  "image_id": "000000",
+  "ocr_is_word_level": true,


JsonlSampleWriterStage Python default differs silently from the CLI default

JsonlSampleWriterStage.__init__ defaults valid_only=True, whereas the CLI (--valid-only) defaults to False. The Python example sets valid_only=False explicitly so this example is correct, but the tutorial never calls out the difference. A user who constructs the stage without this kwarg (e.g. copying the import lines and only passing output_path) will silently drop all invalid records, which is the opposite of what the CLI does by default and what "Rerun safely" step 2 advises.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps · 2026-07-02T15:06:10Z

+## Conversation schema
+
+```json
+{
+  "conversation": [
+    {"sender": "user", "fragments": [{"t": "image", "value": "000042.jpg"}, "What text is in the bounding box [120, 90, 440, 160]?"]},
+    {"sender": "assistant", "fragments": ["PROJECT STATUS"]},
+    {"sender": "user", "fragments": ["Point at the text 'TOTAL'. Answer with [x, y]."]},
+    {"sender": "assistant", "fragments": ["[732, 518]"]}
+  ]
+}
+```


Conversation schema contradicts the tutorial's actual output shape

The schema example shows "conversation" as a direct array inside the top-level object, but in tutorial step 5 (tutorial.mdx, lines 214–220) the full JSONL record stores conversation as a nested object: "conversation": {"conversation": [...]}. A reader following this section would write record["conversation"][0] to get the first turn, when the correct access is record["conversation"]["conversation"][0]. Consider clarifying that the object shown here is the value of the conversation field in the JSONL record, not the record itself, or add a note cross-referencing the step-5 full-record shape.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: add Nemotron OCR pipeline guide

1b523f2

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Jun 29, 2026

lbliii mentioned this pull request Jun 30, 2026

[codex] publish 26.06 release notes and migration checklist #2143

Open

lbliii marked this pull request as ready for review July 2, 2026 14:53

lbliii requested a review from a team as a code owner July 2, 2026 14:53

lbliii requested review from abhinavg4 and removed request for a team July 2, 2026 14:53

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

docs: clarify OCR recovery and writer defaults

b560da6

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

lbliii added 3 commits July 2, 2026 11:07

docs: clarify OCR conversation schema

2497d6d

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'main' into codex/docs-nemotron-ocr-pipeline

f1c79cc

docs: punctuate OCR navigation card

abe92e4

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii force-pushed the codex/docs-nemotron-ocr-pipeline branch from 093a623 to abe92e4 Compare July 2, 2026 15:22

docs: distinguish OCR conversation types

249be5e

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] docs: add Nemotron OCR pipeline guide#2137

[codex] docs: add Nemotron OCR pipeline guide#2137
lbliii wants to merge 6 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-ocr-pipeline

lbliii commented Jun 29, 2026

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lbliii commented Jun 29, 2026

Summary

Current API boundary

Validation

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading