Skip to content

[codex] docs: add Nemotron OCR pipeline guide#2137

Open
lbliii wants to merge 6 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-ocr-pipeline
Open

[codex] docs: add Nemotron OCR pipeline guide#2137
lbliii wants to merge 6 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-ocr-pipeline

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a Nemotron OCR concept page covering architecture, installation, input sources, schemas, stage defaults, verifier behavior, and failure/restart semantics
  • add a runnable tutorial for OCR-only and optional NVIDIA Inference API scoring/QA paths
  • document the six user-visible QA interaction shapes, four internal QA tags, and all 11 dense-answer formats
  • provide CLI and Python examples, complete argument reference, result schemas, batching behavior, secrets handling, and shard lifecycle guidance
  • add navigation and cross-links from synthetic data, image curation, image tutorials, and LLM client documentation

Current API boundary

Issue #2121 names several APIs that were not present in the final merged #1899 implementation. These docs use the actual current names and behavior:

  • OCRNemotronV2Stage, not NemotronOCRV2Stage
  • HFDatasetImageReaderStage and JsonlSampleWriterStage, not HFDatasetImageReader / ResultWriterStage
  • NVInferenceClient, not NVInferenceModel
  • OCRDenseItem, not OCRDenseWord
  • conversation and dense-QA helper functions called by OCRScoringQAStage; there are no standalone OCRConversationalizeStage or OCRDenseQAStage classes
  • no OCR-specific TarImageReader, ParquetReader, or SkipProcessedStage

The pages call this out directly and document custom ImageSampleTask[OCRData] input as the extension point instead of publishing non-runnable configuration.

Validation

  • fern check — 0 errors (103 existing warnings)
  • fern docs broken-links — no errors in changed pages; 22 pre-existing API-reference errors elsewhere
  • documented scoring CLI arguments build the expected four-stage pipeline
  • focused OCR, task, conversation, model-stage, and NVIDIA client tests — 120 passed, 1 live GPU model-load test deselected
  • python -m py_compile for the tutorial and all documented source modules
  • git diff --cached --check

The repository's Linux-only import guard was bypassed for local CPU unit validation on macOS. The live NemotronOCR-v2 model load requires the documented container and GPU and was not run locally.

Closes #2121

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jun 29, 2026
@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from abhinavg4 and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a comprehensive Nemotron OCR documentation set to the NeMo Curator docs: a concept page covering architecture, schemas, verifier behavior, QA shapes, and failure semantics, plus a runnable tutorial covering OCR-only and scoring+QA paths. Navigation entries and cross-links from image-curation, synthetic-data, and LLM-client pages are also added.

  • Concept page (nemotron-ocr/index.mdx) documents the four-stage pipeline architecture, all API names (correcting earlier development names), OCRData/OCRConversationData schemas, bounding-box coordinate system caveat, verifier defaults, six QA interaction shapes, eleven dense-answer formats, and the conversation wrapper schema.
  • Tutorial page (nemotron-ocr/tutorial.mdx) provides step-by-step CLI and Python examples, a full CLI reference table, JsonlSampleWriterStage default-vs-CLI discrepancy note, and rerun/restart guidance with the merge_output_shards call placed in a finally block.
  • Navigation and cards (main.yml, four existing index pages) wire the new section into the sidebar and add discovery cards consistent with the existing pattern.

Confidence Score: 5/5

Documentation-only change with no runtime code modifications; safe to merge.

All changes are Fern MDX documentation and navigation YAML. The concept page and tutorial are internally consistent with each other and with the described implementation. All three concerns raised in prior review threads have been addressed: merge_output_shards now appears inside the finally block, the JsonlSampleWriterStage default-vs-CLI discrepancy is explicitly called out, and the conversation schema correctly documents the nested wrapper. The {{ container_version }} template placeholder follows the same pattern used across at least seven pre-existing docs pages. No code is executed and no APIs are changed.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx New concept page; internally consistent schema, coordinate, and conversation-wrapper documentation; previous review thread concerns about the nested conversation shape have been addressed.
fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx New tutorial; merge_output_shards is now inside the finally block, valid_only discrepancy is explicitly documented, and CLI/Python examples are consistent with the concept page.
fern/versions/main.yml Adds Nemotron OCR section between Multilingual Q&A and Nemotron-CC with correct slug and paths; no issues found.
fern/versions/main/pages/curate-images/index.mdx Adds cross-link card following the existing tag pattern; no issues.
fern/versions/main/pages/curate-images/tutorials/index.mdx Adds tutorial discovery card consistent with neighboring cards; no issues.
fern/versions/main/pages/curate-text/synthetic/index.mdx Adds Nemotron OCR card to the synthetic data landing page; consistent with existing cards.
fern/versions/main/pages/curate-text/synthetic/llm-client.mdx Adds a next-steps link to the Nemotron OCR page; the description accurately reflects NVInferenceClient's streaming use for multimodal tasks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart LR
    A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
    B --> C{"--run-scoring-qa?"}
    C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
    C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
    E --> D
    D --> F["merge_output_shards\n(Driver · shared FS)"]
    F --> G["Final JSONL"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart LR
    A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
    B --> C{"--run-scoring-qa?"}
    C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
    C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
    E --> D
    D --> F["merge_output_shards\n(Driver · shared FS)"]
    F --> G["Final JSONL"]
Loading

Reviews (6): Last reviewed commit: "docs: distinguish OCR conversation types" | Re-trigger Greptile

Comment on lines +168 to +178
from nemo_curator.stages.synthetic.omni.io import HFDatasetImageReaderStage, JsonlSampleWriterStage
from nemo_curator.stages.synthetic.omni.ocr_nemotron_v2 import OCRNemotronV2Stage
from nemo_curator.stages.synthetic.omni.ocr_scoring_qa import OCRScoringQAStage
from nemo_curator.tasks.ocr import OCRData

pipeline = Pipeline(name="ocr-nemotron-custom")
pipeline.add_stage(HFDatasetImageReaderStage(
dataset_name="lmms-lab/textvqa",
image_dir="/workspace/ocr-data/images",
split="validation",
limit=25,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Example doesn't demonstrate the recovery pattern it recommends

The code block places merge_output_shards after the try/finally, so it is only reached when pipeline.run() succeeds — identical to the CLI behaviour. The adjacent prose advises "call merge_output_shards in your own outer finally after workers stop", but no example of that pattern is shown. A reader who follows the code verbatim (e.g. after an OOM or network error during pipeline.run) will lose their partially-written shards. Consider either showing the finally variant or removing the recommendation sentence to avoid the contradiction.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +186 to +204
max_text_errors=0,
dense_dump_prob=0.05,
batch_size=16,
priority_mode=False,
))
pipeline.add_stage(JsonlSampleWriterStage(
output_path="/workspace/ocr-data/results/custom.jsonl",
valid_only=False,
image_parent="/workspace/ocr-data/images",
))
```

## 5. Read scored output

```json
{
"image_path": "000000.jpg",
"image_id": "000000",
"ocr_is_word_level": true,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 JsonlSampleWriterStage Python default differs silently from the CLI default

JsonlSampleWriterStage.__init__ defaults valid_only=True, whereas the CLI (--valid-only) defaults to False. The Python example sets valid_only=False explicitly so this example is correct, but the tutorial never calls out the difference. A user who constructs the stage without this kwarg (e.g. copying the import lines and only passing output_path) will silently drop all invalid records, which is the opposite of what the CLI does by default and what "Rerun safely" step 2 advises.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment on lines +156 to +167
## Conversation schema

```json
{
"conversation": [
{"sender": "user", "fragments": [{"t": "image", "value": "000042.jpg"}, "What text is in the bounding box [120, 90, 440, 160]?"]},
{"sender": "assistant", "fragments": ["PROJECT STATUS"]},
{"sender": "user", "fragments": ["Point at the text 'TOTAL'. Answer with [x, y]."]},
{"sender": "assistant", "fragments": ["[732, 518]"]}
]
}
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Conversation schema contradicts the tutorial's actual output shape

The schema example shows "conversation" as a direct array inside the top-level object, but in tutorial step 5 (tutorial.mdx, lines 214–220) the full JSONL record stores conversation as a nested object: "conversation": {"conversation": [...]}. A reader following this section would write record["conversation"][0] to get the first turn, when the correct access is record["conversation"]["conversation"][0]. Consider clarifying that the object shown here is the value of the conversation field in the JSONL record, not the record itself, or add a note cross-referencing the step-5 full-record shape.

lbliii added 3 commits July 2, 2026 11:07
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii force-pushed the codex/docs-nemotron-ocr-pipeline branch from 093a623 to abe92e4 Compare July 2, 2026 15:22
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Document the Nemotron OCR synthetic-data-generation pipeline

1 participant