[codex] docs: add Nemotron OCR pipeline guide#2137
Conversation
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Greptile SummaryThis PR adds a comprehensive Nemotron OCR documentation set to the NeMo Curator docs: a concept page covering architecture, schemas, verifier behavior, QA shapes, and failure semantics, plus a runnable tutorial covering OCR-only and scoring+QA paths. Navigation entries and cross-links from image-curation, synthetic-data, and LLM-client pages are also added.
Confidence Score: 5/5Documentation-only change with no runtime code modifications; safe to merge. All changes are Fern MDX documentation and navigation YAML. The concept page and tutorial are internally consistent with each other and with the described implementation. All three concerns raised in prior review threads have been addressed: merge_output_shards now appears inside the finally block, the JsonlSampleWriterStage default-vs-CLI discrepancy is explicitly called out, and the conversation schema correctly documents the nested wrapper. The {{ container_version }} template placeholder follows the same pattern used across at least seven pre-existing docs pages. No code is executed and no APIs are changed. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart LR
A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
B --> C{"--run-scoring-qa?"}
C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
E --> D
D --> F["merge_output_shards\n(Driver · shared FS)"]
F --> G["Final JSONL"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart LR
A["HFDatasetImageReaderStage\n(Driver · 1 CPU)"] --> B["OCRNemotronV2Stage\n(Worker · 8 CPU + 1 GPU · batch 32)"]
B --> C{"--run-scoring-qa?"}
C -->|No| D["JsonlSampleWriterStage\n(Worker · 2 CPU)"]
C -->|Yes| E["OCRScoringQAStage\n(Worker · 1 CPU + NVIDIA API · batch 16)"]
E --> D
D --> F["merge_output_shards\n(Driver · shared FS)"]
F --> G["Final JSONL"]
Reviews (6): Last reviewed commit: "docs: distinguish OCR conversation types" | Re-trigger Greptile |
| from nemo_curator.stages.synthetic.omni.io import HFDatasetImageReaderStage, JsonlSampleWriterStage | ||
| from nemo_curator.stages.synthetic.omni.ocr_nemotron_v2 import OCRNemotronV2Stage | ||
| from nemo_curator.stages.synthetic.omni.ocr_scoring_qa import OCRScoringQAStage | ||
| from nemo_curator.tasks.ocr import OCRData | ||
|
|
||
| pipeline = Pipeline(name="ocr-nemotron-custom") | ||
| pipeline.add_stage(HFDatasetImageReaderStage( | ||
| dataset_name="lmms-lab/textvqa", | ||
| image_dir="/workspace/ocr-data/images", | ||
| split="validation", | ||
| limit=25, |
There was a problem hiding this comment.
Example doesn't demonstrate the recovery pattern it recommends
The code block places merge_output_shards after the try/finally, so it is only reached when pipeline.run() succeeds — identical to the CLI behaviour. The adjacent prose advises "call merge_output_shards in your own outer finally after workers stop", but no example of that pattern is shown. A reader who follows the code verbatim (e.g. after an OOM or network error during pipeline.run) will lose their partially-written shards. Consider either showing the finally variant or removing the recommendation sentence to avoid the contradiction.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| max_text_errors=0, | ||
| dense_dump_prob=0.05, | ||
| batch_size=16, | ||
| priority_mode=False, | ||
| )) | ||
| pipeline.add_stage(JsonlSampleWriterStage( | ||
| output_path="/workspace/ocr-data/results/custom.jsonl", | ||
| valid_only=False, | ||
| image_parent="/workspace/ocr-data/images", | ||
| )) | ||
| ``` | ||
|
|
||
| ## 5. Read scored output | ||
|
|
||
| ```json | ||
| { | ||
| "image_path": "000000.jpg", | ||
| "image_id": "000000", | ||
| "ocr_is_word_level": true, |
There was a problem hiding this comment.
JsonlSampleWriterStage Python default differs silently from the CLI default
JsonlSampleWriterStage.__init__ defaults valid_only=True, whereas the CLI (--valid-only) defaults to False. The Python example sets valid_only=False explicitly so this example is correct, but the tutorial never calls out the difference. A user who constructs the stage without this kwarg (e.g. copying the import lines and only passing output_path) will silently drop all invalid records, which is the opposite of what the CLI does by default and what "Rerun safely" step 2 advises.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
| ## Conversation schema | ||
|
|
||
| ```json | ||
| { | ||
| "conversation": [ | ||
| {"sender": "user", "fragments": [{"t": "image", "value": "000042.jpg"}, "What text is in the bounding box [120, 90, 440, 160]?"]}, | ||
| {"sender": "assistant", "fragments": ["PROJECT STATUS"]}, | ||
| {"sender": "user", "fragments": ["Point at the text 'TOTAL'. Answer with [x, y]."]}, | ||
| {"sender": "assistant", "fragments": ["[732, 518]"]} | ||
| ] | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Conversation schema contradicts the tutorial's actual output shape
The schema example shows "conversation" as a direct array inside the top-level object, but in tutorial step 5 (tutorial.mdx, lines 214–220) the full JSONL record stores conversation as a nested object: "conversation": {"conversation": [...]}. A reader following this section would write record["conversation"][0] to get the first turn, when the correct access is record["conversation"]["conversation"][0]. Consider clarifying that the object shown here is the value of the conversation field in the JSONL record, not the record itself, or add a note cross-referencing the step-5 full-record shape.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
093a623 to
abe92e4
Compare
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Summary
Current API boundary
Issue #2121 names several APIs that were not present in the final merged #1899 implementation. These docs use the actual current names and behavior:
OCRNemotronV2Stage, notNemotronOCRV2StageHFDatasetImageReaderStageandJsonlSampleWriterStage, notHFDatasetImageReader/ResultWriterStageNVInferenceClient, notNVInferenceModelOCRDenseItem, notOCRDenseWordOCRScoringQAStage; there are no standaloneOCRConversationalizeStageorOCRDenseQAStageclassesTarImageReader,ParquetReader, orSkipProcessedStageThe pages call this out directly and document custom
ImageSampleTask[OCRData]input as the extension point instead of publishing non-runnable configuration.Validation
fern check— 0 errors (103 existing warnings)fern docs broken-links— no errors in changed pages; 22 pre-existing API-reference errors elsewherepython -m py_compilefor the tutorial and all documented source modulesgit diff --cached --checkThe repository's Linux-only import guard was bypassed for local CPU unit validation on macOS. The live NemotronOCR-v2 model load requires the documented container and GPU and was not run locally.
Closes #2121