diff --git a/fern/versions/main.yml b/fern/versions/main.yml
index 1c94ec0bac..494e1a4e72 100644
--- a/fern/versions/main.yml
+++ b/fern/versions/main.yml
@@ -309,6 +309,15 @@ navigation:
- page: Multilingual Q&A
path: ./main/pages/curate-text/synthetic/multilingual-qa.mdx
slug: multilingual-qa
+ - section: Nemotron OCR
+ slug: nemotron-ocr
+ contents:
+ - page: Overview
+ path: ./main/pages/curate-text/synthetic/nemotron-ocr/index.mdx
+ slug: ""
+ - page: Tutorial
+ path: ./main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx
+ slug: tutorial
- section: Nemotron-CC
slug: nemotron-cc
contents:
diff --git a/fern/versions/main/pages/curate-images/index.mdx b/fern/versions/main/pages/curate-images/index.mdx
index b6d29f6406..d643803dcf 100644
--- a/fern/versions/main/pages/curate-images/index.mdx
+++ b/fern/versions/main/pages/curate-images/index.mdx
@@ -117,6 +117,13 @@ deduplication semantic clustering
+
+
+Generate word-level OCR labels and verifier-scored visual question-answering conversations.
+ocr multimodal synthetic-data
+
+
+
### Pipeline Management
diff --git a/fern/versions/main/pages/curate-images/tutorials/index.mdx b/fern/versions/main/pages/curate-images/tutorials/index.mdx
index 98834a0fee..d50f0e433e 100644
--- a/fern/versions/main/pages/curate-images/tutorials/index.mdx
+++ b/fern/versions/main/pages/curate-images/tutorials/index.mdx
@@ -28,6 +28,13 @@ embeddings
clustering
+
+Extract word boxes from image datasets and generate multimodal bounding-box QA conversations.
+nemotron-ocr
+multimodal
+synthetic-data
+
+
## Resources
diff --git a/fern/versions/main/pages/curate-text/synthetic/index.mdx b/fern/versions/main/pages/curate-text/synthetic/index.mdx
index e6fe404ab0..da2621a893 100644
--- a/fern/versions/main/pages/curate-text/synthetic/index.mdx
+++ b/fern/versions/main/pages/curate-text/synthetic/index.mdx
@@ -125,6 +125,13 @@ quickstart
tutorial
+
+Turn images into word-level OCR annotations and optionally generate verifier-scored multimodal QA conversations.
+multimodal
+ocr
+visual-qa
+
+
Declarative data generation with structured columns and NDD-backed Nemotron-CC stages
ndd
diff --git a/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx b/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx
index db2d5082a0..d872e21146 100644
--- a/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx
+++ b/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx
@@ -221,4 +221,5 @@ client = AsyncOpenAIClient(
## Next Steps
- [Multilingual Q&A](/curate-text/synthetic/multilingual-qa): Generate multilingual Q&A pairs
+- [Nemotron OCR](/curate-text/synthetic/nemotron-ocr): Use the specialized streaming `NVInferenceClient` for multimodal OCR verification and QA generation
- [Nemotron-CC](/curate-text/synthetic/nemotron-cc): Advanced text transformation pipelines
diff --git a/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx
new file mode 100644
index 0000000000..368ee135ac
--- /dev/null
+++ b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx
@@ -0,0 +1,193 @@
+---
+description: "Understand the Nemotron OCR synthetic-data pipeline, its image and OCR schemas, verifier scoring, QA conversation generation, and output lifecycle"
+categories: ["synthetic-data"]
+tags: ["nemotron-ocr", "multimodal", "ocr", "visual-question-answering", "synthetic-data"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "advanced"
+content_type: "concept"
+modality: "image-only"
+---
+
+# Nemotron OCR Synthetic Data
+
+The Nemotron OCR pipeline converts images into word-level OCR annotations and, optionally, scored visual question-answering conversations. It runs NemotronOCR-v2 locally on a GPU, can ask Nemotron-Nano-Omni through the NVIDIA Inference API to verify every bounding box, and writes one JSONL record per image.
+
+```mermaid
+flowchart LR
+ A["HF Hub, saved dataset, or image folder"] --> B["Extract and cache JPEG images"]
+ B --> C["NemotronOCR-v2 word OCR"]
+ C --> D{"Run scoring QA?"}
+ D -->|"No"| E["OCR JSONL"]
+ D -->|"Yes"| F["NVIDIA API bbox verification"]
+ F --> G["Filter boxes and generate QA"]
+ G --> H["Conversation JSONL"]
+```
+
+
+The merged API names are `OCRNemotronV2Stage`, `OCRScoringQAStage`, `HFDatasetImageReaderStage`, `JsonlSampleWriterStage`, `NVInferenceClient`, `OCRDenseItem`, `OCRData`, and `OCRConversationData`. Earlier development names such as `NemotronOCRV2Stage`, `OCRConversationalizeStage`, `OCRDenseQAStage`, `NVInferenceModel`, `OCRDenseWord`, `TarImageReader`, `SkipProcessedStage`, and `ResultWriterStage` are not shipped APIs. Conversation and dense-QA behavior is implemented by helper functions called inside `OCRScoringQAStage`.
+
+
+## Runtime architecture
+
+| Component | Runs where | Resources and defaults | Responsibility |
+| --- | --- | --- | --- |
+| `HFDatasetImageReaderStage` | Driver/fan-out stage | 1 CPU | Loads a Hugging Face dataset, deduplicates image IDs, caches RGB JPEGs, and creates `ImageSampleTask[OCRData]` tasks. |
+| `OCRNemotronV2Stage` | Curator worker | 8 CPUs, 1 GPU, stage batch size 32 | Loads NemotronOCR-v2 and adds normalized word boxes and text. Each image error is isolated. |
+| `OCRScoringQAStage` | Curator worker plus NVIDIA API | 1 CPU, stage batch size 16 | Sends the image and OCR boxes to a remote verifier, applies quality thresholds, and creates a deterministic conversation. |
+| `JsonlSampleWriterStage` | Curator worker | 2 CPUs | Appends records to worker-specific JSONL shards. |
+| `merge_output_shards` | Driver | CPU and shared filesystem | Concatenates shards into the requested JSONL and deletes shards after a successful merge. |
+
+The included runner starts a Ray cluster and uses `XennaExecutor`; it does not expose a Ray Data backend option. All workers must see the extracted image directory and output path through the same filesystem.
+
+## Installation and model access
+
+Use the NeMo Curator container for this pipeline. In addition to CUDA and the normal SDG/image dependencies, the repository Dockerfile builds the `nemotron-ocr` C++/CUDA extension from the `nvidia/nemotron-ocr-v2` repository. Installing `sdg_cuda12` or `image_cuda12` alone does not install that extension.
+
+```bash
+docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
+
+docker run --gpus all -it --rm \
+ -v "$PWD:/workspace" \
+ -w /workspace \
+ nvcr.io/nvidia/nemo-curator:{{ container_version }}
+```
+
+Inside the container:
+
+```bash
+source /opt/venv/env.sh
+python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('Nemotron OCR ready')"
+```
+
+The default OCR stage downloads `nvidia/nemotron-ocr-v2` from Hugging Face and uses its `v2_multilingual` subdirectory. Mount a persistent Hugging Face cache, or pass a pre-downloaded `v2_multilingual` directory with `--nemotron-model-dir`.
+
+The repository Dockerfile compiles for CUDA architectures 8.0, 8.6, 8.9, and 9.0. GPU memory requirements depend on image size, model version, and worker concurrency; the stage does not enforce a VRAM minimum.
+
+The optional verifier reads `NVINFERENCE_API_KEY` from each worker's environment during setup. Do not place the key in source, CLI arguments, or serialized stage configuration. Obtain a key from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) and ensure Ray workers inherit it.
+
+## Input sources
+
+`HFDatasetImageReaderStage` supports three source forms:
+
+| Source | Detection and loading behavior |
+| --- | --- |
+| Hugging Face Hub dataset | A non-local dataset name is passed to `load_dataset`, with optional config and split. A limit becomes a split slice such as `train[:100]`. |
+| Dataset saved to disk | An existing directory containing `dataset_info.json` is loaded with `load_from_disk`. If it yields a dataset dictionary, the requested split must exist. |
+| Local image folder | Any other existing directory is loaded through the Hugging Face `imagefolder` builder. |
+
+The image column may contain a PIL image, raw bytes, a Hugging Face Image dictionary with `bytes` or `path`, or a valid file-path string. Images are converted to RGB and saved as `/.jpg`.
+
+If `id_column` is unset, zero-padded row indexes are used. If set, only the first row for each ID is processed. IDs must be safe as local filenames. Hub and image-folder limits apply before ID deduplication, so the final number of unique tasks can be lower than the limit.
+
+
+The current OCR pipeline has no tar reader, parquet reader, or skip-processed stage. To use those sources, implement an upstream stage that emits `ImageSampleTask[OCRData]`. Built-in resume behavior is limited to reusing already-extracted JPEG files.
+
+
+## OCR data model
+
+`OCRData` extends `ImageTaskData` and carries the reader, OCR, and verifier fields. When `OCRScoringQAStage` runs, it converts the payload to `OCRConversationData`, an `OCRData` subclass that adds the `conversation` field while retaining the base fields:
+
+| Field | Producer | Meaning |
+| --- | --- | --- |
+| `image_path`, `image_id` | Reader | Cached JPEG and its dataset ID. |
+| `is_valid`, `error` | All stages | Internal validity and last error. The writer removes `is_valid` from JSONL. |
+| `ocr_is_word_level` | OCR/verifier | Defaults true; changes to false when the verifier identifies line boxes. |
+| `ocr_dense` | OCR | `OCRDenseItem` values on a normalized 0–1000 grid. |
+| `ocr_scoring_*` | Scoring QA | Prompt, model, raw response, mode, and missing-region audit data. |
+| `conversation` | `OCRConversationData` from Scoring QA | Alternating user/assistant messages, with the image on the first user turn. This field is not part of the base `OCRData` dataclass. |
+
+Each `OCRDenseItem` contains `bbox_2d`, `text_content`, optional `quad`, `valid`, `bbox_match`, and `text_errors`. Nemotron output is scaled to integer 0–1000 coordinates, and inverted vertical bounds are sorted.
+
+
+Stored OCR boxes use `[x0, y0, x1, y1]`. The verifier prompt and `ocr_scoring_missing` use `[y0, x0, y1, x1]` to match its protocol. Reorder coordinates before combining these fields.
+
+
+## Bounding-box verification
+
+`OCRScoringQAStage` makes one remote verifier request per image:
+
+| Setting | Default | Behavior |
+| --- | --- | --- |
+| `model_id` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | NVIDIA Inference API model. |
+| `temperature` | `1.0` | Verifier sampling temperature. |
+| `max_tokens` | `16384` | Allows reasoning plus final JSON; only streamed final content is retained. |
+| `min_bbox_match` | `5` | Inclusive minimum score for a valid box. |
+| `max_text_errors` | `0` | Inclusive maximum error count. |
+| `fail_on_missing_text` | `false` | Optionally invalidate the image when text regions are missing. |
+| `dense_dump_prob` | `0.05` | Dense single-turn probability when no text is missing. |
+| `batch_size` | `16` | Image tasks passed to the model stage. |
+| `priority_mode` | `false` | Adds the NVIDIA priority header when enabled through Python. |
+
+The verifier returns `ocr_mode`, indexed box scores, and `missing_text`. Missing or nonnumeric scores invalidate a box. If every original box is invalid, the image is invalid. Empty or unparseable responses also invalidate the image.
+
+`NVInferenceClient` uses the NVIDIA integration endpoint, reads `NVINFERENCE_API_KEY`, streams final-answer content, allows 10 concurrent requests per worker, uses a 120-second timeout, and retries rate-limit and connection failures three times with exponential backoff and jitter.
+
+## QA interaction shapes
+
+The source defines five QA type constants. Four tag and balance the multi-turn families; the fifth, `dense_dump`, names the separate single-turn all-words path. Single-occurrence and repeated-text routing within the four tagged families produces six user-visible shapes:
+
+| Shape | Prompt gives | Answer gives |
+| --- | --- | --- |
+| Box to text | One box | Text inside it |
+| Point to text | One box center | Text at that point |
+| Text to one box | Text occurring once | One box |
+| Text to many boxes | Repeated text | Every matching box |
+| Text to one point | Text occurring once | One center point |
+| Text to many points | Repeated text | Every matching center point |
+
+Generation is seeded from the framework task ID. At most 100 multi-turn QA pairs are balanced across the four tagged families. When five or more OCR items are invalid, text-to-location questions are disabled.
+
+When no text is missing, `dense_dump_prob` can select a single turn listing every valid word. It combines one of 33 instruction phrasings with one of 11 answer formats:
+
+| # | Dense answer format |
+| --- | --- |
+| 1 | Plain JSON list with `bbox_2d` and `text_content` |
+| 2 | Fenced JSON code block |
+| 3 | JSON with an explicit key-schema instruction |
+| 4 | JSON with an explicit per-item object example |
+| 5 | JSON-only response with no surrounding text |
+| 6 | JSON with explicit `x_min, y_min, x_max, y_max` semantics |
+| 7 | One `text [bbox]` item per line |
+| 8 | One `[bbox]: text` item per line |
+| 9 | One `text (x0, y0, x1, y1)` item per line |
+| 10 | Markdown table |
+| 11 | Tab-separated text and coordinates |
+
+## Conversation schema
+
+Within each output JSONL record, the top-level `conversation` field contains a wrapper object whose inner `conversation` field is the ordered turn list:
+
+```json
+{
+ "conversation": {
+ "conversation": [
+ {"sender": "user", "fragments": [{"t": "image", "value": "000042.jpg"}, "What text is in the bounding box [120, 90, 440, 160]?"]},
+ {"sender": "assistant", "fragments": ["PROJECT STATUS"]},
+ {"sender": "user", "fragments": ["Point at the text 'TOTAL'. Answer with [x, y]."]},
+ {"sender": "assistant", "fragments": ["[732, 518]"]}
+ ]
+ }
+}
+```
+
+The image value is the cached filename, not its bytes. Use `--image-parent` to make `image_path` portable and distribute the cached images with the JSONL.
+
+## Failure and restart model
+
+- Cached JPEGs are reused by filename without verifying their content.
+- OCR catches exceptions per image; prompt/response errors are also isolated per image. A generation or response-count failure can invalidate a whole scoring batch.
+- Remote failures become empty responses after retries and invalidate the affected image.
+- Worker JSONL records are flushed immediately. `valid_only=True` skips invalid tasks.
+- The CLI defaults to `valid_only=False`, retaining invalid records and their `error` field.
+- The writer removes `is_valid` and all `None` fields but retains empty lists, strings, and `false`.
+- There is no processed-ID checkpoint. Reruns repeat OCR and API calls even when JPEGs are cached.
+- Stale `*_worker*.jsonl` files are not cleaned before a run. Remove them or use a fresh output basename after interruption.
+
+Continue with the [runnable tutorial](/curate-text/synthetic/nemotron-ocr/tutorial).
+
+## Related topics
+
+- [Synthetic data generation](/curate-text/synthetic)
+- [LLM client configuration](/curate-text/synthetic/llm-client)
+- [Image curation](/curate-images)
+- [Execution backends](/reference/infra/execution-backends)
diff --git a/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx
new file mode 100644
index 0000000000..bff74d0dbb
--- /dev/null
+++ b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx
@@ -0,0 +1,251 @@
+---
+description: "Run NemotronOCR-v2 over Hugging Face image datasets and optionally generate bbox-scored multimodal QA conversations with NVIDIA Inference"
+categories: ["tutorials"]
+tags: ["nemotron-ocr", "ocr", "multimodal", "synthetic-data", "nvidia-inference-api"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "advanced"
+content_type: "tutorial"
+modality: "image-only"
+---
+
+# Run the Nemotron OCR Pipeline
+
+Run `tutorials/synthetic/omni/ocr_pipeline.py` over a Hugging Face dataset. Start with local OCR, inspect the boxes, and then enable remote verification and QA conversation generation.
+
+## 1. Start the GPU container
+
+From the repository root:
+
+```bash
+docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
+
+docker run --gpus all -it --rm \
+ -v "$PWD:/workspace" \
+ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+ -w /workspace \
+ nvcr.io/nvidia/nemo-curator:{{ container_version }}
+```
+
+Inside the container:
+
+```bash
+source /opt/venv/env.sh
+nvidia-smi
+python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('Nemotron OCR ready')"
+```
+
+For private or gated data, authenticate through the environment or Hugging Face CLI. Do not commit tokens.
+
+## 2. Run a small OCR-only sample
+
+```bash
+python -m tutorials.synthetic.omni.ocr_pipeline \
+ --hf-dataset lmms-lab/textvqa \
+ --hf-image-dir /workspace/ocr-data/images \
+ --hf-split validation \
+ --hf-image-column image \
+ --hf-limit 25 \
+ --output-path /workspace/ocr-data/results/ocr.jsonl \
+ --image-parent /workspace/ocr-data/images
+```
+
+The runner starts Ray, uses Xenna, writes `ocr_worker.jsonl` shards, and merges them to the requested file after success.
+
+
+`--output-path` is a base file path, despite the current help text describing it as a directory. Include a `.jsonl` filename. Without a suffix, the writer creates `_worker*.jsonl` and merges them to `.jsonl` beside the supplied path.
+
+
+Inspect one record:
+
+```bash
+head -n 1 /workspace/ocr-data/results/ocr.jsonl
+```
+
+```json
+{
+ "image_path": "000000.jpg",
+ "image_id": "000000",
+ "ocr_is_word_level": true,
+ "ocr_dense": [
+ {"bbox_2d": [108, 72, 492, 146], "text_content": "EMERGENCY EXIT", "valid": true}
+ ]
+}
+```
+
+## 3. Enable scoring and QA
+
+Create an NVIDIA API key at [NVIDIA Build](https://build.nvidia.com/settings/api-keys), then export it without putting the value in shell history:
+
+```bash
+read -s NVINFERENCE_API_KEY
+export NVINFERENCE_API_KEY
+```
+
+Each image causes one remote verifier request, so validate a small sample first:
+
+```bash
+python -m tutorials.synthetic.omni.ocr_pipeline \
+ --hf-dataset lmms-lab/textvqa \
+ --hf-image-dir /workspace/ocr-data/images \
+ --hf-split validation \
+ --hf-image-column image \
+ --hf-limit 25 \
+ --output-path /workspace/ocr-data/results/ocr-qa.jsonl \
+ --image-parent /workspace/ocr-data/images \
+ --run-scoring-qa \
+ --scoring-qa-min-bbox-match 5 \
+ --scoring-qa-max-text-errors 0 \
+ --scoring-qa-dense-dump-prob 0.05
+```
+
+Add `--scoring-qa-fail-on-missing-text` to invalidate images with uncovered text. Otherwise missing regions are recorded, dense dump is disabled for that image, and QA uses boxes that passed scoring. Add `--valid-only` to exclude invalid images; without it, failed records remain with an `error` string.
+
+## CLI reference
+
+| Argument | Default | Description |
+| --- | --- | --- |
+| `--hf-dataset` | Required | Hub dataset ID or local dataset/image-folder path. |
+| `--hf-image-dir` | Required | RGB JPEG cache. Existing filenames are reused. |
+| `--hf-split` | `train` | Dataset split. |
+| `--hf-config` | None | Optional Hub configuration/subset. |
+| `--hf-image-column` | `image` | Column containing the image. |
+| `--hf-id-column` | Row index | Filename/deduplication column; use safe unique values. |
+| `--hf-limit` | None | Records selected before duplicate IDs are removed. |
+| `--output-path` | Required | Base JSONL file used for shards and final merge. |
+| `--image-parent` | None | Makes descendant image paths relative in output. |
+| `--valid-only` | False | Skip invalid tasks rather than retaining errors. |
+| `--nemotron-model-dir` | HF download | Local `v2_multilingual` model directory. |
+| `--run-scoring-qa` | False | Enable remote scoring and conversation generation. |
+| `--scoring-qa-model-id` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | Verifier model. |
+| `--scoring-qa-min-bbox-match` | `5` | Inclusive minimum box-fit score. |
+| `--scoring-qa-max-text-errors` | `0` | Inclusive maximum text errors. |
+| `--scoring-qa-fail-on-missing-text` | False | Invalidate images with missing regions. |
+| `--scoring-qa-dense-dump-prob` | `0.05` | Dense-turn probability for complete OCR. |
+
+OCR `merge_level`, verifier generation/batch settings, priority mode, and API client concurrency are Python-only settings.
+
+## 4. Use the Python API
+
+```python
+from pathlib import Path
+
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.core.client import RayClient
+from nemo_curator.stages.synthetic.omni.io import merge_output_shards
+from tutorials.synthetic.omni.ocr_pipeline import create_hf_ocr_pipeline
+
+output = Path("/workspace/ocr-data/results/ocr-qa.jsonl")
+pipeline = create_hf_ocr_pipeline(
+ dataset_name="lmms-lab/textvqa",
+ image_dir=Path("/workspace/ocr-data/images"),
+ output_path=output,
+ hf_split="validation",
+ hf_limit=25,
+ image_parent=Path("/workspace/ocr-data/images"),
+ valid_only=False,
+ run_scoring_qa=True,
+ scoring_qa_min_bbox_match=5,
+ scoring_qa_max_text_errors=0,
+ scoring_qa_dense_dump_prob=0.05,
+)
+
+client = RayClient()
+client.start()
+try:
+ pipeline.run(XennaExecutor())
+finally:
+ client.stop()
+ merged_output = merge_output_shards(output)
+
+print(merged_output)
+```
+
+The CLI merges shards only after a successful pipeline run. This example uses an outer `finally` so workers stop before any available shards are merged, including after a pipeline failure.
+
+Configure stages individually when you need Python-only options:
+
+```python
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.synthetic.omni.io import HFDatasetImageReaderStage, JsonlSampleWriterStage
+from nemo_curator.stages.synthetic.omni.ocr_nemotron_v2 import OCRNemotronV2Stage
+from nemo_curator.stages.synthetic.omni.ocr_scoring_qa import OCRScoringQAStage
+from nemo_curator.tasks.ocr import OCRData
+
+pipeline = Pipeline(name="ocr-nemotron-custom")
+pipeline.add_stage(HFDatasetImageReaderStage(
+ dataset_name="lmms-lab/textvqa",
+ image_dir="/workspace/ocr-data/images",
+ split="validation",
+ limit=25,
+ task_type=OCRData,
+))
+pipeline.add_stage(OCRNemotronV2Stage(merge_level="word"))
+pipeline.add_stage(OCRScoringQAStage(
+ temperature=1.0,
+ max_tokens=16384,
+ min_bbox_match=5,
+ max_text_errors=0,
+ dense_dump_prob=0.05,
+ batch_size=16,
+ priority_mode=False,
+))
+pipeline.add_stage(JsonlSampleWriterStage(
+ output_path="/workspace/ocr-data/results/custom.jsonl",
+ valid_only=False,
+ image_parent="/workspace/ocr-data/images",
+))
+```
+
+`JsonlSampleWriterStage` defaults `valid_only=True` when constructed directly, while the tutorial CLI defaults `--valid-only` to false and passes that value explicitly. Set `valid_only=False` to retain invalid records for inspection or a later rerun.
+
+## 5. Read scored output
+
+```json
+{
+ "image_path": "000000.jpg",
+ "image_id": "000000",
+ "ocr_is_word_level": true,
+ "ocr_dense": [
+ {"bbox_2d": [108, 72, 492, 146], "text_content": "EMERGENCY EXIT", "valid": true, "bbox_match": 10, "text_errors": 0}
+ ],
+ "ocr_scoring_model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
+ "ocr_scoring_mode": "word",
+ "ocr_scoring_missing": [],
+ "conversation": {
+ "conversation": [
+ {"sender": "user", "fragments": [{"t": "image", "value": "000000.jpg"}, "What text is in the bounding box [108, 72, 492, 146]?"]},
+ {"sender": "assistant", "fragments": ["EMERGENCY EXIT"]}
+ ]
+ }
+}
+```
+
+Output also retains `ocr_scoring_prompt` and `ocr_scoring_response_raw`. These can be large with the 16,384-token response budget; remove them downstream if training does not require provenance.
+
+## Batching and failures
+
+- OCR has `batch_size=32`; image errors are caught individually.
+- Scoring QA has `batch_size=16`; each worker permits 10 concurrent requests.
+- Invalid inputs skip later model stages but reach the writer unless `--valid-only` is set.
+- No OCR boxes skips scoring without invalidating the record.
+- Empty/unparseable verifier output, no boxes passing thresholds, and optionally missing text invalidate the image.
+- Conversation choices are deterministic for framework task identity. Do not set `task_id` yourself.
+- Weights download once per node and load once per OCR worker.
+
+## Rerun safely
+
+Image extraction is idempotent by filename, but inference is not resumable:
+
+1. Use a new output basename or remove stale `_worker*.jsonl` shards.
+2. Preserve invalid rows initially so `image_id` and `error` can drive a targeted retry dataset.
+3. Do not treat cached JPEGs as completed OCR; reruns process every selected unique ID.
+4. After interruption, inspect worker shards before merging or discarding them.
+5. After success, verify record count, errors, bbox thresholds, missing-text rate, and conversation mix.
+
+## Related topics
+
+- [Nemotron OCR concepts and schemas](/curate-text/synthetic/nemotron-ocr)
+- [Synthetic data generation](/curate-text/synthetic)
+- [LLM client configuration](/curate-text/synthetic/llm-client)
+- [Image curation tutorials](/curate-images/tutorials)
+- [Install NeMo Curator](/get-started/installation)