diff --git a/fern/versions/main.yml b/fern/versions/main.yml index 1c94ec0bac..494e1a4e72 100644 --- a/fern/versions/main.yml +++ b/fern/versions/main.yml @@ -309,6 +309,15 @@ navigation: - page: Multilingual Q&A path: ./main/pages/curate-text/synthetic/multilingual-qa.mdx slug: multilingual-qa + - section: Nemotron OCR + slug: nemotron-ocr + contents: + - page: Overview + path: ./main/pages/curate-text/synthetic/nemotron-ocr/index.mdx + slug: "" + - page: Tutorial + path: ./main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx + slug: tutorial - section: Nemotron-CC slug: nemotron-cc contents: diff --git a/fern/versions/main/pages/curate-images/index.mdx b/fern/versions/main/pages/curate-images/index.mdx index b6d29f6406..d643803dcf 100644 --- a/fern/versions/main/pages/curate-images/index.mdx +++ b/fern/versions/main/pages/curate-images/index.mdx @@ -117,6 +117,13 @@ deduplication semantic clustering + + +Generate word-level OCR labels and verifier-scored visual question-answering conversations. +ocr multimodal synthetic-data + + + ### Pipeline Management diff --git a/fern/versions/main/pages/curate-images/tutorials/index.mdx b/fern/versions/main/pages/curate-images/tutorials/index.mdx index 98834a0fee..d50f0e433e 100644 --- a/fern/versions/main/pages/curate-images/tutorials/index.mdx +++ b/fern/versions/main/pages/curate-images/tutorials/index.mdx @@ -28,6 +28,13 @@ embeddings clustering + +Extract word boxes from image datasets and generate multimodal bounding-box QA conversations. +nemotron-ocr +multimodal +synthetic-data + + ## Resources diff --git a/fern/versions/main/pages/curate-text/synthetic/index.mdx b/fern/versions/main/pages/curate-text/synthetic/index.mdx index e6fe404ab0..da2621a893 100644 --- a/fern/versions/main/pages/curate-text/synthetic/index.mdx +++ b/fern/versions/main/pages/curate-text/synthetic/index.mdx @@ -125,6 +125,13 @@ quickstart tutorial + +Turn images into word-level OCR annotations and optionally generate verifier-scored multimodal QA conversations. +multimodal +ocr +visual-qa + + Declarative data generation with structured columns and NDD-backed Nemotron-CC stages ndd diff --git a/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx b/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx index db2d5082a0..d872e21146 100644 --- a/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx +++ b/fern/versions/main/pages/curate-text/synthetic/llm-client.mdx @@ -221,4 +221,5 @@ client = AsyncOpenAIClient( ## Next Steps - [Multilingual Q&A](/curate-text/synthetic/multilingual-qa): Generate multilingual Q&A pairs +- [Nemotron OCR](/curate-text/synthetic/nemotron-ocr): Use the specialized streaming `NVInferenceClient` for multimodal OCR verification and QA generation - [Nemotron-CC](/curate-text/synthetic/nemotron-cc): Advanced text transformation pipelines diff --git a/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx new file mode 100644 index 0000000000..368ee135ac --- /dev/null +++ b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/index.mdx @@ -0,0 +1,193 @@ +--- +description: "Understand the Nemotron OCR synthetic-data pipeline, its image and OCR schemas, verifier scoring, QA conversation generation, and output lifecycle" +categories: ["synthetic-data"] +tags: ["nemotron-ocr", "multimodal", "ocr", "visual-question-answering", "synthetic-data"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "advanced" +content_type: "concept" +modality: "image-only" +--- + +# Nemotron OCR Synthetic Data + +The Nemotron OCR pipeline converts images into word-level OCR annotations and, optionally, scored visual question-answering conversations. It runs NemotronOCR-v2 locally on a GPU, can ask Nemotron-Nano-Omni through the NVIDIA Inference API to verify every bounding box, and writes one JSONL record per image. + +```mermaid +flowchart LR + A["HF Hub, saved dataset, or image folder"] --> B["Extract and cache JPEG images"] + B --> C["NemotronOCR-v2 word OCR"] + C --> D{"Run scoring QA?"} + D -->|"No"| E["OCR JSONL"] + D -->|"Yes"| F["NVIDIA API bbox verification"] + F --> G["Filter boxes and generate QA"] + G --> H["Conversation JSONL"] +``` + + +The merged API names are `OCRNemotronV2Stage`, `OCRScoringQAStage`, `HFDatasetImageReaderStage`, `JsonlSampleWriterStage`, `NVInferenceClient`, `OCRDenseItem`, `OCRData`, and `OCRConversationData`. Earlier development names such as `NemotronOCRV2Stage`, `OCRConversationalizeStage`, `OCRDenseQAStage`, `NVInferenceModel`, `OCRDenseWord`, `TarImageReader`, `SkipProcessedStage`, and `ResultWriterStage` are not shipped APIs. Conversation and dense-QA behavior is implemented by helper functions called inside `OCRScoringQAStage`. + + +## Runtime architecture + +| Component | Runs where | Resources and defaults | Responsibility | +| --- | --- | --- | --- | +| `HFDatasetImageReaderStage` | Driver/fan-out stage | 1 CPU | Loads a Hugging Face dataset, deduplicates image IDs, caches RGB JPEGs, and creates `ImageSampleTask[OCRData]` tasks. | +| `OCRNemotronV2Stage` | Curator worker | 8 CPUs, 1 GPU, stage batch size 32 | Loads NemotronOCR-v2 and adds normalized word boxes and text. Each image error is isolated. | +| `OCRScoringQAStage` | Curator worker plus NVIDIA API | 1 CPU, stage batch size 16 | Sends the image and OCR boxes to a remote verifier, applies quality thresholds, and creates a deterministic conversation. | +| `JsonlSampleWriterStage` | Curator worker | 2 CPUs | Appends records to worker-specific JSONL shards. | +| `merge_output_shards` | Driver | CPU and shared filesystem | Concatenates shards into the requested JSONL and deletes shards after a successful merge. | + +The included runner starts a Ray cluster and uses `XennaExecutor`; it does not expose a Ray Data backend option. All workers must see the extracted image directory and output path through the same filesystem. + +## Installation and model access + +Use the NeMo Curator container for this pipeline. In addition to CUDA and the normal SDG/image dependencies, the repository Dockerfile builds the `nemotron-ocr` C++/CUDA extension from the `nvidia/nemotron-ocr-v2` repository. Installing `sdg_cuda12` or `image_cuda12` alone does not install that extension. + +```bash +docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }} + +docker run --gpus all -it --rm \ + -v "$PWD:/workspace" \ + -w /workspace \ + nvcr.io/nvidia/nemo-curator:{{ container_version }} +``` + +Inside the container: + +```bash +source /opt/venv/env.sh +python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('Nemotron OCR ready')" +``` + +The default OCR stage downloads `nvidia/nemotron-ocr-v2` from Hugging Face and uses its `v2_multilingual` subdirectory. Mount a persistent Hugging Face cache, or pass a pre-downloaded `v2_multilingual` directory with `--nemotron-model-dir`. + +The repository Dockerfile compiles for CUDA architectures 8.0, 8.6, 8.9, and 9.0. GPU memory requirements depend on image size, model version, and worker concurrency; the stage does not enforce a VRAM minimum. + +The optional verifier reads `NVINFERENCE_API_KEY` from each worker's environment during setup. Do not place the key in source, CLI arguments, or serialized stage configuration. Obtain a key from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) and ensure Ray workers inherit it. + +## Input sources + +`HFDatasetImageReaderStage` supports three source forms: + +| Source | Detection and loading behavior | +| --- | --- | +| Hugging Face Hub dataset | A non-local dataset name is passed to `load_dataset`, with optional config and split. A limit becomes a split slice such as `train[:100]`. | +| Dataset saved to disk | An existing directory containing `dataset_info.json` is loaded with `load_from_disk`. If it yields a dataset dictionary, the requested split must exist. | +| Local image folder | Any other existing directory is loaded through the Hugging Face `imagefolder` builder. | + +The image column may contain a PIL image, raw bytes, a Hugging Face Image dictionary with `bytes` or `path`, or a valid file-path string. Images are converted to RGB and saved as `/.jpg`. + +If `id_column` is unset, zero-padded row indexes are used. If set, only the first row for each ID is processed. IDs must be safe as local filenames. Hub and image-folder limits apply before ID deduplication, so the final number of unique tasks can be lower than the limit. + + +The current OCR pipeline has no tar reader, parquet reader, or skip-processed stage. To use those sources, implement an upstream stage that emits `ImageSampleTask[OCRData]`. Built-in resume behavior is limited to reusing already-extracted JPEG files. + + +## OCR data model + +`OCRData` extends `ImageTaskData` and carries the reader, OCR, and verifier fields. When `OCRScoringQAStage` runs, it converts the payload to `OCRConversationData`, an `OCRData` subclass that adds the `conversation` field while retaining the base fields: + +| Field | Producer | Meaning | +| --- | --- | --- | +| `image_path`, `image_id` | Reader | Cached JPEG and its dataset ID. | +| `is_valid`, `error` | All stages | Internal validity and last error. The writer removes `is_valid` from JSONL. | +| `ocr_is_word_level` | OCR/verifier | Defaults true; changes to false when the verifier identifies line boxes. | +| `ocr_dense` | OCR | `OCRDenseItem` values on a normalized 0–1000 grid. | +| `ocr_scoring_*` | Scoring QA | Prompt, model, raw response, mode, and missing-region audit data. | +| `conversation` | `OCRConversationData` from Scoring QA | Alternating user/assistant messages, with the image on the first user turn. This field is not part of the base `OCRData` dataclass. | + +Each `OCRDenseItem` contains `bbox_2d`, `text_content`, optional `quad`, `valid`, `bbox_match`, and `text_errors`. Nemotron output is scaled to integer 0–1000 coordinates, and inverted vertical bounds are sorted. + + +Stored OCR boxes use `[x0, y0, x1, y1]`. The verifier prompt and `ocr_scoring_missing` use `[y0, x0, y1, x1]` to match its protocol. Reorder coordinates before combining these fields. + + +## Bounding-box verification + +`OCRScoringQAStage` makes one remote verifier request per image: + +| Setting | Default | Behavior | +| --- | --- | --- | +| `model_id` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | NVIDIA Inference API model. | +| `temperature` | `1.0` | Verifier sampling temperature. | +| `max_tokens` | `16384` | Allows reasoning plus final JSON; only streamed final content is retained. | +| `min_bbox_match` | `5` | Inclusive minimum score for a valid box. | +| `max_text_errors` | `0` | Inclusive maximum error count. | +| `fail_on_missing_text` | `false` | Optionally invalidate the image when text regions are missing. | +| `dense_dump_prob` | `0.05` | Dense single-turn probability when no text is missing. | +| `batch_size` | `16` | Image tasks passed to the model stage. | +| `priority_mode` | `false` | Adds the NVIDIA priority header when enabled through Python. | + +The verifier returns `ocr_mode`, indexed box scores, and `missing_text`. Missing or nonnumeric scores invalidate a box. If every original box is invalid, the image is invalid. Empty or unparseable responses also invalidate the image. + +`NVInferenceClient` uses the NVIDIA integration endpoint, reads `NVINFERENCE_API_KEY`, streams final-answer content, allows 10 concurrent requests per worker, uses a 120-second timeout, and retries rate-limit and connection failures three times with exponential backoff and jitter. + +## QA interaction shapes + +The source defines five QA type constants. Four tag and balance the multi-turn families; the fifth, `dense_dump`, names the separate single-turn all-words path. Single-occurrence and repeated-text routing within the four tagged families produces six user-visible shapes: + +| Shape | Prompt gives | Answer gives | +| --- | --- | --- | +| Box to text | One box | Text inside it | +| Point to text | One box center | Text at that point | +| Text to one box | Text occurring once | One box | +| Text to many boxes | Repeated text | Every matching box | +| Text to one point | Text occurring once | One center point | +| Text to many points | Repeated text | Every matching center point | + +Generation is seeded from the framework task ID. At most 100 multi-turn QA pairs are balanced across the four tagged families. When five or more OCR items are invalid, text-to-location questions are disabled. + +When no text is missing, `dense_dump_prob` can select a single turn listing every valid word. It combines one of 33 instruction phrasings with one of 11 answer formats: + +| # | Dense answer format | +| --- | --- | +| 1 | Plain JSON list with `bbox_2d` and `text_content` | +| 2 | Fenced JSON code block | +| 3 | JSON with an explicit key-schema instruction | +| 4 | JSON with an explicit per-item object example | +| 5 | JSON-only response with no surrounding text | +| 6 | JSON with explicit `x_min, y_min, x_max, y_max` semantics | +| 7 | One `text [bbox]` item per line | +| 8 | One `[bbox]: text` item per line | +| 9 | One `text (x0, y0, x1, y1)` item per line | +| 10 | Markdown table | +| 11 | Tab-separated text and coordinates | + +## Conversation schema + +Within each output JSONL record, the top-level `conversation` field contains a wrapper object whose inner `conversation` field is the ordered turn list: + +```json +{ + "conversation": { + "conversation": [ + {"sender": "user", "fragments": [{"t": "image", "value": "000042.jpg"}, "What text is in the bounding box [120, 90, 440, 160]?"]}, + {"sender": "assistant", "fragments": ["PROJECT STATUS"]}, + {"sender": "user", "fragments": ["Point at the text 'TOTAL'. Answer with [x, y]."]}, + {"sender": "assistant", "fragments": ["[732, 518]"]} + ] + } +} +``` + +The image value is the cached filename, not its bytes. Use `--image-parent` to make `image_path` portable and distribute the cached images with the JSONL. + +## Failure and restart model + +- Cached JPEGs are reused by filename without verifying their content. +- OCR catches exceptions per image; prompt/response errors are also isolated per image. A generation or response-count failure can invalidate a whole scoring batch. +- Remote failures become empty responses after retries and invalidate the affected image. +- Worker JSONL records are flushed immediately. `valid_only=True` skips invalid tasks. +- The CLI defaults to `valid_only=False`, retaining invalid records and their `error` field. +- The writer removes `is_valid` and all `None` fields but retains empty lists, strings, and `false`. +- There is no processed-ID checkpoint. Reruns repeat OCR and API calls even when JPEGs are cached. +- Stale `*_worker*.jsonl` files are not cleaned before a run. Remove them or use a fresh output basename after interruption. + +Continue with the [runnable tutorial](/curate-text/synthetic/nemotron-ocr/tutorial). + +## Related topics + +- [Synthetic data generation](/curate-text/synthetic) +- [LLM client configuration](/curate-text/synthetic/llm-client) +- [Image curation](/curate-images) +- [Execution backends](/reference/infra/execution-backends) diff --git a/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx new file mode 100644 index 0000000000..bff74d0dbb --- /dev/null +++ b/fern/versions/main/pages/curate-text/synthetic/nemotron-ocr/tutorial.mdx @@ -0,0 +1,251 @@ +--- +description: "Run NemotronOCR-v2 over Hugging Face image datasets and optionally generate bbox-scored multimodal QA conversations with NVIDIA Inference" +categories: ["tutorials"] +tags: ["nemotron-ocr", "ocr", "multimodal", "synthetic-data", "nvidia-inference-api"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "advanced" +content_type: "tutorial" +modality: "image-only" +--- + +# Run the Nemotron OCR Pipeline + +Run `tutorials/synthetic/omni/ocr_pipeline.py` over a Hugging Face dataset. Start with local OCR, inspect the boxes, and then enable remote verification and QA conversation generation. + +## 1. Start the GPU container + +From the repository root: + +```bash +docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }} + +docker run --gpus all -it --rm \ + -v "$PWD:/workspace" \ + -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ + -w /workspace \ + nvcr.io/nvidia/nemo-curator:{{ container_version }} +``` + +Inside the container: + +```bash +source /opt/venv/env.sh +nvidia-smi +python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('Nemotron OCR ready')" +``` + +For private or gated data, authenticate through the environment or Hugging Face CLI. Do not commit tokens. + +## 2. Run a small OCR-only sample + +```bash +python -m tutorials.synthetic.omni.ocr_pipeline \ + --hf-dataset lmms-lab/textvqa \ + --hf-image-dir /workspace/ocr-data/images \ + --hf-split validation \ + --hf-image-column image \ + --hf-limit 25 \ + --output-path /workspace/ocr-data/results/ocr.jsonl \ + --image-parent /workspace/ocr-data/images +``` + +The runner starts Ray, uses Xenna, writes `ocr_worker.jsonl` shards, and merges them to the requested file after success. + + +`--output-path` is a base file path, despite the current help text describing it as a directory. Include a `.jsonl` filename. Without a suffix, the writer creates `_worker*.jsonl` and merges them to `.jsonl` beside the supplied path. + + +Inspect one record: + +```bash +head -n 1 /workspace/ocr-data/results/ocr.jsonl +``` + +```json +{ + "image_path": "000000.jpg", + "image_id": "000000", + "ocr_is_word_level": true, + "ocr_dense": [ + {"bbox_2d": [108, 72, 492, 146], "text_content": "EMERGENCY EXIT", "valid": true} + ] +} +``` + +## 3. Enable scoring and QA + +Create an NVIDIA API key at [NVIDIA Build](https://build.nvidia.com/settings/api-keys), then export it without putting the value in shell history: + +```bash +read -s NVINFERENCE_API_KEY +export NVINFERENCE_API_KEY +``` + +Each image causes one remote verifier request, so validate a small sample first: + +```bash +python -m tutorials.synthetic.omni.ocr_pipeline \ + --hf-dataset lmms-lab/textvqa \ + --hf-image-dir /workspace/ocr-data/images \ + --hf-split validation \ + --hf-image-column image \ + --hf-limit 25 \ + --output-path /workspace/ocr-data/results/ocr-qa.jsonl \ + --image-parent /workspace/ocr-data/images \ + --run-scoring-qa \ + --scoring-qa-min-bbox-match 5 \ + --scoring-qa-max-text-errors 0 \ + --scoring-qa-dense-dump-prob 0.05 +``` + +Add `--scoring-qa-fail-on-missing-text` to invalidate images with uncovered text. Otherwise missing regions are recorded, dense dump is disabled for that image, and QA uses boxes that passed scoring. Add `--valid-only` to exclude invalid images; without it, failed records remain with an `error` string. + +## CLI reference + +| Argument | Default | Description | +| --- | --- | --- | +| `--hf-dataset` | Required | Hub dataset ID or local dataset/image-folder path. | +| `--hf-image-dir` | Required | RGB JPEG cache. Existing filenames are reused. | +| `--hf-split` | `train` | Dataset split. | +| `--hf-config` | None | Optional Hub configuration/subset. | +| `--hf-image-column` | `image` | Column containing the image. | +| `--hf-id-column` | Row index | Filename/deduplication column; use safe unique values. | +| `--hf-limit` | None | Records selected before duplicate IDs are removed. | +| `--output-path` | Required | Base JSONL file used for shards and final merge. | +| `--image-parent` | None | Makes descendant image paths relative in output. | +| `--valid-only` | False | Skip invalid tasks rather than retaining errors. | +| `--nemotron-model-dir` | HF download | Local `v2_multilingual` model directory. | +| `--run-scoring-qa` | False | Enable remote scoring and conversation generation. | +| `--scoring-qa-model-id` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | Verifier model. | +| `--scoring-qa-min-bbox-match` | `5` | Inclusive minimum box-fit score. | +| `--scoring-qa-max-text-errors` | `0` | Inclusive maximum text errors. | +| `--scoring-qa-fail-on-missing-text` | False | Invalidate images with missing regions. | +| `--scoring-qa-dense-dump-prob` | `0.05` | Dense-turn probability for complete OCR. | + +OCR `merge_level`, verifier generation/batch settings, priority mode, and API client concurrency are Python-only settings. + +## 4. Use the Python API + +```python +from pathlib import Path + +from nemo_curator.backends.xenna import XennaExecutor +from nemo_curator.core.client import RayClient +from nemo_curator.stages.synthetic.omni.io import merge_output_shards +from tutorials.synthetic.omni.ocr_pipeline import create_hf_ocr_pipeline + +output = Path("/workspace/ocr-data/results/ocr-qa.jsonl") +pipeline = create_hf_ocr_pipeline( + dataset_name="lmms-lab/textvqa", + image_dir=Path("/workspace/ocr-data/images"), + output_path=output, + hf_split="validation", + hf_limit=25, + image_parent=Path("/workspace/ocr-data/images"), + valid_only=False, + run_scoring_qa=True, + scoring_qa_min_bbox_match=5, + scoring_qa_max_text_errors=0, + scoring_qa_dense_dump_prob=0.05, +) + +client = RayClient() +client.start() +try: + pipeline.run(XennaExecutor()) +finally: + client.stop() + merged_output = merge_output_shards(output) + +print(merged_output) +``` + +The CLI merges shards only after a successful pipeline run. This example uses an outer `finally` so workers stop before any available shards are merged, including after a pipeline failure. + +Configure stages individually when you need Python-only options: + +```python +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.omni.io import HFDatasetImageReaderStage, JsonlSampleWriterStage +from nemo_curator.stages.synthetic.omni.ocr_nemotron_v2 import OCRNemotronV2Stage +from nemo_curator.stages.synthetic.omni.ocr_scoring_qa import OCRScoringQAStage +from nemo_curator.tasks.ocr import OCRData + +pipeline = Pipeline(name="ocr-nemotron-custom") +pipeline.add_stage(HFDatasetImageReaderStage( + dataset_name="lmms-lab/textvqa", + image_dir="/workspace/ocr-data/images", + split="validation", + limit=25, + task_type=OCRData, +)) +pipeline.add_stage(OCRNemotronV2Stage(merge_level="word")) +pipeline.add_stage(OCRScoringQAStage( + temperature=1.0, + max_tokens=16384, + min_bbox_match=5, + max_text_errors=0, + dense_dump_prob=0.05, + batch_size=16, + priority_mode=False, +)) +pipeline.add_stage(JsonlSampleWriterStage( + output_path="/workspace/ocr-data/results/custom.jsonl", + valid_only=False, + image_parent="/workspace/ocr-data/images", +)) +``` + +`JsonlSampleWriterStage` defaults `valid_only=True` when constructed directly, while the tutorial CLI defaults `--valid-only` to false and passes that value explicitly. Set `valid_only=False` to retain invalid records for inspection or a later rerun. + +## 5. Read scored output + +```json +{ + "image_path": "000000.jpg", + "image_id": "000000", + "ocr_is_word_level": true, + "ocr_dense": [ + {"bbox_2d": [108, 72, 492, 146], "text_content": "EMERGENCY EXIT", "valid": true, "bbox_match": 10, "text_errors": 0} + ], + "ocr_scoring_model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning", + "ocr_scoring_mode": "word", + "ocr_scoring_missing": [], + "conversation": { + "conversation": [ + {"sender": "user", "fragments": [{"t": "image", "value": "000000.jpg"}, "What text is in the bounding box [108, 72, 492, 146]?"]}, + {"sender": "assistant", "fragments": ["EMERGENCY EXIT"]} + ] + } +} +``` + +Output also retains `ocr_scoring_prompt` and `ocr_scoring_response_raw`. These can be large with the 16,384-token response budget; remove them downstream if training does not require provenance. + +## Batching and failures + +- OCR has `batch_size=32`; image errors are caught individually. +- Scoring QA has `batch_size=16`; each worker permits 10 concurrent requests. +- Invalid inputs skip later model stages but reach the writer unless `--valid-only` is set. +- No OCR boxes skips scoring without invalidating the record. +- Empty/unparseable verifier output, no boxes passing thresholds, and optionally missing text invalidate the image. +- Conversation choices are deterministic for framework task identity. Do not set `task_id` yourself. +- Weights download once per node and load once per OCR worker. + +## Rerun safely + +Image extraction is idempotent by filename, but inference is not resumable: + +1. Use a new output basename or remove stale `_worker*.jsonl` shards. +2. Preserve invalid rows initially so `image_id` and `error` can drive a targeted retry dataset. +3. Do not treat cached JPEGs as completed OCR; reruns process every selected unique ID. +4. After interruption, inspect worker shards before merging or discarding them. +5. After success, verify record count, errors, bbox thresholds, missing-text rate, and conversation mix. + +## Related topics + +- [Nemotron OCR concepts and schemas](/curate-text/synthetic/nemotron-ocr) +- [Synthetic data generation](/curate-text/synthetic) +- [LLM client configuration](/curate-text/synthetic/llm-client) +- [Image curation tutorials](/curate-images/tutorials) +- [Install NeMo Curator](/get-started/installation)