Skip to content

[codex] docs: expand Nemotron-Parse tuning guidance#2131

Open
lbliii wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-parse-tuning
Open

[codex] docs: expand Nemotron-Parse tuning guidance#2131
lbliii wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-nemotron-parse-tuning

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What changed

  • document stage-level engine_kwargs tuning with gpu_memory_utilization and max_num_batched_tokens
  • clarify which controls belong to the vLLM initializer versus vllm.LLM
  • explain PDFPartitioningStage fanout behavior with Ray Data and its relationship to pdfs_per_task
  • add a complete vLLM/HF metric reference and TaskPerfUtils aggregation example
  • show end-to-end pages/s and output-tokens/s calculations using pipeline wall time
  • distinguish startup port-collision retries from inference-engine resets
  • expand benchmark guidance without depending on internal nightly infrastructure

Why

PR #2054 added vLLM engine passthrough, additive inference metrics, Ray Data fanout, and broader engine-startup retry handling. The published Nemotron-Parse guide described none of those behaviors, leaving users without a supported tuning or observability path.

User impact

Users can tune vLLM memory and batching, understand backend-specific metrics, measure throughput correctly under parallel execution, and diagnose retry exhaustion without masking non-retryable GPU or configuration errors.

Validation

  • npm run check from fern/: 0 errors
  • fern docs broken-links: no errors in the changed page; 22 existing errors remain in older API-reference pages
  • git diff --check
  • targeted source tests could not import on this macOS host because NeMo Curator intentionally supports Linux only; the documented behaviors are covered by existing unit tests in test_stages.py and test_vllm_utils.py

Closes #2128
Parent tracking issue: #2118

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jun 29, 2026
@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from suiyoubi and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This documentation-only PR expands the Nemotron-Parse PDF guide to cover behaviors added in PR #2054: stage-level engine_kwargs vLLM tuning, PDFPartitioningStage Ray Data fanout semantics, additive inference metrics with TaskPerfUtils aggregation, and a detailed vLLM retry-path breakdown.

  • Adds a "Tune the vLLM Engine" section with a complete pipeline composition example, a tuning control table, and a <Note> explaining engine_kwargs precedence over max_num_seqs/enforce_eager.
  • Adds an "Inspect Inference Metrics" section with a working code sample and a full metric reference table covering both the vLLM and HF backends.
  • Adds a "vLLM Retry Behavior" section distinguishing startup port-collision retries from inference-engine resets, and updates the retry description and default model path in the existing reference table.

Confidence Score: 5/5

Documentation-only change with no runtime impact; all added descriptions, metric names, default values, and code examples were verified against the implementation.

Every factual claim in the new sections was cross-checked against the source: metric key construction in TaskPerfUtils.aggregate_task_metrics, StagePerfStats.items() custom prefix, default values in create_vllm_llm and NemotronParseInferenceStage, the 9000-token SamplingParams limit, retry counts and jitter range, the IS_FANOUT_STAGE marker on PDFPartitioningStage, and the InterleavedParquetWriterStage import path. No discrepancies were found.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-text/load-data/nemotron-parse-pdf.mdx Documentation expansion adding three new sections (vLLM tuning, Ray Data fanout, inference metrics) and updating retry/model-path descriptions. All metric names, key structures, default values, and API shapes verified against the implementation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User
    participant Pipeline
    participant Partition as PDFPartitioningStage
    participant Preprocess as PDFPreprocessStage
    participant Inference as NemotronParseInferenceStage
    participant Postprocess as NemotronParsePostprocessStage
    participant TPU as TaskPerfUtils

    User->>Pipeline: run(executor)
    Pipeline->>Partition: process(EmptyTask)
    Note over Partition: 1 worker, IS_FANOUT_STAGE=True
    Partition-->>Pipeline: list of FileGroupTask (one per pdfs_per_task group)
    Pipeline->>Preprocess: process(FileGroupTask) in parallel blocks
    Preprocess-->>Pipeline: InterleavedBatch with rendered page images
    Pipeline->>Inference: process(InterleavedBatch) on GPU
    Note over Inference: create_vllm_llm retries port collisions (max_port_retries=3)
    Note over Inference: vLLM generate retried up to 3x with engine reset on failure
    Inference-->>Pipeline: InterleavedBatch with raw model text + custom metrics
    Pipeline->>Postprocess: process(InterleavedBatch)
    Postprocess-->>Pipeline: InterleavedBatch with interleaved rows
    Pipeline-->>User: list of Task with _stage_perf
    User->>TPU: "aggregate_task_metrics(results, prefix="task")"
    TPU-->>User: "task_nemotron_parse_inference_custom.<metric>_sum/mean/std"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User
    participant Pipeline
    participant Partition as PDFPartitioningStage
    participant Preprocess as PDFPreprocessStage
    participant Inference as NemotronParseInferenceStage
    participant Postprocess as NemotronParsePostprocessStage
    participant TPU as TaskPerfUtils

    User->>Pipeline: run(executor)
    Pipeline->>Partition: process(EmptyTask)
    Note over Partition: 1 worker, IS_FANOUT_STAGE=True
    Partition-->>Pipeline: list of FileGroupTask (one per pdfs_per_task group)
    Pipeline->>Preprocess: process(FileGroupTask) in parallel blocks
    Preprocess-->>Pipeline: InterleavedBatch with rendered page images
    Pipeline->>Inference: process(InterleavedBatch) on GPU
    Note over Inference: create_vllm_llm retries port collisions (max_port_retries=3)
    Note over Inference: vLLM generate retried up to 3x with engine reset on failure
    Inference-->>Pipeline: InterleavedBatch with raw model text + custom metrics
    Pipeline->>Postprocess: process(InterleavedBatch)
    Postprocess-->>Pipeline: InterleavedBatch with interleaved rows
    Pipeline-->>User: list of Task with _stage_perf
    User->>TPU: "aggregate_task_metrics(results, prefix="task")"
    TPU-->>User: "task_nemotron_parse_inference_custom.<metric>_sum/mean/std"
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into codex/docs-nemo..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Document Nemotron-Parse inference metrics, engine tuning, and Ray fanout

1 participant