[codex] docs: complete translation configuration reference by lbliii · Pull Request #2132 · NVIDIA-NeMo/Curator

lbliii · 2026-06-29T20:17:47Z

What changed

add a complete, source-verified TranslationStage parameter reference covering inputs, segmentation, inference, FAITH evaluation, output, and resume controls
document validation rules and interactions among output modes, score merging, message reconstruction, dry runs, and health checks
explain skip/restore semantics, including how pre-translated rows interact with FAITH filtering and output formatting
add built-in Google, AWS, and NMT backend configuration details
add a non-LLM translation example that uses a separate LLM for FAITH scoring
document custom translation and FAITH prompt schemas, placeholders, escaping, and dry-run behavior

Why

The translation controls introduced in #2038 were not fully represented in the user guide. Users had to inspect implementation details to discover defaults, constraints, backend requirements, and the behavior of resumed translation runs.

Impact

Users can configure and validate experimental translation pipelines directly from the guide, including mixed-provider translation and quality evaluation workflows. This is a documentation-only change; runtime behavior is unchanged.

Validation

cd fern && npm run check — passes with 0 errors
cd fern && fern docs broken-links — changed page has no broken links; the command reports 22 pre-existing errors in older API-reference pages
git diff --check — passes

Closes #2126

Parent workstream: #2118

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-29T20:17:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T14:57:03Z

Greptile Summary

This PR adds a complete parameter reference for TranslationStage to the translation user guide, filling a gap left by the feature's initial introduction in #2038. The documentation is source-verified against the implementation and accurately covers all constructor parameters, validation rules, backend configuration, custom prompt schemas, and resume semantics.

TranslationStage reference tables — all defaults, types, and constraints were confirmed against pipeline.py, format_translation_output.py, and the backend files; no discrepancies found.
Skip/restore semantics — the new prose correctly explains that skipped rows bypass FAITH filtering and are restored after threshold filtering, with appropriate defaults for missing columns.
Non-LLM + FAITH example — the Google/OpenAI split-provider snippet is well-structured and the concurrency sizing note is accurate.

Confidence Score: 5/5

Documentation-only change with no runtime impact; all new content was verified against the source implementation.

Every new parameter default, validation rule, and behavioral claim was cross-checked against pipeline.py, format_translation_output.py, the three backend files, faith.py, and utils/metadata.py. No inaccuracies were found. The prose correctly describes the raw-mode + reconstruct_messages failure mode, the skip/restore ordering relative to FAITH filtering, and the backend-specific constructor requirements.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main/pages/curate-text/process-data/language-management/translation.mdx	Adds ~170 lines of source-verified documentation: parameter reference tables, validation interaction notes, built-in backend config, skip/restore semantics, non-LLM FAITH example, and custom prompt schema. All claims verified against pipeline.py, format_translation_output.py, faith.py, and the three backend files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input DocumentBatch] --> B{skip_translated?}
    B -- Yes --> C[SkipExistingTranslationsStage\nremoves pre-translated rows]
    B -- No --> D[SegmentationStage\ncoarse or fine]
    C --> D
    D --> E[SegmentTranslationStage\nllm / google / aws / nmt]
    E --> F{enable_faith_eval?}
    F -- Yes --> G[FaithEvalFilter\nscores segments, filter_enabled=False]
    F -- No --> H[ReassemblyStage\ncollapse to doc-level rows]
    G --> H
    H --> I{enable_faith_eval AND filter_enabled?}
    I -- Yes --> J[FaithThresholdFilterStage\ndrop rows below faith_avg threshold]
    I -- No --> K{skip_translated?}
    J --> K
    K -- Yes --> L[RestoreSkippedRowsStage\nfill defaults, sort to original order]
    K -- No --> M{needs_formatting?}
    L --> M
    M -- output_mode != replaced OR reconstruct_messages --> N[FormatTranslationOutputStage\nbuild translation_metadata / translated_messages]
    M -- replaced and no reconstruct --> O[Output DocumentBatch]
    N --> P{merge_scores AND enable_faith_eval?}
    P -- Yes --> Q[MergeFaithScoresStage\nmerge faith_scores into translation_metadata]
    P -- No --> O
    Q --> O

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Input DocumentBatch] --> B{skip_translated?}
    B -- Yes --> C[SkipExistingTranslationsStage\nremoves pre-translated rows]
    B -- No --> D[SegmentationStage\ncoarse or fine]
    C --> D
    D --> E[SegmentTranslationStage\nllm / google / aws / nmt]
    E --> F{enable_faith_eval?}
    F -- Yes --> G[FaithEvalFilter\nscores segments, filter_enabled=False]
    F -- No --> H[ReassemblyStage\ncollapse to doc-level rows]
    G --> H
    H --> I{enable_faith_eval AND filter_enabled?}
    I -- Yes --> J[FaithThresholdFilterStage\ndrop rows below faith_avg threshold]
    I -- No --> K{skip_translated?}
    J --> K
    K -- Yes --> L[RestoreSkippedRowsStage\nfill defaults, sort to original order]
    K -- No --> M{needs_formatting?}
    L --> M
    M -- output_mode != replaced OR reconstruct_messages --> N[FormatTranslationOutputStage\nbuild translation_metadata / translated_messages]
    M -- replaced and no reconstruct --> O[Output DocumentBatch]
    N --> P{merge_scores AND enable_faith_eval?}
    P -- Yes --> Q[MergeFaithScoresStage\nmerge faith_scores into translation_metadata]
    P -- No --> O
    Q --> O

_{Reviews (4): Last reviewed commit: "docs: clarify FAITH client concurrency" | Re-trigger Greptile}

greptile-apps · 2026-07-02T14:57:06Z

+- `backend_type="llm"` requires both `client` and a non-empty `model_name`, even in dry-run mode.
+- FAITH always uses an LLM client. With Google, AWS, NMT, or a custom translation backend, pass a separate `client` and set `faith_model_name`.
+- `merge_scores=True` with `output_mode="replaced"` is rejected. With FAITH disabled, score merging is skipped with a warning.
+- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction; use `"both"` when you need metadata and `translated_messages`.


The guidance correctly says not to combine output_mode="raw" with reconstruct_messages, but doesn't explain the concrete consequence. In FormatTranslationOutputStage.process, output_field is dropped from the DataFrame before _build_translated_messages is called, so every translated_messages entry ends up as an empty string. Describing that failure mode here will save users from a confusing silent-empty result.

Suggested change

- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction; use `"both"` when you need metadata and `translated_messages`.

- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction: `output_field` is dropped before `_build_translated_messages` runs, so every `translated_messages` entry will be an empty string. Use `output_mode="both"` when you need metadata and `translated_messages`.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: complete translation configuration reference

03f44a0

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Jun 29, 2026

lbliii mentioned this pull request Jun 30, 2026

[codex] publish 26.06 release notes and migration checklist #2143

Open

lbliii marked this pull request as ready for review July 2, 2026 14:53

lbliii requested a review from a team as a code owner July 2, 2026 14:53

lbliii requested review from ayushdg and removed request for a team July 2, 2026 14:53

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

lbliii added 3 commits July 2, 2026 10:58

docs: clarify raw translation output

3075982

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'main' into codex/docs-translation-reference

e0c7a49

docs: clarify FAITH client concurrency

2115104

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] docs: complete translation configuration reference#2132

[codex] docs: complete translation configuration reference#2132
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-translation-reference

lbliii commented Jun 29, 2026

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction; use `"both"` when you need metadata and `translated_messages`.
	- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction: `output_field` is dropped before `_build_translated_messages` runs, so every `translated_messages` entry will be an empty string. Use `output_mode="both"` when you need metadata and `translated_messages`.

Uh oh!

Conversation

lbliii commented Jun 29, 2026

What changed

Why

Impact

Validation

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading