Skip to content

[codex] docs: complete translation configuration reference#2132

Open
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-translation-reference
Open

[codex] docs: complete translation configuration reference#2132
lbliii wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-translation-reference

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What changed

  • add a complete, source-verified TranslationStage parameter reference covering inputs, segmentation, inference, FAITH evaluation, output, and resume controls
  • document validation rules and interactions among output modes, score merging, message reconstruction, dry runs, and health checks
  • explain skip/restore semantics, including how pre-translated rows interact with FAITH filtering and output formatting
  • add built-in Google, AWS, and NMT backend configuration details
  • add a non-LLM translation example that uses a separate LLM for FAITH scoring
  • document custom translation and FAITH prompt schemas, placeholders, escaping, and dry-run behavior

Why

The translation controls introduced in #2038 were not fully represented in the user guide. Users had to inspect implementation details to discover defaults, constraints, backend requirements, and the behavior of resumed translation runs.

Impact

Users can configure and validate experimental translation pipelines directly from the guide, including mixed-provider translation and quality evaluation workflows. This is a documentation-only change; runtime behavior is unchanged.

Validation

  • cd fern && npm run check — passes with 0 errors
  • cd fern && fern docs broken-links — changed page has no broken links; the command reports 22 pre-existing errors in older API-reference pages
  • git diff --check — passes

Closes #2126

Parent workstream: #2118

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jun 29, 2026
@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from ayushdg and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a complete parameter reference for TranslationStage to the translation user guide, filling a gap left by the feature's initial introduction in #2038. The documentation is source-verified against the implementation and accurately covers all constructor parameters, validation rules, backend configuration, custom prompt schemas, and resume semantics.

  • TranslationStage reference tables — all defaults, types, and constraints were confirmed against pipeline.py, format_translation_output.py, and the backend files; no discrepancies found.
  • Skip/restore semantics — the new prose correctly explains that skipped rows bypass FAITH filtering and are restored after threshold filtering, with appropriate defaults for missing columns.
  • Non-LLM + FAITH example — the Google/OpenAI split-provider snippet is well-structured and the concurrency sizing note is accurate.

Confidence Score: 5/5

Documentation-only change with no runtime impact; all new content was verified against the source implementation.

Every new parameter default, validation rule, and behavioral claim was cross-checked against pipeline.py, format_translation_output.py, the three backend files, faith.py, and utils/metadata.py. No inaccuracies were found. The prose correctly describes the raw-mode + reconstruct_messages failure mode, the skip/restore ordering relative to FAITH filtering, and the backend-specific constructor requirements.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-text/process-data/language-management/translation.mdx Adds ~170 lines of source-verified documentation: parameter reference tables, validation interaction notes, built-in backend config, skip/restore semantics, non-LLM FAITH example, and custom prompt schema. All claims verified against pipeline.py, format_translation_output.py, faith.py, and the three backend files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input DocumentBatch] --> B{skip_translated?}
    B -- Yes --> C[SkipExistingTranslationsStage\nremoves pre-translated rows]
    B -- No --> D[SegmentationStage\ncoarse or fine]
    C --> D
    D --> E[SegmentTranslationStage\nllm / google / aws / nmt]
    E --> F{enable_faith_eval?}
    F -- Yes --> G[FaithEvalFilter\nscores segments, filter_enabled=False]
    F -- No --> H[ReassemblyStage\ncollapse to doc-level rows]
    G --> H
    H --> I{enable_faith_eval AND filter_enabled?}
    I -- Yes --> J[FaithThresholdFilterStage\ndrop rows below faith_avg threshold]
    I -- No --> K{skip_translated?}
    J --> K
    K -- Yes --> L[RestoreSkippedRowsStage\nfill defaults, sort to original order]
    K -- No --> M{needs_formatting?}
    L --> M
    M -- output_mode != replaced OR reconstruct_messages --> N[FormatTranslationOutputStage\nbuild translation_metadata / translated_messages]
    M -- replaced and no reconstruct --> O[Output DocumentBatch]
    N --> P{merge_scores AND enable_faith_eval?}
    P -- Yes --> Q[MergeFaithScoresStage\nmerge faith_scores into translation_metadata]
    P -- No --> O
    Q --> O
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Input DocumentBatch] --> B{skip_translated?}
    B -- Yes --> C[SkipExistingTranslationsStage\nremoves pre-translated rows]
    B -- No --> D[SegmentationStage\ncoarse or fine]
    C --> D
    D --> E[SegmentTranslationStage\nllm / google / aws / nmt]
    E --> F{enable_faith_eval?}
    F -- Yes --> G[FaithEvalFilter\nscores segments, filter_enabled=False]
    F -- No --> H[ReassemblyStage\ncollapse to doc-level rows]
    G --> H
    H --> I{enable_faith_eval AND filter_enabled?}
    I -- Yes --> J[FaithThresholdFilterStage\ndrop rows below faith_avg threshold]
    I -- No --> K{skip_translated?}
    J --> K
    K -- Yes --> L[RestoreSkippedRowsStage\nfill defaults, sort to original order]
    K -- No --> M{needs_formatting?}
    L --> M
    M -- output_mode != replaced OR reconstruct_messages --> N[FormatTranslationOutputStage\nbuild translation_metadata / translated_messages]
    M -- replaced and no reconstruct --> O[Output DocumentBatch]
    N --> P{merge_scores AND enable_faith_eval?}
    P -- Yes --> Q[MergeFaithScoresStage\nmerge faith_scores into translation_metadata]
    P -- No --> O
    Q --> O
Loading

Reviews (4): Last reviewed commit: "docs: clarify FAITH client concurrency" | Re-trigger Greptile

- `backend_type="llm"` requires both `client` and a non-empty `model_name`, even in dry-run mode.
- FAITH always uses an LLM client. With Google, AWS, NMT, or a custom translation backend, pass a separate `client` and set `faith_model_name`.
- `merge_scores=True` with `output_mode="replaced"` is rejected. With FAITH disabled, score merging is skipped with a warning.
- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction; use `"both"` when you need metadata and `translated_messages`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The guidance correctly says not to combine output_mode="raw" with reconstruct_messages, but doesn't explain the concrete consequence. In FormatTranslationOutputStage.process, output_field is dropped from the DataFrame before _build_translated_messages is called, so every translated_messages entry ends up as an empty string. Describing that failure mode here will save users from a confusing silent-empty result.

Suggested change
- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction; use `"both"` when you need metadata and `translated_messages`.
- `output_mode="raw"` removes `output_field` after building `translation_metadata`. Do not combine raw mode with message reconstruction: `output_field` is dropped before `_build_translated_messages` runs, so every `translated_messages` entry will be an empty string. Use `output_mode="both"` when you need metadata and `translated_messages`.

lbliii added 3 commits July 2, 2026 10:58
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Complete the experimental translation configuration reference

1 participant