[codex] docs: add audio tagging pipeline guide by lbliii · Pull Request #2135 · NVIDIA-NeMo/Curator

lbliii · 2026-06-29T20:47:20Z

Summary

add a source-verified Fern tutorial for the YAML-driven TTS and ASR audio tagging pipelines
document stage ordering, manifest field lifecycle, defaults, dependencies, GPU/batching constraints, output metrics, and failure behavior
cover PrepareModuleSegmentsStage, bandwidth, TorchSQUIM, second-pass WER/CER, inverse text normalization, and Chinese conversion
add runnable commands for the bundled sample manifest and configuration overrides for both paths
add the page to audio navigation and cross-link it from the audio overview, processing, quality-assessment, and text-integration pages

PnC and normalization implementation boundary

Issue #2123 was written from the description of #1863, but the final merged source does not contain PNCwithvLLMInferenceStage, CleanLLMOutputStage, VLLMInference, or an Arabic diacritic-removal stage. This guide does not invent configuration for those unavailable APIs. It explicitly documents that:

punctuation must currently come from the selected ASR model or a custom post-alignment stage
ComputeWERStage.compute_pnc_wer evaluates punctuation-sensitive agreement but does not generate PnC
the shipped pipeline has no vLLM dependency or lifecycle setting
Arabic diacritic removal requires a custom stage in the current release
CER fallback, LLM retry, and text-selection policies are not implemented by the supplied pipeline

This keeps the published workflow runnable against the implementation on main while still addressing the operational questions in the issue.

Validation

fern check — 0 errors (103 existing warnings)
fern docs broken-links — no errors in the changed pages; reports 22 pre-existing API-reference errors elsewhere
python3 -m py_compile tutorials/audio/tagging/main.py
git diff --cached --check

Closes #2123

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-29T20:47:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-02T14:59:47Z

Greptile Summary

This PR adds a new Fern tutorial page (audio-tagging.mdx) documenting the YAML-driven TTS and ASR audio tagging pipelines, along with navigation registration and cross-links from four existing pages. All implementation-facing claims were verified against the source code.

Stage documentation (numbering 0–12, key names, failure behaviors) was checked against the actual stage implementations: sisdr_squim key name, BandwidthEstimationStage using audio_filepath, fatal vs. recoverable failure paths for 2nd-pass ASR, and ComputeWERStage skip conditions all match.
Text normalization output keys (text_ITN, text_simplified) and the note that ChineseConversionStage always writes to _simplified regardless of convert_type are accurate per the source.
word_rate and char_rate are now present in both the JSON example and the output table, addressing the previously flagged gap.

Confidence Score: 5/5

Documentation-only change; no executable logic is modified.

All stage-level claims—key names, failure paths, field lifecycle, and text-normalization output keys—were verified against the implementation. No factual inaccuracies were found. The navigation entry, cross-links, and tutorial content are consistent with each other and with the live codebase.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main/pages/curate-audio/tutorials/audio-tagging.mdx	New 308-line tutorial for the audio tagging pipeline; implementation-verified: stage numbering, key names (sisdr_squim), failure behaviors, and text-normalization output keys all match the source code.
fern/versions/main.yml	Adds Audio Tagging Pipeline page to the Tutorials nav section with correct path and slug; no structural issues.
fern/versions/main/pages/curate-audio/tutorials/index.mdx	Adds Audio Tagging Tutorial card to the tutorials index with consistent description and correct href.
fern/versions/main/pages/curate-audio/index.mdx	Adds cross-link card to the audio overview page; description and href are correct.
fern/versions/main/pages/curate-audio/process-data/index.mdx	Adds cross-link card from the processing overview to the new tutorial; correct href and tags.
fern/versions/main/pages/curate-audio/process-data/quality-assessment/index.mdx	Adds a Related topics link to the new tutorial from the quality-assessment page; link text accurately describes SQUIM and WER content.
fern/versions/main/pages/curate-audio/process-data/text-integration/index.mdx	Adds a Related topics link from text-integration page to the new tutorial; description accurately mentions ITN and Chinese conversion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["JSONL manifest\n(audio_filepath, audio_item_id)"] --> B["1. ResampleAudioStage\n→ resampled_audio_filepath, duration"]
    B --> C["2. PyAnnoteDiarizationStage\n→ segments, overlap_segments, RTTM"]
    C --> D["3. SplitLongAudioStage\n→ split_filepaths, split_metadata"]
    D --> E["4. NeMoASRAlignerStage (1st pass)\n→ text, alignment (per split)"]
    E --> F["5. JoinSplitAudioMetadataStage\n→ rejoined text/timestamps"]
    F --> G["6. MergeAlignmentDiarizationStage\n→ segments[].text, segments[].words"]
    G --> H_opt["Optional: InverseTextNormalization\nor ChineseConversionStage"]
    H_opt --> H["7. BandwidthEstimationStage\n→ segments[].metrics.bandwidth"]
    H --> I["8. TorchSquimQualityMetricsStage\n→ pesq_squim, stoi_squim, sisdr_squim"]
    I --> J["9. PrepareModuleSegmentsStage\n→ training-length segments"]
    J --> K_tts["TTS: 10. ManifestWriterStage"]
    J --> L["ASR: 10. NeMoASRAlignerStage (2nd pass)\n→ segments[].text_2"]
    L --> M["ASR: 11. ComputeWERStage\n→ wer, cer, start_cer, end_cer, word_rate, char_rate"]
    M --> N["ASR: 12. ManifestWriterStage"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["JSONL manifest\n(audio_filepath, audio_item_id)"] --> B["1. ResampleAudioStage\n→ resampled_audio_filepath, duration"]
    B --> C["2. PyAnnoteDiarizationStage\n→ segments, overlap_segments, RTTM"]
    C --> D["3. SplitLongAudioStage\n→ split_filepaths, split_metadata"]
    D --> E["4. NeMoASRAlignerStage (1st pass)\n→ text, alignment (per split)"]
    E --> F["5. JoinSplitAudioMetadataStage\n→ rejoined text/timestamps"]
    F --> G["6. MergeAlignmentDiarizationStage\n→ segments[].text, segments[].words"]
    G --> H_opt["Optional: InverseTextNormalization\nor ChineseConversionStage"]
    H_opt --> H["7. BandwidthEstimationStage\n→ segments[].metrics.bandwidth"]
    H --> I["8. TorchSquimQualityMetricsStage\n→ pesq_squim, stoi_squim, sisdr_squim"]
    I --> J["9. PrepareModuleSegmentsStage\n→ training-length segments"]
    J --> K_tts["TTS: 10. ManifestWriterStage"]
    J --> L["ASR: 10. NeMoASRAlignerStage (2nd pass)\n→ segments[].text_2"]
    L --> M["ASR: 11. ComputeWERStage\n→ wer, cer, start_cer, end_cer, word_rate, char_rate"]
    M --> N["ASR: 12. ManifestWriterStage"]

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/docs-audi..." | Re-trigger Greptile}

greptile-apps · 2026-07-02T14:59:52Z

+| Additional split signals | Pause longer than `max_pause`; bandwidth change | Randomized duration boundary |
+| Reproducibility | Deterministic for an input entry | Randomized boundary is deterministically seeded from audio path or ID |
+
+`terminal_punct_marks` defaults to the value in the YAML, `. ! ? 。 ？ ！ 。` without spaces. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.


terminal_punct_marks is displayed with spaces between each character in the inline code (. ! ? 。？！。), then immediately qualified with "without spaces" — these two parts contradict each other and will confuse readers trying to copy the value. The actual YAML string is .!?。？！。; use that verbatim so the clarification is unnecessary.

Suggested change

`terminal_punct_marks` defaults to the value in the YAML, `. ! ? 。？！。` without spaces. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.

`terminal_punct_marks` defaults to the value in the YAML, `.!?。？！。`. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-07-02T14:59:53Z

+| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |
+| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |


The output table below the JSON example omits word_rate and char_rate, which are visible in the example object. A reader seeing these fields in a real manifest will have no documentation to explain them. Both are produced by ComputeWERStage (characters/words per second, derived from the hypothesis text and segment duration).

Suggested change

| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |

| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |

| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |

| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |

| `metrics.word_rate` | Words per second for the segment, computed from the hypothesis text and segment duration. |

| `metrics.char_rate` | Characters per second for the segment, computed from the hypothesis text and segment duration. |

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: add audio tagging pipeline guide

a1a6adc

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Jun 29, 2026

lbliii mentioned this pull request Jun 30, 2026

[codex] publish 26.06 release notes and migration checklist #2143

Open

lbliii marked this pull request as ready for review July 2, 2026 14:53

lbliii requested a review from a team as a code owner July 2, 2026 14:53

lbliii requested review from weijiac0619 and removed request for a team July 2, 2026 14:53

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

lbliii added 2 commits July 2, 2026 11:02

docs: clarify audio tagging output fields

379c5c4

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'main' into codex/docs-audio-tagging-guide

7c6a234

This was referenced Jul 2, 2026

[Docs] Integrate 26.06 documentation PRs and stage the version train #2160

Open

[Docs] Repair audio API examples and add workflow selection guidance #2162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] docs: add audio tagging pipeline guide#2135

[codex] docs: add audio tagging pipeline guide#2135
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-audio-tagging-guide

lbliii commented Jun 29, 2026

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	`terminal_punct_marks` defaults to the value in the YAML, `. ! ? 。？！。` without spaces. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.
	`terminal_punct_marks` defaults to the value in the YAML, `.!?。？！。`. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.

		\| `metrics.wer.wer`, `metrics.cer.cer` \| Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. \|
		\| `metrics.start_cer.cer`, `metrics.end_cer.cer` \| CER at the configured beginning and ending character windows. \|

Uh oh!

Conversation

lbliii commented Jun 29, 2026

Summary

PnC and normalization implementation boundary

Validation

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading