Skip to content

[codex] docs: add audio tagging pipeline guide#2135

Open
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-audio-tagging-guide
Open

[codex] docs: add audio tagging pipeline guide#2135
lbliii wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-audio-tagging-guide

Conversation

@lbliii

@lbliii lbliii commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a source-verified Fern tutorial for the YAML-driven TTS and ASR audio tagging pipelines
  • document stage ordering, manifest field lifecycle, defaults, dependencies, GPU/batching constraints, output metrics, and failure behavior
  • cover PrepareModuleSegmentsStage, bandwidth, TorchSQUIM, second-pass WER/CER, inverse text normalization, and Chinese conversion
  • add runnable commands for the bundled sample manifest and configuration overrides for both paths
  • add the page to audio navigation and cross-link it from the audio overview, processing, quality-assessment, and text-integration pages

PnC and normalization implementation boundary

Issue #2123 was written from the description of #1863, but the final merged source does not contain PNCwithvLLMInferenceStage, CleanLLMOutputStage, VLLMInference, or an Arabic diacritic-removal stage. This guide does not invent configuration for those unavailable APIs. It explicitly documents that:

  • punctuation must currently come from the selected ASR model or a custom post-alignment stage
  • ComputeWERStage.compute_pnc_wer evaluates punctuation-sensitive agreement but does not generate PnC
  • the shipped pipeline has no vLLM dependency or lifecycle setting
  • Arabic diacritic removal requires a custom stage in the current release
  • CER fallback, LLM retry, and text-selection policies are not implemented by the supplied pipeline

This keeps the published workflow runnable against the implementation on main while still addressing the operational questions in the issue.

Validation

  • fern check — 0 errors (103 existing warnings)
  • fern docs broken-links — no errors in the changed pages; reports 22 pre-existing API-reference errors elsewhere
  • python3 -m py_compile tutorials/audio/tagging/main.py
  • git diff --cached --check

Closes #2123

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jun 29, 2026
@lbliii lbliii marked this pull request as ready for review July 2, 2026 14:53
@lbliii lbliii requested a review from a team as a code owner July 2, 2026 14:53
@lbliii lbliii requested review from weijiac0619 and removed request for a team July 2, 2026 14:53
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new Fern tutorial page (audio-tagging.mdx) documenting the YAML-driven TTS and ASR audio tagging pipelines, along with navigation registration and cross-links from four existing pages. All implementation-facing claims were verified against the source code.

  • Stage documentation (numbering 0–12, key names, failure behaviors) was checked against the actual stage implementations: sisdr_squim key name, BandwidthEstimationStage using audio_filepath, fatal vs. recoverable failure paths for 2nd-pass ASR, and ComputeWERStage skip conditions all match.
  • Text normalization output keys (text_ITN, text_simplified) and the note that ChineseConversionStage always writes to _simplified regardless of convert_type are accurate per the source.
  • word_rate and char_rate are now present in both the JSON example and the output table, addressing the previously flagged gap.

Confidence Score: 5/5

Documentation-only change; no executable logic is modified.

All stage-level claims—key names, failure paths, field lifecycle, and text-normalization output keys—were verified against the implementation. No factual inaccuracies were found. The navigation entry, cross-links, and tutorial content are consistent with each other and with the live codebase.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-audio/tutorials/audio-tagging.mdx New 308-line tutorial for the audio tagging pipeline; implementation-verified: stage numbering, key names (sisdr_squim), failure behaviors, and text-normalization output keys all match the source code.
fern/versions/main.yml Adds Audio Tagging Pipeline page to the Tutorials nav section with correct path and slug; no structural issues.
fern/versions/main/pages/curate-audio/tutorials/index.mdx Adds Audio Tagging Tutorial card to the tutorials index with consistent description and correct href.
fern/versions/main/pages/curate-audio/index.mdx Adds cross-link card to the audio overview page; description and href are correct.
fern/versions/main/pages/curate-audio/process-data/index.mdx Adds cross-link card from the processing overview to the new tutorial; correct href and tags.
fern/versions/main/pages/curate-audio/process-data/quality-assessment/index.mdx Adds a Related topics link to the new tutorial from the quality-assessment page; link text accurately describes SQUIM and WER content.
fern/versions/main/pages/curate-audio/process-data/text-integration/index.mdx Adds a Related topics link from text-integration page to the new tutorial; description accurately mentions ITN and Chinese conversion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["JSONL manifest\n(audio_filepath, audio_item_id)"] --> B["1. ResampleAudioStage\n→ resampled_audio_filepath, duration"]
    B --> C["2. PyAnnoteDiarizationStage\n→ segments, overlap_segments, RTTM"]
    C --> D["3. SplitLongAudioStage\n→ split_filepaths, split_metadata"]
    D --> E["4. NeMoASRAlignerStage (1st pass)\n→ text, alignment (per split)"]
    E --> F["5. JoinSplitAudioMetadataStage\n→ rejoined text/timestamps"]
    F --> G["6. MergeAlignmentDiarizationStage\n→ segments[].text, segments[].words"]
    G --> H_opt["Optional: InverseTextNormalization\nor ChineseConversionStage"]
    H_opt --> H["7. BandwidthEstimationStage\n→ segments[].metrics.bandwidth"]
    H --> I["8. TorchSquimQualityMetricsStage\n→ pesq_squim, stoi_squim, sisdr_squim"]
    I --> J["9. PrepareModuleSegmentsStage\n→ training-length segments"]
    J --> K_tts["TTS: 10. ManifestWriterStage"]
    J --> L["ASR: 10. NeMoASRAlignerStage (2nd pass)\n→ segments[].text_2"]
    L --> M["ASR: 11. ComputeWERStage\n→ wer, cer, start_cer, end_cer, word_rate, char_rate"]
    M --> N["ASR: 12. ManifestWriterStage"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["JSONL manifest\n(audio_filepath, audio_item_id)"] --> B["1. ResampleAudioStage\n→ resampled_audio_filepath, duration"]
    B --> C["2. PyAnnoteDiarizationStage\n→ segments, overlap_segments, RTTM"]
    C --> D["3. SplitLongAudioStage\n→ split_filepaths, split_metadata"]
    D --> E["4. NeMoASRAlignerStage (1st pass)\n→ text, alignment (per split)"]
    E --> F["5. JoinSplitAudioMetadataStage\n→ rejoined text/timestamps"]
    F --> G["6. MergeAlignmentDiarizationStage\n→ segments[].text, segments[].words"]
    G --> H_opt["Optional: InverseTextNormalization\nor ChineseConversionStage"]
    H_opt --> H["7. BandwidthEstimationStage\n→ segments[].metrics.bandwidth"]
    H --> I["8. TorchSquimQualityMetricsStage\n→ pesq_squim, stoi_squim, sisdr_squim"]
    I --> J["9. PrepareModuleSegmentsStage\n→ training-length segments"]
    J --> K_tts["TTS: 10. ManifestWriterStage"]
    J --> L["ASR: 10. NeMoASRAlignerStage (2nd pass)\n→ segments[].text_2"]
    L --> M["ASR: 11. ComputeWERStage\n→ wer, cer, start_cer, end_cer, word_rate, char_rate"]
    M --> N["ASR: 12. ManifestWriterStage"]
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/docs-audi..." | Re-trigger Greptile

| Additional split signals | Pause longer than `max_pause`; bandwidth change | Randomized duration boundary |
| Reproducibility | Deterministic for an input entry | Randomized boundary is deterministically seeded from audio path or ID |

`terminal_punct_marks` defaults to the value in the YAML, `. ! ? 。 ? ! 。` without spaces. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 terminal_punct_marks is displayed with spaces between each character in the inline code (. ! ? 。 ? ! 。), then immediately qualified with "without spaces" — these two parts contradict each other and will confuse readers trying to copy the value. The actual YAML string is .!?。?!。; use that verbatim so the clarification is unnecessary.

Suggested change
`terminal_punct_marks` defaults to the value in the YAML, `. ! ? 。 ? ! 。` without spaces. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.
`terminal_punct_marks` defaults to the value in the YAML, `.!?。?!。`. If `punctuation_split_only: true`, the stage returns no prepared segments when it cannot find a punctuation boundary. With the supplied `false`, duration and TTS pause/bandwidth heuristics remain available.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +283 to +284
| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |
| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The output table below the JSON example omits word_rate and char_rate, which are visible in the example object. A reader seeing these fields in a real manifest will have no documentation to explain them. Both are produced by ComputeWERStage (characters/words per second, derived from the hypothesis text and segment duration).

Suggested change
| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |
| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |
| `metrics.wer.wer`, `metrics.cer.cer` | Second-pass disagreement ratios, not percentages. Each object also records token and edit rates. |
| `metrics.start_cer.cer`, `metrics.end_cer.cer` | CER at the configured beginning and ending character windows. |
| `metrics.word_rate` | Words per second for the segment, computed from the hypothesis text and segment duration. |
| `metrics.char_rate` | Characters per second for the segment, computed from the hypothesis text and segment duration. |

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Add an audio tagging, quality-metrics, normalization, and LLM PnC guide

1 participant