Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
686 changes: 686 additions & 0 deletions PR1967_FEATURE_WALKTHROUGH.md

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions examples/audio/qwen_omni_inprocess/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Qwen-Omni In-Process ASR Assets

This folder contains prompt templates used by the Qwen-Omni in-process ASR
adapter.

Install the runtime with `uv sync --extra audio_qwen`. The dedicated extra
keeps Qwen/vLLM dependencies out of existing `audio_cuda12` installations.

The executable code path is:

```text
Pipeline
-> ManifestReader
-> AudioPayloadMaterializeStage
-> ASRStage(adapter_target=QwenOmniASRAdapter)
-> PayloadReleaseStage
-> ManifestWriterStage
```

The adapter reads prompt text through `prompt_file`, `en_prompt_file`,
`followup_prompt_file`, or `system_prompt_file`. Curator stage behavior remains
outside the prompt files:

- graph expansion lives in `nemo_curator/pipeline/payload_lifecycle.py`;
- audio decode and payload refs live in `nemo_curator/stages/payload_lifecycle.py`;
- local/windowed ASR model-input segmentation and batching live in
`nemo_curator/stages/audio/inference/asr/stage.py`;
- Qwen/vLLM request construction lives in `nemo_curator/models/asr/qwen_omni.py`.

Prompt files may use `{language}` and `{transcript}` placeholders when the
stage supplies language or reference text columns.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
You receive audio in English.

MAIN GOAL: faithfully transcribe audio as is spoken in the audio with all disfluencies present in the audio.
- Do NOT remove, correct, or "clean up" any speech artifacts.
- Do NOT paraphrase, edit grammar, or make the speech more polished.

FILLER WORDS:
- Include hesitation markers like "um", "uh", "hm", "ah" etc as is spoken in the audio.

REPETITIONS:
- Consecutive instances of the same word or short phrase spoken unintentionally — keep all repetitions as-is.
- Example: "I I think", "the the problem"

FALSE STARTS:
- Incomplete words or phrases the speaker abandons, mark with a hyphen — keep them as-is.
- Example: "I was go going to the store." → "I was go- going to the store."

COLLOQUIAL REDUCTIONS:
- Preserve forms such as "wanna", "gonna", "kinda", "lemme", "lotta", "outta", "Imma", "sorta", "ya", "m'kay", "finna", "tryna", etc exactly as spoken. Do NOT expand them into standard forms.

WRONG GRAMMAR:
- Grammatical errors should be faithfully captured in the transcript — do NOT correct them.
- You MUST NOT fix subject-verb agreement, tense errors, or any other grammatical issues.

NUMERICALS:
- Keep numbers as is spoken in words. Do NOT convert them to numbers. like "oh eleven" should be "oh eleven" as spoken in the audio not "zero eleven" etc

Output format:
- Return ONLY the transcription text.
- No explanations, no JSON, no lists.
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
You receive English audio and a reference transcript. The reference may be cleaned, partially wrong, or missing speech artifacts. The audio is the ground truth.

REFERENCE TRANSCRIPT:
"{transcript}"

MAIN GOAL: Listen carefully to the audio and revise the reference so it faithfully reflects exactly what is spoken, including all disfluencies present in the audio.
- Use the reference as a starting point; do not ignore it.
- When the reference matches the audio, keep it unchanged.
- When the reference conflicts with the audio, follow the audio.
- Do NOT invent words or content not spoken in the audio.
- Do NOT remove substantive content that is spoken in the audio (remove reference words only if they are not spoken).
- Do NOT paraphrase, polish grammar or rewrite sentences that already match the audio.
- Prefer minimal edits: fix mismatches and insert missing speech artifacts.
- Preserve named entities from the reference in their exact written form.
- Normalize numbers to their written form.

ENTITIES (names, places, brands, titles, etc.):
- Keep every named entity from the reference in its exact written form: spelling, casing, script, and punctuation. This includes names, places, brands, titles, acronyms, and other proper nouns.
- Do not ever transliterate, translate, re-spell, normalize, or "correct" an entity into another script.
- If enetities are part code switched data it should stay the same.

KEEP REFERENCE DISFLUENCIES:
- If the reference already has fillers, repetitions, false starts, colloquial reductions, or grammatical errors, keep them.
- Add hesitation markers and fillers natural to English wherever they are spoken in the audio but missing from the reference.
- Do NOT clean up, normalize, or remove disfluencies that are already in the reference and are spoken in the audio.
- Add consecutive instances of the same word or short phrase when spoken unintentionally.
- Example: reference "I think" → "I I think" if that is what is spoken.


BACKGROUND / QUIET / OVERLAPPING SPEECH:
- Keep all audible speech in the reference, including quieter, distant, or overlapping voices — not just the loudest speaker.
- Add background or secondary speech that is audible but missing; do not drop words because they sound like background.

FALSE STARTS:
- Add incomplete words or phrases the speaker abandons, marked with a hyphen.
- Example: "I was go- going to the store."
- Do NOT remove false starts already in the reference if they are spoken in the audio.

COLLOQUIAL REDUCTIONS:
- If the reference uses standard forms but the speaker used reductions, use the spoken form: "want to" → "wanna", "going to" → "gonna", etc.
- Preserve forms such as "wanna", "gonna", "kinda", "lemme", "lotta", "outta", "Imma", "sorta", "ya", "m'kay", "finna", "tryna", etc. Do NOT expand them.

WRONG GRAMMAR:
- Keep grammatical errors as spoken. Do NOT correct subject-verb agreement, tense errors, or other grammar issues.

NUMERICALS:
- Keep numbers as spoken in words. Do NOT convert them to digits.
- Example: keep "oh eleven" or "zero eleven".

Output format:
- Return ONLY the revised transcription text.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Transcribe the {language} audio into text exactly as the speaker says it. Write numbers as spoken words.
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
You receive:
1) An audio file,
2) A Ground Truth Transcription of the audio {transcript}.

Goal: To normalize numbers from the text and add any disfluencies that are present in the audio.

ALLOWED ONLY:
1) Normalize numeric expressions into words exactly as they are SPOKEN in the audio.
- Mixed format is forbidden:
Bad: "5 percent", "2 zeros"
Good: "five percent", "two zeros"
- Normalize: percentages, currencies, units, ranges, decimals, dates/years — ONLY if they are spoken.
- If a unit (for example “percent”) is NOT spoken, do not add it.
2) Add any disfluencies present in the audio.
- Disfluencies as "um", "uh" that are present in the audio should be added to the text.
- If word is repeated in the audio but missing from ground truth add it to the text.

ENTITIES (names, places, brands, titles, etc.) should be the same as inGround Truth Transcription:
- Keep every named entity from the reference in its exact written form: spelling, casing, script, and punctuation. This includes names, places, brands, titles, acronyms, and other proper nouns.

OUTPUT FORMAT:
- Return only the final text.
- No explanations, no JSON, no lists.
Loading
Loading