Skip to content

feat: add OpenAI-compatible Audio Speech API endpoint#41

Open
pinghe wants to merge 3 commits intoOpenMOSS:mainfrom
pinghe:main
Open

feat: add OpenAI-compatible Audio Speech API endpoint#41
pinghe wants to merge 3 commits intoOpenMOSS:mainfrom
pinghe:main

Conversation

@pinghe
Copy link
Copy Markdown

@pinghe pinghe commented Apr 24, 2026

Summary

  • Add OpenAI-compatible POST /v1/audio/speech endpoint for TTS
  • Fix torchaudio SoX backend segfault- Fix voice preset mappings referencing non-existent audio files- Enable GPU inference (was hardcoded to CPU)- Add timestamps to uvicorn access logs

Changes

New file: openai_audio_api.py

  • OpenAI /v1/audio/speech request/response models (SpeechRequest, make_error_response)
  • Voice mapping: OpenAI voice names to MOSS-TTS-Nano presets (alloy to Junhao, echo to Xiaoyu, etc.)
  • Audio format helpers: WAV header construction, PCM 16-bit encoding, MP3 encoding via lameenc
  • Streaming generators: iter_pcm_audio, generate_wav_stream, generate_mp3_stream, generate_pcm_stream
  • Supports wav, mp3, and pcm response formats

app.py

  • Add OpenAI-compatible endpoint using background thread + queue streaming model
    • Avoids holding _cpu_execution_lock inside ASGI streaming iterator (prevents deadlock on client disconnect)
    • _put() has 30s deadline to prevent threads from blocking indefinitely when client disconnects
    • Explicit events_gen.close() to release the lock promptly
    • Wrap lameenc return values in bytes() to fix Starlette bytearray type error
  • Add _patch_torchaudio_backend(): monkey-patch torchaudio to default to soundfile backend, working around SoX segfault
  • Change --device default from cpu to auto, add cuda option, use resolve_device() for auto GPU detection
  • Customize uvicorn log config to add timestamps to access logs
  • Add request/complete logging with elapsed time and audio chunk count

app_onnx.py, infer.py

  • Add _patch_torchaudio_backend() to fix SoX segfault

moss_tts_nano_runtime.py

  • Remove 8 voice presets whose audio files don't exist in the repository (Zhiming, Weiguo, Trump, Nathan, Sakura, Aoi, Hina, Mei), keeping only the 8 with actual files

pyproject.toml, requirements.txt

  • Add lameenc>=1.7.0 dependency (MP3 encoding)
  • Add openai_audio_api module declaration

Test plan

  • WAV format returns valid audio (RIFF PCM 16-bit stereo 48kHz)
  • MP3 format returns valid audio (MPEG layer III, 128kbps, 48kHz), plays correctly in mpv
  • PCM format returns valid raw audio data
  • Consecutive requests don't deadlock (lock released correctly)
  • Service recovers after client disconnect (_put deadline mechanism)
  • Invalid voice/params return OpenAI-format error JSON (HTTP 400)
  • GPU auto-detection works when CUDA is available

- Add OpenAI-compatible POST /v1/audio/speech endpoint for TTS
- Fix torchaudio SoX backend segfault- Fix voice preset mappings referencing non-existent audio files- Enable GPU inference (was hardcoded to CPU)- Add timestamps to uvicorn access logs

- OpenAI /v1/audio/speech request/response models (SpeechRequest, make_error_response)
- Voice mapping: OpenAI voice names to MOSS-TTS-Nano presets (alloy to Junhao, echo to Xiaoyu, etc.)
- Audio format helpers: WAV header construction, PCM 16-bit encoding, MP3 encoding via lameenc
- Streaming generators: iter_pcm_audio, generate_wav_stream, generate_mp3_stream, generate_pcm_stream
- Supports wav, mp3, and pcm response formats

- Add OpenAI-compatible endpoint using background thread + queue streaming model
  - Avoids holding _cpu_execution_lock inside ASGI streaming iterator (prevents deadlock on client disconnect)
  - _put() has 30s deadline to prevent threads from blocking indefinitely when client disconnects
  - Explicit events_gen.close() to release the lock promptly
  - Wrap lameenc return values in bytes() to fix Starlette bytearray type error
- Add _patch_torchaudio_backend(): monkey-patch torchaudio to default to soundfile backend, working around SoX segfault
- Change --device default from cpu to auto, add cuda option, use resolve_device() for auto GPU detection
- Customize uvicorn log config to add timestamps to access logs
- Add request/complete logging with elapsed time and audio chunk count

- Add _patch_torchaudio_backend() to fix SoX segfault

- Remove 8 voice presets whose audio files don't exist in the repository (Zhiming, Weiguo, Trump, Nathan, Sakura, Aoi, Hina, Mei), keeping only the 8 with actual files

- Add lameenc>=1.7.0 dependency (MP3 encoding)
- Add openai_audio_api module declaration

- WAV format returns valid audio (RIFF PCM 16-bit stereo 48kHz)
- MP3 format returns valid audio (MPEG layer III, 128kbps, 48kHz), plays correctly in mpv
- PCM format returns valid raw audio data
- Consecutive requests don't deadlock (lock released correctly)
- Service recovers after client disconnect (_put deadline mechanism)
- Invalid voice/params return OpenAI-format error JSON (HTTP 400)
- GPU auto-detection works when CUDA is available
@guoqiangui
Copy link
Copy Markdown

missing openai_audio_api.py

pinghe added 2 commits April 26, 2026 19:03
- Added noova → Lingyu, ballad → 男播音, yangmi → 杨幂, etc.

- Time format: 6:00 → 6点 (processed before colon replacement)
- Colon replacement: : → , (was 。, which caused repetition and content loss)
- Consecutive punctuation cleanup: ,。 or 。, collapsed to 。
- Temperature range: keep N°C and M°C as separate conversions (merging caused repetition)

- wav / mp3 / pcm / opus output formats
- Opus encoded via ffmpeg subprocess with atempo filter for speed adjustment

  ┌──────────────┬─────────────────────┬────────────────────┐
  │              │       Before        │       After        │
  ├──────────────┼─────────────────────┼────────────────────┤
  │ Architecture │ 7 chunks sequential │ 6 workers parallel │
  ├──────────────┼─────────────────────┼────────────────────┤
  │ Latency      │ ~105s               │ ~22-25s            │
  ├──────────────┼─────────────────────┼────────────────────┤
  │ Speedup      │ -                   │ ~4.5x              │
  └──────────────┴─────────────────────┴────────────────────┘

- Pre-create 6 independent ONNX InferenceSession groups (~3MB each) at startup, each with its own thread pool
- Parallel chunk execution via concurrent.futures.ThreadPoolExecutor
- Each worker acquires an exclusive pool slot, temporarily swapping sessions and rng for thread safety
- Results reassembled in original chunk order
- Added detailed per-step timing logs (ONNX timing)

- Fixed default_cpu_threads using os.cpu_count() which created a new runtime with _parallel_workers=1. Now uses configured thread_count
- Fixed synthesize_stream signature compatibility with PyTorch-specific parameters

- Added _create_sessions_with_threads() to create session groups with a specified thread count (used by the parallel pool)
- Refactored _create_sessions() to delegate to the new method

- resolve_prompt_audio_codes fallback: when a voice is not in the builtin manifest, automatically load the corresponding wav from assets/audio/ and encode on the fly
- Also checks the PyTorch preset map (_DEFAULT_VOICE_FILES) for wav filename resolution
- Supports voices like 杨幂 (zh_11.wav), 男播音 (zh_10.wav) not present in the ONNX manifest

- Added 男播音 (zh_10.wav) and 杨幂 (zh_11.wav) presets

- Added chunk splitting logs to the OpenAI endpoint
- Performance tuning: max_new_frames=200, voice_clone_max_memory_per_sample_gb=0.6, tts_max_batch_size=7

  ┌────────────────────────────────┬─────────┐
  │             Config             │ Latency │
  ├────────────────────────────────┼─────────┤
  │ PyTorch + CUDA (before tuning) │ ~38s    │
  ├────────────────────────────────┼─────────┤
  │ PyTorch + CUDA (after tuning)  │ ~22s    │
  ├────────────────────────────────┼─────────┤
  │ ONNX + CPU (sequential)        │ ~105s   │
  ├────────────────────────────────┼─────────┤
  │ ONNX + CPU (6-worker parallel) │ ~22-25s │
  └────────────────────────────────┴─────────┘
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants