This file is for any AI coding agent working in this repo (Claude, Codex, Cursor, Aider, etc.).
speech-studio — Speech Studio, a Soniqo project. Open-source desktop app for content creators.
MVP scope. Voice cloning + adjusting a cloned voice over a video timeline + emotional markers (style / prosody tags on the synthesized speech). The first end-to-end story:
- Drop a short reference clip → clone the speaker.
- Drop a video → extract / line up the existing dialogue.
- Rewrite or re-record lines in the cloned voice, with inline emotion markers (e.g.
<whisper>,<excited>,<sad>). - Preview against the video; export muxed output.
Status: v0 scaffold in place. Tauri shell compiles, the Swift sidecar responds to ping over stdin/stdout, the Rust side round-trips a JSON request, and the React frontend can invoke the round-trip. Real Qwen3-TTS wiring is the next step.
Tauri (Rust shell + web frontend) wrapping the Soniqo speech engines.
- Rust process — Tauri app, owns the window, menu, file pickers, IPC, model lifecycle, file I/O. Talks to a voice-cloning TTS backend through a sidecar chosen at compile time per OS:
- macOS (Apple Silicon) —
speech-swift(Swift / MLX) via theswift-sidecar/binary, cloning + cloned-voice TTS with theVoxCPM2MLX model. - Windows / Linux (x86_64) —
speech-core(C++) via thecore-sidecar/binary, cloning + cloned-voice TTS with theVoxCPM2LiteRT model through the C ABI ininclude/speech_core/voxcpm2_c.h. - v1+ —
speech-core's broader C ABI (speech_core_c.h) for STT (Parakeet), VAD (Silero), noise suppression (DeepFilterNet3), audio utilities.
- macOS (Apple Silicon) —
- Web frontend — React + Vite, rendered into the OS WebView (WKWebView on macOS, WebView2 on Windows, WebKitGTK on Linux). Owns the video timeline, voice-clone manager, script editor with emotion markers, and waveform views. Talks to Rust via Tauri
invoke()commands and events. - Bridge mechanism — a stateful sidecar binary bundled with the app. Tauri spawns it; Rust talks to it over stdin/stdout using an NDJSON protocol (one JSON object per line each way). The sidecar loads the model once and keeps it resident across calls, so per-line synthesis after warmup is fast. The same protocol (
ping/init_model/synthesize_voxcpm2) is implemented by both sidecars, soSidecarManager(src-tauri/src/lib.rs) only varies the binary path per OS. Code:swift-sidecar/(Swift package, macOS) andcore-sidecar/(CMake C++, Windows/Linux).
Target platforms.
- v0: macOS (Apple Silicon) —
VoxCPM2via MLX (swift-sidecar). The original headline cloning + emotional-marker path. - v0: Windows / Linux (x86_64) —
VoxCPM2via speech-core's LiteRT backend (core-sidecar). Same clone + emotion-marker story without MLX. ASR-graded retry isn't wired here yet (see Notes); the first successful take is accepted.
Why Tauri (vs Electron): smaller binaries, native shell, easier C++ FFI from Rust, desktop-first distribution. Matches the "deploy-anywhere" positioning.
No Chromium, no Node in the shipped app. WKWebView is part of macOS; the only JS that ships is our built bundle. Node lives on dev machines as a build-time toolchain (like Cargo) — never in the .app.
- speech-core — C++ engine. v0 dependency on Windows/Linux: the
core-sidecarlinks itsspeech_core_models_litertstatic lib + thelibLiteRtruntime and drivesVoxCPM2viainclude/speech_core/voxcpm2_c.h. Also the v1+ source of truth for VAD / STT / non-cloned TTS / enhancement. Build it with-DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=...first (see itsAGENTS.mdfor the C ABI and CMake targets). - speech-swift — speech models runtime for Apple Silicon (MLX / CoreML). v0 voice-cloning + cloned-voice TTS backend on macOS, via
Qwen3-TTS(Sources/Qwen3TTS/, ICL clone API inQwen3TTS+ICL.swift). - speech-models — model artifacts on Hugging Face (
aufklarer/). Studio bundles or downloads from here on first use.
Common: Rust 1.95+ via rustup, Node 20+ and pnpm 11+.
macOS (Apple Silicon) — Swift sidecar:
- macOS 15+, Xcode 26+ (check with
xcode-select -p)
Windows / Linux (x86_64) — C++ sidecar linking speech-core:
- A C++17 toolchain + CMake 3.16+ (MSVC + Visual Studio on Windows; gcc/clang on Linux)
- A built
speech-corecheckout with LiteRT and the model downloader:-DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=... -DSPEECH_CORE_WITH_HF_DOWNLOAD=ON. The download feature needs libcurl (find_package(CURL)) — system libcurl on Linux; on Windows, vcpkg (vcpkg install curl:x64-windows-static-md, then pass the vcpkg toolchain + triplet at configure). - The
VoxCPM2-LiteRTmodel bundle (~4.6 GB) is downloaded on first run by the sidecar (sc_voxcpm2_create_from_pretrained) — no manual fetch needed. To use a pre-downloaded bundle instead, setSONIQO_VOXCPM2_BUNDLE_DIR.
pnpm install # frontend + Tauri CLI depsmacOS sidecar:
cd swift-sidecar && swift build # binary at .build/debug/soniqo-tts-sidecarWindows / Linux sidecar (from the speech-studio root). The sidecar links
speech-core's prebuilt libs and libcurl; on Windows pass the vcpkg toolchain so
find_package(CURL) resolves:
cmake -B core-sidecar/build core-sidecar \
-DSPEECH_CORE_DIR=../speech-core \
-DSPEECH_CORE_BUILD_DIR=../speech-core/build # build dir with HF download on
# Windows only — add:
# -DCMAKE_TOOLCHAIN_FILE=<vcpkg>/scripts/buildsystems/vcpkg.cmake \
# -DVCPKG_TARGET_TRIPLET=x64-windows-static-md
cmake --build core-sidecar/build --config Release
# → core-sidecar/build[/Release]/speech-core-tts-sidecar(.exe), with libLiteRt colocatedpnpm tauri dev # opens the app, hot-reloads the frontendOn Windows/Linux the sidecar downloads the VoxCPM2-LiteRT bundle from Hugging
Face on first run (cached under the OS cache dir / SPEECH_CORE_CACHE_DIR). To
skip that and use a local bundle, export SONIQO_VOXCPM2_BUNDLE_DIR=/path/to/bundle.
macOS:
cd swift-sidecar && swift build -c release
cd .. && pnpm tauri build # .app + .dmg under src-tauri/target/release/bundle/Windows / Linux: build the C++ sidecar (above), then stage it for Tauri's
externalBin (the bundler appends the target triple) before pnpm tauri build:
mkdir -p src-tauri/binaries
# Windows example (x86_64-pc-windows-msvc):
cp core-sidecar/build/Release/speech-core-tts-sidecar.exe \
src-tauri/binaries/speech-core-tts-sidecar-x86_64-pc-windows-msvc.exe
cp core-sidecar/build/Release/libLiteRt.dll src-tauri/binaries/libLiteRt.dll
pnpm tauri build # add --no-bundle to skip msi/nsis installerspnpm-workspace.yamlwhitelistsesbuildfor pnpm 11'sallowBuildscheck. Don't drop it — without itpnpm execfails before any script runs.- The Rust side keeps one sidecar process alive across calls (see
SidecarManagerinsrc-tauri/src/lib.rs) so the model stays warm. Spawned lazily on first IPC.sidecar_path()picks the per-OS binary; dev spawns from the sidecar's build dir, release from next to the app binary. - Per-OS Tauri bundle settings live in
tauri.{macos,windows,linux}.conf.json(merged overtauri.conf.json):externalBinselects the sidecar,resourcesships its runtime (mlx.metallibon macOS,libLiteRt.dll/.soon Windows/Linux).tauri-buildverifiesexternalBinon every cargo build, so stagesrc-tauri/binaries/<name>-<triple>first (it's gitignored) or evencargo test --libfails. - macOS sidecar (
swift-sidecar): the build doesn't emitmlx.metallibnext to the binary on its own — copy it once from the speech-swift build that does (~/repos/speech-swift/.build/arm64-apple-macosx/debug/mlx.metallib→swift-sidecar/.build/arm64-apple-macosx/debug/mlx.metallib) or you'll getMLX error: Failed to load the default metallib. The Rustcolocate_metallibhelper (macOS-only) handles the bundled.applayout. - Windows/Linux sidecar (
core-sidecar):cfg'd off macOS. Oninit_modelit loads the bundle fromSONIQO_VOXCPM2_BUNDLE_DIRif set, else callssc_voxcpm2_create_from_pretrainedto download+cache it (SONIQO_VOXCPM2_MODEL_ID, defaultsoniqo/VoxCPM2-LiteRT; cache viaSONIQO_MODEL_CACHE_DIR/SPEECH_CORE_CACHE_DIR). The CMake colocateslibLiteRtnext to the binary; libcurl is linked statically (no extra DLL).cfgValuefrom the synth ladder has no LiteRT knob and is ignored (the ladder still variesseed). - TTS model: macOS MLX path defaults to
aufklarer/VoxCPM2-MLX-int8. Windows/Linux use theVoxCPM2-LiteRTbundle — downloaded on first run (resumable; see speech-core'sSPEECH_CORE_WITH_HF_DOWNLOAD) or supplied viaSONIQO_VOXCPM2_BUNDLE_DIR. - Generated clip audio is cached under
dirs::cache_dir()/audio.soniqo.studio/clips/(~/Library/Caches/...on macOS,%LOCALAPPDATA%\...on Windows,$XDG_CACHE_HOME/~/.cacheon Linux). The Rust and sidecar sides compute this independently and must stay in sync. - Follow-ups: (1) ASR-graded retry is macOS-only (
GRADING_AVAILABLEinlib.rs); Windows/Linux accept the first successful take — wiring Parakeet-via-sidecar grading is TODO. (2) A Windows/Linux CI lane to build + publish installers on tag (build.ymlonly covers macOS today). First-run model download is now handled by speech-core, so installers don't embed the bundle.
.
├── AGENTS.md project conventions (this file)
├── CLAUDE.md symlink → AGENTS.md
├── package.json React + Vite + @tauri-apps deps
├── pnpm-workspace.yaml allowBuilds: esbuild (pnpm 11)
├── pnpm-lock.yaml
├── vite.config.ts
├── tsconfig.json
├── index.html Vite entry
├── src/ React frontend
│ ├── App.tsx v0 sanity-check UI (ping_sidecar)
│ ├── main.tsx
│ └── …
├── public/ static assets served by Vite
├── src-tauri/ Rust Tauri shell
│ ├── Cargo.toml
│ ├── tauri.conf.json base config (productName, window)
│ ├── tauri.macos.conf.json macOS sidecar + metallib + .app/.dmg
│ ├── tauri.windows.conf.json Windows sidecar + libLiteRt + msi/nsis
│ ├── tauri.linux.conf.json Linux sidecar + libLiteRt + deb/appimage
│ ├── binaries/ staged sidecars for externalBin (gitignored)
│ ├── src/lib.rs Tauri commands + SidecarManager
│ ├── src/main.rs
│ ├── capabilities/
│ └── icons/
├── swift-sidecar/ macOS sidecar (Swift/MLX, VoxCPM2)
│ ├── Package.swift macOS 15+, Swift 6.0
│ └── Sources/soniqo-tts-sidecar/
│ └── main.swift NDJSON request loop
└── core-sidecar/ Windows/Linux sidecar (C++/LiteRT, VoxCPM2)
├── CMakeLists.txt links speech-core's speech_core_models_litert
└── src/main.cpp NDJSON loop over voxcpm2_c.h
- Do not mention Claude, Codex, Cursor, Anthropic, OpenAI, or any AI assistant in commit messages, PR titles, PR descriptions, or code comments.
- Do not add
Co-Authored-By: <AI> ...trailers orGenerated with …footers. - Write as if authored by a human contributor: focus on the why of the change.
- Never push directly to
main. Branch → PR → merge. - Branch naming:
feat/description,fix/description,chore/description,docs/description. - PR description: summary, what changed, test plan. No marketing fluff.
- Don't commit unless explicitly asked. Likewise for
git push. - Never amend commits or force-push unless the user explicitly asks.
- Always ask for confirmation before externally-visible actions — pushes, PRs, comments, external service calls. Local commits and local builds are fine without asking.
When a Studio feature needs a change in speech-core (new TTS knob, new model, new C-ABI symbol), one PR per repo, merged in order:
- speech-core lands first — adds the API.
- speech-studio bumps the
speech-corepin and uses the new API.
Don't bundle a single PR that straddles repos; each repo has its own review and release cadence.