|
| 1 | +--- |
| 2 | +title: "Run Vosk Transcription With Sapat in Daytona" |
| 3 | +description: "Build a reproducible Daytona workspace for offline Vosk speech-to-text with Sapat." |
| 4 | +date: 2026-05-20 |
| 5 | +author: "Aldo Giovanni" |
| 6 | +tags: ["daytona", "sapat", "transcription", "vosk", "python"] |
| 7 | +--- |
| 8 | + |
| 9 | +# Run Vosk Transcription With Sapat in Daytona |
| 10 | + |
| 11 | +# Introduction |
| 12 | + |
| 13 | +Hosted transcription APIs are convenient, but they are not always the right |
| 14 | +default. Product demos, customer calls, internal design reviews, and incident |
| 15 | +recordings often contain material that should not leave the team environment |
| 16 | +until someone has reviewed it. A local speech-to-text model gives engineers a |
| 17 | +practical middle path: generate a draft transcript fast, keep the data path |
| 18 | +visible, and decide later whether a hosted model is worth using for correction |
| 19 | +or enrichment. |
| 20 | + |
| 21 | +This guide shows how to run Sapat with an offline Vosk provider inside a |
| 22 | +Daytona workspace. Sapat already handles the repetitive parts of a transcription |
| 23 | +workflow: converting recordings with `ffmpeg`, choosing a provider through |
| 24 | +`--api`, and writing a sidecar `.txt` file next to each source recording. The |
| 25 | +Vosk provider adds a local option for teams that want predictable cost, |
| 26 | +repeatable setup, and no remote audio upload during the first transcription |
| 27 | +pass. |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +## TL;DR |
| 32 | + |
| 33 | +- Use Daytona to create a clean workspace for Sapat and the Vosk model files. |
| 34 | +- Install Sapat with the optional `vosk` extra so the offline provider is |
| 35 | + available without changing the hosted API paths. |
| 36 | +- Set `VOSK_MODEL_PATH` to an unpacked model directory and run |
| 37 | + `sapat recording.mp4 --api vosk`. |
| 38 | +- Keep the generated transcript, command log, and review notes together so the |
| 39 | + output can be audited before sharing. |
| 40 | + |
| 41 | +## When an Offline Provider Makes Sense |
| 42 | + |
| 43 | +[Offline transcription](../definitions/20260520_definition_offline_transcription.md) |
| 44 | +is useful when the first requirement is control. A local Vosk model can run |
| 45 | +without sending the audio file to a hosted endpoint, which makes it a good fit |
| 46 | +for early review of sensitive material. It also gives teams a low-cost smoke |
| 47 | +test before they spend hosted API credits on higher-quality transcription or |
| 48 | +post-processing. |
| 49 | + |
| 50 | +There are tradeoffs. Vosk models are fast and practical, but the output may need |
| 51 | +more human review than a larger hosted model. Speaker labels, punctuation, and |
| 52 | +domain-specific vocabulary can also require cleanup. That is acceptable for |
| 53 | +many engineering workflows because the first output is not a final publication. |
| 54 | +It is a searchable draft that helps the team find timestamps, summarize |
| 55 | +decisions, and decide what to process next. |
| 56 | + |
| 57 | +Use Vosk when you need: |
| 58 | + |
| 59 | +- A local first pass for private recordings. |
| 60 | +- A repeatable workflow that does not depend on API availability. |
| 61 | +- A cheap batch run over many demo or support recordings. |
| 62 | +- A transcript draft that will be reviewed before external sharing. |
| 63 | + |
| 64 | +Use a hosted provider when you need: |
| 65 | + |
| 66 | +- Better punctuation and formatting out of the box. |
| 67 | +- Built-in diarization, summaries, or multilingual model quality. |
| 68 | +- Centralized provider logs for a production workflow. |
| 69 | +- A managed service agreement for enterprise transcription. |
| 70 | + |
| 71 | +## Prepare the Daytona Workspace |
| 72 | + |
| 73 | +Start by creating a workspace from the Sapat repository. This keeps the code, |
| 74 | +model configuration, and generated transcript artifacts in one reproducible |
| 75 | +environment. |
| 76 | + |
| 77 | +```bash |
| 78 | +daytona create https://github.com/nibzard/sapat --code |
| 79 | +``` |
| 80 | + |
| 81 | +Open the workspace terminal and confirm the baseline tools are available: |
| 82 | + |
| 83 | +```bash |
| 84 | +python --version |
| 85 | +ffmpeg -version |
| 86 | +``` |
| 87 | + |
| 88 | +Sapat uses `ffmpeg` to convert source recordings into an intermediate MP3 file. |
| 89 | +The Vosk provider then converts that file into mono WAV audio at the configured |
| 90 | +sample rate before passing it to Vosk. Keeping `ffmpeg` in the workspace makes |
| 91 | +the conversion deterministic across contributors. |
| 92 | + |
| 93 | +Install Sapat in editable mode with the Vosk optional dependency: |
| 94 | + |
| 95 | +```bash |
| 96 | +python -m pip install --upgrade pip |
| 97 | +python -m pip install -e ".[vosk]" |
| 98 | +``` |
| 99 | + |
| 100 | +If you are testing against the companion implementation branch, fetch the branch |
| 101 | +from the Sapat pull request before installing: |
| 102 | + |
| 103 | +```bash |
| 104 | +git fetch origin pull/39/head:vosk-provider |
| 105 | +git switch vosk-provider |
| 106 | +python -m pip install -e ".[vosk]" |
| 107 | +``` |
| 108 | + |
| 109 | +## Download and Configure a Vosk Model |
| 110 | + |
| 111 | +Vosk model files are separate from the Python package. Download one model from |
| 112 | +the Vosk model catalog and unpack it into the workspace. For a small English |
| 113 | +smoke test, the compact English model is usually enough. For production review, |
| 114 | +choose a language and model size that matches the recordings. |
| 115 | + |
| 116 | +```bash |
| 117 | +mkdir -p models |
| 118 | +curl -L -o models/vosk-small-en.zip \ |
| 119 | + https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip |
| 120 | +unzip models/vosk-small-en.zip -d models |
| 121 | +``` |
| 122 | + |
| 123 | +Create a local `.env` file: |
| 124 | + |
| 125 | +```bash |
| 126 | +cat > .env <<'EOF' |
| 127 | +VOSK_MODEL_PATH=models/vosk-model-small-en-us-0.15 |
| 128 | +VOSK_SAMPLE_RATE=16000 |
| 129 | +VOSK_CHUNK_SIZE=4000 |
| 130 | +EOF |
| 131 | +``` |
| 132 | + |
| 133 | +The values are intentionally simple: |
| 134 | + |
| 135 | +| Variable | Purpose | Recommended Start | |
| 136 | +| --- | --- | --- | |
| 137 | +| `VOSK_MODEL_PATH` | Path to the unpacked Vosk model directory | Required | |
| 138 | +| `VOSK_SAMPLE_RATE` | WAV sample rate used before recognition | `16000` | |
| 139 | +| `VOSK_CHUNK_SIZE` | Frames read per recognition loop | `4000` | |
| 140 | + |
| 141 | +Do not commit `.env` if it contains local paths that only work on your machine. |
| 142 | +The repeatable part belongs in the guide or project README. The local value |
| 143 | +belongs in the workspace. |
| 144 | + |
| 145 | +## Run the First Transcription |
| 146 | + |
| 147 | +Put a short test recording in the workspace. Start with a one or two minute |
| 148 | +clip so you can validate the flow before running a full batch. |
| 149 | + |
| 150 | +```bash |
| 151 | +mkdir -p recordings transcripts |
| 152 | +cp ~/Downloads/demo-call.mp4 recordings/demo-call.mp4 |
| 153 | +``` |
| 154 | + |
| 155 | +Run Sapat with Vosk: |
| 156 | + |
| 157 | +```bash |
| 158 | +sapat recordings/demo-call.mp4 --api vosk --quality M --language en |
| 159 | +``` |
| 160 | + |
| 161 | +Sapat will: |
| 162 | + |
| 163 | +1. Convert `recordings/demo-call.mp4` to `recordings/demo-call.mp3`. |
| 164 | +2. Convert the intermediate MP3 to Vosk-friendly WAV audio. |
| 165 | +3. Run the local Vosk recognizer. |
| 166 | +4. Write `recordings/demo-call.txt`. |
| 167 | +5. Remove temporary audio files. |
| 168 | + |
| 169 | +Open the transcript and do a quick read: |
| 170 | + |
| 171 | +```bash |
| 172 | +sed -n '1,80p' recordings/demo-call.txt |
| 173 | +``` |
| 174 | + |
| 175 | +For batch work, point Sapat at a directory: |
| 176 | + |
| 177 | +```bash |
| 178 | +sapat recordings --api vosk --quality M --language en |
| 179 | +``` |
| 180 | + |
| 181 | +The current Sapat directory mode processes `.mp4` files. If your source |
| 182 | +recordings are `.mov`, `.m4a`, or `.wav`, normalize or copy them into a `.mp4` |
| 183 | +test fixture first, or process each file directly. |
| 184 | + |
| 185 | +## Add a Review Packet |
| 186 | + |
| 187 | +A transcript is more useful when it travels with context. Create a small review |
| 188 | +packet next to every important recording so future contributors know how the |
| 189 | +text was produced. |
| 190 | + |
| 191 | +```bash |
| 192 | +cat > recordings/demo-call.review.md <<'EOF' |
| 193 | +# Demo Call Review |
| 194 | +
|
| 195 | +## Command |
| 196 | +
|
| 197 | +`sapat recordings/demo-call.mp4 --api vosk --quality M --language en` |
| 198 | +
|
| 199 | +## Environment |
| 200 | +
|
| 201 | +- Provider: Vosk local model |
| 202 | +- Model path: models/vosk-model-small-en-us-0.15 |
| 203 | +- Sample rate: 16000 |
| 204 | +- Workspace: Daytona |
| 205 | +
|
| 206 | +## Review Checklist |
| 207 | +
|
| 208 | +- [ ] Names and product terms checked |
| 209 | +- [ ] Action items extracted |
| 210 | +- [ ] Sensitive content marked before sharing |
| 211 | +- [ ] Low-confidence sections tagged with timestamps |
| 212 | +EOF |
| 213 | +``` |
| 214 | + |
| 215 | +This packet is intentionally plain Markdown. It can be committed to an internal |
| 216 | +repo, attached to an issue, or handed to a reviewer without requiring a separate |
| 217 | +database. |
| 218 | + |
| 219 | +## Validate the Output |
| 220 | + |
| 221 | +Do not treat an offline transcript as final text. Treat it as a draft artifact. |
| 222 | +Use a short validation pass before anyone relies on it: |
| 223 | + |
| 224 | +| Check | What to Look For | Action | |
| 225 | +| --- | --- | --- | |
| 226 | +| Coverage | The transcript is not empty and roughly matches the recording length | Re-run with a larger model if too much is missing | |
| 227 | +| Names | Product names, people, repos, and acronyms are spelled correctly | Add a manual glossary note | |
| 228 | +| Decisions | Clear decisions and action items are captured | Extract into the review packet | |
| 229 | +| Privacy | Sensitive phrases are marked before sharing | Redact or keep internal | |
| 230 | +| Reproducibility | The command and model path are recorded | Update the review packet | |
| 231 | + |
| 232 | +For a more formal workflow, keep a tiny golden clip in the workspace and re-run |
| 233 | +it whenever you update the model or Sapat branch: |
| 234 | + |
| 235 | +```bash |
| 236 | +sapat recordings/golden-demo.mp4 --api vosk --quality M --language en |
| 237 | +diff -u expected/golden-demo.txt recordings/golden-demo.txt || true |
| 238 | +``` |
| 239 | + |
| 240 | +The goal is not to make every word identical forever. The goal is to catch |
| 241 | +unexpected drops in quality when a model, conversion setting, or provider path |
| 242 | +changes. |
| 243 | + |
| 244 | +## Troubleshooting |
| 245 | + |
| 246 | +**Problem: `VOSK_MODEL_PATH` is missing or invalid.** |
| 247 | + |
| 248 | +Check that the path points to the unpacked model directory, not the downloaded |
| 249 | +`.zip` file. |
| 250 | + |
| 251 | +```bash |
| 252 | +ls "$VOSK_MODEL_PATH" |
| 253 | +``` |
| 254 | + |
| 255 | +You should see model files and subdirectories such as `am`, `conf`, or `graph`, |
| 256 | +depending on the model. |
| 257 | + |
| 258 | +**Problem: Vosk is not installed.** |
| 259 | + |
| 260 | +Install Sapat with the optional extra: |
| 261 | + |
| 262 | +```bash |
| 263 | +python -m pip install -e ".[vosk]" |
| 264 | +``` |
| 265 | + |
| 266 | +If you are using a locked internal environment, install `vosk` directly in the |
| 267 | +workspace image and keep that dependency in your workspace documentation. |
| 268 | + |
| 269 | +**Problem: Output is empty or very poor.** |
| 270 | + |
| 271 | +Confirm the language model matches the recording. Then try a larger model, |
| 272 | +check the source audio quality, and use `--quality H` for the conversion step. |
| 273 | +Noisy meeting audio may need preprocessing before any speech-to-text provider |
| 274 | +can produce reliable text. |
| 275 | + |
| 276 | +**Problem: Hosted correction is still needed.** |
| 277 | + |
| 278 | +That is normal. Use Vosk for the private first pass, then send only reviewed |
| 279 | +snippets or redacted text to a hosted LLM for cleanup. Do not upload raw |
| 280 | +recordings if your privacy requirement was the reason for using Vosk. |
| 281 | + |
| 282 | +## Where This Fits in a Team Workflow |
| 283 | + |
| 284 | +The best use of this workflow is not "perfect transcript in one command." It is |
| 285 | +"safe first draft in one reproducible workspace." That distinction matters. |
| 286 | + |
| 287 | +A product team can drop demo recordings into Daytona, run Vosk locally, and |
| 288 | +extract customer quotes or bug reproduction steps. A support team can turn a |
| 289 | +call into a searchable note before deciding whether it needs higher-quality |
| 290 | +processing. An engineering manager can review sprint demos without sending raw |
| 291 | +internal recordings to a hosted API. |
| 292 | + |
| 293 | +The model, command, output, and review packet all stay together. That makes the |
| 294 | +workflow easy to audit and easy to repeat. |
| 295 | + |
| 296 | +## Conclusion |
| 297 | + |
| 298 | +Sapat plus Vosk gives AI engineers a practical offline transcription path. The |
| 299 | +workflow is not a replacement for every hosted transcription service, but it is |
| 300 | +a strong default for private first-pass processing, cost-controlled batches, and |
| 301 | +repeatable engineering review. |
| 302 | + |
| 303 | +Use Daytona to keep the environment clean, use Sapat to make the command |
| 304 | +consistent, and use the review packet to make the transcript trustworthy enough |
| 305 | +for the next step. |
| 306 | + |
| 307 | +## References |
| 308 | + |
| 309 | +- [Sapat repository](https://github.com/nibzard/sapat) |
| 310 | +- [Vosk models](https://alphacephei.com/vosk/models) |
| 311 | +- [Daytona documentation](https://www.daytona.io/docs) |
0 commit comments