Skip to content

Commit 1b56094

Browse files
committed
Add Vosk Sapat Daytona guide
1 parent df97ac5 commit 1b56094

4 files changed

Lines changed: 386 additions & 0 deletions

File tree

authors/aldo_giovanni.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Author: Aldo Giovanni Title: Software Engineer Description: Aldo Giovanni is a
2+
software engineer and operator focused on practical AI workflows, developer
3+
tooling, and production-minded automation. He writes implementation guides that
4+
favor reproducible setup, clear validation, and systems that teams can run
5+
without hiding complexity. Author Image: ![agionni](https://github.com/agionni.png)
6+
Author LinkedIn: Author Twitter:
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: 'Offline Transcription'
3+
description: 'Speech-to-text processing that runs on local models instead of a hosted API.'
4+
date: 2026-05-20
5+
author: 'Aldo Giovanni'
6+
---
7+
8+
# Offline Transcription
9+
10+
## Definition
11+
12+
Offline transcription is the process of converting speech into text with a
13+
model that runs locally on the same machine or workspace where the audio file is
14+
processed. The audio does not need to be uploaded to a hosted speech-to-text
15+
API.
16+
17+
## Context and Usage
18+
19+
Engineering teams use offline transcription when recordings contain sensitive
20+
customer calls, internal demos, unreleased product details, or regulated data.
21+
It is also useful when a workflow must keep working without internet access or
22+
when a team wants predictable cost for large batches of recordings.
23+
24+
Offline transcription still needs model files, CPU or GPU resources, and a
25+
review step. The tradeoff is direct control over the execution environment and
26+
data path.
Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
---
2+
title: "Run Vosk Transcription With Sapat in Daytona"
3+
description: "Build a reproducible Daytona workspace for offline Vosk speech-to-text with Sapat."
4+
date: 2026-05-20
5+
author: "Aldo Giovanni"
6+
tags: ["daytona", "sapat", "transcription", "vosk", "python"]
7+
---
8+
9+
# Run Vosk Transcription With Sapat in Daytona
10+
11+
# Introduction
12+
13+
Hosted transcription APIs are convenient, but they are not always the right
14+
default. Product demos, customer calls, internal design reviews, and incident
15+
recordings often contain material that should not leave the team environment
16+
until someone has reviewed it. A local speech-to-text model gives engineers a
17+
practical middle path: generate a draft transcript fast, keep the data path
18+
visible, and decide later whether a hosted model is worth using for correction
19+
or enrichment.
20+
21+
This guide shows how to run Sapat with an offline Vosk provider inside a
22+
Daytona workspace. Sapat already handles the repetitive parts of a transcription
23+
workflow: converting recordings with `ffmpeg`, choosing a provider through
24+
`--api`, and writing a sidecar `.txt` file next to each source recording. The
25+
Vosk provider adds a local option for teams that want predictable cost,
26+
repeatable setup, and no remote audio upload during the first transcription
27+
pass.
28+
29+
![Vosk Sapat Daytona workflow](assets/20260520_vosk_sapat_daytona_workflow.svg)
30+
31+
## TL;DR
32+
33+
- Use Daytona to create a clean workspace for Sapat and the Vosk model files.
34+
- Install Sapat with the optional `vosk` extra so the offline provider is
35+
available without changing the hosted API paths.
36+
- Set `VOSK_MODEL_PATH` to an unpacked model directory and run
37+
`sapat recording.mp4 --api vosk`.
38+
- Keep the generated transcript, command log, and review notes together so the
39+
output can be audited before sharing.
40+
41+
## When an Offline Provider Makes Sense
42+
43+
[Offline transcription](../definitions/20260520_definition_offline_transcription.md)
44+
is useful when the first requirement is control. A local Vosk model can run
45+
without sending the audio file to a hosted endpoint, which makes it a good fit
46+
for early review of sensitive material. It also gives teams a low-cost smoke
47+
test before they spend hosted API credits on higher-quality transcription or
48+
post-processing.
49+
50+
There are tradeoffs. Vosk models are fast and practical, but the output may need
51+
more human review than a larger hosted model. Speaker labels, punctuation, and
52+
domain-specific vocabulary can also require cleanup. That is acceptable for
53+
many engineering workflows because the first output is not a final publication.
54+
It is a searchable draft that helps the team find timestamps, summarize
55+
decisions, and decide what to process next.
56+
57+
Use Vosk when you need:
58+
59+
- A local first pass for private recordings.
60+
- A repeatable workflow that does not depend on API availability.
61+
- A cheap batch run over many demo or support recordings.
62+
- A transcript draft that will be reviewed before external sharing.
63+
64+
Use a hosted provider when you need:
65+
66+
- Better punctuation and formatting out of the box.
67+
- Built-in diarization, summaries, or multilingual model quality.
68+
- Centralized provider logs for a production workflow.
69+
- A managed service agreement for enterprise transcription.
70+
71+
## Prepare the Daytona Workspace
72+
73+
Start by creating a workspace from the Sapat repository. This keeps the code,
74+
model configuration, and generated transcript artifacts in one reproducible
75+
environment.
76+
77+
```bash
78+
daytona create https://github.com/nibzard/sapat --code
79+
```
80+
81+
Open the workspace terminal and confirm the baseline tools are available:
82+
83+
```bash
84+
python --version
85+
ffmpeg -version
86+
```
87+
88+
Sapat uses `ffmpeg` to convert source recordings into an intermediate MP3 file.
89+
The Vosk provider then converts that file into mono WAV audio at the configured
90+
sample rate before passing it to Vosk. Keeping `ffmpeg` in the workspace makes
91+
the conversion deterministic across contributors.
92+
93+
Install Sapat in editable mode with the Vosk optional dependency:
94+
95+
```bash
96+
python -m pip install --upgrade pip
97+
python -m pip install -e ".[vosk]"
98+
```
99+
100+
If you are testing against the companion implementation branch, fetch the branch
101+
from the Sapat pull request before installing:
102+
103+
```bash
104+
git fetch origin pull/39/head:vosk-provider
105+
git switch vosk-provider
106+
python -m pip install -e ".[vosk]"
107+
```
108+
109+
## Download and Configure a Vosk Model
110+
111+
Vosk model files are separate from the Python package. Download one model from
112+
the Vosk model catalog and unpack it into the workspace. For a small English
113+
smoke test, the compact English model is usually enough. For production review,
114+
choose a language and model size that matches the recordings.
115+
116+
```bash
117+
mkdir -p models
118+
curl -L -o models/vosk-small-en.zip \
119+
https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
120+
unzip models/vosk-small-en.zip -d models
121+
```
122+
123+
Create a local `.env` file:
124+
125+
```bash
126+
cat > .env <<'EOF'
127+
VOSK_MODEL_PATH=models/vosk-model-small-en-us-0.15
128+
VOSK_SAMPLE_RATE=16000
129+
VOSK_CHUNK_SIZE=4000
130+
EOF
131+
```
132+
133+
The values are intentionally simple:
134+
135+
| Variable | Purpose | Recommended Start |
136+
| --- | --- | --- |
137+
| `VOSK_MODEL_PATH` | Path to the unpacked Vosk model directory | Required |
138+
| `VOSK_SAMPLE_RATE` | WAV sample rate used before recognition | `16000` |
139+
| `VOSK_CHUNK_SIZE` | Frames read per recognition loop | `4000` |
140+
141+
Do not commit `.env` if it contains local paths that only work on your machine.
142+
The repeatable part belongs in the guide or project README. The local value
143+
belongs in the workspace.
144+
145+
## Run the First Transcription
146+
147+
Put a short test recording in the workspace. Start with a one or two minute
148+
clip so you can validate the flow before running a full batch.
149+
150+
```bash
151+
mkdir -p recordings transcripts
152+
cp ~/Downloads/demo-call.mp4 recordings/demo-call.mp4
153+
```
154+
155+
Run Sapat with Vosk:
156+
157+
```bash
158+
sapat recordings/demo-call.mp4 --api vosk --quality M --language en
159+
```
160+
161+
Sapat will:
162+
163+
1. Convert `recordings/demo-call.mp4` to `recordings/demo-call.mp3`.
164+
2. Convert the intermediate MP3 to Vosk-friendly WAV audio.
165+
3. Run the local Vosk recognizer.
166+
4. Write `recordings/demo-call.txt`.
167+
5. Remove temporary audio files.
168+
169+
Open the transcript and do a quick read:
170+
171+
```bash
172+
sed -n '1,80p' recordings/demo-call.txt
173+
```
174+
175+
For batch work, point Sapat at a directory:
176+
177+
```bash
178+
sapat recordings --api vosk --quality M --language en
179+
```
180+
181+
The current Sapat directory mode processes `.mp4` files. If your source
182+
recordings are `.mov`, `.m4a`, or `.wav`, normalize or copy them into a `.mp4`
183+
test fixture first, or process each file directly.
184+
185+
## Add a Review Packet
186+
187+
A transcript is more useful when it travels with context. Create a small review
188+
packet next to every important recording so future contributors know how the
189+
text was produced.
190+
191+
```bash
192+
cat > recordings/demo-call.review.md <<'EOF'
193+
# Demo Call Review
194+
195+
## Command
196+
197+
`sapat recordings/demo-call.mp4 --api vosk --quality M --language en`
198+
199+
## Environment
200+
201+
- Provider: Vosk local model
202+
- Model path: models/vosk-model-small-en-us-0.15
203+
- Sample rate: 16000
204+
- Workspace: Daytona
205+
206+
## Review Checklist
207+
208+
- [ ] Names and product terms checked
209+
- [ ] Action items extracted
210+
- [ ] Sensitive content marked before sharing
211+
- [ ] Low-confidence sections tagged with timestamps
212+
EOF
213+
```
214+
215+
This packet is intentionally plain Markdown. It can be committed to an internal
216+
repo, attached to an issue, or handed to a reviewer without requiring a separate
217+
database.
218+
219+
## Validate the Output
220+
221+
Do not treat an offline transcript as final text. Treat it as a draft artifact.
222+
Use a short validation pass before anyone relies on it:
223+
224+
| Check | What to Look For | Action |
225+
| --- | --- | --- |
226+
| Coverage | The transcript is not empty and roughly matches the recording length | Re-run with a larger model if too much is missing |
227+
| Names | Product names, people, repos, and acronyms are spelled correctly | Add a manual glossary note |
228+
| Decisions | Clear decisions and action items are captured | Extract into the review packet |
229+
| Privacy | Sensitive phrases are marked before sharing | Redact or keep internal |
230+
| Reproducibility | The command and model path are recorded | Update the review packet |
231+
232+
For a more formal workflow, keep a tiny golden clip in the workspace and re-run
233+
it whenever you update the model or Sapat branch:
234+
235+
```bash
236+
sapat recordings/golden-demo.mp4 --api vosk --quality M --language en
237+
diff -u expected/golden-demo.txt recordings/golden-demo.txt || true
238+
```
239+
240+
The goal is not to make every word identical forever. The goal is to catch
241+
unexpected drops in quality when a model, conversion setting, or provider path
242+
changes.
243+
244+
## Troubleshooting
245+
246+
**Problem: `VOSK_MODEL_PATH` is missing or invalid.**
247+
248+
Check that the path points to the unpacked model directory, not the downloaded
249+
`.zip` file.
250+
251+
```bash
252+
ls "$VOSK_MODEL_PATH"
253+
```
254+
255+
You should see model files and subdirectories such as `am`, `conf`, or `graph`,
256+
depending on the model.
257+
258+
**Problem: Vosk is not installed.**
259+
260+
Install Sapat with the optional extra:
261+
262+
```bash
263+
python -m pip install -e ".[vosk]"
264+
```
265+
266+
If you are using a locked internal environment, install `vosk` directly in the
267+
workspace image and keep that dependency in your workspace documentation.
268+
269+
**Problem: Output is empty or very poor.**
270+
271+
Confirm the language model matches the recording. Then try a larger model,
272+
check the source audio quality, and use `--quality H` for the conversion step.
273+
Noisy meeting audio may need preprocessing before any speech-to-text provider
274+
can produce reliable text.
275+
276+
**Problem: Hosted correction is still needed.**
277+
278+
That is normal. Use Vosk for the private first pass, then send only reviewed
279+
snippets or redacted text to a hosted LLM for cleanup. Do not upload raw
280+
recordings if your privacy requirement was the reason for using Vosk.
281+
282+
## Where This Fits in a Team Workflow
283+
284+
The best use of this workflow is not "perfect transcript in one command." It is
285+
"safe first draft in one reproducible workspace." That distinction matters.
286+
287+
A product team can drop demo recordings into Daytona, run Vosk locally, and
288+
extract customer quotes or bug reproduction steps. A support team can turn a
289+
call into a searchable note before deciding whether it needs higher-quality
290+
processing. An engineering manager can review sprint demos without sending raw
291+
internal recordings to a hosted API.
292+
293+
The model, command, output, and review packet all stay together. That makes the
294+
workflow easy to audit and easy to repeat.
295+
296+
## Conclusion
297+
298+
Sapat plus Vosk gives AI engineers a practical offline transcription path. The
299+
workflow is not a replacement for every hosted transcription service, but it is
300+
a strong default for private first-pass processing, cost-controlled batches, and
301+
repeatable engineering review.
302+
303+
Use Daytona to keep the environment clean, use Sapat to make the command
304+
consistent, and use the review packet to make the transcript trustworthy enough
305+
for the next step.
306+
307+
## References
308+
309+
- [Sapat repository](https://github.com/nibzard/sapat)
310+
- [Vosk models](https://alphacephei.com/vosk/models)
311+
- [Daytona documentation](https://www.daytona.io/docs)
Lines changed: 43 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)