daytonaio · jamilahmadzai · May 23, 2026
diff --git a/articles/20260524_run_whispercpp_transcription_with_sapat_in_daytona.md b/articles/20260524_run_whispercpp_transcription_with_sapat_in_daytona.md
@@ -0,0 +1,313 @@
+---
+title: 'Run whisper.cpp Transcription With Sapat'
+description:
+  'Build a private, offline transcription workflow by running Sapat and whisper.cpp inside a reproducible Daytona workspace.'
+date: 2026-05-24
+author: 'Jamil Ahmadzai'
+tags: ['daytona', 'sapat', 'transcription', 'whispercpp']
+---
+
+# Run whisper.cpp Transcription With Sapat
+
+# Introduction
+
+Cloud transcription APIs are convenient, but they are not always the right
+default. Product demos, customer calls, security reviews, and internal incident
+recordings can contain details that should stay inside a controlled development
+environment. That is where [offline speech-to-text](../definitions/20260524_definition_offline_speech_to_text.md)
+is useful.
+
+This guide shows how to run a local `whisper.cpp` transcription workflow with
+Sapat inside a Daytona workspace. Sapat handles the repeatable command line
+experience. `whisper.cpp` handles local inference with a GGML Whisper model.
+Daytona gives the workflow a clean workspace boundary so another engineer can
+recreate the same setup instead of guessing what was installed on your laptop.
+
+The companion Sapat implementation for this guide is available in
+[nibzard/sapat#45](https://github.com/nibzard/sapat/pull/45). It adds
+`--api whispercpp`, validates the local binary and model path, converts input
+media to 16 kHz mono WAV for the local CLI, and writes the transcript next to
+the source file.
+
+## TL;DR
+
+- Use Daytona to run the Sapat workspace in a disposable, reproducible
+  environment.
+- Build `whisper.cpp` locally and download a GGML model such as `base.en`.
+- Configure `WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH` in `.env`.
+- Run `sapat demo.mp4 --api whispercpp --language en --quality M`.
+- Keep recordings, model files, prompts, and transcripts inside the workspace
+  unless your team explicitly approves sharing them.
+
+![Offline Sapat transcription workflow with whisper.cpp in Daytona](assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg)
+
+## Why Use whisper.cpp in a Daytona Workspace?
+
+Sapat already supports hosted APIs such as OpenAI, Groq, and Azure OpenAI.
+Those are good options when you want managed infrastructure and do not mind
+sending audio to a third-party provider. A local `whisper.cpp` path solves a
+different problem: private and repeatable transcription without a cloud API key.
+
+That matters for AI engineers because transcription is often the first step in
+a larger workflow. A support call becomes a bug report. A sales demo becomes
+release-note evidence. An incident review becomes a timeline. If that first
+step is hard to reproduce, every downstream artifact becomes harder to trust.
+
+Daytona helps by turning the setup into a workspace recipe. The `ffmpeg`
+version, Sapat branch, `whisper.cpp` binary, model path, and transcript command
+can all live in the same workspace. When a teammate needs to review the output,
+they can open the same environment and inspect the exact command path.
+
+Here is the division of responsibility:
+
+| Layer | Responsibility |
+| --- | --- |
+| Daytona | Creates the reproducible workspace where the workflow runs. |
+| Sapat | Provides one CLI for file and directory transcription. |
+| ffmpeg | Converts source media into the audio format needed by the provider. |
+| whisper.cpp | Runs local speech recognition against a GGML model. |
+| Transcript review | Checks the output before it is used in summaries or tickets. |
+
+## Prepare the Workspace
+
+Start from the Sapat repository. While the companion provider pull request is
+under review, use the provider branch directly. After it is merged, you can use
+the upstream `main` branch instead.
+
+```bash
+daytona create https://github.com/nibzard/sapat --code
+```
+
+Inside the workspace, fetch the provider branch:
+
+```bash
+git remote add jamil https://github.com/jamilahmadzai/sapat.git
+git fetch jamil codex/whispercpp-provider
+git checkout -b whispercpp-provider jamil/codex/whispercpp-provider
+```
+
+Install Sapat in editable mode:
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python -m pip install -e .
+```
+
+Sapat still uses `ffmpeg` for media conversion, so confirm it is available:
+
+```bash
+ffmpeg -version
+```
+
+If the command is missing, install it in the workspace. On Debian or Ubuntu
+base images, this is usually:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y ffmpeg cmake build-essential
+```
+
+The `cmake` and compiler packages are needed for the next step, where you build
+`whisper.cpp`.
+
+## Build whisper.cpp and Download a Model
+
+The official `whisper.cpp` quick start builds a `whisper-cli` binary and uses
+GGML-formatted Whisper models. That is exactly what the Sapat provider expects.
+
+Clone and build `whisper.cpp` beside the Sapat checkout:
+
+```bash
+cd ..
+git clone https://github.com/ggml-org/whisper.cpp.git
+cd whisper.cpp
+cmake -B build
+cmake --build build -j --config Release
+```
+
+Download a small English model for your first run:
+
+```bash
+sh ./models/download-ggml-model.sh base.en
+```
+
+You can use larger models after the workflow is proven. Larger models can
+improve transcript quality, but they also need more CPU, memory, and time. For
+most smoke tests, `base.en` is enough to verify the local path.
+
+Return to Sapat and configure the local provider:
+
+```bash
+cd ../sapat
+cp .env.example .env
+```
+
+Edit `.env`:
+
+```bash
+WHISPERCPP_BINARY=/workspaces/whisper.cpp/build/bin/whisper-cli
+WHISPERCPP_MODEL_PATH=/workspaces/whisper.cpp/models/ggml-base.en.bin
+WHISPERCPP_THREADS=4
+WHISPERCPP_EXTRA_ARGS=
+```
+
+Adjust the paths to match your Daytona workspace. The two required values are
+`WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH`. `WHISPERCPP_THREADS` is
+optional, but setting it makes the command easier to reproduce across machines.
+
+## Run a First Transcription
+
+Place a short `.mp4` file in the Sapat workspace. Keep the first file short.
+You want to test the full chain before spending minutes on a long recording.
+
+```bash
+mkdir -p samples
+cp /path/to/local/demo.mp4 samples/demo.mp4
+```
+
+Run Sapat with the local provider:
+
+```bash
+sapat samples/demo.mp4 --api whispercpp --language en --quality M
+```
+
+The provider performs three steps:
+
+1. Converts `samples/demo.mp4` to a temporary 16 kHz mono WAV file.
+2. Runs `whisper-cli` with the configured GGML model.
+3. Writes `samples/demo.txt` and removes the temporary WAV file.
+
+Open the transcript:
+
+```bash
+sed -n '1,120p' samples/demo.txt
+```
+
+For a directory of videos, point Sapat at the directory:
+
+```bash
+sapat samples --api whispercpp --language en --quality M
+```
+
+Sapat processes `.mp4` files in that directory and writes one `.txt` file per
+video. This is useful for meeting folders, product demos, or training clips that
+need the same model and language settings.
+
+## Keep the Workflow Reviewable
+
+The transcript is not the final artifact. Treat it as evidence that needs a
+small review loop before it feeds a summary, ticket, or retrieval system.
+
+Use this checklist:
+
+- Confirm the transcript was created from the expected source file.
+- Save the exact `sapat` command in a `README.md` or run log.
+- Record the model file name, such as `ggml-base.en.bin`.
+- Review product names, customer names, acronyms, and code terms manually.
+- Mark low-confidence sections with timestamps from the source recording when
+  a human needs to listen again.
+- Keep private audio and transcripts inside the workspace unless your policy
+  allows moving them elsewhere.
+
+If you need prompt-specific vocabulary, pass a prompt:
+
+```bash
+sapat samples/demo.mp4 \
+  --api whispercpp \
+  --language en \
+  --prompt "Daytona, Sapat, whisper.cpp, dev container, workspace"
+```
+
+The provider forwards the prompt to the local CLI. This is helpful for product
+names, project codenames, and technical terms that are easy to miss in speech.
+
+Do not use `--correct` with this local provider. Sapat correction is currently
+implemented through hosted chat APIs on the cloud providers. For a private
+workflow, keep correction as a manual review step or run a separate local LLM
+review after you have approved the transcript.
+
+## Operational Notes for Teams
+
+Treat the model file as part of the workflow contract. Two engineers can run
+the same `sapat` command and still get different output if one uses
+`ggml-base.en.bin` and the other uses a larger multilingual model. Write the
+model name into your run log, and keep a small representative clip for smoke
+testing changes to the workspace.
+
+Also decide where generated transcripts should live. For short experiments,
+placing `demo.txt` next to `demo.mp4` is convenient. For team workflows, a
+dedicated `transcripts/` folder with a simple naming convention is easier to
+review:
+
+```bash
+mkdir -p transcripts
+cp samples/demo.txt transcripts/20260524_demo_whispercpp_base_en.txt
+```
+
+Finally, keep provider choice explicit. If a task can use a cloud provider,
+document why. If a task should stay local, document that as well. A short note
+in the pull request, incident packet, or research log prevents accidental
+switching between private local transcription and hosted transcription later.
+
+## Troubleshooting
+
+**`whisper.cpp binary not found`**
+
+Set `WHISPERCPP_BINARY` to the full path of the compiled binary:
+
+```bash
+export WHISPERCPP_BINARY="$PWD/../whisper.cpp/build/bin/whisper-cli"
+```
+
+If you installed `whisper-cli` globally, make sure it is on `PATH`:
+
+```bash
+which whisper-cli
+```
+
+**`WHISPERCPP_MODEL_PATH must point to a local ggml model file`**
+
+Download a model and point the environment variable at the `.bin` file:
+
+```bash
+cd ../whisper.cpp
+sh ./models/download-ggml-model.sh base.en
+export WHISPERCPP_MODEL_PATH="$PWD/models/ggml-base.en.bin"
+```
+
+**The transcript is empty**
+
+Start with a shorter, clearer sample. Confirm that `ffmpeg` can read the file:
+
+```bash
+ffmpeg -i samples/demo.mp4 -f null -
+```
+
+Then run `whisper-cli` directly against a WAV file to isolate whether the issue
+is in the model, the binary, or the Sapat wrapper.
+
+**The transcript is too slow**
+
+Use a smaller model for drafts, increase `WHISPERCPP_THREADS`, or split long
+recordings into shorter clips before running Sapat. Keep the same model for
+comparisons so you do not mix performance results with model-quality changes.
+
+## Conclusion
+
+Sapat plus `whisper.cpp` gives AI engineers a practical local transcription
+path. Daytona makes that path repeatable. The result is a workflow where source
+recordings, model configuration, transcript commands, and generated text stay
+inside one workspace.
+
+This is not a replacement for every hosted transcription API. It is a strong
+option when privacy, reproducibility, and local control matter more than a
+managed provider. Start with a short sample, document the command, review the
+output, and then scale the same workflow to a folder of recordings.
+
+## References
+
+- [Sapat repository](https://github.com/nibzard/sapat)
+- [Sapat whisper.cpp provider pull request](https://github.com/nibzard/sapat/pull/45)
+- [whisper.cpp repository and quick start](https://github.com/ggml-org/whisper.cpp)
+- [Daytona repository](https://github.com/daytonaio/daytona)
diff --git a/...assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg b/...assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg
diff --git a/authors/jamil_ahmadzai.md b/authors/jamil_ahmadzai.md
@@ -0,0 +1,8 @@
+Author: Jamil Ahmadzai
+Title: Software Engineer
+Description: Jamil Ahmadzai builds practical developer tooling and integration guides for AI workflows, with a focus on reproducible environments, automation, and shipping examples that engineers can run and adapt.
+Company Name: Independent
+Company Description: Independent software engineering and technical writing.
+Author Image: <https://github.com/jamilahmadzai.png?size=512>
+Company Logo Dark:
+Company Logo White:
diff --git a/definitions/20260524_definition_offline_speech_to_text.md b/definitions/20260524_definition_offline_speech_to_text.md
@@ -0,0 +1,28 @@
+---
+title: 'Offline Speech-to-Text'
+description: 'Speech recognition that transcribes audio locally without sending recordings to a cloud API.'
+date: 2026-05-24
+author: 'Jamil Ahmadzai'
+---
+
+# Offline Speech-to-Text
+
+## Definition
+
+Offline speech-to-text is the process of converting audio into written text on
+the same machine or workspace where the audio file is stored. Instead of sending
+the file to a hosted transcription API, the workflow runs a local speech
+recognition model and writes the transcript back to local storage.
+
+## Context and Usage
+
+Offline speech-to-text is useful when recordings contain sensitive customer
+calls, internal meetings, product demos, or incident reviews that should not
+leave the development environment. It is also useful for repeatable benchmarks,
+because every engineer can run the same binary, model file, prompt, and input
+clip without depending on cloud quota or a provider outage.
+
+In a Daytona workspace, offline speech-to-text can be combined with a pinned
+toolchain and environment variables. The workspace keeps the model path,
+transcription command, source audio, and generated transcript close together,
+which makes the workflow easier to review and reproduce.