Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
---
title: 'Run whisper.cpp Transcription With Sapat'
description:
'Build a private, offline transcription workflow by running Sapat and whisper.cpp inside a reproducible Daytona workspace.'
date: 2026-05-24
author: 'Jamil Ahmadzai'
tags: ['daytona', 'sapat', 'transcription', 'whispercpp']
---

# Run whisper.cpp Transcription With Sapat

# Introduction

Cloud transcription APIs are convenient, but they are not always the right
default. Product demos, customer calls, security reviews, and internal incident
recordings can contain details that should stay inside a controlled development
environment. That is where [offline speech-to-text](../definitions/20260524_definition_offline_speech_to_text.md)
is useful.

This guide shows how to run a local `whisper.cpp` transcription workflow with
Sapat inside a Daytona workspace. Sapat handles the repeatable command line
experience. `whisper.cpp` handles local inference with a GGML Whisper model.
Daytona gives the workflow a clean workspace boundary so another engineer can
recreate the same setup instead of guessing what was installed on your laptop.

The companion Sapat implementation for this guide is available in
[nibzard/sapat#45](https://github.com/nibzard/sapat/pull/45). It adds
`--api whispercpp`, validates the local binary and model path, converts input
media to 16 kHz mono WAV for the local CLI, and writes the transcript next to
the source file.

## TL;DR

- Use Daytona to run the Sapat workspace in a disposable, reproducible
environment.
- Build `whisper.cpp` locally and download a GGML model such as `base.en`.
- Configure `WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH` in `.env`.
- Run `sapat demo.mp4 --api whispercpp --language en --quality M`.
- Keep recordings, model files, prompts, and transcripts inside the workspace
unless your team explicitly approves sharing them.

![Offline Sapat transcription workflow with whisper.cpp in Daytona](assets/20260524_run_whispercpp_transcription_with_sapat_in_daytona_workflow.svg)

## Why Use whisper.cpp in a Daytona Workspace?

Sapat already supports hosted APIs such as OpenAI, Groq, and Azure OpenAI.
Those are good options when you want managed infrastructure and do not mind
sending audio to a third-party provider. A local `whisper.cpp` path solves a
different problem: private and repeatable transcription without a cloud API key.

That matters for AI engineers because transcription is often the first step in
a larger workflow. A support call becomes a bug report. A sales demo becomes
release-note evidence. An incident review becomes a timeline. If that first
step is hard to reproduce, every downstream artifact becomes harder to trust.

Daytona helps by turning the setup into a workspace recipe. The `ffmpeg`
version, Sapat branch, `whisper.cpp` binary, model path, and transcript command
can all live in the same workspace. When a teammate needs to review the output,
they can open the same environment and inspect the exact command path.

Here is the division of responsibility:

| Layer | Responsibility |
| --- | --- |
| Daytona | Creates the reproducible workspace where the workflow runs. |
| Sapat | Provides one CLI for file and directory transcription. |
| ffmpeg | Converts source media into the audio format needed by the provider. |
| whisper.cpp | Runs local speech recognition against a GGML model. |
| Transcript review | Checks the output before it is used in summaries or tickets. |

## Prepare the Workspace

Start from the Sapat repository. While the companion provider pull request is
under review, use the provider branch directly. After it is merged, you can use
the upstream `main` branch instead.

```bash
daytona create https://github.com/nibzard/sapat --code
```

Inside the workspace, fetch the provider branch:

```bash
git remote add jamil https://github.com/jamilahmadzai/sapat.git
git fetch jamil codex/whispercpp-provider
git checkout -b whispercpp-provider jamil/codex/whispercpp-provider
```

Install Sapat in editable mode:

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .
```

Sapat still uses `ffmpeg` for media conversion, so confirm it is available:

```bash
ffmpeg -version
```

If the command is missing, install it in the workspace. On Debian or Ubuntu
base images, this is usually:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg cmake build-essential
```

The `cmake` and compiler packages are needed for the next step, where you build
`whisper.cpp`.

## Build whisper.cpp and Download a Model

The official `whisper.cpp` quick start builds a `whisper-cli` binary and uses
GGML-formatted Whisper models. That is exactly what the Sapat provider expects.

Clone and build `whisper.cpp` beside the Sapat checkout:

```bash
cd ..
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
```

Download a small English model for your first run:

```bash
sh ./models/download-ggml-model.sh base.en
```

You can use larger models after the workflow is proven. Larger models can
improve transcript quality, but they also need more CPU, memory, and time. For
most smoke tests, `base.en` is enough to verify the local path.

Return to Sapat and configure the local provider:

```bash
cd ../sapat
cp .env.example .env
```

Edit `.env`:

```bash
WHISPERCPP_BINARY=/workspaces/whisper.cpp/build/bin/whisper-cli
WHISPERCPP_MODEL_PATH=/workspaces/whisper.cpp/models/ggml-base.en.bin
WHISPERCPP_THREADS=4
WHISPERCPP_EXTRA_ARGS=
```

Adjust the paths to match your Daytona workspace. The two required values are
`WHISPERCPP_BINARY` and `WHISPERCPP_MODEL_PATH`. `WHISPERCPP_THREADS` is
optional, but setting it makes the command easier to reproduce across machines.

## Run a First Transcription

Place a short `.mp4` file in the Sapat workspace. Keep the first file short.
You want to test the full chain before spending minutes on a long recording.

```bash
mkdir -p samples
cp /path/to/local/demo.mp4 samples/demo.mp4
```

Run Sapat with the local provider:

```bash
sapat samples/demo.mp4 --api whispercpp --language en --quality M
```

The provider performs three steps:

1. Converts `samples/demo.mp4` to a temporary 16 kHz mono WAV file.
2. Runs `whisper-cli` with the configured GGML model.
3. Writes `samples/demo.txt` and removes the temporary WAV file.

Open the transcript:

```bash
sed -n '1,120p' samples/demo.txt
```

For a directory of videos, point Sapat at the directory:

```bash
sapat samples --api whispercpp --language en --quality M
```

Sapat processes `.mp4` files in that directory and writes one `.txt` file per
video. This is useful for meeting folders, product demos, or training clips that
need the same model and language settings.

## Keep the Workflow Reviewable

The transcript is not the final artifact. Treat it as evidence that needs a
small review loop before it feeds a summary, ticket, or retrieval system.

Use this checklist:

- Confirm the transcript was created from the expected source file.
- Save the exact `sapat` command in a `README.md` or run log.
- Record the model file name, such as `ggml-base.en.bin`.
- Review product names, customer names, acronyms, and code terms manually.
- Mark low-confidence sections with timestamps from the source recording when
a human needs to listen again.
- Keep private audio and transcripts inside the workspace unless your policy
allows moving them elsewhere.

If you need prompt-specific vocabulary, pass a prompt:

```bash
sapat samples/demo.mp4 \
--api whispercpp \
--language en \
--prompt "Daytona, Sapat, whisper.cpp, dev container, workspace"
```

The provider forwards the prompt to the local CLI. This is helpful for product
names, project codenames, and technical terms that are easy to miss in speech.

Do not use `--correct` with this local provider. Sapat correction is currently
implemented through hosted chat APIs on the cloud providers. For a private
workflow, keep correction as a manual review step or run a separate local LLM
review after you have approved the transcript.

## Operational Notes for Teams

Treat the model file as part of the workflow contract. Two engineers can run
the same `sapat` command and still get different output if one uses
`ggml-base.en.bin` and the other uses a larger multilingual model. Write the
model name into your run log, and keep a small representative clip for smoke
testing changes to the workspace.

Also decide where generated transcripts should live. For short experiments,
placing `demo.txt` next to `demo.mp4` is convenient. For team workflows, a
dedicated `transcripts/` folder with a simple naming convention is easier to
review:

```bash
mkdir -p transcripts
cp samples/demo.txt transcripts/20260524_demo_whispercpp_base_en.txt
```

Finally, keep provider choice explicit. If a task can use a cloud provider,
document why. If a task should stay local, document that as well. A short note
in the pull request, incident packet, or research log prevents accidental
switching between private local transcription and hosted transcription later.

## Troubleshooting

**`whisper.cpp binary not found`**

Set `WHISPERCPP_BINARY` to the full path of the compiled binary:

```bash
export WHISPERCPP_BINARY="$PWD/../whisper.cpp/build/bin/whisper-cli"
```

If you installed `whisper-cli` globally, make sure it is on `PATH`:

```bash
which whisper-cli
```

**`WHISPERCPP_MODEL_PATH must point to a local ggml model file`**

Download a model and point the environment variable at the `.bin` file:

```bash
cd ../whisper.cpp
sh ./models/download-ggml-model.sh base.en
export WHISPERCPP_MODEL_PATH="$PWD/models/ggml-base.en.bin"
```

**The transcript is empty**

Start with a shorter, clearer sample. Confirm that `ffmpeg` can read the file:

```bash
ffmpeg -i samples/demo.mp4 -f null -
```

Then run `whisper-cli` directly against a WAV file to isolate whether the issue
is in the model, the binary, or the Sapat wrapper.

**The transcript is too slow**

Use a smaller model for drafts, increase `WHISPERCPP_THREADS`, or split long
recordings into shorter clips before running Sapat. Keep the same model for
comparisons so you do not mix performance results with model-quality changes.

## Conclusion

Sapat plus `whisper.cpp` gives AI engineers a practical local transcription
path. Daytona makes that path repeatable. The result is a workflow where source
recordings, model configuration, transcript commands, and generated text stay
inside one workspace.

This is not a replacement for every hosted transcription API. It is a strong
option when privacy, reproducibility, and local control matter more than a
managed provider. Start with a short sample, document the command, review the
output, and then scale the same workflow to a folder of recordings.

## References

- [Sapat repository](https://github.com/nibzard/sapat)
- [Sapat whisper.cpp provider pull request](https://github.com/nibzard/sapat/pull/45)
- [whisper.cpp repository and quick start](https://github.com/ggml-org/whisper.cpp)
- [Daytona repository](https://github.com/daytonaio/daytona)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions authors/jamil_ahmadzai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Author: Jamil Ahmadzai
Title: Software Engineer
Description: Jamil Ahmadzai builds practical developer tooling and integration guides for AI workflows, with a focus on reproducible environments, automation, and shipping examples that engineers can run and adapt.
Company Name: Independent
Company Description: Independent software engineering and technical writing.
Author Image: <https://github.com/jamilahmadzai.png?size=512>
Company Logo Dark:
Company Logo White:
28 changes: 28 additions & 0 deletions definitions/20260524_definition_offline_speech_to_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: 'Offline Speech-to-Text'
description: 'Speech recognition that transcribes audio locally without sending recordings to a cloud API.'
date: 2026-05-24
author: 'Jamil Ahmadzai'
---

# Offline Speech-to-Text

## Definition

Offline speech-to-text is the process of converting audio into written text on
the same machine or workspace where the audio file is stored. Instead of sending
the file to a hosted transcription API, the workflow runs a local speech
recognition model and writes the transcript back to local storage.

## Context and Usage

Offline speech-to-text is useful when recordings contain sensitive customer
calls, internal meetings, product demos, or incident reviews that should not
leave the development environment. It is also useful for repeatable benchmarks,
because every engineer can run the same binary, model file, prompt, and input
clip without depending on cloud quota or a provider outage.

In a Daytona workspace, offline speech-to-text can be combined with a pinned
toolchain and environment variables. The workspace keeps the model path,
transcription command, source audio, and generated transcript close together,
which makes the workflow easier to review and reproduce.