daytonaio · JJ-Lin · May 23, 2026
diff --git a/articles/20260523_run_localai_transcription_with_sapat_in_daytona.md b/articles/20260523_run_localai_transcription_with_sapat_in_daytona.md
@@ -0,0 +1,326 @@
+---
+title: 'Run LocalAI Transcription With Sapat'
+description: 'Build a reproducible Daytona workspace for private Sapat video transcription backed by a self-hosted LocalAI endpoint.'
+date: 2026-05-23
+author: 'JJ Lin'
+tags: ['Daytona', 'Sapat', 'Speech-to-Text', 'LocalAI']
+---
+
+# Run LocalAI Transcription With Sapat
+
+## Introduction
+
+AI teams often collect short product demos, debugging recordings, user research
+clips, and internal walkthroughs long before those recordings become useful
+written artifacts. A transcript makes the material searchable, easier to
+summarize, and safer to hand to downstream agents. The challenge is turning that
+workflow into something another engineer can reproduce without guessing which
+machine had `ffmpeg`, which shell had credentials, or which speech-to-text
+provider was used.
+
+[Sapat](https://github.com/nkkko/sapat) is a compact Python CLI for this job. It
+accepts an `.mp4` file or a folder of `.mp4` files, converts each video to MP3
+with `ffmpeg`, sends the audio to a selected transcription provider, and writes
+a `.txt` transcript next to the source file. This guide adds a LocalAI path to
+that workflow and runs it inside a [Daytona](https://www.daytona.io/docs/)
+workspace so setup, configuration, and validation are explicit. The result is a
+[self-hosted speech-to-text](/definitions/20260523_definition_self_hosted_speech_to_text.md)
+workflow that stays reproducible without depending on one developer's laptop.
+
+The companion implementation is available in
+[nibzard/sapat#43](https://github.com/nibzard/sapat/pull/43). It adds
+`--api localai`, documents the `LOCALAI_*` environment variables, and includes
+mocked tests for CLI routing and request construction.
+
+![LocalAI transcription workflow in Daytona](assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg)
+
+## TL;DR
+
+- Use Daytona to open a clean Sapat workspace instead of relying on a hand-tuned
+  local machine.
+- Run LocalAI where the workspace can reach it and expose its OpenAI-compatible
+  transcription endpoint.
+- Configure `LOCALAI_BASE_URL`, `LOCALAI_MODEL`, and optional
+  `LOCALAI_API_KEY` outside source control.
+- Run `sapat <file>.mp4 --api localai` to convert the video, send the MP3 to
+  LocalAI, and save a transcript.
+- Review the transcript before sharing it with another model, teammate, or
+  public issue.
+
+## What The LocalAI Provider Adds
+
+LocalAI is a self-hosted AI runtime that offers OpenAI-compatible APIs for
+several model types, including audio-to-text. Its
+[audio-to-text documentation](https://localai.io/features/audio-to-text/)
+describes a `POST /v1/audio/transcriptions` endpoint that accepts multipart
+form data with an audio `file` and a `model` value. That shape is a natural fit
+for Sapat because Sapat already prepares an MP3 file before calling a provider.
+
+The LocalAI Sapat provider keeps the flow intentionally simple:
+
+```text
+video.mp4 -> ffmpeg MP3 conversion -> LocalAI transcription request -> video.txt
+```
+
+It reads the following environment variables:
+
+Variable | Purpose | Default
+--- | --- | ---
+`LOCALAI_BASE_URL` | Base URL for the LocalAI server, such as `http://localhost:8080` | Required unless `LOCALAI_API_ENDPOINT` is set
+`LOCALAI_API_ENDPOINT` | Full transcription endpoint override | Built from `LOCALAI_BASE_URL`
+`LOCALAI_MODEL` | Audio-to-text model name sent in the request | `whisper-1`
+`LOCALAI_API_KEY` | Optional bearer token if the LocalAI server requires auth | Not set
+
+The provider does not require a cloud transcription account. That makes it
+useful when a team wants to keep internal recordings inside a self-hosted
+boundary, test transcription quality against local models, or run the same
+workflow in a controlled development environment.
+
+## Prerequisites
+
+Before you start, make sure you have:
+
+- A Daytona account and CLI that can create a sandbox.
+- A LocalAI server with an audio-to-text model installed.
+- A short `.mp4` recording with spoken audio.
+- `ffmpeg` available in the Sapat workspace.
+- Enough disk space for Sapat to create a temporary MP3 next to the video.
+
+The guide uses the companion Sapat branch until the provider is merged
+upstream. After merge, create the workspace from `nkkko/sapat` directly and skip
+the branch checkout step.
+
+## Start Or Reach A LocalAI Server
+
+Run LocalAI wherever the Daytona workspace can reach it. For a quick local test,
+that may be a LocalAI process on the same machine or inside the same development
+network. For a team setup, it may be an internal server with access control in
+front of it.
+
+After the server is running and a Whisper-compatible model is installed, verify
+the transcription endpoint with a small audio file:
+
+```bash
+curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \
+  -H "Content-Type: multipart/form-data" \
+  -F file="@sample.wav" \
+  -F model="whisper-1"
+```
+
+If the server requires authentication, include the bearer token:
+
+```bash
+curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \
+  -H "Authorization: Bearer $LOCALAI_API_KEY" \
+  -H "Content-Type: multipart/form-data" \
+  -F file="@sample.wav" \
+  -F model="whisper-1"
+```
+
+Do this endpoint check before opening Sapat. It separates LocalAI setup problems
+from Sapat workflow problems and makes troubleshooting much faster.
+
+## Create The Daytona Workspace
+
+Create a Daytona sandbox from the fork that contains the LocalAI provider:
+
+```bash
+daytona create https://github.com/JJ-Lin/sapat --name sapat-localai
+```
+
+Open a terminal in the workspace, then check out the provider branch:
+
+```bash
+git fetch origin feature/localai-transcription-provider
+git checkout feature/localai-transcription-provider
+```
+
+Install Sapat in editable mode:
+
+```bash
+python -m pip install -e .
+```
+
+Sapat uses `ffmpeg` to convert videos to MP3 before transcription. Confirm that
+it is available:
+
+```bash
+ffmpeg -version
+```
+
+If `ffmpeg` is missing, install it through the package manager available in your
+Daytona environment or bake it into the workspace image. The important part is
+to make this dependency visible in the workspace rather than leaving it as an
+undocumented local-machine assumption.
+
+## Configure LocalAI Without Committing Secrets
+
+Create a `.env` file in the workspace root:
+
+```bash
+LOCALAI_BASE_URL=http://localhost:8080
+LOCALAI_MODEL=whisper-1
+LOCALAI_API_KEY=
+```
+
+If the LocalAI server is not reachable through a simple base URL, set the full
+endpoint instead:
+
+```bash
+LOCALAI_API_ENDPOINT=https://localai.example.com/v1/audio/transcriptions
+LOCALAI_MODEL=whisper-1
+LOCALAI_API_KEY=replace_if_required
+```
+
+Do not commit `.env`. A safe `.env.example` can show the required variable names
+without exposing a server URL or token:
+
+```bash
+LOCALAI_BASE_URL=
+LOCALAI_API_ENDPOINT=
+LOCALAI_MODEL=whisper-1
+LOCALAI_API_KEY=
+```
+
+This is where Daytona helps. The workspace setup stays reproducible, while
+secrets and internal endpoints remain outside the repository.
+
+## Run A First Transcription
+
+Copy a short test video into the workspace. Start with a clip under a few
+minutes so you can test the full loop quickly.
+
+```bash
+sapat demo.mp4 --api localai --quality M --language en
+```
+
+Sapat will:
+
+1. Convert `demo.mp4` to `demo.mp3`.
+2. Send the generated MP3 to LocalAI.
+3. Save the returned transcript as `demo.txt`.
+4. Remove the temporary MP3 file after processing.
+
+For clearer audio at the cost of a larger temporary file, use the high-quality
+conversion option:
+
+```bash
+sapat demo.mp4 --api localai --quality H --language en
+```
+
+For product names, acronyms, or internal terms, pass a prompt:
+
+```bash
+sapat demo.mp4 \
+  --api localai \
+  --language en \
+  --prompt "Product names: Daytona, Sapat, LocalAI"
+```
+
+The prompt gives the transcription model vocabulary hints. It is especially
+useful for developer tools, repository names, customer names, and abbreviations
+that are easy to misspell.
+
+## Process A Folder Of Recordings
+
+Sapat can process every `.mp4` file in a directory. This is useful for a batch
+of product demos, design review clips, or field recordings from the same
+project.
+
+```bash
+mkdir recordings
+```
+
+Copy the videos into that folder, then run:
+
+```bash
+sapat recordings --api localai --quality M --language en --prompt "Product names: Daytona, Sapat, LocalAI"
+```
+
+Sapat writes one `.txt` transcript for each `.mp4` file. Rename the videos before
+you run the batch. A transcript named `workspace_setup_walkthrough.txt` is much
+easier to reuse than `screen-recording-7.txt`.
+
+For larger folders, start with two or three representative recordings. Review
+the outputs, adjust the model or prompt, then process the rest. That short
+feedback loop saves time when the first model choice is not strong enough for
+your audio.
+
+## Validate The Transcript
+
+Do not hand a raw transcript straight to another agent. Review it first:
+
+Check | What To Look For
+--- | ---
+Completeness | The transcript covers the full recording, not only the first segment.
+Names | Product, speaker, company, and repository names match the prompt vocabulary.
+Numbers | Dates, version numbers, ports, amounts, and IDs are accurate.
+Private data | Secrets, customer names, or sensitive details are removed before sharing.
+Next-step readiness | The text is clear enough for summarization, issue filing, or documentation.
+
+Open the transcript in the terminal:
+
+```bash
+sed -n '1,160p' demo.txt
+```
+
+If the transcript stops early, check LocalAI logs, model limits, and temporary
+file size. If names are wrong, rerun with a more specific prompt. If the audio
+is noisy, try a cleaner source recording before tuning the provider.
+
+## Compare Local And Cloud Providers
+
+One reason to use Sapat is that it gives the same CLI shape to multiple
+providers. After the LocalAI run works, compare it with another configured
+provider on the same sample:
+
+```bash
+sapat demo.mp4 --api localai --quality M --language en
+mv demo.txt demo.localai.txt
+
+sapat demo.mp4 --api openai --quality M --language en
+mv demo.txt demo.openai.txt
+
+diff -u demo.localai.txt demo.openai.txt
+```
+
+The goal is not to declare a universal winner from one clip. The goal is to
+measure the tradeoffs that matter for your team: privacy boundary, latency,
+cost, model availability, punctuation, code terms, and behavior on noisy audio.
+Daytona keeps that comparison repeatable because both runs happen in the same
+workspace with the same input file and command options.
+
+## Troubleshooting
+
+Problem | Fix
+--- | ---
+`LOCALAI_BASE_URL or LOCALAI_API_ENDPOINT must be set` | Add one of those values to `.env`, then open a new shell or rerun the command.
+Connection refused | Confirm LocalAI is running and reachable from the Daytona workspace, not only from your laptop.
+401 or 403 response | Set `LOCALAI_API_KEY` if the server requires a bearer token.
+Unsupported audio file format | Let Sapat process the original `.mp4`; the provider receives the generated MP3.
+Empty or poor transcript | Check the LocalAI model, language, prompt, and source audio quality.
+`ffmpeg` not found | Install `ffmpeg` in the workspace image or through the workspace package manager.
+
+When debugging, keep the failing input small. A ten-second clip with known
+speech is enough to verify the endpoint, request shape, and transcript writing
+path.
+
+## Conclusion
+
+Sapat plus LocalAI gives AI engineers a private, reproducible transcription
+workflow: Daytona supplies the clean workspace, Sapat supplies the video-to-text
+CLI, and LocalAI supplies a self-hosted audio-to-text endpoint. The result is a
+simple loop that can turn recordings into transcripts without committing
+credentials or depending on a hidden local setup.
+
+Once the first transcript is reliable, use the same workspace for provider
+comparisons, batch processing, and downstream summarization. Keep the input
+videos, prompts, model names, and validation notes together so another engineer
+can reproduce the result later.
+
+## References
+
+- [LocalAI Audio to Text](https://localai.io/features/audio-to-text/)
+- [Daytona Documentation](https://www.daytona.io/docs/)
+- [Daytona CLI Reference](https://www.daytona.io/docs/en/tools/cli/)
+- [Sapat LocalAI provider PR](https://github.com/nibzard/sapat/pull/43)
diff --git a/articles/assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg b/articles/assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg
diff --git a/authors/jj_lin.md b/authors/jj_lin.md
@@ -0,0 +1,10 @@
+Author: JJ Lin
+Title: AI Engineering Contributor
+Description: JJ Lin works on practical AI engineering workflows, developer tooling, and reproducible automation. He focuses on turning small command-line tools into reliable, testable workflows that can run in clean development environments.
+Author Image: <https://github.com/JJ-Lin.png>
+Author LinkedIn:
+Author Twitter:
+Company Name: Independent
+Company Description: Independent software and AI engineering work.
+Company Logo Dark:
+Company Logo White: