Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
326 changes: 326 additions & 0 deletions articles/20260523_run_localai_transcription_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
---
title: 'Run LocalAI Transcription With Sapat'
description: 'Build a reproducible Daytona workspace for private Sapat video transcription backed by a self-hosted LocalAI endpoint.'
date: 2026-05-23
author: 'JJ Lin'
tags: ['Daytona', 'Sapat', 'Speech-to-Text', 'LocalAI']
---

# Run LocalAI Transcription With Sapat

## Introduction

AI teams often collect short product demos, debugging recordings, user research
clips, and internal walkthroughs long before those recordings become useful
written artifacts. A transcript makes the material searchable, easier to
summarize, and safer to hand to downstream agents. The challenge is turning that
workflow into something another engineer can reproduce without guessing which
machine had `ffmpeg`, which shell had credentials, or which speech-to-text
provider was used.

[Sapat](https://github.com/nkkko/sapat) is a compact Python CLI for this job. It
accepts an `.mp4` file or a folder of `.mp4` files, converts each video to MP3
with `ffmpeg`, sends the audio to a selected transcription provider, and writes
a `.txt` transcript next to the source file. This guide adds a LocalAI path to
that workflow and runs it inside a [Daytona](https://www.daytona.io/docs/)
workspace so setup, configuration, and validation are explicit. The result is a
[self-hosted speech-to-text](/definitions/20260523_definition_self_hosted_speech_to_text.md)
workflow that stays reproducible without depending on one developer's laptop.

The companion implementation is available in
[nibzard/sapat#43](https://github.com/nibzard/sapat/pull/43). It adds
`--api localai`, documents the `LOCALAI_*` environment variables, and includes
mocked tests for CLI routing and request construction.

![LocalAI transcription workflow in Daytona](assets/20260523_run_localai_transcription_with_sapat_in_daytona.svg)

## TL;DR

- Use Daytona to open a clean Sapat workspace instead of relying on a hand-tuned
local machine.
- Run LocalAI where the workspace can reach it and expose its OpenAI-compatible
transcription endpoint.
- Configure `LOCALAI_BASE_URL`, `LOCALAI_MODEL`, and optional
`LOCALAI_API_KEY` outside source control.
- Run `sapat <file>.mp4 --api localai` to convert the video, send the MP3 to
LocalAI, and save a transcript.
- Review the transcript before sharing it with another model, teammate, or
public issue.

## What The LocalAI Provider Adds

LocalAI is a self-hosted AI runtime that offers OpenAI-compatible APIs for
several model types, including audio-to-text. Its
[audio-to-text documentation](https://localai.io/features/audio-to-text/)
describes a `POST /v1/audio/transcriptions` endpoint that accepts multipart
form data with an audio `file` and a `model` value. That shape is a natural fit
for Sapat because Sapat already prepares an MP3 file before calling a provider.

The LocalAI Sapat provider keeps the flow intentionally simple:

```text
video.mp4 -> ffmpeg MP3 conversion -> LocalAI transcription request -> video.txt
```

It reads the following environment variables:

Variable | Purpose | Default
--- | --- | ---
`LOCALAI_BASE_URL` | Base URL for the LocalAI server, such as `http://localhost:8080` | Required unless `LOCALAI_API_ENDPOINT` is set
`LOCALAI_API_ENDPOINT` | Full transcription endpoint override | Built from `LOCALAI_BASE_URL`
`LOCALAI_MODEL` | Audio-to-text model name sent in the request | `whisper-1`
`LOCALAI_API_KEY` | Optional bearer token if the LocalAI server requires auth | Not set

The provider does not require a cloud transcription account. That makes it
useful when a team wants to keep internal recordings inside a self-hosted
boundary, test transcription quality against local models, or run the same
workflow in a controlled development environment.

## Prerequisites

Before you start, make sure you have:

- A Daytona account and CLI that can create a sandbox.
- A LocalAI server with an audio-to-text model installed.
- A short `.mp4` recording with spoken audio.
- `ffmpeg` available in the Sapat workspace.
- Enough disk space for Sapat to create a temporary MP3 next to the video.

The guide uses the companion Sapat branch until the provider is merged
upstream. After merge, create the workspace from `nkkko/sapat` directly and skip
the branch checkout step.

## Start Or Reach A LocalAI Server

Run LocalAI wherever the Daytona workspace can reach it. For a quick local test,
that may be a LocalAI process on the same machine or inside the same development
network. For a team setup, it may be an internal server with access control in
front of it.

After the server is running and a Whisper-compatible model is installed, verify
the transcription endpoint with a small audio file:

```bash
curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-F file="@sample.wav" \
-F model="whisper-1"
```

If the server requires authentication, include the bearer token:

```bash
curl "$LOCALAI_BASE_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $LOCALAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@sample.wav" \
-F model="whisper-1"
```

Do this endpoint check before opening Sapat. It separates LocalAI setup problems
from Sapat workflow problems and makes troubleshooting much faster.

## Create The Daytona Workspace

Create a Daytona sandbox from the fork that contains the LocalAI provider:

```bash
daytona create https://github.com/JJ-Lin/sapat --name sapat-localai
```

Open a terminal in the workspace, then check out the provider branch:

```bash
git fetch origin feature/localai-transcription-provider
git checkout feature/localai-transcription-provider
```

Install Sapat in editable mode:

```bash
python -m pip install -e .
```

Sapat uses `ffmpeg` to convert videos to MP3 before transcription. Confirm that
it is available:

```bash
ffmpeg -version
```

If `ffmpeg` is missing, install it through the package manager available in your
Daytona environment or bake it into the workspace image. The important part is
to make this dependency visible in the workspace rather than leaving it as an
undocumented local-machine assumption.

## Configure LocalAI Without Committing Secrets

Create a `.env` file in the workspace root:

```bash
LOCALAI_BASE_URL=http://localhost:8080
LOCALAI_MODEL=whisper-1
LOCALAI_API_KEY=
```

If the LocalAI server is not reachable through a simple base URL, set the full
endpoint instead:

```bash
LOCALAI_API_ENDPOINT=https://localai.example.com/v1/audio/transcriptions
LOCALAI_MODEL=whisper-1
LOCALAI_API_KEY=replace_if_required
```

Do not commit `.env`. A safe `.env.example` can show the required variable names
without exposing a server URL or token:

```bash
LOCALAI_BASE_URL=
LOCALAI_API_ENDPOINT=
LOCALAI_MODEL=whisper-1
LOCALAI_API_KEY=
```

This is where Daytona helps. The workspace setup stays reproducible, while
secrets and internal endpoints remain outside the repository.

## Run A First Transcription

Copy a short test video into the workspace. Start with a clip under a few
minutes so you can test the full loop quickly.

```bash
sapat demo.mp4 --api localai --quality M --language en
```

Sapat will:

1. Convert `demo.mp4` to `demo.mp3`.
2. Send the generated MP3 to LocalAI.
3. Save the returned transcript as `demo.txt`.
4. Remove the temporary MP3 file after processing.

For clearer audio at the cost of a larger temporary file, use the high-quality
conversion option:

```bash
sapat demo.mp4 --api localai --quality H --language en
```

For product names, acronyms, or internal terms, pass a prompt:

```bash
sapat demo.mp4 \
--api localai \
--language en \
--prompt "Product names: Daytona, Sapat, LocalAI"
```

The prompt gives the transcription model vocabulary hints. It is especially
useful for developer tools, repository names, customer names, and abbreviations
that are easy to misspell.

## Process A Folder Of Recordings

Sapat can process every `.mp4` file in a directory. This is useful for a batch
of product demos, design review clips, or field recordings from the same
project.

```bash
mkdir recordings
```

Copy the videos into that folder, then run:

```bash
sapat recordings --api localai --quality M --language en --prompt "Product names: Daytona, Sapat, LocalAI"
```

Sapat writes one `.txt` transcript for each `.mp4` file. Rename the videos before
you run the batch. A transcript named `workspace_setup_walkthrough.txt` is much
easier to reuse than `screen-recording-7.txt`.

For larger folders, start with two or three representative recordings. Review
the outputs, adjust the model or prompt, then process the rest. That short
feedback loop saves time when the first model choice is not strong enough for
your audio.

## Validate The Transcript

Do not hand a raw transcript straight to another agent. Review it first:

Check | What To Look For
--- | ---
Completeness | The transcript covers the full recording, not only the first segment.
Names | Product, speaker, company, and repository names match the prompt vocabulary.
Numbers | Dates, version numbers, ports, amounts, and IDs are accurate.
Private data | Secrets, customer names, or sensitive details are removed before sharing.
Next-step readiness | The text is clear enough for summarization, issue filing, or documentation.

Open the transcript in the terminal:

```bash
sed -n '1,160p' demo.txt
```

If the transcript stops early, check LocalAI logs, model limits, and temporary
file size. If names are wrong, rerun with a more specific prompt. If the audio
is noisy, try a cleaner source recording before tuning the provider.

## Compare Local And Cloud Providers

One reason to use Sapat is that it gives the same CLI shape to multiple
providers. After the LocalAI run works, compare it with another configured
provider on the same sample:

```bash
sapat demo.mp4 --api localai --quality M --language en
mv demo.txt demo.localai.txt

sapat demo.mp4 --api openai --quality M --language en
mv demo.txt demo.openai.txt

diff -u demo.localai.txt demo.openai.txt
```

The goal is not to declare a universal winner from one clip. The goal is to
measure the tradeoffs that matter for your team: privacy boundary, latency,
cost, model availability, punctuation, code terms, and behavior on noisy audio.
Daytona keeps that comparison repeatable because both runs happen in the same
workspace with the same input file and command options.

## Troubleshooting

Problem | Fix
--- | ---
`LOCALAI_BASE_URL or LOCALAI_API_ENDPOINT must be set` | Add one of those values to `.env`, then open a new shell or rerun the command.
Connection refused | Confirm LocalAI is running and reachable from the Daytona workspace, not only from your laptop.
401 or 403 response | Set `LOCALAI_API_KEY` if the server requires a bearer token.
Unsupported audio file format | Let Sapat process the original `.mp4`; the provider receives the generated MP3.
Empty or poor transcript | Check the LocalAI model, language, prompt, and source audio quality.
`ffmpeg` not found | Install `ffmpeg` in the workspace image or through the workspace package manager.

When debugging, keep the failing input small. A ten-second clip with known
speech is enough to verify the endpoint, request shape, and transcript writing
path.

## Conclusion

Sapat plus LocalAI gives AI engineers a private, reproducible transcription
workflow: Daytona supplies the clean workspace, Sapat supplies the video-to-text
CLI, and LocalAI supplies a self-hosted audio-to-text endpoint. The result is a
simple loop that can turn recordings into transcripts without committing
credentials or depending on a hidden local setup.

Once the first transcript is reliable, use the same workspace for provider
comparisons, batch processing, and downstream summarization. Keep the input
videos, prompts, model names, and validation notes together so another engineer
can reproduce the result later.

## References

- [LocalAI Audio to Text](https://localai.io/features/audio-to-text/)
- [Daytona Documentation](https://www.daytona.io/docs/)
- [Daytona CLI Reference](https://www.daytona.io/docs/en/tools/cli/)
- [Sapat LocalAI provider PR](https://github.com/nibzard/sapat/pull/43)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions authors/jj_lin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Author: JJ Lin
Title: AI Engineering Contributor
Description: JJ Lin works on practical AI engineering workflows, developer tooling, and reproducible automation. He focuses on turning small command-line tools into reliable, testable workflows that can run in clean development environments.
Author Image: <https://github.com/JJ-Lin.png>
Author LinkedIn:
Author Twitter:
Company Name: Independent
Company Description: Independent software and AI engineering work.
Company Logo Dark:
Company Logo White:
Loading