Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions authors/markus_reimer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Author: Markus Reimer
Title: Software Engineer
Description: Markus Reimer is a software engineer and open-source contributor focused on pragmatic AI-assisted development, developer workflows, and maintainable automation for engineering teams.
Author Image: <https://avatars.githubusercontent.com/u/22987960?v=4>
Author Twitter: <https://twitter.com/markusreimer>
Company Name: Agilenge AB
Company Description: Agilenge AB builds pragmatic software and automation for engineering and business teams.
20 changes: 20 additions & 0 deletions definitions/20260520_definition_model_api_transcription.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "Model API Transcription"
description: "A speech-to-text workflow that sends prepared audio to a hosted model API and stores the returned transcript."
date: 2026-05-20
author: "Markus Reimer"
---

# Model API Transcription

## Definition

Model API transcription is the process of converting speech to text by sending an audio file to a hosted AI model endpoint and receiving a transcript response.

## Context and Usage

Engineering teams use model API transcription when they want speech-to-text capabilities without operating their own inference infrastructure. A local tool or backend prepares audio, uploads it to a provider, waits for inference, and saves the returned text.

This pattern is useful for demo recordings, product interviews, meeting notes, support calls, accessibility drafts, and transcript archives. Teams still need to manage credentials carefully, check provider limits, and verify transcript quality.

They also need to decide which recordings are appropriate for third-party processing.
249 changes: 249 additions & 0 deletions guides/20260520_fal_ai_transcription_with_sapat_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
---
title: "fal.ai Transcription with Sapat"
description: "Run Sapat with fal.ai Whisper in a Daytona workspace to create a reproducible speech-to-text workflow."
date: 2026-05-20
author: "Markus Reimer"
tags: ["daytona", "sapat", "fal-ai", "transcription", "whisper"]
---

# fal.ai Transcription with Sapat

## Introduction

AI transcription often starts as a single command on one developer's machine. That works for a quick demo, but it breaks down when the workflow needs to be repeated, reviewed, or handed to another engineer.

Local Python versions differ, `ffmpeg` may be missing, and API credentials can end up in shell history or temporary scripts.

This guide shows how to run Sapat, a small Python video transcription tool, inside a Daytona workspace with a fal.ai Whisper provider. Daytona gives the workflow a clean development environment, while Sapat handles audio conversion, provider selection, and transcript file creation.

The fal.ai provider used in this guide is implemented in the companion Sapat pull request: [nibzard/sapat#30](https://github.com/nibzard/sapat/pull/30). While that PR is under review, use the contributor branch shown below. After it is merged, use the upstream Sapat repository directly.

![fal.ai-backed Sapat workflow in Daytona](assets/20260520_fal_sapat_daytona_workflow.svg)

## TL;DR

- Create a Daytona workspace so the transcription workflow is reproducible.
- Install Sapat and `ffmpeg` inside the workspace.
- Configure `FAL_KEY` as an environment variable instead of committing secrets.
- Run `sapat <recording> --api fal` to transcribe audio through fal.ai Whisper.
- Review a short smoke clip before processing a batch of recordings.

## Materials Checklist

| Item | Why you need it |
| --- | --- |
| Daytona installed and configured | Creates a clean workspace for Sapat |
| Python 3.9 or newer in the workspace | Runs Sapat and the fal.ai client |
| `ffmpeg` | Converts video inputs to MP3 before transcription |
| fal.ai API key | Authenticates the hosted Whisper model call |
| One short `.mp4`, `.mp3`, `.wav`, or `.m4a` sample | Verifies the workflow before a larger run |
Comment on lines +33 to +39

## Step 1: Create a Daytona Workspace

Start with a clean workspace instead of depending on local machine state. If the companion Sapat PR is still under review, create the workspace from the contributor fork and switch to the fal.ai provider branch:

```bash
daytona create https://github.com/Dowser/sapat --code
```

Inside the workspace terminal, check out the provider branch:

```bash
git fetch origin codex/fal-transcription-provider
git checkout codex/fal-transcription-provider
```

After [nibzard/sapat#30](https://github.com/nibzard/sapat/pull/30) is merged, use the upstream project instead:

```bash
daytona create https://github.com/nibzard/sapat --code
```

This keeps the checkout, dependency metadata, and CLI surface consistent for everyone who needs to reproduce the transcription workflow.

## Step 2: Install Runtime Dependencies

Sapat converts video files to MP3 before passing audio to a transcription provider. Install `ffmpeg` in the workspace:

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg
```

Install Sapat from the checked-out project:

```bash
python -m pip install -e .
```

The fal.ai provider adds the `fal-client` Python package to Sapat's project dependencies. The client handles authentication through `FAL_KEY`, uploads local audio to fal.ai storage, and calls the hosted model with `fal_client.subscribe`.

Confirm that the CLI exposes the fal.ai provider:

```bash
sapat --help
```

The `--api` option should include `fal`:

```text
--api [openai|groq|azure|fal]
```

## Step 3: Configure fal.ai Credentials

Create an API key in the fal.ai dashboard and expose it to the Daytona workspace as an [environment variable](../definitions/20241126_definition_environment_variables.md):

```bash
export FAL_KEY="your-fal-api-key"
```

You can also configure the provider with optional variables:

```bash
export FAL_MODEL_ID="fal-ai/whisper"
export FAL_TASK="transcribe"
export FAL_CHUNK_LEVEL="segment"
export FAL_BATCH_SIZE="64"
```

Keep real keys out of source code, `.env` files committed to Git, PR descriptions, issue comments, and terminal logs. The fal.ai client reads `FAL_KEY` automatically, so there is no reason to hard-code it in the Sapat project.

## Step 4: Prepare a Smoke Clip

Before processing a meeting archive or a directory of product demos, use a short recording with content you can verify by ear. A thirty-second clip is enough to test the full path:

- The workspace can run Sapat.
- `ffmpeg` can read and convert the file.
- Sapat routes the request to the fal.ai provider.
- fal.ai returns transcript text.
- Sapat writes the `.txt` file next to the input.

Create a small recordings directory:

```bash
mkdir -p recordings
cp ~/Downloads/demo-call.mp4 recordings/demo-call.mp4
```

Sapat writes the transcript beside the input file:

```text
recordings/
demo-call.mp4
demo-call.txt
```

If your source file is already audio, Sapat can still process it after conversion. The fal.ai Whisper API supports common formats such as MP3, MP4, MPEG, MPGA, M4A, WAV, and WebM.

## Step 5: Run Sapat with fal.ai Whisper

Run the smoke clip through the fal.ai provider:

```bash
sapat recordings/demo-call.mp4 --api fal --quality M --language en --prompt "Daytona, Sapat, fal.ai"
```

Under the hood, the provider does three things:

1. Validates that `FAL_KEY` is present and that the local audio file is a supported format.
2. Uploads the prepared audio file through `fal_client.upload_file`.
3. Calls `fal_client.subscribe("fal-ai/whisper", arguments={...})` with `audio_url`, `task`, `chunk_level`, `batch_size`, and optional prompt/language settings.

The fal.ai Whisper model returns a JSON response with a `text` field and optional chunk metadata. Sapat's existing base writer stores the `text` value in `recordings/demo-call.txt`.

Review the first transcript:

```bash
sed -n '1,80p' recordings/demo-call.txt
```

Check the details that usually decide whether a transcript is usable:

- Product names and acronyms are recognizable.
- The first and last spoken sections are present.
- The transcript is not empty or replaced by an error phrase.
- The output language matches the recording.
- Any domain-specific terms from the prompt appear more consistently.

## Step 6: Process a Small Batch

After the smoke test passes, process a directory of `.mp4` recordings:

```bash
sapat recordings --api fal --quality M --language en
```

Start with a small batch before sending a large set of recordings. Provider quotas, source audio quality, and model behavior are easier to debug with three files than with fifty.

A simple review loop helps you catch obvious failures quickly:

```bash
for transcript in recordings/*.txt; do
echo "===== $transcript ====="
sed -n '1,40p' "$transcript"
done
```

For production use, keep a manifest next to the outputs:

```markdown
# demo-call

- Source file: recordings/demo-call.mp4
- Provider: fal.ai
- Model: fal-ai/whisper
- Task: transcribe
- Chunk level: segment
- Reviewed by: <name>
- Known issues: "AcmeDB" appears once as "Acme DB"
```

That small note turns a raw transcript into an engineering artifact. A teammate can see which [model API transcription](../definitions/20260520_definition_model_api_transcription.md) workflow created it and what still needs review.

## Step 7: Decide Where Correction Belongs

The fal.ai provider in the companion Sapat PR focuses on transcription only. It does not add a second chat-based correction path.

That separation is intentional. Transcription and correction have different failure modes. The first pass should answer, "Did we capture the spoken words?" The second pass should answer, "Did we normalize names, acronyms, punctuation, and formatting correctly?"

If your team needs correction, run it as a separate review step with an approved model and a clear prompt. Keep the original transcript so reviewers can compare the corrected version against the provider output.

## Common Issues and Troubleshooting

**Problem:** `FAL_KEY is required for fal.ai transcription.`

**Solution:** Export `FAL_KEY` in the workspace shell or set it through your team's secret-management workflow. Do not paste the key into PR comments or issue threads.

**Problem:** `ffmpeg` is missing.

**Solution:** Install it with `sudo apt-get install -y ffmpeg`. For repeated use, add `ffmpeg` to the workspace image or dev container configuration.

**Problem:** The transcript file is empty.

**Solution:** Confirm that the source recording has an audio stream. Run `ffprobe recordings/demo-call.mp4` and try a shorter sample before processing a full batch.

**Problem:** fal.ai rejects the request.

**Solution:** Check that `FAL_KEY` is valid, `FAL_MODEL_ID` is set to `fal-ai/whisper`, and `FAL_BATCH_SIZE` is between 1 and 64.

**Problem:** Domain terms are spelled inconsistently.

**Solution:** Add a short prompt with product names, team names, and acronyms. Then run a separate human or model-assisted correction pass before downstream use.

## Conclusion

Running Sapat with fal.ai Whisper inside Daytona gives engineering teams a repeatable speech-to-text workflow instead of a one-off local command. Daytona keeps the environment clean, Sapat handles file conversion and provider routing, and fal.ai hosts the Whisper model behind a simple model API.

The practical habit is to keep the workflow auditable: isolate credentials, start with a short smoke clip, inspect outputs before a batch run, and package transcripts with provider metadata.

That makes the transcript useful beyond the first command that generated it.

## References

- [Sapat repository](https://github.com/nibzard/sapat)
- [Companion Sapat fal.ai provider PR](https://github.com/nibzard/sapat/pull/30)
- [fal.ai Whisper API reference](https://fal.ai/docs/model-api-reference/audio-api/whisper)
- [fal.ai client setup](https://fal.ai/docs/model-apis/client)
- [fal.ai authentication guide](https://fal.ai/docs/model-apis/authentication)
- [Daytona documentation](https://www.daytona.io/docs/)
50 changes: 50 additions & 0 deletions guides/assets/20260520_fal_sapat_daytona_workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.