Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions authors/dysfu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Author: DYSfu
Title: Independent Developer
Description: DYSfu writes practical developer guides focused on reproducible environments, small automation tools, and source-checked implementation notes. The work favors concise setup steps, clear validation commands, and maintainable workflows over broad claims.
Author Image: <https://github.com/DYSfu.png>
Author LinkedIn:
Author Twitter:
Company Name: Independent
Company Description: Independent software delivery and technical writing.
Company Logo Dark:
Company Logo White:
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: 'OpenAI-Compatible Speech-to-Text'
description: 'A speech-to-text API that follows the request and response shape of OpenAI transcription endpoints.'
date: 2026-05-21
author: 'DYSfu'
---

# OpenAI-Compatible Speech-to-Text

## Definition

OpenAI-compatible speech-to-text is an API design where a provider accepts transcription requests using a request shape similar to OpenAI's audio transcription endpoint.

These services usually accept an uploaded audio file, an authorization bearer token, a response format, and optional fields such as language or prompt.

## Context and Usage

OpenAI-compatible speech-to-text APIs are useful when a tool needs provider choice without rewriting the entire transcription workflow. A command-line tool can keep the same file conversion, environment-variable handling, and transcript writing steps while routing the final API call to a different provider.

In a Daytona workspace, this pattern helps teams compare transcription providers in a disposable environment. Engineers can keep secrets in `.env`, run the same test clip through multiple providers, and decide which endpoint fits their cost, latency, privacy, and accuracy requirements.
212 changes: 212 additions & 0 deletions guides/20260521_lemonfox_sapat_transcription_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
---
title: "Run Lemonfox transcription with Sapat in Daytona"
description: "Build a reproducible Daytona workflow for Sapat video transcription using the Lemonfox Speech-to-Text API."
date: 2026-05-21
author: "DYSfu"
tags: ["sapat", "lemonfox", "transcription", "daytona"]
---

# Run Lemonfox transcription with Sapat in Daytona

# Introduction

Sapat is a small command-line tool for turning video files into transcripts. It converts video to MP3 with FFmpeg, sends the audio to a speech-to-text provider, and writes a `.txt` file beside the original recording.

The project already supports OpenAI, Groq, and Azure OpenAI. This guide adds a fourth path: Lemonfox Speech-to-Text.

The useful part is not just adding another vendor name to the list. Lemonfox exposes an OpenAI-compatible transcription endpoint, supports uploaded audio files, accepts a `prompt` for domain vocabulary, and can return JSON, plain text, SRT, VTT, or verbose JSON.

That makes it a good fit for AI engineers who want a clean fallback provider inside a reproducible [Daytona workspace](../definitions/20240819_definition_daytona workspace.md).

In this guide, you will set up Sapat in Daytona, configure Lemonfox credentials without committing secrets, run a test transcription, and verify that the new provider path behaves like the rest of the Sapat CLI. The companion implementation is available in [nibzard/sapat#23](https://github.com/nibzard/sapat/pull/23).

## TL;DR

- Sapat now has a `--api lemonfox` provider backed by Lemonfox's OpenAI-compatible `/v1/audio/transcriptions` endpoint.
- Daytona gives you a disposable workspace where FFmpeg, Python dependencies, and provider credentials can be tested without polluting a local machine.
- The workflow uses environment variables for `LEMONFOX_API_KEY`, `LEMONFOX_API_ENDPOINT`, and `LEMONFOX_RESPONSE_FORMAT`.
- Validation can be done without real recordings or secrets by running mocked unit tests and `sapat --help`.
- Use Lemonfox prompts for product names, acronyms, and domain terms that Whisper-style models commonly miss.

## Why Lemonfox fits Sapat

Sapat's provider model is intentionally simple. The CLI receives a video path, converts it to MP3, calls a provider class, and writes the provider result to disk. A provider only needs to accept a local audio file and return either a string or a JSON object with a `text` field.

Lemonfox fits that shape because its Speech-to-Text API accepts a file upload and returns the transcript in a familiar response format.

The public API documentation lists `POST https://api.lemonfox.ai/v1/audio/transcriptions` as the transcription endpoint and describes the `file`, `response_format`, `prompt`, and `language` parameters. It also documents a 100 MB limit for direct file uploads and larger limits for URL-based audio.

That means the Sapat integration can stay small:

- read Lemonfox settings from `.env`;
- upload the MP3 file Sapat already creates;
- forward `language` and `prompt` when the caller provides them;
- keep the output contract the same as the existing OpenAI/Groq providers.

![Lemonfox Sapat workflow](assets/20260521_lemonfox_sapat_transcription_daytona_img1.svg)

## Prerequisites

You need:

- A Daytona account and the Daytona CLI installed.
- Python 3.10 or newer in the workspace.
- FFmpeg available in the workspace image.
- A Lemonfox API key.
- A short `.mp4` test clip that does not contain private or regulated data.

Do not paste real API keys into Git commits, PR descriptions, screenshots, or terminal recordings. Keep them in `.env` or in your Daytona workspace environment only.

## Create the Daytona workspace

Start from the Sapat repository. The companion PR can be checked out directly, or you can use the fork branch while it is under review.

```bash
daytona create https://github.com/DYSfu/sapat --code
```

Inside the workspace, switch to the Lemonfox branch:

```bash
git fetch origin add-lemonfox-transcription
git switch add-lemonfox-transcription
```

Install the package in editable mode:

```bash
python -m venv .venv
. .venv/bin/activate
python -m pip install -e .
```

Confirm that the CLI sees the new provider:

```bash
PYTHONPATH=src python -m sapat.script --help
```

The `--api` option should include `lemonfox` alongside `openai`, `groq`, and `azure`.

## Configure Lemonfox credentials

Create a `.env` file in the Sapat project root:

```bash
cat > .env <<'EOF'
LEMONFOX_API_KEY=replace_with_your_key
LEMONFOX_API_ENDPOINT=https://api.lemonfox.ai/v1/audio/transcriptions
LEMONFOX_RESPONSE_FORMAT=json
EOF
```

The endpoint value above uses Lemonfox's global API host. Lemonfox also documents an EU processing host, `https://eu-api.lemonfox.ai/v1/audio/transcriptions`, for teams that need EU-based processing. Use the endpoint that matches your data-handling requirements.

Keep `LEMONFOX_RESPONSE_FORMAT=json` for the normal Sapat flow. Sapat writes JSON provider output by reading the `text` field. Lemonfox also supports `text`, `srt`, `vtt`, and `verbose_json`, but those formats change what the provider returns.

## Run a first transcription

Use a short recording first. A one-minute clip is enough to validate credentials, FFmpeg, file upload, and transcript output.

```bash
sapat demo_recording.mp4 \
--api lemonfox \
--quality M \
--language english \
--prompt "Daytona, Sapat, Lemonfox, FFmpeg"
```

Sapat will:

1. Convert `demo_recording.mp4` to `demo_recording.mp3`.
2. Upload the MP3 file to Lemonfox.
3. Save the transcript as `demo_recording.txt`.
4. Remove the temporary MP3 file after the transcript is written.

Open the transcript and check the vocabulary you included in the prompt:

```bash
sed -n '1,120p' demo_recording.txt
```

If the transcript includes project names, acronyms, or product terms correctly, the prompt is doing its job. If not, make the prompt more concrete. A short comma-separated list works well for product names and acronyms.

## Validate without sending audio

Before relying on the workflow for real recordings, run the mocked test suite. These tests do not call Lemonfox or require an API key.

```bash
PYTHONPATH=src python -m unittest discover -s tests -v
```

The Lemonfox test checks that Sapat posts a file upload to the configured endpoint with:

- an `Authorization: Bearer ...` header;
- `response_format=json`;
- optional `language`;
- optional `prompt`;
- no hardcoded secret values.

Also run a compile check:

```bash
PYTHONPATH=src python -m py_compile \
src/sapat/script.py \
src/sapat/transcription/lemonfox.py \
tests/test_lemonfox_transcription.py \
tests/test_script.py
```

These checks are useful in Daytona because the workspace can be recreated later. You can keep the same commands in a project README, a release checklist, or a team runbook.

## Build a provider fallback habit

AI transcription workflows are rarely one-provider systems for long. Providers differ in pricing, upload limits, region support, diarization, latency, and how well they handle product vocabulary. Sapat's `--api` flag gives you a simple provider switch without changing the rest of the workflow.

A practical fallback routine looks like this:

| Scenario | Sapat option | What to check |
| --- | --- | --- |
| General English product demo | `--api lemonfox --language english` | Product names and command output |
| Domain-specific lecture | `--api lemonfox --prompt "terms..."` | Acronyms, names, and punctuation |
| Provider comparison | run the same clip with two APIs | Transcript accuracy and turnaround time |
| Subtitle export needed | set Lemonfox response format separately | Whether downstream tools expect SRT or VTT |

For normal Sapat text output, keep JSON as the response format. If you need SRT or VTT, treat that as a separate workflow and adjust the Sapat writing logic so it does not try to read a JSON `text` field.

## Troubleshooting

**Problem:** `LEMONFOX_API_KEY is required for Lemonfox transcription.`

**Solution:** Check that `.env` exists in the project root and that you launched the command from that directory. The provider loads `.env` before reading `LEMONFOX_API_KEY`.

**Problem:** The provider returns an authentication error.

**Solution:** Regenerate the API key in Lemonfox and update `.env`. Do not paste the key into GitHub comments or PR descriptions.

**Problem:** The transcript misses product names.

**Solution:** Use `--prompt` with a small vocabulary list. Lemonfox documents prompt usage for terms that the model might otherwise miss, such as acronyms or product names.

**Problem:** The upload fails for a long recording.

**Solution:** Start with a shorter clip and confirm the workflow. Lemonfox documents a 100 MB limit for direct file uploads. For longer recordings, split the recording first or use a URL-based workflow.

**Problem:** `--correct` fails with Lemonfox.

**Solution:** Do not use `--correct` with `--api lemonfox`. Lemonfox Speech-to-Text handles transcription, but the Sapat correction path expects a chat model implementation. Run correction as a separate step if your team needs it.

## Conclusion

You now have a reproducible Sapat transcription workflow in Daytona that can use Lemonfox as a provider. The change is intentionally small: one provider class, one CLI route, environment-driven configuration, and mocked tests that prove the request shape without sending real audio.

That smallness is the point. A transcription workflow should be easy to rerun, easy to inspect, and easy to switch when provider requirements change.

Daytona gives you the workspace boundary, Sapat gives you the provider abstraction, and Lemonfox gives you another Speech-to-Text route for cost-aware or fallback transcription jobs.

## References

- [Lemonfox Speech-to-Text API documentation](https://www.lemonfox.ai/apis/speech-to-text)
- [Sapat repository](https://github.com/nkkko/sapat)
- [Companion Lemonfox provider PR](https://github.com/nibzard/sapat/pull/23)
- [Daytona documentation](https://www.daytona.io/docs/)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.