diff --git a/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md b/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md new file mode 100644 index 00000000..4ba723fe --- /dev/null +++ b/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md @@ -0,0 +1,439 @@ +--- +title: "Building an AI-Powered Video Transcription Tool with Sapat and Daytona" +description: "A comprehensive guide to building and running a multi-provider AI transcription tool using Sapat (OpenAI Whisper, Groq, and Azure OpenAI) inside a reproducible Daytona workspace." +date: 2026-05-20 +author: "Kim Jin" +tags: ["ai", "transcription", "whisper", "openai", "groq", "azure", "daytona", "python", "devcontainer"] +--- + +# Building an AI-Powered Video Transcription Tool with Sapat and Daytona + +Transcribing audio and video content manually is time-consuming and error-prone. Modern AI-powered speech recognition APIs — including OpenAI Whisper, Groq's ultra-fast inference, and Azure OpenAI — have made automated transcription fast, accurate, and affordable. But setting up a consistent, reproducible development environment for these tools can still be a challenge. + +This guide walks you through building and running [Sapat](https://github.com/nibzard/sapat), a Python-based multi-provider video transcription tool, inside a [Daytona](https://www.daytona.io/) workspace. By the end, you will have a fully configured, portable development environment that can transcribe video files using any of three major AI providers — with a single command. + +## TL;DR + +- What Sapat is and how its multi-provider architecture works +- Setting up a Daytona workspace with a pre-configured devcontainer +- Configuring API credentials for OpenAI, Groq, and Azure OpenAI +- Transcribing video files and entire directories with Sapat +- Comparing the three providers: speed, cost, and accuracy +- Extending Sapat with additional transcription providers + +## Prerequisites + +To follow this guide, you will need: + +- A basic understanding of [Python](../definitions/20240820_defintion_python.md) and command-line tools +- A [Daytona](https://www.daytona.io/docs/installation/installation/) installation (latest version) +- [Docker](https://www.docker.com/) installed and running +- An API key for at least one of the following services: + - [OpenAI](https://platform.openai.com/) (for Whisper API) + - [Groq Cloud](https://console.groq.com/) (for Whisper Large v3 Turbo — free tier available) + - [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) (for Azure-hosted Whisper) +- A video file to transcribe (`.mp4`, `.mov`, `.avi`, or similar) + +> **Note:** Groq offers a generous free tier with fast inference, making it the easiest provider to get started with at no cost. + +## Understanding Sapat's Architecture + +Before diving into setup, it helps to understand how Sapat is structured. The tool is built around a clean provider abstraction: a `TranscriptionBase` class defines the interface, and each provider (OpenAI, Groq, Azure) implements it independently. + +``` +src/sapat/ +├── script.py # CLI entry point (Click-based) +└── transcription/ + ├── base.py # Abstract base class + ├── openai.py # OpenAI Whisper implementation + ├── groq.py # Groq Cloud implementation + └── azure.py # Azure OpenAI implementation +``` + +The `script.py` CLI accepts a video file or directory, converts it to MP3 using `ffmpeg`, sends it to the selected provider's API, and saves the transcription as a `.txt` file alongside the source video. Temporary MP3 files are cleaned up automatically. + +This architecture makes it straightforward to add new providers — which we will cover at the end of this guide. + +## Setting Up the Project Template + +### Step 1: Clone the Sapat Repository + +Start by cloning the Sapat repository to your local machine: + +```bash +git clone https://github.com/nibzard/sapat.git +cd sapat +``` + +### Step 2: Review the Existing devcontainer Configuration + +Sapat already includes a `.devcontainer/devcontainer.json` file. Open it to see what it configures: + +```json +{ + "name": "Video Transcription Tool", + "image": "mcr.microsoft.com/devcontainers/python:3.12", + "customizations": { + "vscode": { + "settings": { + "python.defaultInterpreterPath": "/usr/local/bin/python", + "python.linting.enabled": true, + "python.linting.pylintEnabled": true + }, + "extensions": [ + "ms-python.python", + "ms-python.vscode-pylance", + "njpwerner.autodocstring" + ] + } + }, + "postCreateCommand": { + "ffmpeg": "sudo apt install ffmpeg -y", + "requirements": "pip install -r requirements.txt" + } +} +``` + +This configuration: +- Uses a Python 3.12 devcontainer image +- Installs `ffmpeg` automatically (required for video-to-audio conversion) +- Installs all Python dependencies from `requirements.txt` +- Configures VS Code with Python linting and autocomplete + +### Step 3: Configure Your API Credentials + +Create a `.env` file in the project root by copying the provided example: + +```bash +cp .env.example .env +``` + +Now edit `.env` and fill in your credentials. You only need to configure the provider(s) you plan to use: + +```bash +# ── OpenAI ────────────────────────────────────────────────────────────────── +OPENAI_API_KEY=sk-your-openai-api-key-here +OPENAI_MODEL=whisper-1 +OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions +OPENAI_MODEL_NAME_CHAT=gpt-4o + +# ── Groq Cloud ─────────────────────────────────────────────────────────────── +GROQCLOUD_API_KEY=gsk_your-groq-api-key-here +GROQCLOUD_MODEL=whisper-large-v3-turbo +GROQCLOUD_API_ENDPOINT=https://api.groq.com/openai/v1/audio/transcriptions +GROQCLOUD_MODEL_NAME_CHAT=llama3-8b-8192 + +# ── Azure OpenAI ───────────────────────────────────────────────────────────── +AZURE_OPENAI_API_KEY=your-azure-api-key-here +AZURE_OPENAI_ENDPOINT=https://YOUR-DEPLOYMENT.openai.azure.com +AZURE_OPENAI_DEPLOYMENT_NAME_WHISPER=whisper +AZURE_OPENAI_API_VERSION_WHISPER=2024-06-01 +AZURE_OPENAI_DEPLOYMENT_NAME_CHAT=gpt-4o +AZURE_OPENAI_API_VERSION_CHAT=2023-03-15-preview +``` + +> **Important:** Never commit your `.env` file to version control. The `.gitignore` in the Sapat repository already excludes it, but double-check before pushing any changes. + +## Creating a Daytona Workspace + +With the project template ready, you can now open it as a Daytona workspace. Daytona will automatically detect the `.devcontainer/devcontainer.json` configuration and provision a fully isolated, reproducible environment. + +### Step 4: Create the Workspace + +Run the following command to create a new Daytona workspace from the Sapat repository: + +```bash +daytona create https://github.com/nibzard/sapat +``` + +Daytona will: +1. Clone the repository into a new workspace +2. Build the devcontainer image (Python 3.12 + ffmpeg + dependencies) +3. Open the workspace in your configured IDE (VS Code by default) + +> **Tip:** If you have already cloned the repository locally, you can also create a workspace from a local path: `daytona create /path/to/sapat` + +### Step 5: Verify the Environment + +Once the workspace is open, verify that all dependencies are installed correctly by opening the integrated terminal and running: + +```bash +# Verify ffmpeg is available +ffmpeg -version | head -1 + +# Verify Python packages +pip show openai groq python-dotenv click +``` + +You should see version information for both `ffmpeg` and the Python packages. If anything is missing, the `postCreateCommand` in `devcontainer.json` can be re-run manually: + +```bash +sudo apt install ffmpeg -y && pip install -r requirements.txt +``` + +### Step 6: Install Sapat as a Package + +Install Sapat in development mode so the `sapat` CLI command is available: + +```bash +pip install -e . +``` + +Verify the installation: + +```bash +sapat --help +``` + +You should see output similar to: + +``` +Usage: sapat [OPTIONS] INPUT_PATH + + Transcribe video files using AI transcription services. + +Options: + --provider [openai|groq|azure] Transcription provider to use. + --temperature FLOAT Temperature for transcription. + --response-format TEXT Response format (json or text). + --help Show this message and exit. +``` + +## Transcribing Your First Video + +### Step 7: Transcribe a Single File + +Place a video file in your workspace (e.g., `demo.mp4`) and run: + +```bash +# Using Groq (fastest, free tier available) +sapat demo.mp4 --provider groq + +# Using OpenAI Whisper +sapat demo.mp4 --provider openai + +# Using Azure OpenAI +sapat demo.mp4 --provider azure +``` + +Sapat will: +1. Convert `demo.mp4` to a temporary `demo.mp3` using `ffmpeg` +2. Send the audio to the selected provider's API +3. Save the transcription as `demo.txt` in the same directory +4. Delete the temporary `demo.mp3` + +### Step 8: Transcribe an Entire Directory + +If you have a folder of video files to process in bulk, pass the directory path instead: + +```bash +sapat ./videos/ --provider groq +``` + +Sapat will process every video file in the directory sequentially, creating a corresponding `.txt` transcription file for each one. + +### Step 9: Customize Temperature and Response Format + +The `--temperature` and `--response-format` options allow fine-grained control over the transcription output: + +```bash +# Get verbose JSON output with timestamps +sapat demo.mp4 --provider openai --response-format verbose_json + +# Lower temperature for more deterministic output +sapat demo.mp4 --provider groq --temperature 0.0 +``` + +The `verbose_json` format (supported by OpenAI and Azure) includes word-level timestamps, which is useful for generating subtitles or syncing transcriptions with video playback. + +## Comparing the Three Providers + +Each provider has different strengths depending on your use case: + +| Feature | OpenAI Whisper | Groq Cloud | Azure OpenAI | +|---|---|---|---| +| **Model** | `whisper-1` | `whisper-large-v3-turbo` | `whisper` (Azure-hosted) | +| **Speed** | Moderate | Very fast (LPU inference) | Moderate | +| **Free Tier** | No | Yes (generous limits) | No | +| **Cost** | $0.006/min | Free tier + paid | Pay-as-you-go | +| **Max File Size** | 25 MB | 25 MB | 25 MB | +| **Best For** | General use | Development, prototyping | Enterprise, compliance | +| **Verbose JSON** | Yes | No | Yes | + +> **Recommendation for development:** Use Groq during development and testing — it is the fastest and has a free tier. Switch to OpenAI or Azure for production workloads that require verbose JSON output or enterprise SLAs. + +## Understanding the Code: How Sapat Works Internally + +To extend Sapat with new providers, it helps to understand the base class: + +```python +# src/sapat/transcription/base.py (simplified) +from abc import ABC, abstractmethod + +class TranscriptionBase(ABC): + max_file_size_mb: int = 25 + + def _validate_audio_file(self, audio_file: str): + """Checks file exists and is under the size limit.""" + ... + + @abstractmethod + def transcribe_audio(self, audio_file: str, **kwargs) -> dict | str: + """Transcribes an audio file and returns the result.""" + ... +``` + +Each provider subclass implements `transcribe_audio` by calling its respective API. The Groq implementation, for example, uses the official `groq` Python SDK: + +```python +# src/sapat/transcription/groq.py (simplified) +from groq import Groq + +class GroqCloudTranscription(TranscriptionBase): + def transcribe_audio(self, audio_file: str, **kwargs): + self._validate_audio_file(audio_file) + client = Groq(api_key=self.api_key) + with open(audio_file, "rb") as f: + response = client.audio.transcriptions.create( + file=f, + model=self.model, + temperature=self.temperature, + response_format=self.response_format, + ) + return response +``` + +## Extending Sapat: Adding a New Provider + +The bounty for this issue specifically asks contributors to add support for additional APIs. Here is how to add a new provider — using [AssemblyAI](https://www.assemblyai.com/) as an example: + +### Step 10: Create the Provider Class + +Create a new file `src/sapat/transcription/assemblyai.py`: + +```python +import os +import requests +from dotenv import load_dotenv +from .base import TranscriptionBase + +load_dotenv(".env") + +class AssemblyAITranscription(TranscriptionBase): + """AssemblyAI implementation for transcription.""" + + def __init__(self, temperature: float = 0.0, response_format: str = "json"): + self.api_key = os.getenv('ASSEMBLYAI_API_KEY') + self.endpoint = "https://api.assemblyai.com/v2" + self.temperature = temperature + self.response_format = response_format + self.max_file_size_mb = 100 # AssemblyAI supports larger files + + def transcribe_audio(self, audio_file: str, **kwargs) -> dict: + self._validate_audio_file(audio_file) + headers = {"authorization": self.api_key} + + # Step 1: Upload the audio file + with open(audio_file, "rb") as f: + upload_response = requests.post( + f"{self.endpoint}/upload", + headers=headers, + data=f, + ) + upload_url = upload_response.json()["upload_url"] + + # Step 2: Request transcription + transcript_response = requests.post( + f"{self.endpoint}/transcript", + headers=headers, + json={"audio_url": upload_url}, + ) + transcript_id = transcript_response.json()["id"] + + # Step 3: Poll for completion + import time + while True: + result = requests.get( + f"{self.endpoint}/transcript/{transcript_id}", + headers=headers, + ).json() + if result["status"] == "completed": + return result + elif result["status"] == "error": + raise RuntimeError(f"AssemblyAI transcription failed: {result['error']}") + time.sleep(2) +``` + +### Step 11: Register the Provider in the CLI + +Update `src/sapat/script.py` to include the new provider in the `--provider` option: + +```python +# Add to the provider choices +@click.option( + "--provider", + type=click.Choice(["openai", "groq", "azure", "assemblyai"]), + default="openai", +) +``` + +And add the import and instantiation logic: + +```python +elif provider == "assemblyai": + from sapat.transcription.assemblyai import AssemblyAITranscription + transcriber = AssemblyAITranscription(temperature=temperature, response_format=response_format) +``` + +### Step 12: Add the Environment Variable + +Add the new API key to `.env.example`: + +```bash +# AssemblyAI +ASSEMBLYAI_API_KEY=your-assemblyai-api-key-here +``` + +And add `assemblyai` to `requirements.txt`: + +``` +assemblyai +``` + +## Troubleshooting Common Issues + +**`ffmpeg: command not found`** +The `postCreateCommand` in `devcontainer.json` installs ffmpeg automatically. If it is missing, run `sudo apt install ffmpeg -y` manually inside the workspace terminal. + +**`File size exceeds maximum limit`** +All three providers have a 25 MB audio limit. For longer videos, split them first using ffmpeg: +```bash +ffmpeg -i long_video.mp4 -t 600 -c copy part1.mp4 # First 10 minutes +``` + +**`API key not found` errors** +Ensure your `.env` file is in the project root (the same directory as `requirements.txt`), not inside `src/`. Sapat uses `load_dotenv(".env")` which looks for the file relative to the current working directory. + +**Groq rate limit errors** +Groq's free tier has rate limits. If you hit them, add a short delay between files when processing directories, or upgrade to a paid Groq plan. + +## Conclusion + +In this guide, you have: + +- Set up a reproducible Daytona workspace for the Sapat transcription tool +- Configured API credentials for OpenAI Whisper, Groq Cloud, and Azure OpenAI +- Transcribed single video files and entire directories using the `sapat` CLI +- Compared the three providers across speed, cost, and features +- Extended Sapat with a new provider (AssemblyAI) following the existing architecture + +Daytona's devcontainer integration makes it trivial to share this environment with teammates or reproduce it on any machine — no manual dependency installation required. The combination of Sapat's clean provider abstraction and Daytona's reproducible workspaces creates a solid foundation for building production-grade transcription pipelines. + +## References + +- [Sapat GitHub Repository](https://github.com/nibzard/sapat) +- [Daytona Documentation](https://www.daytona.io/docs) +- [OpenAI Whisper API Reference](https://platform.openai.com/docs/api-reference/audio) +- [Groq Cloud API Documentation](https://console.groq.com/docs/speech-text) +- [Azure OpenAI Whisper Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/whisper-quickstart) +- [AssemblyAI API Documentation](https://www.assemblyai.com/docs)