Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions articles/20260525_run_aws_transcribe_with_sapat_in_daytona.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
title: 'Run AWS Transcribe With Sapat in Daytona'
description:
'Build a Daytona workspace for AWS-backed Sapat transcription with S3 input,
tests, and cleanup guardrails.'
date: 2026-05-25
author: 'Kleenex Ultrasoft'
tags: ['Daytona', 'Sapat', 'AWS', 'Amazon Transcribe', 'Transcription']
---

# Run AWS Transcribe With Sapat in Daytona

Sapat is a small command-line transcription tool that turns audio or video files into text. It already supports API-backed transcription providers, and the companion implementation in
[Sapat pull request #46](https://github.com/nibzard/sapat/pull/46) adds an [Amazon Transcribe](/definitions/20260525_definition_amazon_transcribe.md) provider for teams that already run workloads on
[AWS](/definitions/20240904_definition_aws.md).

This guide shows how to test that provider from a [Daytona workspace](/definitions/20240819_definition_daytona%20workspace.md). The workflow keeps credentials in environment variables, uploads converted
media to S3, starts an Amazon Transcribe batch job, polls for the result, and returns a plain text transcript without changing the local machine setup.

## TL;DR

- Use Daytona to create a reproducible Sapat workspace for AWS transcription work.
- Configure S3 and Amazon Transcribe with environment variables, not hardcoded credentials.
- Run Sapat with `--api aws` to upload media, wait for the batch job, and write transcript text.
- Validate the provider with mocked unit tests before spending money on live AWS jobs.

## Why Use AWS Transcribe From a Daytona Workspace

Amazon Transcribe is useful when audio processing already belongs near other AWS services. The batch API reads media from S3, creates a transcription job, and returns a JSON transcript when the job completes.
That fits long-form files better than a local-only workflow because the heavy recognition step runs on AWS infrastructure.

Daytona adds a clean workspace boundary around that provider work. Instead of asking every contributor to install Python packages, `ffmpeg`, AWS tooling, and local project dependencies by hand, the repository can be opened in a workspace and tested with the same commands.

The result is a practical development path:

| Layer | Role in the workflow |
| --- | --- |
| Daytona | Provides a repeatable development workspace for Sapat changes. |
| Sapat | Converts the source file, routes to the selected transcription provider, and writes text output. |
| S3 | Stores the media file that Amazon Transcribe reads. |
| Amazon Transcribe | Runs the batch speech-to-text job and returns transcript JSON. |

## Architecture

The AWS provider uses S3 as the handoff point between Sapat and Amazon Transcribe. Sapat uploads the media file, starts the job, waits until the job is complete, downloads the transcript JSON, and extracts the first transcript string.

![AWS Transcribe workflow for Sapat in Daytona](/assets/20260525_run_aws_transcribe_with_sapat_in_daytona_img1.png)

There are two supported output modes. If `AWS_TRANSCRIBE_OUTPUT_BUCKET` is set, Amazon Transcribe writes the transcript JSON into that bucket and Sapat reads it back with S3. If no output bucket is configured, Sapat uses the service-managed `TranscriptFileUri` returned by the completed job.

## Prerequisites

Before running a live transcription, prepare these pieces:

- A Daytona installation that can create a workspace from a GitHub repository.
- An AWS account with an S3 bucket in the region you want to use.
- AWS credentials available through the normal AWS SDK chain, such as environment variables, a shared credentials file, or an instance role.
- Permission to call `transcribe:StartTranscriptionJob` and `transcribe:GetTranscriptionJob`.
- Permission to upload, read, and delete objects in the S3 bucket used by Sapat.
- A media file in a format accepted by the provider, such as `mp3`, `mp4`, `wav`, `flac`, `ogg`, `webm`, or `amr`.

Use the smallest realistic audio sample for the first live run. Unit tests can validate the integration shape without creating AWS jobs, so there is no need to burn cloud spend while checking basic routing and error handling.

## Prepare the Workspace

Create a Daytona workspace from the Sapat repository:

```bash
daytona create https://github.com/nibzard/sapat --code
```

Until the AWS provider is merged upstream, fetch the companion branch from the implementation fork:

```bash
git remote add kleenex https://github.com/Kleenex-ultrasoft/sapat.git
git fetch kleenex feat/aws-transcribe-provider
git checkout feat/aws-transcribe-provider
```

Then create a virtual environment and install Sapat in editable mode:

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install -e .
```

Confirm that the CLI exposes the AWS provider:

```bash
sapat --help
```

The provider list should include `aws` next to the existing transcription backends.

## Configure AWS

Create a local `.env` file in the Sapat workspace. Do not commit this file.

```bash
AWS_TRANSCRIBE_REGION=us-east-1
AWS_TRANSCRIBE_S3_BUCKET=your-sapat-input-bucket
AWS_TRANSCRIBE_S3_PREFIX=sapat/
AWS_TRANSCRIBE_LANGUAGE_CODE=en-US

# Optional. Use this if you want transcript JSON written to your own bucket.
AWS_TRANSCRIBE_OUTPUT_BUCKET=your-sapat-output-bucket
AWS_TRANSCRIBE_OUTPUT_PREFIX=sapat-transcripts/

# Optional tuning.
AWS_TRANSCRIBE_POLL_INTERVAL=5
AWS_TRANSCRIBE_TIMEOUT_SECONDS=900
AWS_TRANSCRIBE_DELETE_MEDIA=true
```

The AWS SDK also needs credentials. For a local workstation or a Daytona workspace, the simplest path is usually one of these:

```bash
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
```

For production-like workspaces, prefer short-lived credentials or an IAM role over long-lived access keys. The Sapat provider only needs a narrow permission set for this workflow:

| Service | Example actions |
| --- | --- |
| Amazon Transcribe | `transcribe:StartTranscriptionJob`, `transcribe:GetTranscriptionJob` |
| S3 input bucket | `s3:PutObject`, `s3:GetObject`, `s3:DeleteObject` |
| S3 output bucket | `s3:GetObject` if `AWS_TRANSCRIBE_OUTPUT_BUCKET` is used |

## Run a Smoke Test

Run Sapat with the AWS provider and an explicit language:

```bash
sapat demo.mp4 --api aws --quality H --language en-US
```

With the companion provider, Sapat converts the input with the existing audio pipeline, uploads the resulting media to the configured S3 bucket, starts the Amazon Transcribe job, polls until completion, reads the transcript JSON, and writes the transcript text to the local output file.

If the live job fails, read the Amazon Transcribe failure reason first. Most first-run problems are region mismatches, missing S3 permissions, unsupported media, or a bucket policy that prevents Amazon Transcribe from reading the uploaded object.

## Validate Without Cloud Spend

The companion implementation includes mocked tests for the AWS provider. These tests check the CLI routing, required environment variables, S3 upload and cleanup behavior, service-managed transcript URLs, configured output bucket reads, failed jobs, and timeout handling.

Run them before a live AWS smoke test:

```bash
python -m unittest tests.test_aws -v
python -m compileall src tests
sapat --help
```

These commands do not need a real AWS account when the mocked tests are used. They are a quick way to verify that the provider branch still works after dependency or CLI changes.

## Troubleshooting

| Symptom | Likely cause | Fix |
| --- | --- | --- |
| `AWS_TRANSCRIBE_S3_BUCKET must be set` | The input bucket variable is missing. | Add `AWS_TRANSCRIBE_S3_BUCKET` to `.env`. |
| `AccessDenied` from S3 | Credentials cannot write, read, or delete the target object. | Check bucket policy and IAM permissions for the exact prefix. |
| `AccessDeniedException` from Transcribe | Credentials cannot start or inspect jobs. | Add the required Amazon Transcribe actions to the IAM policy. |
| Job reaches `FAILED` | Transcribe rejected the media file or could not access it. | Check the failure reason, media format, region, and bucket access. |
| Timeout while polling | The job did not complete inside the configured timeout. | Increase `AWS_TRANSCRIBE_TIMEOUT_SECONDS` or test with a shorter media file. |
| No transcript text | The transcript JSON has no `results.transcripts` entry. | Inspect the raw job result and retry with a supported language code. |

## Security and Cleanup Notes

Keep `.env`, AWS keys, and transcript output outside commits. Transcripts can contain private user data, so treat generated `.txt` and JSON files as sensitive unless the source audio is public.

By default, the provider deletes the uploaded media object after the job finishes or fails. Set `AWS_TRANSCRIBE_DELETE_MEDIA=false` only when you need to debug the uploaded object, and delete the object manually after the investigation.

Use separate S3 prefixes for experiments, shared demos, and production samples. Prefix separation keeps cleanup simple and makes it easier to apply tighter bucket policies later.

## Conclusion

The AWS provider gives Sapat a useful backend for teams that want managed batch transcription without leaving the AWS boundary. Daytona makes that provider easier to develop and review because contributors can
reproduce the same Python setup, run mocked tests, and then perform a small live smoke test with controlled AWS credentials.

Use the companion [Sapat pull request #46](https://github.com/nibzard/sapat/pull/46) until the provider lands upstream, and keep the first live job small so the setup can be verified before larger media files are processed.

## References

- [Amazon Transcribe StartTranscriptionJob API](https://docs.aws.amazon.com/transcribe/latest/APIReference/API_StartTranscriptionJob.html)
- [Amazon Transcribe GetTranscriptionJob API](https://docs.aws.amazon.com/transcribe/latest/APIReference/API_GetTranscriptionJob.html)
- [AWS service authorization reference for Amazon Transcribe](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazontranscribe.html)
- [AWS service authorization reference for Amazon S3](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazons3.html)
- [Sapat AWS provider pull request](https://github.com/nibzard/sapat/pull/46)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions authors/kleenex_ultrasoft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Author: Kleenex Ultrasoft Title: Open Source Contributor Description: Kleenex Ultrasoft is an open source contributor focused on small, reviewable developer-tooling changes, API integrations, and practical
technical writing for reproducible workflows. Author Image: <https://github.com/Kleenex-ultrasoft.png> Author LinkedIn: Author Twitter: Company Name: Independent Company Description: Independent open source
contributor Company Logo Dark: Company Logo White:
18 changes: 18 additions & 0 deletions definitions/20260525_definition_amazon_transcribe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: 'Amazon Transcribe'
description: 'An AWS speech-to-text service for creating transcripts from audio and video files.'
date: 2026-05-25
author: 'Kleenex Ultrasoft'
---

# Amazon Transcribe

## Definition

Amazon Transcribe is an AWS automatic speech recognition service that converts speech in audio or video files into text. Developers can use it for batch transcription jobs, streaming transcription, subtitles,
call analytics, and other voice processing workflows.

## Context and Usage

In a batch transcription workflow, an application places a media file in Amazon S3, starts a transcription job, waits for the job to complete, and reads the transcript output. This pattern is useful when teams
need managed speech-to-text processing without running local recognition models or custom transcription infrastructure.