This project demonstrates parallel audio transcription using OpenAI's Whisper model on Kubernetes/OpenShift, orchestrated by Kueue for workload management. It downloads audio files from the Rev.com earnings22 speech dataset and transcribes them using either CPU or GPU.
For experienced users, here's the TL;DR version:
# 1. Build and push containers
podman build -t quay.io/<user>/alpine:latest -f containers/init/Dockerfile containers/init/ && podman push quay.io/<user>/alpine:latest
podman build -t quay.io/<user>/whisper:latest -f containers/whisper/Dockerfile containers/whisper/ && podman push quay.io/<user>/whisper:latest
# 2. Create namespace and apply Kueue configs
oc create namespace sai
oc apply -f kueue/manifests/resource-flavor.yml
oc apply -f kueue/manifests/cluster-queue.yml
oc apply -f kueue/manifests/local-queue.yml
# 3. (Optional) Configure GitHub token
cp kueue/manifests/github-token-secret.yml.template kueue/manifests/github-token-secret.yml
# Edit github-token-secret.yml with your token, then:
oc apply -f kueue/manifests/github-token-secret.yml
# 4. Deploy ConfigMap and Job
oc apply -f kueue/manifests/download-script-configmap.yml
oc apply -f kueue/manifests/whisper-gpu.yml
# 5. Monitor
oc get workloads -n sai
oc get pods -n saiSee detailed instructions below for more information.
.
├── containers/
│ ├── init/
│ │ └── Dockerfile # Alpine image with curl and jq for downloading files
│ └── whisper/
│ └── Dockerfile # Whisper transcription image with Python and ffmpeg
└── kueue/
└── manifests/
├── resource-flavor.yml # Defines CPU/GPU resource types
├── cluster-queue.yml # Cluster-wide resource quota definitions
├── local-queue.yml # Namespace-scoped queue
├── download-script-configmap.yml # Shell script for downloading audio files
├── whisper-gpu.yml # Main job manifest (indexed job)
└── github-token-secret.yml.template # Template for GitHub API token
- OpenShift 4.19.4 cluster (or compatible Kubernetes cluster)
- Podman (for building containers)
- kubectl/oc CLI tools
- Kueue installed from kubernetes-sigs/kueue
For GPU-accelerated transcription, install the following operators on OpenShift:
- Node Feature Discovery (NFD) Operator - Detects hardware features and labels nodes
- NVIDIA GPU Operator - Manages NVIDIA GPU resources
- Instantiate the ClusterPolicy CR after installing the GPU operator
Follow these steps in order to set up the audio transcription pipeline:
Build the init container (Alpine with curl and jq for downloading audio files):
podman build -t quay.io/<your-username>/alpine:latest -f containers/init/Dockerfile containers/init/
podman push quay.io/<your-username>/alpine:latestBuild the Whisper container (Python with ffmpeg and openai-whisper for transcription):
podman build -t quay.io/<your-username>/whisper:latest -f containers/whisper/Dockerfile containers/whisper/
podman push quay.io/<your-username>/whisper:latestCreate a dedicated namespace for your transcription workloads:
oc create namespace saiApply Kueue configuration manifests in this order:
Step 3.1: Resource Flavor - Defines available resource types (CPU/GPU flavors):
oc apply -f kueue/manifests/resource-flavor.ymlStep 3.2: Cluster Queue - Defines resource quotas (6 CPUs, 8Gi memory, 2 GPUs):
oc apply -f kueue/manifests/cluster-queue.ymlStep 3.3: Local Queue - Namespace-scoped queue linked to the cluster queue:
oc apply -f kueue/manifests/local-queue.ymlThe download script accesses GitHub's API. Without authentication, you're limited to 60 requests/hour. To avoid rate limiting:
Step 4.1: Generate a GitHub personal access token with repo scope at https://github.com/settings/tokens
Step 4.2: Create a secret file from the template:
# Copy the template
cp kueue/manifests/github-token-secret.yml.template kueue/manifests/github-token-secret.yml
# Edit the file and replace YOUR_GITHUB_TOKEN_HERE with your actual token
# IMPORTANT: Never commit github-token-secret.yml to version control!
# Apply the secret
oc apply -f kueue/manifests/github-token-secret.ymlSecurity Note: The file kueue/manifests/github-token-secret.yml is git-ignored to prevent accidentally committing your token.
The ConfigMap contains a shell script that:
- Fetches the list of MP3 files from the earnings22/media directory via GitHub API
- Uses the
JOB_COMPLETION_INDEXenvironment variable to select which file to download - Handles pagination for large directories (100 files per page)
- Implements retry logic with exponential backoff for rate limiting
- Downloads the selected audio file to
/datadirectory using Git LFS media URLs
Apply the ConfigMap:
oc apply -f kueue/manifests/download-script-configmap.ymlThe job manifest (kueue/manifests/whisper-gpu.yml) uses an Indexed Job pattern for parallel processing:
- Parallelism: 2 pods run concurrently
- Completions: 6 total tasks (6 different audio files to process)
- Completion Mode: Indexed - each pod gets a unique
JOB_COMPLETION_INDEX(0-5)
Init Container (download-audio):
- Uses the Alpine image with curl and jq
- Mounts the download script from ConfigMap at
/scripts - Executes
download-audio.shwhich usesJOB_COMPLETION_INDEXto download a specific MP3 file - Saves the audio file to shared
/datavolume - Uses GitHub token from secret for API authentication (avoids rate limits)
Main Container (whisper-transcriber):
- Uses the Whisper image with Python, ffmpeg, and openai-whisper
- Reads the audio file from shared
/datavolume - Runs Whisper transcription with the
tiny.enmodel - Outputs transcription to
/tmp
| Volume Name | Type | Purpose | Mount Points |
|---|---|---|---|
audio-data |
emptyDir | Shares downloaded audio between init and main containers | Init: /data, Main: /data |
model-cache-volume |
emptyDir | Caches Whisper model files to avoid re-downloading | Main: /tmp/whisper_models |
download-script |
configMap | Provides the download script to init container | Init: /scripts (executable) |
Before deploying, update the job manifest (kueue/manifests/whisper-gpu.yml) if needed:
- Change the namespace (default:
sai) - Update image references to match your container registry
- Adjust parallelism and completions values if desired
Deploy the job:
oc apply -f kueue/manifests/whisper-gpu.ymlCheck Kueue workload status:
oc get workloads -n saiView job status:
oc get jobs -n sai
oc describe job whisper-transcription-cpu -n saiView running pods:
oc get pods -n saiCheck logs for a specific pod:
# View init container logs (download process)
oc logs <pod-name> -n sai -c download-audio
# View main container logs (transcription process)
oc logs <pod-name> -n sai -c whisper-transcriberEach pod in the job receives a unique JOB_COMPLETION_INDEX environment variable:
- Pod 1:
JOB_COMPLETION_INDEX=0→ downloads file at index 0 - Pod 2:
JOB_COMPLETION_INDEX=1→ downloads file at index 1 - ...and so on
This enables parallel processing of different files without coordination between pods. With parallelism=2, two files are processed simultaneously until all 6 completions are done.
The provided job uses CPU resources. To enable GPU transcription:
- Ensure GPU operators are installed and nodes are labeled
- Uncomment the GPU limits in
whisper-gpu.yml:
limits:
nvidia.com/gpu: 1- Modify the Whisper command to use GPU acceleration (requires CUDA-compatible setup)
┌─────────────────────────────────────────────────────────────┐
│ Kueue ClusterQueue │
│ (manages resource quotas) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Kueue LocalQueue (sai) │
│ (namespace-scoped queue) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Indexed Job (6 completions, 2 parallel) │
├─────────────────────────────────────────────────────────────┤
│ Pod [0] Pod [1] │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Init: Download │ JOB_INDEX=0 │ Init: Download │ ... │
│ │ script+GitHub │───────────▶file0 │ script+GitHub │ │
│ └───────┬────────┘ └───────┬────────┘ │
│ │ /data (emptyDir) │ │
│ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Main: Whisper │ │ Main: Whisper │ │
│ │ transcription │ │ transcription │ │
│ └────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Audio files are sourced from the Rev.com Speech Datasets - earnings22 collection, which contains earnings call recordings in MP3 format. The download script automatically filters for .mp3 files and handles GitHub's pagination to support large directories.
| Issue | Solution |
|---|---|
| GitHub rate limit errors | Ensure the GitHub token secret is properly configured (Step 4) |
| Pod failures during download | Check init container logs: oc logs <pod-name> -n sai -c download-audio |
| GPU not detected | Verify NFD and GPU operators are running: oc get pods -n nvidia-gpu-operator and ensure nodes have GPU labels: oc get nodes --show-labels | grep nvidia |
| Job stuck in queue | Check Kueue workload admission status: oc describe workload <name> -n sai |
| Image pull errors | Verify container images are pushed to your registry and image references in manifests are correct |
| Transcription fails | Check main container logs: oc logs <pod-name> -n sai -c whisper-transcriber |