Standalone, low-latency speech transcription for Apple Silicon.
dictate.sh uses MLX for fast, local ASR with VAD-based turn detection, plus optional
LLM intent analysis. It ships as a single Python script with inline dependencies
so you can run it with uv and start talking.
- Low-latency, rolling-window ASR (MLX)
- Voice activity detection (VAD) for turn boundaries
- Optional intent analysis with a local LLM
- Live terminal UI (status, transcript, stats) via Rich
- Works offline after models are downloaded
- macOS on Apple Silicon (MLX)
- Python >= 3.10
uvinstalled- Microphone permission granted to your terminal
uv run stt.pyWith intent analysis:
uv run stt.py --analyzeChoose a different ASR model:
uv run stt.py --model mlx-community/Qwen3-ASR-1.7B-8bitList audio input devices:
uv run stt.py --list-devicesUse a specific input device:
uv run stt.py --device 3--model: ASR model (default:mlx-community/Qwen3-ASR-0.6B-8bit)--language: Transcription language (default:English)--transcribe-interval: Seconds between updates (default:0.5)--vad-frame-ms: VAD frame size (10/20/30, default:30)--vad-mode: VAD aggressiveness 0-3 (default:2)--vad-silence-ms: Silence to finalize a turn (default:500)--min-words: Minimum words to finalize a turn (default:3)--analyze: Enable LLM intent analysis--llm-model: LLM model to use for analysis (default:mlx-community/Qwen3-0.6B-4bit)--no-ui: Disable the Rich live UI--list-devices: List audio input devices--device: Audio input device index
ASR (MLX Qwen3-ASR):
mlx-community/Qwen3-ASR-0.6B-4bit: fastest, lowest qualitymlx-community/Qwen3-ASR-0.6B-8bit: good balance (default)mlx-community/Qwen3-ASR-0.6B-bf16: higher quality, more RAMmlx-community/Qwen3-ASR-1.7B-8bit: higher quality, slower
LLM (for --analyze):
mlx-community/Qwen3-0.6B-4bit: fastest, lowest RAM (default)mlx-community/Qwen3-1.7B-4bit: better quality, slowermlx-community/Mistral-7B-Instruct-v0.2-4bit: heaviermlx-community/Llama-3.1-8B-Instruct-4bit: heavier
- The Rich live UI renders on
stderrto keepstdoutclean for scripting. - If
stdoutis not a TTY (e.g., when piping to another tool),stt.pyautomatically suppresses the UI elements and prints raw transcript lines tostdout. - Use
--no-uito force-disable the visual interface even in a TTY.
# Pipe raw transcripts into another tool
uv run stt.py | grep "important"- Too many short turns: increase
--vad-silence-msor lower--vad-mode. - No audio: check mic permissions or try
--list-devices+--device. - Laggy output: reduce
--transcribe-interval.
- Set
LOG_LEVEL=DEBUGfor verbose logs. - Hugging Face HTTP request logs are suppressed by default.
MIT. See LICENSE.