diff --git a/docs/audio-capture-and-volume.md b/docs/audio-capture-and-volume.md new file mode 100644 index 0000000..eba8530 --- /dev/null +++ b/docs/audio-capture-and-volume.md @@ -0,0 +1,271 @@ +# Audio Capture and Volume Control + +## Overview + +This document explains how Brainwave handles microphone input, audio processing, and volume control for real-time transcription with OpenAI's Realtime API. + +## Audio Flow Architecture + +``` +Browser Microphone + ↓ (48kHz, Float32, mono) +Web Audio API ScriptProcessor + ↓ (Convert to PCM16) +WebSocket to Backend + ↓ +FastAPI Server (realtime_server.py) + ↓ (Resample 48kHz → 24kHz) +Audio Quality Detection + ↓ +OpenAI Realtime API + ↓ (whisper-1 transcription) +Transcription Results +``` + +## Browser-Side Audio Capture + +### Microphone Constraints + +**File**: [`static/main.js:577-585`](static/main.js#L577-L585) + +```javascript +stream = await navigator.mediaDevices.getUserMedia({ + audio: { + channelCount: 1, // Mono audio + echoCancellation: true, // Remove echo + noiseSuppression: true, // Reduce background noise + autoGainControl: false // CRITICAL: Disabled - see below + } +}); +``` + +### Why Auto Gain Control (AGC) is Disabled + +**IMPORTANT**: Browser AGC is **disabled** (`autoGainControl: false`) for the following reasons: + +1. **Preserves System Settings**: Users configure their microphone volume at the OS level (System Settings → Sound → Input). Browser AGC would override these settings. + +2. **Prevents Unexpected Volume Reduction**: During debugging, we discovered that `autoGainControl: true` caused audio amplitude to drop from expected values (>1000) to extremely low values (5-57), causing transcription failures. + +3. **User Control**: Professional users with proper microphone setups expect their hardware/system settings to be respected, not overridden by browser heuristics. + +4. **Predictable Behavior**: With AGC disabled, audio levels are consistent and predictable based on system configuration. + +**Trade-off**: Users must manually configure their microphone volume at the system level. The application provides warnings when volume is too low (see Audio Quality Detection below). + +### Audio Format Conversion + +**File**: [`static/main.js:189-192`](static/main.js#L189-L192) + +```javascript +// Convert float32 (-1.0 to 1.0) to PCM16 (-32768 to 32767) +for (let i = 0; i < inputData.length; i++) { + pcmData[i] = Math.max(-32768, Math.min(32767, Math.floor(inputData[i] * 32767))); +} +``` + +Browser audio is captured as Float32 values in the range [-1.0, 1.0], which are converted to PCM16 format (16-bit signed integers) before sending to the server. + +## Server-Side Audio Processing + +### Audio Resampling + +**File**: [`realtime_server.py:179-211`](realtime_server.py#L179-L211) + +OpenAI's Realtime API requires audio at **24kHz sample rate**, but browsers typically capture at **48kHz**. We downsample using `scipy.signal.resample_poly`: + +```python +def process_audio_chunk(self, audio_data): + # Convert binary PCM16 to numpy array + pcm_data = np.frombuffer(audio_data, dtype=np.int16) + + # Convert to float32 for precision during resampling + float_data = pcm_data.astype(np.float32) / 32768.0 + + # Resample: 48kHz → 24kHz (reduces data by half) + resampled_data = scipy.signal.resample_poly( + float_data, + self.target_sample_rate, # 24000 + self.source_sample_rate # 48000 + ) + + # Convert back to PCM16 + resampled_int16 = (resampled_data * 32768.0).clip(-32768, 32767).astype(np.int16) + + return resampled_int16.tobytes(), quality_status, max_amplitude +``` + +**Why resample_poly?**: This method preserves audio quality better than simple decimation and prevents aliasing artifacts. + +## Audio Quality Detection + +### Quality Metrics + +**File**: [`realtime_server.py:182-197`](realtime_server.py#L182-L197) + +The system analyzes each audio chunk to detect quality issues: + +```python +# Check audio quality +max_amplitude = np.max(np.abs(pcm_data)) +rms = np.sqrt(np.mean(pcm_data.astype(np.float32) ** 2)) + +# Determine audio quality status +if max_amplitude == 0: + quality_status = "silent" + logger.warning("⚠️ Silent audio detected! All samples are zero.") +elif max_amplitude < 100: + quality_status = "too_quiet" + logger.warning(f"⚠️ Very quiet audio detected! Max amplitude: {max_amplitude} (expected > 1000)") +else: + quality_status = "ok" + logger.debug(f"Audio quality: max_amp={max_amplitude}, rms={rms:.1f}") +``` + +### Quality Thresholds + +| Status | Max Amplitude | Description | +|--------|---------------|-------------| +| `silent` | 0 | No audio signal detected (all zeros) | +| `too_quiet` | 1-99 | Audio too quiet for reliable transcription | +| `ok` | ≥100 | Acceptable audio level (typically >1000 for normal speech) | + +**Expected Range**: Normal speech at comfortable microphone distance typically produces amplitudes of **1,000 to 15,000** (PCM16 range). + +### Developer-Friendly Audio Monitoring + +**Browser Console Logging** - **File**: [`static/main.js:199-202`](static/main.js#L199-L202) + +Audio levels are logged to the browser console every ~2 seconds for debugging: + +```javascript +// Log audio levels every ~2 seconds (at 48kHz with 4096 buffer = ~12 chunks/sec) +if (Math.random() < 0.08) { + console.log(`🎤 Audio levels - Float: ${maxFloat.toFixed(3)} (0.0-1.0), PCM16: ${maxPCM} (expected >1000 for speech)`); +} +``` + +**Server-Side Logging** - **File**: [`realtime_server.py:182-197`](realtime_server.py#L182-L197) + +The server logs audio quality warnings when issues are detected: +- `logger.warning()` for silent or too-quiet audio +- `logger.debug()` for normal audio with amplitude details + +**Why console logging instead of popups?**: Since the primary users are experienced developers, console logging provides debugging information without interrupting the user experience. Users can monitor audio levels in the browser DevTools console. + +## OpenAI Integration + +### Transcription Model + +**File**: [`openai_realtime_client.py:56-66`](openai_realtime_client.py#L56-L66) + +We use **gpt-4o-transcribe** for transcription: + +```python +if session_mode == "transcription": + session_config_payload["input_audio_transcription"] = { + "model": "gpt-4o-transcribe" + } +``` + +**Note**: This model is part of OpenAI's Realtime API and is optimized for real-time transcription. Alternative models like `whisper-1` can be used if needed. + +### Error Handling + +**File**: [`realtime_server.py:486-499`](realtime_server.py#L486-L499) + +The system handles transcription failures from OpenAI: + +```python +async def handle_transcription_failed(data): + """Handle transcription failure events from OpenAI""" + logger.error(f"⚠️ TRANSCRIPTION FAILED: {json.dumps(data, indent=2)}") + + error_msg = data.get("error", {}) + item_id = data.get("item_id", "unknown") + + # Send error to frontend + await websocket.send_text(json.dumps({ + "type": "error", + "content": f"Transcription failed: {error_msg.get('message', 'Unknown error')}", + "details": error_msg + }, ensure_ascii=False)) +``` + +This handler captures detailed error information from OpenAI's `conversation.item.input_audio_transcription.failed` event, which would otherwise be silently lost. + +## Troubleshooting + +### Problem: No transcription appearing + +**Possible Causes**: +1. **Low microphone volume** → Increase system microphone volume +2. **Rate limiting** → Wait 10-15 minutes between tests during development +3. **Microphone permissions** → Check browser permissions +4. **Silent audio** → Verify microphone is working in system settings + +### Problem: Low audio levels (shown in browser console) + +**Check browser console** for messages like: `🎤 Audio levels - Float: 0.003 (0.0-1.0), PCM16: 57 (expected >1000 for speech)` + +**Solutions**: +1. **Increase system volume**: macOS → System Settings → Sound → Input → adjust Input volume slider +2. **Select different microphone**: Choose a microphone with higher sensitivity +3. **Move closer to mic**: Ensure proper distance from microphone +4. **Test with Voice Memo**: macOS Voice Memo app can verify microphone is working properly +5. **Check browser permissions**: Ensure the correct microphone is selected in browser settings + +### Problem: 429 Rate Limit Error + +**Cause**: Too many connection attempts in a short time period. + +**Solution**: +- Wait 10-15 minutes for rate limits to reset +- OpenAI uses sliding window rate limits (not fixed intervals) +- Each connection attempt counts against your limit, even if it fails +- Test sparingly during development (2-3 minute gaps between attempts) + +## Technical Specifications + +| Parameter | Value | Notes | +|-----------|-------|-------| +| Sample Rate (Browser) | 48,000 Hz | Standard browser capture rate | +| Sample Rate (OpenAI) | 24,000 Hz | Required by OpenAI Realtime API | +| Format | PCM16 (16-bit signed) | Mono channel | +| Encoding | Base64 | For WebSocket transmission | +| Buffer Size | 4,096 samples | ScriptProcessor buffer | +| Auto Gain Control | **Disabled** | User controls volume at OS level | +| Echo Cancellation | Enabled | Removes audio output echo | +| Noise Suppression | Enabled | Reduces background noise | + +## Best Practices + +### For Users + +1. **Configure microphone at system level** before using the app +2. **Test microphone** with native apps (e.g., Voice Memo on macOS) +3. **Position microphone** properly for clear speech capture +4. **Avoid rapid testing** during development to prevent rate limits + +### For Developers + +1. **Monitor audio quality** using the built-in detection system +2. **Handle transcription failures** explicitly (don't rely on silent failures) +3. **Use debug logging** sparingly in production (`logger.debug()` for verbose output) +4. **Respect rate limits** when testing (use sliding window approach) +5. **Keep AGC disabled** unless users specifically request it + +## Related Files + +- [`static/main.js`](../static/main.js) - Browser audio capture and WebSocket client +- [`realtime_server.py`](../realtime_server.py) - Server-side audio processing and quality detection +- [`openai_realtime_client.py`](../openai_realtime_client.py) - OpenAI Realtime API integration +- [`audio_processor.py`](../audio_processor.py) - Audio resampling utilities (AudioProcessor class is in realtime_server.py) + +## Version History + +- **2026-01-05**: Initial documentation + - Disabled browser Auto Gain Control + - Implemented audio quality detection with console logging (developer-friendly) + - Using gpt-4o-transcribe transcription model (can be switched to whisper-1 if needed) + - Added transcription failure error handling diff --git a/openai_realtime_client.py b/openai_realtime_client.py index ce1e68a..4a97ca2 100644 --- a/openai_realtime_client.py +++ b/openai_realtime_client.py @@ -122,9 +122,11 @@ async def default_handler(self, data: dict): async def send_audio(self, audio_data: bytes): if self._is_ws_open(): + encoded = base64.b64encode(audio_data).decode('utf-8') + logger.debug(f"🎤 Encoding {len(audio_data)} bytes → {len(encoded)} base64 chars for OpenAI") await self.ws.send(json.dumps({ "type": "input_audio_buffer.append", - "audio": base64.b64encode(audio_data).decode('utf-8') + "audio": encoded })) else: logger.error("WebSocket is not open. Cannot send audio.") diff --git a/realtime_server.py b/realtime_server.py index f8f3a21..c66725f 100644 --- a/realtime_server.py +++ b/realtime_server.py @@ -179,20 +179,37 @@ def __init__(self, target_sample_rate=24000): def process_audio_chunk(self, audio_data): # Convert binary audio data to Int16 array pcm_data = np.frombuffer(audio_data, dtype=np.int16) - + + # Check audio quality + max_amplitude = np.max(np.abs(pcm_data)) + rms = np.sqrt(np.mean(pcm_data.astype(np.float32) ** 2)) + + # Determine audio quality status + if max_amplitude == 0: + quality_status = "silent" + logger.warning("⚠️ Silent audio detected! All samples are zero.") + elif max_amplitude < 100: + quality_status = "too_quiet" + logger.warning(f"⚠️ Very quiet audio detected! Max amplitude: {max_amplitude} (expected > 1000)") + else: + quality_status = "ok" + logger.debug(f"Audio quality: max_amp={max_amplitude}, rms={rms:.1f}") + # Convert to float32 for better precision during resampling float_data = pcm_data.astype(np.float32) / 32768.0 - + # Resample from 48kHz to 24kHz resampled_data = scipy.signal.resample_poly( - float_data, - self.target_sample_rate, + float_data, + self.target_sample_rate, self.source_sample_rate ) - + # Convert back to int16 while preserving amplitude resampled_int16 = (resampled_data * 32768.0).clip(-32768, 32767).astype(np.int16) - return resampled_int16.tobytes() + + # Return both audio and quality info + return resampled_int16.tobytes(), quality_status, max_amplitude def save_audio_buffer(self, audio_buffer, filename): with wave.open(filename, 'wb') as wf: @@ -328,6 +345,7 @@ async def initialize_realtime_client(provider: str = None, model: str = None, vo client.register_handler("input_audio_buffer.committed", lambda data: handle_generic_event("input_audio_buffer.committed", data)) client.register_handler("conversation.item.added", lambda data: handle_generic_event("conversation.item.added", data)) client.register_handler("conversation.item.input_audio_transcription.completed", lambda data: handle_generic_event("conversation.item.input_audio_transcription.completed", data)) + client.register_handler("conversation.item.input_audio_transcription.failed", lambda data: handle_transcription_failed(data)) client.register_handler("response.output_audio_transcript.done", lambda data: handle_generic_event("response.output_audio_transcript.done", data)) client.register_handler("response.output_audio.delta", lambda data: handle_generic_event("response.output_audio.delta", data)) client.register_handler("response.output_audio.done", lambda data: handle_generic_event("response.output_audio.done", data)) @@ -464,6 +482,22 @@ async def handle_response_done(data): except Exception as e: logger.error(f"Error closing OpenAI client: {str(e)}", exc_info=True) + async def handle_transcription_failed(data): + """Handle transcription failure events from OpenAI""" + logger.error(f"⚠️ TRANSCRIPTION FAILED: {json.dumps(data, indent=2)}") + + # Extract error details + error_msg = data.get("error", {}) + item_id = data.get("item_id", "unknown") + + # Send error to frontend + if websocket.client_state == WebSocketState.CONNECTED: + await websocket.send_text(json.dumps({ + "type": "error", + "content": f"Transcription failed: {error_msg.get('message', 'Unknown error')}", + "details": error_msg + }, ensure_ascii=False)) + async def handle_generic_event(event_type, data): logger.info(f"Handled {event_type} with data: {json.dumps(data, ensure_ascii=False)}") @@ -497,7 +531,10 @@ async def receive_messages(): break if "bytes" in data: - processed_audio = audio_processor.process_audio_chunk(data["bytes"]) + logger.debug(f"📤 Received {len(data['bytes'])} bytes of audio from frontend") + processed_audio, quality_status, max_amp = audio_processor.process_audio_chunk(data["bytes"]) + logger.debug(f"🔊 Processed audio: {len(processed_audio)} bytes (after 48kHz→24kHz resampling)") + if not openai_ready.is_set(): logger.debug("OpenAI not ready, buffering audio chunk") pending_audio_chunks.append(processed_audio) @@ -507,7 +544,7 @@ async def receive_messages(): "type": "status", "status": "connected" }, ensure_ascii=False)) - logger.debug(f"Sent audio chunk, size: {len(processed_audio)} bytes") + logger.debug(f"📡 Sent {len(processed_audio)} bytes to OpenAI") else: logger.warning("Received audio but client is not initialized") diff --git a/static/main.js b/static/main.js index f355c27..cb59752 100644 --- a/static/main.js +++ b/static/main.js @@ -185,9 +185,20 @@ function createAudioProcessor() { const inputData = e.inputBuffer.getChannelData(0); const pcmData = new Int16Array(inputData.length); - + + // Check audio levels and log periodically (for debugging) + let maxFloat = 0; + let maxPCM = 0; for (let i = 0; i < inputData.length; i++) { - pcmData[i] = Math.max(-32768, Math.min(32767, Math.floor(inputData[i] * 32767))); + maxFloat = Math.max(maxFloat, Math.abs(inputData[i])); + const pcmValue = Math.max(-32768, Math.min(32767, Math.floor(inputData[i] * 32767))); + pcmData[i] = pcmValue; + maxPCM = Math.max(maxPCM, Math.abs(pcmValue)); + } + + // Log audio levels every ~2 seconds (at 48kHz with 4096 buffer = ~12 chunks/sec) + if (Math.random() < 0.08) { + console.log(`🎤 Audio levels - Float: ${maxFloat.toFixed(3)} (0.0-1.0), PCM16: ${maxPCM} (expected >1000 for speech)`); } const combinedBuffer = new Int16Array(audioBuffer.length + pcmData.length); @@ -569,13 +580,13 @@ async function startRecording() { cleanupAudioResources(); try { - stream = await navigator.mediaDevices.getUserMedia({ + stream = await navigator.mediaDevices.getUserMedia({ audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, - autoGainControl: true - } + autoGainControl: false // Disable browser AGC - use system mic settings + } }); streamInitialized = true; } catch (err) {