feat: v0.5.0 — voice input with Whisper ASR

ryanjames85 · ryanjames85 · commit 9a6a327682b5 · 2026-05-22T00:59:14.000+01:00
- Local on-device transcription via OpenAI Whisper (Transformers.js, inline blob worker)
- Six models: tiny.en, tiny, base.en, base, small.en, small (~40-244 MB, download on demand)
- Push-to-talk: click to record, click to stop and transcribe
- Full-utterance transcription — whole recording sent as one chunk for best Whisper accuracy
- 0.3s silence pad prepended to prevent Whisper dropping the first word
- PCM chunks now accumulate correctly before transcription (was overwriting on each chunk)
- Model pre-warming on startup when mic is enabled and a model is downloaded
- Two-step model UX: select to highlight, set as default to activate; tick = downloaded, dot = active
- Mic sensitivity slider in settings — adjustable energy gate (grom.voiceSensitivity, default 0.010)
- Persist downloaded model list to localStorage across sessions
- Hide/show mic toggle, privacy badge, ffmpeg lifecycle management
- Update CHANGELOG, README, website, and all docs — remove all Moonshine references
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,7 @@ All notable changes to Grom are documented here.
 - **Active model indicator** — the current default model is clearly marked in the picker. Selecting a model highlights it; a "Set as default" button promotes it. Downloaded models show a tick; the active model shows a filled dot.
 - **Model pre-warming** — Whisper loads silently in the background when Grom starts (if the mic is enabled and a model is downloaded), so the first utterance transcribes without delay.
 - **Full-utterance transcription** — the entire recording is sent to Whisper as one chunk (capped at 28 s), giving the model full context for accurate transcription. A 0.3 s silence pad is prepended to prevent Whisper from dropping the first word.
+- **Mic sensitivity slider** — Settings → Voice Input exposes the energy gate (RMS threshold) as a slider. Raise it if phantom transcriptions appear from background noise; lower it for quiet microphones. Persisted to VS Code settings as `grom.voiceSensitivity`.
 - **ffmpeg lifecycle management** — Settings → Voice Input lets you remove the downloaded ffmpeg binary for a full cleanup. Re-downloading works seamlessly afterwards.
 - **Hide/show mic toggle** — hide the mic button from the toolbar via Settings → Voice Input; restore it anytime from the same panel. An info badge explains how to get it back if you hide it accidentally.
 - **Privacy badge** — the Voice Input settings section carries a circled-i badge explaining that audio is transcribed entirely on your device and never leaves your machine — part of Grom's accessibility and privacy ethos.
diff --git a/README.md b/README.md
@@ -301,6 +301,7 @@ Grom runs in VS Code and any VS Code-compatible editor:
 | `grom.customLogo` | Override the chat logo — URL, `data:` URI, or emoji | *(blank)* |
 | `grom.voiceInput` | Enable the mic button in the toolbar | `false` |
 | `grom.voiceModel` | Whisper model: `tiny.en`, `tiny`, `base.en`, `base`, `small.en`, `small` | `tiny.en` |
+| `grom.voiceSensitivity` | Mic energy gate (RMS threshold). Raise if phantom transcriptions appear; lower for quiet mics | `0.010` |
 | `grom.ffmpegPath` | Path to a custom ffmpeg binary (skips the built-in download) | *(blank)* |
 
 ### Per-Language Model Routing
diff --git a/docs/index.html b/docs/index.html
@@ -177,7 +177,7 @@ <h3>Memory</h3>
       <div class="feature-card">
         <div class="icon">🎙️</div>
         <h3>Voice Input</h3>
-        <p>Speak your prompts. Audio is transcribed on-device using a local Moonshine ONNX model — nothing is sent to a server. Choose Tiny (~75 MB) or Base (~300 MB). Chunked streaming so text appears as you speak.</p>
+        <p>Speak your prompts. Audio is transcribed on-device using Whisper — nothing is sent to a server. Six models from Tiny EN (~40 MB) to Small (~244 MB). Push-to-talk with model pre-warming for instant first use.</p>
       </div>
     </div>
   </section>
@@ -203,19 +203,19 @@ <h3>Download ffmpeg (once)</h3>
       <div class="step">
         <div class="step-num">3</div>
         <div class="step-body">
-          <h3>Choose your ASR model</h3>
-          <p>Pick <strong>Tiny</strong> (~75 MB, fast) or <strong>Base</strong> (~300 MB, more accurate) in Settings → Voice Input. Models download on demand and are cached locally. Switch at any time.</p>
+          <h3>Choose your Whisper model</h3>
+          <p>Pick from six models in Settings → Voice Input — from <strong>Tiny EN</strong> (~40 MB, fast) up to <strong>Small</strong> (~244 MB, best accuracy). English-only <code>.en</code> variants are more accurate for English speakers. Models download on demand and are cached locally. Download multiple and switch at any time.</p>
         </div>
       </div>
       <div class="step">
         <div class="step-num">4</div>
         <div class="step-body">
           <h3>Record</h3>
-          <p>Click the mic button or press <code>Ctrl+Shift+M</code> to start. Text appears as you speak. Click again or press the shortcut to stop — the transcript is appended to your prompt.</p>
+          <p>Click the mic button or press <code>Ctrl+Shift+M</code> to start recording. Click again to stop — the transcript is appended to your prompt. The model pre-warms on startup so the first utterance transcribes without delay.</p>
         </div>
       </div>
     </div>
-    <p style="margin-top:20px;font-size:13px;">Audio is transcribed entirely on your device using <a href="https://github.com/huggingface/transformers.js">Transformers.js</a> and the <a href="https://github.com/usefulsensors/moonshine">Moonshine</a> ONNX model. Nothing leaves your machine. Voice input is optional and designed for those who want or need it as an accessibility tool.</p>
+    <p style="margin-top:20px;font-size:13px;">Audio is transcribed entirely on your device using <a href="https://github.com/huggingface/transformers.js">Transformers.js</a> and <a href="https://openai.com/research/whisper">OpenAI Whisper</a>. Nothing leaves your machine. Voice input is optional and designed for those who want or need it as an accessibility tool.</p>
     <p style="font-size:13px;"><strong>Platform support:</strong> Windows (DirectShow), macOS (avfoundation), Linux (PulseAudio / PipeWire / ALSA).</p>
   </section>
 
diff --git a/media/main.js b/media/main.js
@@ -815,7 +815,7 @@ function handleFileUpload(input) {
 
 // ── Voice input ──────────────────────────────────────────────────────────────
 // Audio capture: ffmpeg subprocess in extension host → raw 16kHz PCM → sent here as base64.
-// Inference: Moonshine ONNX via @huggingface/transformers CDN, runs in this webview.
+// Inference: Whisper via @huggingface/transformers CDN, runs in an inline blob Worker.
 
 let _voiceRecordingTimer = null;
 function _setVoiceState(state) {
@@ -859,6 +859,7 @@ let _vpWarming = false; // true while silently pre-loading on startup — suppre
 let _vpModelId = 'tiny.en';   // active/default model used for transcription
 let _vpSelectedId = 'tiny.en'; // currently highlighted in the picker UI
 let _vpLoadedModel = null;     // which model id is loaded in the active worker
+let _vpEnergyGate = 0.010;    // RMS threshold — audio below this is treated as silence
 
 function _vpGetDownloaded() {
   try { return JSON.parse(localStorage.getItem('grom_downloaded_models') || '[]'); } catch { return []; }
@@ -1129,8 +1130,8 @@ async function _vpTranscribe(isFinal) {
   }
   const rms = _vpRms(_vpAllPcm);
   console.log('[grom voice] RMS:', rms.toFixed(6));
-  if (rms < 0.010) {
-    console.log('[grom voice] energy gate: skipping (RMS ' + rms.toFixed(6) + ')');
+  if (rms < _vpEnergyGate) {
+    console.log('[grom voice] energy gate: skipping (RMS ' + rms.toFixed(6) + ' < ' + _vpEnergyGate + ')');
     if (isFinal) { _setVoiceState('idle'); _setVoiceDownload(null); }
     return;
   }
@@ -1241,6 +1242,30 @@ window.toggleMicVisibility = function() {
   vscode.postMessage({ type: _voiceInputEnabled ? 'disableVoiceInput' : 'enableVoiceInput' });
 };
 
+// Slider: 1–100 maps to 0.001–0.100 linearly
+function _vpSliderToGate(v) { return Math.round(Number(v)) / 1000; }
+function _vpGateToSlider(g) { return Math.round(g * 1000); }
+
+function _vpSetSensitivity(gate, save) {
+  _vpEnergyGate = gate;
+  const slider = document.getElementById('voice-sensitivity-slider');
+  const val = document.getElementById('voice-sensitivity-val');
+  if (slider) slider.value = _vpGateToSlider(gate);
+  if (val) val.textContent = gate.toFixed(3);
+  if (save) vscode.postMessage({ type: 'setVoiceSensitivity', value: gate });
+}
+
+window.onVoiceSensitivityInput = function(v) {
+  const gate = _vpSliderToGate(v);
+  _vpEnergyGate = gate;
+  const val = document.getElementById('voice-sensitivity-val');
+  if (val) val.textContent = gate.toFixed(3);
+};
+
+window.onVoiceSensitivityChange = function(v) {
+  _vpSetSensitivity(_vpSliderToGate(v), true);
+};
+
 let _voiceToggleLock = false;
 window.toggleVoiceInput = function() {
   console.log('[grom voice] toggleVoiceInput called, lock:', _voiceToggleLock, 'state:', _voiceState);
@@ -1463,7 +1488,11 @@ window.addEventListener('message', e => {
     } break;
     case 'voiceState': _setVoiceState(m.state); break;
     case 'voiceDownload': _setVoiceDownload(m.text); break;
-    case 'voiceModelConfig': _vpModelId = m.model || 'tiny.en'; _vpSelectedId = _vpModelId; _vpUpdateModelUI(); break;
+    case 'voiceModelConfig': {
+      _vpModelId = m.model || 'tiny.en'; _vpSelectedId = _vpModelId; _vpUpdateModelUI();
+      if (typeof m.sensitivity === 'number') _vpSetSensitivity(m.sensitivity, false);
+      break;
+    }
     case 'voiceAudioStart': {
       _vpAllPcm = null; _vpBusy = false; _vpPendingFinal = false;
       const _existingPrompt = document.getElementById('prompt');
diff --git a/media/styles.css b/media/styles.css
@@ -116,6 +116,10 @@ mark.search-highlight { background: rgba(232, 168, 56, 0.4); color: inherit; bor
 .voice-mic-on  { border-color: #4ec9b0 !important; color: #4ec9b0 !important; opacity: 1 !important; }
 .voice-mic-off { opacity: 0.45 !important; }
 .voice-action-result { font-size: 11px; color: var(--vscode-descriptionForeground); margin-top: 6px; min-height: 16px; }
+.voice-sensitivity-row { display: flex; align-items: center; gap: 8px; margin-top: 8px; }
+.voice-sensitivity-label { font-size: 11px; color: var(--vscode-descriptionForeground); white-space: nowrap; }
+#voice-sensitivity-slider { flex: 1; accent-color: var(--vscode-button-background); cursor: pointer; }
+#voice-sensitivity-val { font-size: 11px; color: var(--vscode-descriptionForeground); min-width: 36px; text-align: right; }
 #input-container { padding: 15px; flex-shrink: 0; }
 .input-box { background: var(--item-bg); border: 1px solid var(--border); border-radius: 12px; padding: 12px; display: flex; flex-direction: column; box-shadow: 0 4px 12px rgba(0,0,0,0.1); }
 textarea { width: 100%; background: transparent; color: inherit; border: none; outline: none; resize: none; min-height: 20px; max-height: 200px; font-family: inherit; font-size: 13px; }
diff --git a/media/voice-worker.js b/media/voice-worker.js
diff --git a/media/webview.html b/media/webview.html
@@ -179,6 +179,11 @@
           <button id="remove-ffmpeg-btn" class="voice-remove-btn" onclick="window.removeFfmpeg()" style="display:none">Remove ffmpeg</button>
           <button id="mic-toggle-btn" class="voice-remove-btn voice-mic-on" onclick="window.toggleMicVisibility()" title="Toggle the mic button in the toolbar">● Mic on</button>
         </div>
+        <div class="settings-row voice-sensitivity-row">
+          <label for="voice-sensitivity-slider" class="voice-sensitivity-label">Mic sensitivity</label>
+          <input type="range" id="voice-sensitivity-slider" min="1" max="100" step="1" value="10" oninput="window.onVoiceSensitivityInput(this.value)" onchange="window.onVoiceSensitivityChange(this.value)" title="Energy gate — lower is more sensitive">
+          <span id="voice-sensitivity-val">0.010</span>
+        </div>
         <div id="clear-voice-result" class="voice-action-result"></div>
       </div>
       <div class="settings-section">
diff --git a/package.json b/package.json
@@ -434,6 +434,13 @@
           "default": "tiny.en",
           "description": "Which Whisper model to use for local voice transcription. English-only (.en) models are more accurate for English speakers. The model is downloaded once and cached in your browser session."
         },
+        "grom.voiceSensitivity": {
+          "type": "number",
+          "default": 0.010,
+          "minimum": 0.001,
+          "maximum": 0.100,
+          "description": "Mic energy gate for voice input (RMS threshold). Lower = more sensitive (picks up quiet speech and background noise). Higher = less sensitive (ignores noise but may miss quiet speech). Default 0.010 works for most setups; raise to 0.020–0.030 if phantom transcriptions appear."
+        },
         "grom.ffmpegPath": {
           "type": "string",
           "default": "",
diff --git a/src/provider.ts b/src/provider.ts
@@ -183,7 +183,8 @@ export class LocalChatViewProvider implements vscode.WebviewViewProvider {
           this._loadAllSessions();
           this._updateActiveContext();
           const voiceModel = vscode.workspace.getConfiguration('grom').get<string>('voiceModel', 'tiny.en');
-          webviewView.webview.postMessage({ type: 'voiceModelConfig', model: voiceModel });
+          const voiceSensitivity = vscode.workspace.getConfiguration('grom').get<number>('voiceSensitivity', 0.010);
+          webviewView.webview.postMessage({ type: 'voiceModelConfig', model: voiceModel, sensitivity: voiceSensitivity });
           webviewView.webview.postMessage({ type: 'voiceFfmpegStatus', present: !!findFfmpeg(this._context.globalStorageUri.fsPath) });
           const isDev = this._context.extensionMode === vscode.ExtensionMode.Development;
           if (isDev || !this._context.globalState.get('grom.welcomed')) {
@@ -529,6 +530,9 @@ export class LocalChatViewProvider implements vscode.WebviewViewProvider {
         case 'setVoiceModel':
           void vscode.workspace.getConfiguration('grom').update('voiceModel', data.model, vscode.ConfigurationTarget.Global);
           break;
+        case 'setVoiceSensitivity':
+          void vscode.workspace.getConfiguration('grom').update('voiceSensitivity', data.value, vscode.ConfigurationTarget.Global);
+          break;
         case 'removeFfmpeg': {
           const ffmpegDir = path.join(this._context.globalStorageUri.fsPath, 'ffmpeg');
           try {