Skip to content

Commit 9a6a327

Browse files
committed
feat: v0.5.0 — voice input with Whisper ASR
- Local on-device transcription via OpenAI Whisper (Transformers.js, inline blob worker) - Six models: tiny.en, tiny, base.en, base, small.en, small (~40-244 MB, download on demand) - Push-to-talk: click to record, click to stop and transcribe - Full-utterance transcription — whole recording sent as one chunk for best Whisper accuracy - 0.3s silence pad prepended to prevent Whisper dropping the first word - PCM chunks now accumulate correctly before transcription (was overwriting on each chunk) - Model pre-warming on startup when mic is enabled and a model is downloaded - Two-step model UX: select to highlight, set as default to activate; tick = downloaded, dot = active - Mic sensitivity slider in settings — adjustable energy gate (grom.voiceSensitivity, default 0.010) - Persist downloaded model list to localStorage across sessions - Hide/show mic toggle, privacy badge, ffmpeg lifecycle management - Update CHANGELOG, README, website, and all docs — remove all Moonshine references
1 parent fa59f12 commit 9a6a327

9 files changed

Lines changed: 61 additions & 82 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ All notable changes to Grom are documented here.
1313
- **Active model indicator** — the current default model is clearly marked in the picker. Selecting a model highlights it; a "Set as default" button promotes it. Downloaded models show a tick; the active model shows a filled dot.
1414
- **Model pre-warming** — Whisper loads silently in the background when Grom starts (if the mic is enabled and a model is downloaded), so the first utterance transcribes without delay.
1515
- **Full-utterance transcription** — the entire recording is sent to Whisper as one chunk (capped at 28 s), giving the model full context for accurate transcription. A 0.3 s silence pad is prepended to prevent Whisper from dropping the first word.
16+
- **Mic sensitivity slider** — Settings → Voice Input exposes the energy gate (RMS threshold) as a slider. Raise it if phantom transcriptions appear from background noise; lower it for quiet microphones. Persisted to VS Code settings as `grom.voiceSensitivity`.
1617
- **ffmpeg lifecycle management** — Settings → Voice Input lets you remove the downloaded ffmpeg binary for a full cleanup. Re-downloading works seamlessly afterwards.
1718
- **Hide/show mic toggle** — hide the mic button from the toolbar via Settings → Voice Input; restore it anytime from the same panel. An info badge explains how to get it back if you hide it accidentally.
1819
- **Privacy badge** — the Voice Input settings section carries a circled-i badge explaining that audio is transcribed entirely on your device and never leaves your machine — part of Grom's accessibility and privacy ethos.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,7 @@ Grom runs in VS Code and any VS Code-compatible editor:
301301
| `grom.customLogo` | Override the chat logo — URL, `data:` URI, or emoji | *(blank)* |
302302
| `grom.voiceInput` | Enable the mic button in the toolbar | `false` |
303303
| `grom.voiceModel` | Whisper model: `tiny.en`, `tiny`, `base.en`, `base`, `small.en`, `small` | `tiny.en` |
304+
| `grom.voiceSensitivity` | Mic energy gate (RMS threshold). Raise if phantom transcriptions appear; lower for quiet mics | `0.010` |
304305
| `grom.ffmpegPath` | Path to a custom ffmpeg binary (skips the built-in download) | *(blank)* |
305306

306307
### Per-Language Model Routing

docs/index.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ <h3>Memory</h3>
177177
<div class="feature-card">
178178
<div class="icon">🎙️</div>
179179
<h3>Voice Input</h3>
180-
<p>Speak your prompts. Audio is transcribed on-device using a local Moonshine ONNX model — nothing is sent to a server. Choose Tiny (~75 MB) or Base (~300 MB). Chunked streaming so text appears as you speak.</p>
180+
<p>Speak your prompts. Audio is transcribed on-device using Whisper — nothing is sent to a server. Six models from Tiny EN (~40 MB) to Small (~244 MB). Push-to-talk with model pre-warming for instant first use.</p>
181181
</div>
182182
</div>
183183
</section>
@@ -203,19 +203,19 @@ <h3>Download ffmpeg (once)</h3>
203203
<div class="step">
204204
<div class="step-num">3</div>
205205
<div class="step-body">
206-
<h3>Choose your ASR model</h3>
207-
<p>Pick <strong>Tiny</strong> (~75 MB, fast) or <strong>Base</strong> (~300 MB, more accurate) in Settings → Voice Input. Models download on demand and are cached locally. Switch at any time.</p>
206+
<h3>Choose your Whisper model</h3>
207+
<p>Pick from six models in Settings → Voice Input — from <strong>Tiny EN</strong> (~40 MB, fast) up to <strong>Small</strong> (~244 MB, best accuracy). English-only <code>.en</code> variants are more accurate for English speakers. Models download on demand and are cached locally. Download multiple and switch at any time.</p>
208208
</div>
209209
</div>
210210
<div class="step">
211211
<div class="step-num">4</div>
212212
<div class="step-body">
213213
<h3>Record</h3>
214-
<p>Click the mic button or press <code>Ctrl+Shift+M</code> to start. Text appears as you speak. Click again or press the shortcut to stop — the transcript is appended to your prompt.</p>
214+
<p>Click the mic button or press <code>Ctrl+Shift+M</code> to start recording. Click again to stop — the transcript is appended to your prompt. The model pre-warms on startup so the first utterance transcribes without delay.</p>
215215
</div>
216216
</div>
217217
</div>
218-
<p style="margin-top:20px;font-size:13px;">Audio is transcribed entirely on your device using <a href="https://github.com/huggingface/transformers.js">Transformers.js</a> and the <a href="https://github.com/usefulsensors/moonshine">Moonshine</a> ONNX model. Nothing leaves your machine. Voice input is optional and designed for those who want or need it as an accessibility tool.</p>
218+
<p style="margin-top:20px;font-size:13px;">Audio is transcribed entirely on your device using <a href="https://github.com/huggingface/transformers.js">Transformers.js</a> and <a href="https://openai.com/research/whisper">OpenAI Whisper</a>. Nothing leaves your machine. Voice input is optional and designed for those who want or need it as an accessibility tool.</p>
219219
<p style="font-size:13px;"><strong>Platform support:</strong> Windows (DirectShow), macOS (avfoundation), Linux (PulseAudio / PipeWire / ALSA).</p>
220220
</section>
221221

media/main.js

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -815,7 +815,7 @@ function handleFileUpload(input) {
815815

816816
// ── Voice input ──────────────────────────────────────────────────────────────
817817
// Audio capture: ffmpeg subprocess in extension host → raw 16kHz PCM → sent here as base64.
818-
// Inference: Moonshine ONNX via @huggingface/transformers CDN, runs in this webview.
818+
// Inference: Whisper via @huggingface/transformers CDN, runs in an inline blob Worker.
819819

820820
let _voiceRecordingTimer = null;
821821
function _setVoiceState(state) {
@@ -859,6 +859,7 @@ let _vpWarming = false; // true while silently pre-loading on startup — suppre
859859
let _vpModelId = 'tiny.en'; // active/default model used for transcription
860860
let _vpSelectedId = 'tiny.en'; // currently highlighted in the picker UI
861861
let _vpLoadedModel = null; // which model id is loaded in the active worker
862+
let _vpEnergyGate = 0.010; // RMS threshold — audio below this is treated as silence
862863

863864
function _vpGetDownloaded() {
864865
try { return JSON.parse(localStorage.getItem('grom_downloaded_models') || '[]'); } catch { return []; }
@@ -1129,8 +1130,8 @@ async function _vpTranscribe(isFinal) {
11291130
}
11301131
const rms = _vpRms(_vpAllPcm);
11311132
console.log('[grom voice] RMS:', rms.toFixed(6));
1132-
if (rms < 0.010) {
1133-
console.log('[grom voice] energy gate: skipping (RMS ' + rms.toFixed(6) + ')');
1133+
if (rms < _vpEnergyGate) {
1134+
console.log('[grom voice] energy gate: skipping (RMS ' + rms.toFixed(6) + ' < ' + _vpEnergyGate + ')');
11341135
if (isFinal) { _setVoiceState('idle'); _setVoiceDownload(null); }
11351136
return;
11361137
}
@@ -1241,6 +1242,30 @@ window.toggleMicVisibility = function() {
12411242
vscode.postMessage({ type: _voiceInputEnabled ? 'disableVoiceInput' : 'enableVoiceInput' });
12421243
};
12431244

1245+
// Slider: 1–100 maps to 0.001–0.100 linearly
1246+
function _vpSliderToGate(v) { return Math.round(Number(v)) / 1000; }
1247+
function _vpGateToSlider(g) { return Math.round(g * 1000); }
1248+
1249+
function _vpSetSensitivity(gate, save) {
1250+
_vpEnergyGate = gate;
1251+
const slider = document.getElementById('voice-sensitivity-slider');
1252+
const val = document.getElementById('voice-sensitivity-val');
1253+
if (slider) slider.value = _vpGateToSlider(gate);
1254+
if (val) val.textContent = gate.toFixed(3);
1255+
if (save) vscode.postMessage({ type: 'setVoiceSensitivity', value: gate });
1256+
}
1257+
1258+
window.onVoiceSensitivityInput = function(v) {
1259+
const gate = _vpSliderToGate(v);
1260+
_vpEnergyGate = gate;
1261+
const val = document.getElementById('voice-sensitivity-val');
1262+
if (val) val.textContent = gate.toFixed(3);
1263+
};
1264+
1265+
window.onVoiceSensitivityChange = function(v) {
1266+
_vpSetSensitivity(_vpSliderToGate(v), true);
1267+
};
1268+
12441269
let _voiceToggleLock = false;
12451270
window.toggleVoiceInput = function() {
12461271
console.log('[grom voice] toggleVoiceInput called, lock:', _voiceToggleLock, 'state:', _voiceState);
@@ -1463,7 +1488,11 @@ window.addEventListener('message', e => {
14631488
} break;
14641489
case 'voiceState': _setVoiceState(m.state); break;
14651490
case 'voiceDownload': _setVoiceDownload(m.text); break;
1466-
case 'voiceModelConfig': _vpModelId = m.model || 'tiny.en'; _vpSelectedId = _vpModelId; _vpUpdateModelUI(); break;
1491+
case 'voiceModelConfig': {
1492+
_vpModelId = m.model || 'tiny.en'; _vpSelectedId = _vpModelId; _vpUpdateModelUI();
1493+
if (typeof m.sensitivity === 'number') _vpSetSensitivity(m.sensitivity, false);
1494+
break;
1495+
}
14671496
case 'voiceAudioStart': {
14681497
_vpAllPcm = null; _vpBusy = false; _vpPendingFinal = false;
14691498
const _existingPrompt = document.getElementById('prompt');

media/styles.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,10 @@ mark.search-highlight { background: rgba(232, 168, 56, 0.4); color: inherit; bor
116116
.voice-mic-on { border-color: #4ec9b0 !important; color: #4ec9b0 !important; opacity: 1 !important; }
117117
.voice-mic-off { opacity: 0.45 !important; }
118118
.voice-action-result { font-size: 11px; color: var(--vscode-descriptionForeground); margin-top: 6px; min-height: 16px; }
119+
.voice-sensitivity-row { display: flex; align-items: center; gap: 8px; margin-top: 8px; }
120+
.voice-sensitivity-label { font-size: 11px; color: var(--vscode-descriptionForeground); white-space: nowrap; }
121+
#voice-sensitivity-slider { flex: 1; accent-color: var(--vscode-button-background); cursor: pointer; }
122+
#voice-sensitivity-val { font-size: 11px; color: var(--vscode-descriptionForeground); min-width: 36px; text-align: right; }
119123
#input-container { padding: 15px; flex-shrink: 0; }
120124
.input-box { background: var(--item-bg); border: 1px solid var(--border); border-radius: 12px; padding: 12px; display: flex; flex-direction: column; box-shadow: 0 4px 12px rgba(0,0,0,0.1); }
121125
textarea { width: 100%; background: transparent; color: inherit; border: none; outline: none; resize: none; min-height: 20px; max-height: 200px; font-family: inherit; font-size: 13px; }

media/voice-worker.js

Lines changed: 0 additions & 72 deletions
This file was deleted.

media/webview.html

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,11 @@
179179
<button id="remove-ffmpeg-btn" class="voice-remove-btn" onclick="window.removeFfmpeg()" style="display:none">Remove ffmpeg</button>
180180
<button id="mic-toggle-btn" class="voice-remove-btn voice-mic-on" onclick="window.toggleMicVisibility()" title="Toggle the mic button in the toolbar">● Mic on</button>
181181
</div>
182+
<div class="settings-row voice-sensitivity-row">
183+
<label for="voice-sensitivity-slider" class="voice-sensitivity-label">Mic sensitivity</label>
184+
<input type="range" id="voice-sensitivity-slider" min="1" max="100" step="1" value="10" oninput="window.onVoiceSensitivityInput(this.value)" onchange="window.onVoiceSensitivityChange(this.value)" title="Energy gate — lower is more sensitive">
185+
<span id="voice-sensitivity-val">0.010</span>
186+
</div>
182187
<div id="clear-voice-result" class="voice-action-result"></div>
183188
</div>
184189
<div class="settings-section">

package.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -434,6 +434,13 @@
434434
"default": "tiny.en",
435435
"description": "Which Whisper model to use for local voice transcription. English-only (.en) models are more accurate for English speakers. The model is downloaded once and cached in your browser session."
436436
},
437+
"grom.voiceSensitivity": {
438+
"type": "number",
439+
"default": 0.010,
440+
"minimum": 0.001,
441+
"maximum": 0.100,
442+
"description": "Mic energy gate for voice input (RMS threshold). Lower = more sensitive (picks up quiet speech and background noise). Higher = less sensitive (ignores noise but may miss quiet speech). Default 0.010 works for most setups; raise to 0.020–0.030 if phantom transcriptions appear."
443+
},
437444
"grom.ffmpegPath": {
438445
"type": "string",
439446
"default": "",

src/provider.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,8 @@ export class LocalChatViewProvider implements vscode.WebviewViewProvider {
183183
this._loadAllSessions();
184184
this._updateActiveContext();
185185
const voiceModel = vscode.workspace.getConfiguration('grom').get<string>('voiceModel', 'tiny.en');
186-
webviewView.webview.postMessage({ type: 'voiceModelConfig', model: voiceModel });
186+
const voiceSensitivity = vscode.workspace.getConfiguration('grom').get<number>('voiceSensitivity', 0.010);
187+
webviewView.webview.postMessage({ type: 'voiceModelConfig', model: voiceModel, sensitivity: voiceSensitivity });
187188
webviewView.webview.postMessage({ type: 'voiceFfmpegStatus', present: !!findFfmpeg(this._context.globalStorageUri.fsPath) });
188189
const isDev = this._context.extensionMode === vscode.ExtensionMode.Development;
189190
if (isDev || !this._context.globalState.get('grom.welcomed')) {
@@ -529,6 +530,9 @@ export class LocalChatViewProvider implements vscode.WebviewViewProvider {
529530
case 'setVoiceModel':
530531
void vscode.workspace.getConfiguration('grom').update('voiceModel', data.model, vscode.ConfigurationTarget.Global);
531532
break;
533+
case 'setVoiceSensitivity':
534+
void vscode.workspace.getConfiguration('grom').update('voiceSensitivity', data.value, vscode.ConfigurationTarget.Global);
535+
break;
532536
case 'removeFfmpeg': {
533537
const ffmpegDir = path.join(this._context.globalStorageUri.fsPath, 'ffmpeg');
534538
try {

0 commit comments

Comments
 (0)