shipped · runs in a tray

Vox

Python 3.13 + faster-whisper + Win32 SendInput · Right-Ctrl hotkey

Built for:: One person. Me, at the keyboard, all day, every day, when typing is slower than speaking.
Not built for:: Anyone who needs an account, a cloud upload, or a multi-language translation surface. Vox transcribes English at the cursor, full stop.

Hold right-Ctrl, speak, release. Whatever you said appears at the cursor — in any application, in any text field, with sub-second latency on a stock laptop CPU. No account, no cloud, no listening when you don’t want it to.

§ I

The problem

Cloud dictation tools are accurate, fast, and listening. The cost of accuracy and speed is that the audio leaves the machine; the cost of leaving the machine is that someone else gets a copy of every sentence I say at my desk. That’s a bad trade for daily use.

Local dictation has been good enough for a year and a half. Vox is a thin shim around faster-whisper with one trick: a single hotkey that turns on the mic, captures audio while held, transcribes on release, and types the result at the cursor via Win32 SendInput. No window, no menu, no decision to make.

§ II

Decisions

kept
Push-to-talk over voice activity detection. PTT is unambiguous; VAD picks up the radio, the dog, the air conditioning, my exhale at 11 PM. The wrong thing transcribed is worse than no transcription.
kept
A lockfile (voicetype.active) at the project root that exists only while the mic is hot. Other voice tools on the same machine respect the lockfile and mute themselves; without it, two listeners would fight for the mic and both would lose.
cut
A floating UI window. The status feedback is a tray icon — grey idle, blue recording, yellow processing, green flash on success. A real window would invite configuration, and configuration is the enemy of a tool I want to forget I’m using.

§ III

System

Stack — current pins.
Layer	Implementation	Purpose
Hotkey	Global Win32 hook	Right-Ctrl press / release events
Capture	PyAudio	16 kHz mono WAV ring buffer while held
Transcribe	faster-whisper	Local Whisper inference, CPU or CUDA
Inject	Win32 SendInput	Types into the focused text field
Status	Tray icon (Tkinter)	Four colors, no chrome, no decisions
Coord	Lockfile	`voicetype.active` · respected by peer tools

vox/hotkey_loop.pypython · push-to-talk loop

# One global hotkey: right-Ctrl. Press starts recording and writes
# the lockfile so peer voice tools mute themselves. Release stops,
# transcribes locally, types into whatever has focus.
def on_press(key):
    if key != Key.ctrl_r or recorder.active:
        return
    LOCKFILE.touch()                         # peer tools see this
    set_tray("recording")
    recorder.start()                         # 16 kHz mono ring buf

def on_release(key):
    if key != Key.ctrl_r or not recorder.active:
        return
    audio = recorder.stop()
    LOCKFILE.unlink(missing_ok=True)
    set_tray("processing")
    text = whisper.transcribe(audio, language="en").strip()
    if text:
        SendInput(text)                      # type at the cursor
        set_tray("done")
    else:
        set_tray("idle")

vox.session.logndjson · 0.74s end-to-end

{"t":"14:02:18.402","event":"press","key":"right_ctrl","lockfile":"created"}
{"t":"14:02:18.418","event":"capture_start","sr":16000,"channels":1}
{"t":"14:02:21.071","event":"release","key":"right_ctrl","duration_ms":2653}
{"t":"14:02:21.084","event":"transcribe_start","model":"distil-whisper-large-v3","device":"cpu"}
{"t":"14:02:21.812","event":"transcribe_done","ms":728,"chars":62}
{"t":"14:02:21.815","event":"send_input","method":"win32_unicode","chars":62}
{"t":"14:02:21.821","event":"done","total_ms":1419,"text":"the cost of accuracy is that the audio leaves the machine"}

FIGURE. Press → release → typed at cursor in 1.4 seconds end-to-end. CPU only; no model warm-up keeping the laptop awake.

Vox tray-icon four states tiled — idle, recording at 2.3s, transcribing on whisper-large-v3, done in 1.4s end-to-end — and a Windows taskbar fragment showing the icon embedded among system tray icons. — FIGURE 1. The four states of the tray icon. No window, no menu, no decision to make — push, speak, release, and the typed sentence shows up at the cursor.

Vox waveform during recording — line-only audio spectrum across eight seconds, live state header with a sage pulse dot, interim transcription in body sans below the waveform, mono telemetry strip naming the model and throughput. — FIGURE 2. The recording window. The waveform is decoration; the interim transcription underneath is the real read-out. The most recent word fades on its own — a small visual signal that the model has not closed on it yet.

Vox transcript log — fourteen past dictations as rows with timestamp, duration, the first two lines of transcribed text, and a small destination-app indicator showing where the text was dropped (VS Code, Discord, Obsidian, Brave). — FIGURE 3. The transcript log. Every dictation is searchable; the destination tag tells the user where each one landed. Today’s footer reads forty-two dictations, six hundred and twelve characters across four destinations — a small index of attention.

Acknowledgments

The signal chain reads like this: PyAudio captures, faster-whisper transcribes against the OpenAI Whisper weights, Win32 SendInput types into whatever has focus, and a Tkinter tray icon supplies the four-color status read-out. Push, speak, release, type — five hops, none of them mine, all of them small enough to read in an afternoon.

← Index