Show HN: Localvoxtral – Local real-time dictation on macOS with streaming STT

Show HN: Localvoxtral – Local real-time dictation on macOS with streaming STT(github.com)

1 points by T0mSIlver 132 days ago | 2 comments

I built a native macOS menu bar app for real-time dictation that can run fully on-device.

Most dictation tools, even local ones, use Whisper or similar offline models: you record, then wait for the transcript. Localvoxtral uses Mistral's Voxtral Realtime, one of the first open-source speech models with a natively streaming architecture. Words appear as you speak, not after you stop. It feels closer to someone typing along as you talk.

Press a shortcut, speak, and text gets typed directly into whatever app you're in. No cloud, no subscription, no data leaving your machine.

Two backend options:

voxmlx on Apple Silicon: I forked voxmlx to add a WebSocket server and memory optimizations. Runs a 4-bit quantized model on an M1 Pro. Audio and inference stay fully on-device. vLLM on NVIDIA GPU: tested on an RTX 3090, noticeably faster.

The app is native Swift (~97%), lives in the menu bar, and stays out of your way. Configurable shortcut, mic selection, auto-paste. GitHub: https://github.com/T0mSIlver/localvoxtral

Pre-built DMG available in Releases

T0mSIlver 132 days ago |

Some technical context and where this is headed.

Why streaming matters for dictation. Whisper and most open-source STT models use bidirectional attention, meaning they need the full audio clip before they can transcribe anything. You get your text after you stop talking, usually with a noticeable delay. Voxtral Realtime takes a different approach: it has a causal audio encoder that processes audio left-to-right as it arrives. At 480ms delay it matches offline models on accuracy (FLEURS benchmark), but you see text appearing while you're still mid-sentence. For dictation this changes a lot. You can catch mistakes in real time, and the feedback loop feels natural instead of disconnected.

The app connects to backends via the OpenAI Realtime API WebSocket protocol. It captures audio from your mic, streams it over the WebSocket, and receives partial transcripts that get inserted into your active text field live. Any OpenAI Realtime-compatible server works.

The voxmlx fork. The original voxmlx by Awni Hannun does local Voxtral inference on Apple Silicon via MLX, but it was CLI-only. I added a WebSocket server that speaks the OpenAI Realtime protocol so localvoxtral (or any compatible client) can connect to it. I also added memory management to avoid OOM on longer sessions. Fork is here: https://github.com/T0mSIlver/voxmlx. I'd like to get the server piece upstreamed eventually.

Latency. On M1 Pro with a 4-bit quantized model, first words appear within roughly 200 to 400ms. On RTX 3090 via vLLM it's faster. Both feel responsive enough for natural dictation. What's next. Right now you have to start the server yourself before using the app. I want to add app-managed local serving (start/stop/model download) so it's truly one-click. If anyone has experience bundling Python/MLX processes into macOS apps cleanly, I'd love to hear your approach.

Happy to answer questions.

Leftium 127 days ago | |

> If anyone has experience bundling Python/MLX processes into macOS apps cleanly, I'd love to hear your approach.

This is an example python app wrapped in a (macOS) native shell using Electrobun: https://github.com/blackboardsh/audio-tts

Can you report how well Voxtral Realtime compares to the other currently supported streaming models? https://rift-transcription.vercel.app/local-setup

- Subjectively I've found Web Speech API feels the best (accuracy/latency), followed by moonshine medium

OpenAI Realtime WS API is on the roadmap, so I might be able to compare via RIFT in the future...