Cloud-Powered Dictation: Fast STT for Old Linux Hardware with Copilot & Claude


Context

I work daily with GitHub Copilot and Claude Code, and like many developers, I often think faster than I type. Speech‑to‑Text (STT) is an obvious productivity multiplier — but there’s a catch:

Unlike traditional local STT solutions that struggle on weak hardware, this approach leverages cloud processing for near-instant results.

  • My main Linux laptop is old (Dell Vostro 14‑3468)

  • CPU is weak, no GPU acceleration

  • I want speed and accuracy, not fancy UI

  • I want it to work everywhere, not inside a specific app

This post documents a minimal, fast, and reliable STT setup that works perfectly on old hardware and integrates seamlessly into a modern coding workflow.

Design Goals

Before touching any tools, I defined strict constraints:

  • ⚡ Low latency (sub‑second for short dictations)

  • 🧠 High accuracy (English & Spanish)

  • 🖥️ Works on old CPUs

  • ⌨️ Keyboard‑driven (no mouse, no UI)

  • 🌍 Global (works in Copilot, Claude Code, editors, terminals)

  • 🧩 Wayland‑compatible (no X11 hacks)

Offline STT was nice to have, but not required.

Why Local Whisper GUIs Didn’t Work

I tested several local Whisper-based tools (GUI and CLI wrappers). They worked, but:

  • ❌ Slow on CPU‑only machines

  • ❌ Heavy UIs

  • ❌ Poor integration with keyboard workflows

On old hardware, local inference is the bottleneck, not disk or RAM.

So instead of fighting physics, I changed the architecture.

Architecture: Thin Client + API STT

The winning approach:

Mic → WAV → OpenAI STT API → Clipboard → Paste Anywhere

Key idea:

Let the laptop do only audio capture. Let the cloud do inference.

This gives:

  • Near‑instant transcription for short prompts

  • Consistent accuracy

  • Minimal local resource usage

Core Tools

  • Ubuntu Linux (Wayland)

  • sox (rec, play) for audio capture & beeps

  • OpenAI STT API (gpt-4o-mini-transcribe)

  • wl-copy for clipboard integration

  • Keyboard shortcuts (system‑level)

No window focus tricks. No UI automation. No hacks.

The Dictation Script

This Python script does exactly one thing:

  • Record audio until I stop it

  • Transcribe it

  • Put the text in the clipboard

  • Signal readiness with a sound

import subprocess
import tempfile
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def beep(freq, dur=0.12):
subprocess.run(
["play", "-nq", "synth", str(dur), "sine", str(freq)],
check=False,
)
with tempfile.NamedTemporaryFile(suffix=".wav") as f:
beep(1200) # start recording
try:
subprocess.run(
["rec", "-q", f.name, "rate", "16000", "channels", "1"],
check=True,
)
except KeyboardInterrupt:
pass
beep(600) # stop recording
with open(f.name, "rb") as audio:
r = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=audio,
)
text = r.text.strip()
subprocess.run(
["wl-copy"],
input=text,
text=True,
check=True,
)
beep(1500) # clipboard ready

Push‑to‑Talk With Two Shortcuts

To make this frictionless, I use two global keyboard shortcuts:

▶ Start Dictation

#!/usr/bin/env bash
source ~/tools/stt-dictation/.venv/bin/activate
python ~/tools/stt-dictation/dictate.py

Shortcut example:

  • Super + D

⏹ Stop Dictation

#!/usr/bin/env bash
pkill -INT rec

Shortcut example:

  • Super + S

This cleanly stops recording without killing Python, allowing transcription to finish.

Audio Feedback (Why It Matters)

I intentionally avoided notifications or UI.

Instead, sound cues communicate state instantly:

  • 🔊 High beep → recording started

  • 🔊 Low beep → recording stopped

  • 🔔 Single high beep → text ready to paste

This avoids a classic failure mode:

Pasting before transcription finishes.

With the final beep, I know exactly when Ctrl+V is safe.

Sound cues work particularly well for coding workflows because they don’t require visual attention or interrupt your focus on the screen.

Performance & Cost

For short prompts (

  • ⏱️ End‑to‑end latency: ~1 second

  • 💲 Cost per dictation: ~$0.0005 (at $0.006 per minute for gpt-4o-mini-transcribe, a 5-second dictation costs ~$0.0005)

Even with heavy daily usage, monthly cost is negligible.

At this scale, latency matters far more than cost.

Why This Works Well for Copilot & Claude Code

  • Global clipboard → works everywhere

  • No dependency on editor plugins

  • Perfect for:

Long prompts

  • Refactoring instructions

  • Natural language problem descriptions

It feels like talking to the IDE, not typing into it.

Lessons Learned

  • Old hardware is fine if you keep it thin

  • Cloud inference beats local CPU every time

  • Keyboard‑first workflows matter

  • Audio feedback > visual feedback

  • Simple pipelines beat complex tools

Final Thoughts

This setup turned an aging laptop into a high‑quality dictation workstation for modern AI‑assisted development.

No heavyweight apps. No vendor lock‑in. No UI friction.

Just:

Think → Speak → Paste → Code

Key Takeaways:

  • Cloud processing beats local CPU for speed on old hardware

  • Keyboard-driven workflows integrate seamlessly with coding

  • Audio feedback provides instant state awareness

  • Minimalist design ensures reliability and low maintenance

If you’re working on Linux with limited hardware and heavy AI usage, this approach is worth copying.