Tailscale - iAmDev

I dictated the first draft of this post on a walk. I didn’t type a word of it. I described what I wanted out loud, an AI running on my server wrote it, read its replies back into my earbuds, and we went back and forth — me on a footpath, it in a terminal — until the thing was done. Then it saved itself as a draft on this blog.

Six months ago that would have read like science fiction to me. Today it’s just how I work. This post is two things at once: the story of how my relationship with AI went from typing commands at a desk to talking to it from anywhere, and a complete, replicable build guide for the system that made the second part possible. It’s a handful of small Python files and two shell hooks. You can build your own by the end of this.

The friction was never the AI’s intelligence anymore. The friction was me — the keyboard, the chair, the desk. The best ideas never show up while you’re sitting at the desk.

Part one: how I used to do this

When AI coding tools first arrived, the keyboard was the interface and that felt like the whole point. Autocomplete finished your line. Then chat windows showed up: you typed a question, pasted some code, read an answer, copied it back, repeat. Clever, but it was a lot of shuttling text back and forth by hand.

Then came agents — Claude Code in particular — that live in your terminal, read your files, run commands, and actually do the work instead of just describing it. That was a genuine leap. But notice what didn’t change: I was still hunched over a keyboard at a desk, typing everything.

The bottleneck had quietly moved. It wasn’t the model’s reasoning — it was the input device, and the input device was me. The best ideas arrive on a walk, in the kitchen, halfway through making coffee. By the time I’d sat down to type them out, half of them had evaporated. So I asked a different question: what if I could just talk to it, the way you’d talk to a colleague on the phone?

What I built

A two-way voice bridge to Claude Code. I speak into a VoIP app on my phone; my words are transcribed and typed into a live Claude Code session running on my server; every reply Claude writes gets read straight back to my phone in a natural voice. The same channel quietly carries text chat too, so when I’m somewhere I can glance at a screen, the whole conversation is right there as well.

Here is the entire data flow:

  PHONE (Mumla / Mumble app)
        │  push-to-talk audio  ▲  synthesized voice + text
        ▼                      │
  MURMUR  (self-hosted Mumble server, port 64738, over Tailscale)
        │                      ▲
        ▼                      │
  bot.py  (the bridge)
     IN : audio ─► push-to-talk split ─► Whisper (speech→text) ─► tmux send-keys ─┐
                                                                                    ▼
                                                                        CLAUDE CODE (in tmux)
                                                                                    │
     OUT: tail the session .jsonl ◄── every assistant reply it writes ◄────────────┘
              │
              ├─► Kokoro neural TTS ─► audio ─► phone
              └─► Markdown→HTML ─────► text  ─► phone

The trick that makes the whole thing simple is on the output side. The bridge never calls an API to find out what Claude said. It just tails the transcript file Claude Code already writes to disk, and speaks each new line. No integration, no glue API — the AI writes to a log, and something else reads the log aloud. Once you see it that way, the rest is plumbing.

The moving parts:

Mumble / Murmur — the audio + text transport between phone and server
faster-whisper — speech-to-text, running on the CPU
Kokoro-82M — a small neural text-to-speech model for the voice coming back
tmux — holds the live Claude Code session so we can type into it programmatically
Two Claude Code hooks — one teaches Claude it’s “on a call,” one keeps the audio from breaking
A handful of small Python files — pymumble glue that wires it all together

What you’ll need to follow along: a Linux box (a cheap VPS or a home server) running Claude Code, Python 3.12, a phone, and about an afternoon. Let’s build it.

The fast path: one script (Ubuntu)

Prefer not to do all this by hand? I’ve packaged the whole build — every script and config you’ll see below, plus the Mumble server, the Python environment, the voice models and a systemd service — into a single self-contained installer. It’s written for Ubuntu (it uses apt and systemd); on Debian it works as-is, but on any other Linux you’ll need to adjust the package-install step yourself. Download it, read it first (always read a script before running it as root), then run:

# 1. on a machine where you're already logged into Claude, mint a token:
claude setup-token                       # prints sk-ant-oat01-...

# 2. on your Ubuntu server, as root:
curl -fsSL https://iamdev.net/wp-content/uploads/2026/06/voice-claude-setup.sh -o setup.sh
sudo CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..." bash setup.sh

That stands the whole stack up and enables a voice-claude service so it comes back on boot. Afterwards, edit config.yaml to set a real Mumble password and point claude.cwd at your project. Everything below is what that script assembles, step by step — still worth reading even if you take the shortcut, because the sharp edges are where the hours go. (This will live on GitHub eventually; for now the link above is the canonical copy.)

Step 1 — Run Claude Code headless, in tmux

Everything hangs off one idea: keep a real, interactive Claude Code session alive in a place we can reach programmatically. tmux is perfect for that — it’s a terminal that keeps running after you disconnect, and crucially you can send keystrokes into it from the outside. That’s how we’ll “type” the transcribed speech.

# start a persistent Claude session in a tmux window called "claude"
UUID=$(cat /proc/sys/kernel/random/uuid)
tmux new-session -d -s voice -n claude -c /home/you/project 
  "claude --session-id $UUID"

# you can attach and watch any time:
tmux attach -t voice        # Ctrl-b d to detach again

Pinning the session id with --session-id matters: it tells us exactly which transcript file to tail later, and it lets you jump into the very same conversation from a normal terminal with claude --resume <uuid> whenever you want.

The headless auth gotcha

On a headless box there’s no browser to log into, so the interactive login won’t complete. The fix is a long-lived token. On a machine where you’re already logged in, run claude setup-token to mint one (good for about a year), then make it available to the headless session — it has to be in the environment before you launch the Step 1 command:

# on your logged-in machine:
claude setup-token            # prints sk-ant-oat01-...

# on the server, put it where the session will see it:
export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-...."   # in ~/.bashrc

One subtlety that cost me time: systemd and other non-login shells don’t source ~/.bashrc, so a service-launched Claude would start up “not logged in.” My launcher reads the token out of .bashrc on every start and re-exports it, so rotating the token is just editing one line. Keep .bashrc as the single source of truth and you avoid a whole class of confusing “why is it logged out” mornings.

Step 2 — Mumble as the transport

I needed a pipe between my phone and the server that carried both live audio and text, had solid mobile apps, was low-latency, and that I could fully self-host. That’s Mumble — an open-source voice chat system built for gamers, where every millisecond counts. The server is called Murmur; the Android app is Mumla, the iOS one is just Mumble.

sudo apt install mumble-server
sudo systemctl enable --now mumble-server

# key settings in /etc/mumble/mumble-server.ini:
#   port=64738
#   serverpassword=choose-a-strong-one

On your phone, add the server (host = your server, port = 64738, password = the one you set), accept the self-signed certificate on first connect, and switch the app to push-to-talk. Hold the button, say your sentence, release — that release is what tells the bridge your turn is over.

One sharp edge worth saving you: editing mumble-server.ini with sed -i or some editors strips its group ownership, and Murmur then crash-loops with a cryptic “ini could not be opened.” It must stay root:mumble-server mode 640. After any edit: chown root:mumble-server, chmod 640, restart.

Step 3 — The bridge connects (and a Python 3.12 landmine)

First, the project itself: a handful of small Python files — bot.py wires it all together, alongside stt.py (ears), tts.py (voice), claude_io.py (typing in, and reading the transcript out) and a config.yaml for the knobs. Set up a virtualenv and install what they need:

python3.12 -m venv venv
. venv/bin/activate
pip install pymumble faster-whisper webrtcvad soxr numpy pyyaml "setuptools<81"
pip install kokoro-onnx onnxruntime        # neural TTS for Step 7 (onnxruntime runs the model)
sudo apt install espeak-ng jq tmux               # espeak-ng = Kokoro's phonemizer; jq runs the hooks

The bridge is a Python process that joins the Mumble server as if it were just another user, using the pymumble library. The first thing that bit me: pymumble still calls ssl.wrap_socket(), which was removed in Python 3.12. Rather than downgrade Python, I dropped in a tiny compatibility shim that recreates it with a modern SSL context (Mumble’s trust model is certificate pinning, not CA chains, so not verifying the self-signed cert is fine here):

# compat_ssl.py  — import this BEFORE pymumble connects
import ssl

if not hasattr(ssl, "wrap_socket"):
    def wrap_socket(sock, keyfile=None, certfile=None, server_side=False,
                    ssl_version=None, ca_certs=None,
                    do_handshake_on_connect=True,
                    suppress_ragged_eofs=True, ciphers=None, **_):
        ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        if certfile:
            ctx.load_cert_chain(certfile, keyfile)
        return ctx.wrap_socket(
            sock, server_side=server_side,
            do_handshake_on_connect=do_handshake_on_connect,
            suppress_ragged_eofs=suppress_ragged_eofs)
    ssl.wrap_socket = wrap_socket

(A second, similar trap: webrtcvad needs setuptools<81 because it still imports pkg_resources. Pin it in your venv and move on.)

With the shim in place, connecting is a few lines — and from here the bridge is really just two pipes: one carrying your voice in, one carrying Claude’s replies out.

Step 4 — Ears: turning speech into text

Mumble hands the bridge a stream of audio packets while you hold the talk button. Because I’m using push-to-talk, segmenting your turn is delightfully simple: when the packets stop arriving for ~700 ms, your sentence is done. I buffer the audio, resample it from Mumble’s 48 kHz down to the 16 kHz Whisper wants, and transcribe it with faster-whisper — the base.en model, running int8 on the CPU. No GPU required.

# config knobs that matter for latency vs. accuracy
stt:
  model: base.en       # tiny.en = fastest, small.en = most accurate
  device: cpu
  compute_type: int8
vad:
  mode: ptt            # phone holds the button; we split on the packet gap
  silence_ms: 700      # this much quiet = end of your turn
  min_speech_ms: 350   # ignore little blips and clicks

base.en is the sweet spot: good enough to nail “refactor the auth middleware” while staying fast on a modest VPS. It’s also the bigger memory user in the whole system (~327 MB resident), more than the text-to-speech model, which surprised me.

The first time I held the button, said “list the files in this folder,” and watched the words appear in the terminal and the Enter key press itself — I laughed out loud. The machine was typing for me.

Step 5 — Hands: typing into Claude

Now we have text; we need it inside the Claude session. This is where the tmux choice pays off. tmux send-keys injects keystrokes into a target pane from the outside. The key flag is -l, which sends the text literally so tmux doesn’t try to interpret words like “Enter” as actual key names. Paste the text, pause a beat so the TUI settles, then send a real Enter to submit:

def send_text(self, target, text):
    # -l = literal, so the words aren't parsed as key names
    subprocess.run(["tmux", "send-keys", "-t", target, "-l", text])
    time.sleep(0.18)                 # let the TUI settle before submit
    subprocess.run(["tmux", "send-keys", "-t", target, "Enter"])

That’s the entire input path. Speech became text became keystrokes became a prompt. Claude Code doesn’t know or care that a human didn’t type it.

Step 6 — Eyes: reading Claude’s replies

This is my favourite part, because it’s the part where I expected to need an API and didn’t. Claude Code writes every turn of every conversation to a JSONL transcript on disk. The path is derived from the working directory — every non-alphanumeric character becomes a dash — plus the session id:

~/.claude/projects/<cwd-with-dashes>/<session-id>.jsonl

# e.g.  cwd /home/you/project  +  session id abc123...
~/.claude/projects/-home-you-project/abc123....jsonl

So the “read Claude’s replies” problem becomes the very old, very solved problem of tail -f on a file. The bridge watches that file, and every time a new assistant message lands, it pulls out the text blocks and fires them off to be spoken. It skips Claude’s internal thinking and any sub-agent side-chains, dedupes by message id, and starts reading from the end of the file so it never replays old history:

def _handle(self, raw_line):
    o = json.loads(raw_line)
    if o.get("type") != "assistant" or o.get("isSidechain"):
        return                              # skip tool-use / sub-agents
    if o.get("uuid") in self._seen:
        return                              # already spoke this one
    self._seen.add(o.get("uuid"))
    for block in o["message"]["content"]:
        if block.get("type") == "text":
            self.on_text(block["text"])     # → speak it

No webhook, no streaming API, no integration to keep in sync with. The agent writes to a log; we read the log aloud. That’s the whole secret.

(You might wonder how the bridge knows which session id is live. A tiny pointer file — written by the launcher in Step 12 and kept current by a hook in Step 11 — always names the active session. For now, assume it knows.)

Step 7 — Mouth: turning replies into a voice

For the voice coming back, I landed on Kokoro-82M, a small neural text-to-speech model that sounds startlingly natural for its size and runs happily on CPU via ONNX. I picked a US male voice (am_michael) after auditioning a handful.

# one-time: fetch the Kokoro model files into ./voices/
mkdir -p voices
REL=https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0
#   kokoro-v1.0.onnx   (~310 MB)
#   voices-v1.0.bin    (~27 MB, 54 voices)
curl -fsSL "$REL/kokoro-v1.0.onnx" -o voices/kokoro-v1.0.onnx
curl -fsSL "$REL/voices-v1.0.bin"  -o voices/voices-v1.0.bin

My first version used a different engine (Piper) spawned as a fresh command-line process for every reply — which meant reloading the model every single time and waiting ~5 seconds before any audio, even for a five-word answer. The fix was the obvious one in hindsight: load the model once when the bridge starts, keep it warm in memory, and then stream the speech out sentence by sentence as Claude’s reply comes in. Now it’s about 1.7 seconds to the first spoken word, and because synthesis outruns playback, the rest is gapless.

tts:
  engine: kokoro
  model: voices/kokoro-v1.0.onnx
  voices_bin: voices/voices-v1.0.bin
  voice: am_michael      # 54 voices ship in the model file
  speed: 1.0
  lang: en-us
  speak_code_blocks: false
  normalize_tech: true
  max_chars: 4000        # safety ceiling; cut at a sentence boundary, never mid-word

That last knob matters more than it looks. Spoken replies that get hard-truncated mid-word sound broken; cutting at the nearest sentence boundary instead keeps the experience clean on Claude’s occasional long answers.

Step 8 — Making it sound human, not like a terminal

Here’s a problem you only discover once you’re listening: Claude is a coding agent, so its replies are full of file paths, snake_case identifiers, --flags, version numbers and acronyms. Fed raw to a speech engine, /etc/mumble/server.ini comes out as a slurred mess, and RTF gets read as the word “rtf.” So before any text hits the synthesizer, it goes through a normalizer that rewrites terminal-speak into something a human would actually say:

/etc/mumble/server.ini → “etc, mumble, server dot I N I”
snake_case and kebab-case → spaced-out words
--dry-run → “flag dry run”
1.51.0 → “one point fifty-one point zero”
RTF → spelled “R T F” (from a curated acronym list, so real words like “TODO” are left alone)
long hashes and UUIDs → “a long hash” / “id ending…” instead of reading 40 characters aloud

It’s deliberately conservative and pattern-gated, so ordinary prose passes through untouched. This one unglamorous file is the difference between something that sounds like a colleague and something that sounds like a 1980s text reader.

Step 9 — Teaching Claude it’s on a phone call

By default Claude formats for a screen: tables, bullet lists, fenced code blocks, headings. All of that is unlistenable. I needed Claude to know, on these turns, that it’s speaking, not writing — and to answer in short, conversational sentences.

Claude Code has a hooks system for exactly this kind of thing. A UserPromptSubmit hook runs before each of my prompts reaches the model and can inject extra instructions. Mine checks whether this session is the one the bridge is currently driving, and if so, prepends a short “you’re on voice” note:

# voice_mode_hook.sh — a UserPromptSubmit hook
#   only fires for the session the bridge is driving; silent otherwise
if [ "$sid" = "$active_sid" ]; then
  cat <<EOF
[voice mode] You are replying over a text-to-speech bridge: your words are
spoken aloud on a phone. Keep replies SHORT and conversational — 1-3 sentences.
Do NOT use tables, code blocks, markdown, or long lists (they are unreadable
aloud). Give the essential answer plainly; offer to elaborate if asked.
EOF
fi

Both this hook and the session hook in the next step do nothing until you register them once in ~/.claude/settings.json — the step that’s easiest to forget:

{
  "hooks": {
    "UserPromptSubmit": [
      { "hooks": [ { "type": "command",
                     "command": "/path/to/voice_mode_hook.sh" } ] }
    ],
    "SessionStart": [
      { "hooks": [ { "type": "command",
                     "command": "/path/to/session_start_hook.sh" } ] }
    ]
  }
}

This is genuinely the line that turns a clunky demo into something you’d actually use. The exact same instruction is, in fact, what shaped the spoken half of the conversation that produced this very post. And because it’s scoped to the bridge’s session id, every other Claude session on the box — normal terminal work, sub-agents — is completely unaffected.

Step 10 — The text channel, for free

Mumble carries text chat over the same connection as the audio, so with a bit more wiring the bridge speaks and types. Inbound chat messages take the exact same path into Claude as transcribed speech; outbound, each reply is mirrored into the chat pane with Markdown rendered to Mumble’s little HTML subset (fenced code → <pre>, **bold** → <b>, and so on).

The nice payoff: when you open the app, the bridge backfills the whole conversation so far into the chat, so you’re never staring at a blank screen wondering what you missed while your phone was in your pocket. Voice for when you’re walking; text for when you can look. One conversation, two senses.

Step 11 — Voice commands and self-healing sessions

Two final touches make it feel finished rather than fiddly.

Voice commands. Some things you want to do to the session itself, not say to Claude — like clearing the context to start fresh. So the bridge intercepts a few whole-utterance phrases at the single point all input funnels through, and routes them to a slash-command instead of sending them as a prompt. Say “clear the context” on its own and it types /clear into the TUI. Matching is on the full, normalized utterance, so “can you clear the context for me?” in the middle of a sentence does not trigger it.

Self-healing audio. This one was a subtle bug. The bridge sends input by tmux pane name, but reads output by session id. When you /clear, Claude starts a brand new session with a new id — so input kept working, but replies were being read from the old, now-dead transcript. Silent audio, easy to mistake for a network problem. The fix is a SessionStart hook that rewrites the bridge’s pointer file every time a session starts — but only when it’s running in the bridge’s specific tmux pane, so ordinary sessions and sub-agents can never hijack your phone audio:

# session_start_hook.sh — keep the bridge pointed at the live session
where=$(tmux display-message -p -t "$TMUX_PANE" '#S:#W')
[ "$where" = "voice:claude" ] || exit 0     # only the bridge's own pane

jq -n --arg s "$sid" --arg t "$tpath" 
  '{session_id:$s, transcript_path:$t}' > active_session.json.tmp
mv -f active_session.json.tmp active_session.json   # atomic

Now saying “clear the context” wipes the conversation, the new session repoints the bridge automatically, and the audio just keeps working. It heals itself. One detail to get right: the voice:claude string in this hook must exactly match the tmux session:window the launcher creates in Step 12 — if they differ, the pointer never updates and replies go silent.

Step 12 — One command to bring it all up

All of this hides behind a single launcher script. It makes sure the Mumble server is running, resumes the previous conversation (or starts a fresh one), points the bridge at it, and spins up a persistent tmux session with Claude in one window and the bridge — wrapped in an auto-restart loop — in another.

./voice-comms.sh

#   voice comms is UP  (tmux session: voice)
#   ---------------------------------------------------------------
#   Watch / type : tmux attach -t voice
#   On your phone (push-to-talk):
#     Address  : 100.x.y.z      (your Tailscale IP)
#     Port     : 64738
#   ---------------------------------------------------------------

Drop in a small systemd unit and it comes up on boot, so the moment the server is on, my phone can reach an AI that can see my code. Re-run the script any time to start a clean conversation.

Part two: what it’s actually like

So that’s the build. Here’s the part the architecture diagram doesn’t capture: how different it feels.

I take walks now and the walks are productive in a way they never used to be. “What’s left on the malware cleanup?” — and it tells me, out loud, while I’m looking at trees. “Draft a blog post about the voice system” — and we work through the outline together, me talking, it reading drafts back, neither of us anywhere near a keyboard. The desk used to be the only place the work could happen. Now the work happens wherever I am, and the desk is just one option.

The shift is hard to overstate. Typing at an AI, even a brilliant one, still feels like operating a tool — there’s a machine and there’s you, and you’re working the controls. Talking to it, and having it talk back while it actually does the work, crosses some line into feeling like collaboration. The interface stops being something you operate and starts being someone you’re working with. The keyboard, it turns out, was load-bearing for that distinction.

It’s not magic, and I won’t pretend it is. You can’t both talk at once — one writer at a time. If your phone’s mic is open while a reply plays, the synthesized voice gets transcribed back in as if you said it, so push-to-talk discipline matters. And spoken answers are, by design, shallower than written ones — for anything where I need to see a diff or a table, I still go to the screen. But for thinking out loud, for kicking off work, for staying in the loop while I’m away from the desk, it’s become the way I reach for the AI first.

The bag of sharp edges

If you build this, here are the things that cost me hours, collected in one place so they cost you minutes:

Python 3.12 + pymumble: ssl.wrap_socket is gone; add the shim, and import it before pymumble.
webrtcvad: needs setuptools<81 (it still imports pkg_resources).
The Mumble ini file: editing it can strip its group ownership and crash-loop the server. Keep it root:mumble-server, mode 640.
Headless auth: use a long-lived CLAUDE_CODE_OAUTH_TOKEN; non-login shells don’t read .bashrc, so feed the token in explicitly.
The stale pointer after /clear: input by pane, output by session id — a new session silently kills audio until you repoint it. The SessionStart hook cures it.
The feedback loop: open mic during playback re-transcribes the AI’s own voice. Push-to-talk, or mute during replies.
Load TTS once: spawning the model per reply adds seconds of latency to every single answer.

A note on security

This puts a microphone-driven line into an agent that can read your files and run commands, so don’t expose it to the open internet. I run the whole thing over Tailscale, a private mesh network — the Mumble port isn’t open to the world at all; my phone reaches the server over the tailnet as if they were on the same LAN. Set a real Mumble password, keep your auth token out of any public config, and think carefully about what the underlying account is allowed to do. Treat the voice line with the same respect you’d give an SSH key.

Where this goes

The pieces here aren’t exotic — speech-to-text, a small TTS model, a terminal multiplexer, a file being tailed. What makes it feel like the future isn’t any one component; it’s that together they let the interface disappear. The best tools get out of the way, and a tool you can talk to from a footpath is very far out of the way.

I think this is the direction coding is heading: less time operating an editor, more time in conversation with something that can operate the editor for you. You don’t need my exact stack to get there — swap Mumble for whatever transport you like, Kokoro for whatever voice you prefer. The architecture is the durable part: pin a session, type into it, tail its transcript, speak the result, and teach the model it’s on a call. Five ideas. Go build your own, and then go for a walk.

This post was dictated to the very system it describes, and saved as a draft by the same. — Code Real Stuff.

Tag: Tailscale

From Typing to Talking: Building a Two-Way Voice Bridge to Claude Code