Tag: Tailscale

  • From Typing to Talking: Building a Two-Way Voice Bridge to Claude Code

    I dictated the first draft of this post on a walk. I didn’t type a word of it. I described what I wanted out loud, an AI running on my server wrote it, read its replies back into my earbuds, and we went back and forth — me on a footpath, it in a terminal — until the thing was done. Then it saved itself as a draft on this blog.

    Six months ago that would have read like science fiction to me. Today it’s just how I work. This post is two things at once: the story of how my relationship with AI went from typing commands at a desk to talking to it from anywhere, and a complete, replicable build guide for the system that made the second part possible. It’s a handful of small Python files and two shell hooks. You can build your own by the end of this.

    The friction was never the AI’s intelligence anymore. The friction was me — the keyboard, the chair, the desk. The best ideas never show up while you’re sitting at the desk.

    Part one: how I used to do this

    When AI coding tools first arrived, the keyboard was the interface and that felt like the whole point. Autocomplete finished your line. Then chat windows showed up: you typed a question, pasted some code, read an answer, copied it back, repeat. Clever, but it was a lot of shuttling text back and forth by hand.

    Then came agents — Claude Code in particular — that live in your terminal, read your files, run commands, and actually do the work instead of just describing it. That was a genuine leap. But notice what didn’t change: I was still hunched over a keyboard at a desk, typing everything.

    The bottleneck had quietly moved. It wasn’t the model’s reasoning — it was the input device, and the input device was me. The best ideas arrive on a walk, in the kitchen, halfway through making coffee. By the time I’d sat down to type them out, half of them had evaporated. So I asked a different question: what if I could just talk to it, the way you’d talk to a colleague on the phone?

    What I built

    A two-way voice bridge to Claude Code. I speak into a VoIP app on my phone; my words are transcribed and typed into a live Claude Code session running on my server; every reply Claude writes gets read straight back to my phone in a natural voice. The same channel quietly carries text chat too, so when I’m somewhere I can glance at a screen, the whole conversation is right there as well.

    Here is the entire data flow:

      PHONE (Mumla / Mumble app)
            │  push-to-talk audio  ▲  synthesized voice + text
            ▼                      │
      MURMUR  (self-hosted Mumble server, port 64738, over Tailscale)
            │                      ▲
            ▼                      │
      bot.py  (the bridge)
         IN : audio ─► push-to-talk split ─► Whisper (speech→text) ─► tmux send-keys ─┐
                                                                                        ▼
                                                                            CLAUDE CODE (in tmux)
                                                                                        │
         OUT: tail the session .jsonl ◄── every assistant reply it writes ◄────────────┘
                  │
                  ├─► Kokoro neural TTS ─► audio ─► phone
                  └─► Markdown→HTML ─────► text  ─► phone

    The trick that makes the whole thing simple is on the output side. The bridge never calls an API to find out what Claude said. It just tails the transcript file Claude Code already writes to disk, and speaks each new line. No integration, no glue API — the AI writes to a log, and something else reads the log aloud. Once you see it that way, the rest is plumbing.

    The moving parts:

    • Mumble / Murmur — the audio + text transport between phone and server
    • faster-whisper — speech-to-text, running on the CPU
    • Kokoro-82M — a small neural text-to-speech model for the voice coming back
    • tmux — holds the live Claude Code session so we can type into it programmatically
    • Two Claude Code hooks — one teaches Claude it’s “on a call,” one keeps the audio from breaking
    • A handful of small Python filespymumble glue that wires it all together

    What you’ll need to follow along: a Linux box (a cheap VPS or a home server) running Claude Code, Python 3.12, a phone, and about an afternoon. Let’s build it.


    The fast path: one script (Ubuntu)

    Prefer not to do all this by hand? I’ve packaged the whole build — every script and config you’ll see below, plus the Mumble server, the Python environment, the voice models and a systemd service — into a single self-contained installer. It’s written for Ubuntu (it uses apt and systemd); on Debian it works as-is, but on any other Linux you’ll need to adjust the package-install step yourself. Download it, read it first (always read a script before running it as root), then run:

    # 1. on a machine where you're already logged into Claude, mint a token:
    claude setup-token                       # prints sk-ant-oat01-...
    
    # 2. on your Ubuntu server, as root:
    curl -fsSL https://iamdev.net/wp-content/uploads/2026/06/voice-claude-setup.sh -o setup.sh
    sudo CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-..." bash setup.sh

    That stands the whole stack up and enables a voice-claude service so it comes back on boot. Afterwards, edit config.yaml to set a real Mumble password and point claude.cwd at your project. Everything below is what that script assembles, step by step — still worth reading even if you take the shortcut, because the sharp edges are where the hours go. (This will live on GitHub eventually; for now the link above is the canonical copy.)

    Step 1 — Run Claude Code headless, in tmux

    Everything hangs off one idea: keep a real, interactive Claude Code session alive in a place we can reach programmatically. tmux is perfect for that — it’s a terminal that keeps running after you disconnect, and crucially you can send keystrokes into it from the outside. That’s how we’ll “type” the transcribed speech.

    # start a persistent Claude session in a tmux window called "claude"
    UUID=$(cat /proc/sys/kernel/random/uuid)
    tmux new-session -d -s voice -n claude -c /home/you/project 
      "claude --session-id $UUID"
    
    # you can attach and watch any time:
    tmux attach -t voice        # Ctrl-b d to detach again

    Pinning the session id with --session-id matters: it tells us exactly which transcript file to tail later, and it lets you jump into the very same conversation from a normal terminal with claude --resume <uuid> whenever you want.

    The headless auth gotcha

    On a headless box there’s no browser to log into, so the interactive login won’t complete. The fix is a long-lived token. On a machine where you’re already logged in, run claude setup-token to mint one (good for about a year), then make it available to the headless session — it has to be in the environment before you launch the Step 1 command:

    # on your logged-in machine:
    claude setup-token            # prints sk-ant-oat01-...
    
    # on the server, put it where the session will see it:
    export CLAUDE_CODE_OAUTH_TOKEN="sk-ant-oat01-...."   # in ~/.bashrc

    One subtlety that cost me time: systemd and other non-login shells don’t source ~/.bashrc, so a service-launched Claude would start up “not logged in.” My launcher reads the token out of .bashrc on every start and re-exports it, so rotating the token is just editing one line. Keep .bashrc as the single source of truth and you avoid a whole class of confusing “why is it logged out” mornings.

    Step 2 — Mumble as the transport

    I needed a pipe between my phone and the server that carried both live audio and text, had solid mobile apps, was low-latency, and that I could fully self-host. That’s Mumble — an open-source voice chat system built for gamers, where every millisecond counts. The server is called Murmur; the Android app is Mumla, the iOS one is just Mumble.

    sudo apt install mumble-server
    sudo systemctl enable --now mumble-server
    
    # key settings in /etc/mumble/mumble-server.ini:
    #   port=64738
    #   serverpassword=choose-a-strong-one

    On your phone, add the server (host = your server, port = 64738, password = the one you set), accept the self-signed certificate on first connect, and switch the app to push-to-talk. Hold the button, say your sentence, release — that release is what tells the bridge your turn is over.

    One sharp edge worth saving you: editing mumble-server.ini with sed -i or some editors strips its group ownership, and Murmur then crash-loops with a cryptic “ini could not be opened.” It must stay root:mumble-server mode 640. After any edit: chown root:mumble-server, chmod 640, restart.

    Step 3 — The bridge connects (and a Python 3.12 landmine)

    First, the project itself: a handful of small Python files — bot.py wires it all together, alongside stt.py (ears), tts.py (voice), claude_io.py (typing in, and reading the transcript out) and a config.yaml for the knobs. Set up a virtualenv and install what they need:

    python3.12 -m venv venv
    . venv/bin/activate
    pip install pymumble faster-whisper webrtcvad soxr numpy pyyaml "setuptools<81"
    pip install kokoro-onnx onnxruntime        # neural TTS for Step 7 (onnxruntime runs the model)
    sudo apt install espeak-ng jq tmux               # espeak-ng = Kokoro's phonemizer; jq runs the hooks

    The bridge is a Python process that joins the Mumble server as if it were just another user, using the pymumble library. The first thing that bit me: pymumble still calls ssl.wrap_socket(), which was removed in Python 3.12. Rather than downgrade Python, I dropped in a tiny compatibility shim that recreates it with a modern SSL context (Mumble’s trust model is certificate pinning, not CA chains, so not verifying the self-signed cert is fine here):

    # compat_ssl.py  — import this BEFORE pymumble connects
    import ssl
    
    if not hasattr(ssl, "wrap_socket"):
        def wrap_socket(sock, keyfile=None, certfile=None, server_side=False,
                        ssl_version=None, ca_certs=None,
                        do_handshake_on_connect=True,
                        suppress_ragged_eofs=True, ciphers=None, **_):
            ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
            ctx.check_hostname = False
            ctx.verify_mode = ssl.CERT_NONE
            if certfile:
                ctx.load_cert_chain(certfile, keyfile)
            return ctx.wrap_socket(
                sock, server_side=server_side,
                do_handshake_on_connect=do_handshake_on_connect,
                suppress_ragged_eofs=suppress_ragged_eofs)
        ssl.wrap_socket = wrap_socket

    (A second, similar trap: webrtcvad needs setuptools<81 because it still imports pkg_resources. Pin it in your venv and move on.)

    With the shim in place, connecting is a few lines — and from here the bridge is really just two pipes: one carrying your voice in, one carrying Claude’s replies out.

    Step 4 — Ears: turning speech into text

    Mumble hands the bridge a stream of audio packets while you hold the talk button. Because I’m using push-to-talk, segmenting your turn is delightfully simple: when the packets stop arriving for ~700 ms, your sentence is done. I buffer the audio, resample it from Mumble’s 48 kHz down to the 16 kHz Whisper wants, and transcribe it with faster-whisper — the base.en model, running int8 on the CPU. No GPU required.

    # config knobs that matter for latency vs. accuracy
    stt:
      model: base.en       # tiny.en = fastest, small.en = most accurate
      device: cpu
      compute_type: int8
    vad:
      mode: ptt            # phone holds the button; we split on the packet gap
      silence_ms: 700      # this much quiet = end of your turn
      min_speech_ms: 350   # ignore little blips and clicks

    base.en is the sweet spot: good enough to nail “refactor the auth middleware” while staying fast on a modest VPS. It’s also the bigger memory user in the whole system (~327 MB resident), more than the text-to-speech model, which surprised me.

    The first time I held the button, said “list the files in this folder,” and watched the words appear in the terminal and the Enter key press itself — I laughed out loud. The machine was typing for me.

    Step 5 — Hands: typing into Claude

    Now we have text; we need it inside the Claude session. This is where the tmux choice pays off. tmux send-keys injects keystrokes into a target pane from the outside. The key flag is -l, which sends the text literally so tmux doesn’t try to interpret words like “Enter” as actual key names. Paste the text, pause a beat so the TUI settles, then send a real Enter to submit:

    def send_text(self, target, text):
        # -l = literal, so the words aren't parsed as key names
        subprocess.run(["tmux", "send-keys", "-t", target, "-l", text])
        time.sleep(0.18)                 # let the TUI settle before submit
        subprocess.run(["tmux", "send-keys", "-t", target, "Enter"])

    That’s the entire input path. Speech became text became keystrokes became a prompt. Claude Code doesn’t know or care that a human didn’t type it.

    Step 6 — Eyes: reading Claude’s replies

    This is my favourite part, because it’s the part where I expected to need an API and didn’t. Claude Code writes every turn of every conversation to a JSONL transcript on disk. The path is derived from the working directory — every non-alphanumeric character becomes a dash — plus the session id:

    ~/.claude/projects/<cwd-with-dashes>/<session-id>.jsonl
    
    # e.g.  cwd /home/you/project  +  session id abc123...
    ~/.claude/projects/-home-you-project/abc123....jsonl

    So the “read Claude’s replies” problem becomes the very old, very solved problem of tail -f on a file. The bridge watches that file, and every time a new assistant message lands, it pulls out the text blocks and fires them off to be spoken. It skips Claude’s internal thinking and any sub-agent side-chains, dedupes by message id, and starts reading from the end of the file so it never replays old history:

    def _handle(self, raw_line):
        o = json.loads(raw_line)
        if o.get("type") != "assistant" or o.get("isSidechain"):
            return                              # skip tool-use / sub-agents
        if o.get("uuid") in self._seen:
            return                              # already spoke this one
        self._seen.add(o.get("uuid"))
        for block in o["message"]["content"]:
            if block.get("type") == "text":
                self.on_text(block["text"])     # → speak it

    No webhook, no streaming API, no integration to keep in sync with. The agent writes to a log; we read the log aloud. That’s the whole secret.

    (You might wonder how the bridge knows which session id is live. A tiny pointer file — written by the launcher in Step 12 and kept current by a hook in Step 11 — always names the active session. For now, assume it knows.)

    Step 7 — Mouth: turning replies into a voice

    For the voice coming back, I landed on Kokoro-82M, a small neural text-to-speech model that sounds startlingly natural for its size and runs happily on CPU via ONNX. I picked a US male voice (am_michael) after auditioning a handful.

    # one-time: fetch the Kokoro model files into ./voices/
    mkdir -p voices
    REL=https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0
    #   kokoro-v1.0.onnx   (~310 MB)
    #   voices-v1.0.bin    (~27 MB, 54 voices)
    curl -fsSL "$REL/kokoro-v1.0.onnx" -o voices/kokoro-v1.0.onnx
    curl -fsSL "$REL/voices-v1.0.bin"  -o voices/voices-v1.0.bin

    My first version used a different engine (Piper) spawned as a fresh command-line process for every reply — which meant reloading the model every single time and waiting ~5 seconds before any audio, even for a five-word answer. The fix was the obvious one in hindsight: load the model once when the bridge starts, keep it warm in memory, and then stream the speech out sentence by sentence as Claude’s reply comes in. Now it’s about 1.7 seconds to the first spoken word, and because synthesis outruns playback, the rest is gapless.

    tts:
      engine: kokoro
      model: voices/kokoro-v1.0.onnx
      voices_bin: voices/voices-v1.0.bin
      voice: am_michael      # 54 voices ship in the model file
      speed: 1.0
      lang: en-us
      speak_code_blocks: false
      normalize_tech: true
      max_chars: 4000        # safety ceiling; cut at a sentence boundary, never mid-word

    That last knob matters more than it looks. Spoken replies that get hard-truncated mid-word sound broken; cutting at the nearest sentence boundary instead keeps the experience clean on Claude’s occasional long answers.

    Step 8 — Making it sound human, not like a terminal

    Here’s a problem you only discover once you’re listening: Claude is a coding agent, so its replies are full of file paths, snake_case identifiers, --flags, version numbers and acronyms. Fed raw to a speech engine, /etc/mumble/server.ini comes out as a slurred mess, and RTF gets read as the word “rtf.” So before any text hits the synthesizer, it goes through a normalizer that rewrites terminal-speak into something a human would actually say:

    • /etc/mumble/server.ini → “etc, mumble, server dot I N I”
    • snake_case and kebab-case → spaced-out words
    • --dry-run → “flag dry run”
    • 1.51.0 → “one point fifty-one point zero”
    • RTF → spelled “R T F” (from a curated acronym list, so real words like “TODO” are left alone)
    • long hashes and UUIDs → “a long hash” / “id ending…” instead of reading 40 characters aloud

    It’s deliberately conservative and pattern-gated, so ordinary prose passes through untouched. This one unglamorous file is the difference between something that sounds like a colleague and something that sounds like a 1980s text reader.

    Step 9 — Teaching Claude it’s on a phone call

    By default Claude formats for a screen: tables, bullet lists, fenced code blocks, headings. All of that is unlistenable. I needed Claude to know, on these turns, that it’s speaking, not writing — and to answer in short, conversational sentences.

    Claude Code has a hooks system for exactly this kind of thing. A UserPromptSubmit hook runs before each of my prompts reaches the model and can inject extra instructions. Mine checks whether this session is the one the bridge is currently driving, and if so, prepends a short “you’re on voice” note:

    # voice_mode_hook.sh — a UserPromptSubmit hook
    #   only fires for the session the bridge is driving; silent otherwise
    if [ "$sid" = "$active_sid" ]; then
      cat <<EOF
    [voice mode] You are replying over a text-to-speech bridge: your words are
    spoken aloud on a phone. Keep replies SHORT and conversational — 1-3 sentences.
    Do NOT use tables, code blocks, markdown, or long lists (they are unreadable
    aloud). Give the essential answer plainly; offer to elaborate if asked.
    EOF
    fi

    Both this hook and the session hook in the next step do nothing until you register them once in ~/.claude/settings.json — the step that’s easiest to forget:

    {
      "hooks": {
        "UserPromptSubmit": [
          { "hooks": [ { "type": "command",
                         "command": "/path/to/voice_mode_hook.sh" } ] }
        ],
        "SessionStart": [
          { "hooks": [ { "type": "command",
                         "command": "/path/to/session_start_hook.sh" } ] }
        ]
      }
    }

    This is genuinely the line that turns a clunky demo into something you’d actually use. The exact same instruction is, in fact, what shaped the spoken half of the conversation that produced this very post. And because it’s scoped to the bridge’s session id, every other Claude session on the box — normal terminal work, sub-agents — is completely unaffected.

    Step 10 — The text channel, for free

    Mumble carries text chat over the same connection as the audio, so with a bit more wiring the bridge speaks and types. Inbound chat messages take the exact same path into Claude as transcribed speech; outbound, each reply is mirrored into the chat pane with Markdown rendered to Mumble’s little HTML subset (fenced code → <pre>, **bold**<b>, and so on).

    The nice payoff: when you open the app, the bridge backfills the whole conversation so far into the chat, so you’re never staring at a blank screen wondering what you missed while your phone was in your pocket. Voice for when you’re walking; text for when you can look. One conversation, two senses.

    Step 11 — Voice commands and self-healing sessions

    Two final touches make it feel finished rather than fiddly.

    Voice commands. Some things you want to do to the session itself, not say to Claude — like clearing the context to start fresh. So the bridge intercepts a few whole-utterance phrases at the single point all input funnels through, and routes them to a slash-command instead of sending them as a prompt. Say “clear the context” on its own and it types /clear into the TUI. Matching is on the full, normalized utterance, so “can you clear the context for me?” in the middle of a sentence does not trigger it.

    Self-healing audio. This one was a subtle bug. The bridge sends input by tmux pane name, but reads output by session id. When you /clear, Claude starts a brand new session with a new id — so input kept working, but replies were being read from the old, now-dead transcript. Silent audio, easy to mistake for a network problem. The fix is a SessionStart hook that rewrites the bridge’s pointer file every time a session starts — but only when it’s running in the bridge’s specific tmux pane, so ordinary sessions and sub-agents can never hijack your phone audio:

    # session_start_hook.sh — keep the bridge pointed at the live session
    where=$(tmux display-message -p -t "$TMUX_PANE" '#S:#W')
    [ "$where" = "voice:claude" ] || exit 0     # only the bridge's own pane
    
    jq -n --arg s "$sid" --arg t "$tpath" 
      '{session_id:$s, transcript_path:$t}' > active_session.json.tmp
    mv -f active_session.json.tmp active_session.json   # atomic

    Now saying “clear the context” wipes the conversation, the new session repoints the bridge automatically, and the audio just keeps working. It heals itself. One detail to get right: the voice:claude string in this hook must exactly match the tmux session:window the launcher creates in Step 12 — if they differ, the pointer never updates and replies go silent.

    Step 12 — One command to bring it all up

    All of this hides behind a single launcher script. It makes sure the Mumble server is running, resumes the previous conversation (or starts a fresh one), points the bridge at it, and spins up a persistent tmux session with Claude in one window and the bridge — wrapped in an auto-restart loop — in another.

    ./voice-comms.sh
    
    #   voice comms is UP  (tmux session: voice)
    #   ---------------------------------------------------------------
    #   Watch / type : tmux attach -t voice
    #   On your phone (push-to-talk):
    #     Address  : 100.x.y.z      (your Tailscale IP)
    #     Port     : 64738
    #   ---------------------------------------------------------------

    Drop in a small systemd unit and it comes up on boot, so the moment the server is on, my phone can reach an AI that can see my code. Re-run the script any time to start a clean conversation.


    Part two: what it’s actually like

    So that’s the build. Here’s the part the architecture diagram doesn’t capture: how different it feels.

    I take walks now and the walks are productive in a way they never used to be. “What’s left on the malware cleanup?” — and it tells me, out loud, while I’m looking at trees. “Draft a blog post about the voice system” — and we work through the outline together, me talking, it reading drafts back, neither of us anywhere near a keyboard. The desk used to be the only place the work could happen. Now the work happens wherever I am, and the desk is just one option.

    The shift is hard to overstate. Typing at an AI, even a brilliant one, still feels like operating a tool — there’s a machine and there’s you, and you’re working the controls. Talking to it, and having it talk back while it actually does the work, crosses some line into feeling like collaboration. The interface stops being something you operate and starts being someone you’re working with. The keyboard, it turns out, was load-bearing for that distinction.

    It’s not magic, and I won’t pretend it is. You can’t both talk at once — one writer at a time. If your phone’s mic is open while a reply plays, the synthesized voice gets transcribed back in as if you said it, so push-to-talk discipline matters. And spoken answers are, by design, shallower than written ones — for anything where I need to see a diff or a table, I still go to the screen. But for thinking out loud, for kicking off work, for staying in the loop while I’m away from the desk, it’s become the way I reach for the AI first.

    The bag of sharp edges

    If you build this, here are the things that cost me hours, collected in one place so they cost you minutes:

    • Python 3.12 + pymumble: ssl.wrap_socket is gone; add the shim, and import it before pymumble.
    • webrtcvad: needs setuptools<81 (it still imports pkg_resources).
    • The Mumble ini file: editing it can strip its group ownership and crash-loop the server. Keep it root:mumble-server, mode 640.
    • Headless auth: use a long-lived CLAUDE_CODE_OAUTH_TOKEN; non-login shells don’t read .bashrc, so feed the token in explicitly.
    • The stale pointer after /clear: input by pane, output by session id — a new session silently kills audio until you repoint it. The SessionStart hook cures it.
    • The feedback loop: open mic during playback re-transcribes the AI’s own voice. Push-to-talk, or mute during replies.
    • Load TTS once: spawning the model per reply adds seconds of latency to every single answer.

    A note on security

    This puts a microphone-driven line into an agent that can read your files and run commands, so don’t expose it to the open internet. I run the whole thing over Tailscale, a private mesh network — the Mumble port isn’t open to the world at all; my phone reaches the server over the tailnet as if they were on the same LAN. Set a real Mumble password, keep your auth token out of any public config, and think carefully about what the underlying account is allowed to do. Treat the voice line with the same respect you’d give an SSH key.

    Where this goes

    The pieces here aren’t exotic — speech-to-text, a small TTS model, a terminal multiplexer, a file being tailed. What makes it feel like the future isn’t any one component; it’s that together they let the interface disappear. The best tools get out of the way, and a tool you can talk to from a footpath is very far out of the way.

    I think this is the direction coding is heading: less time operating an editor, more time in conversation with something that can operate the editor for you. You don’t need my exact stack to get there — swap Mumble for whatever transport you like, Kokoro for whatever voice you prefer. The architecture is the durable part: pin a session, type into it, tail its transcript, speak the result, and teach the model it’s on a call. Five ideas. Go build your own, and then go for a walk.

    This post was dictated to the very system it describes, and saved as a draft by the same. — Code Real Stuff.