TL;DR

  • Goal: Call a real phone number and have a proper back-and-forth with my Mac Studio agent while walking the dog.
  • Hardware: Mac Studio (M2 Ultra, 128 GB) running a local model via Ollama or MLX.
  • Voice pipeline: Twilio SIP in, LiveKit Agents orchestrating STT / LLM / TTS, Whisper for transcription, Piper or ElevenLabs for speech.
  • Brain: A local 30B-class model for chat plus tool calls, with Claude API as a fallback for the harder reasoning.
  • Reach: Tailscale between the Mac and a tiny VPS so I never punch a hole in my home router.
  • Outcome: I can ring a UK landline number, ask “what’s failing on the CI pipeline?” and get a spoken answer in ~2 seconds.

Why bother phoning your own agent?

Typing is great at a desk. Outside the desk, it’s hopeless. I wanted the simplest possible interface to the box sat under my desk at home - dial a number, talk, hang up. No app, no login, no VPN dance on my phone.

This isn’t a gimmick. A voice line to a home agent unlocks three things you can’t really get from ChatGPT on your phone:

  1. It knows your actual work - it’s pointed at your repos, your notes, your local services.
  2. It can do real things - trigger a build, re-run a flaky test, stage a commit, check a dashboard.
  3. It’s private by default - the recording never leaves your kit unless you ask it to.

The rest of this post is the exact stack I landed on after a weekend of messing about. I’ve tried to be honest about the bits that are fiddly.

The shape of the system

Before we go deep, here’s the flow of a single call:

  1. I dial a Twilio number from my mobile.
  2. Twilio routes the call over SIP to a small cloud worker.
  3. The worker hands the audio to LiveKit Agents running as the voice orchestrator.
  4. LiveKit streams the audio over Tailscale to the Mac Studio.
  5. On the Mac - Whisper transcribes, the local LLM reasons and calls tools, Piper speaks the reply.
  6. Audio streams back out the same way.

Budget for the round trip, end to end, is roughly 1.5 to 2.5 seconds per turn. Rough split: 300 ms Twilio / SIP ingress, 150 ms Whisper partials finalising, 400 ms first token from the LLM, 200 ms Piper starting to speak, the rest is network and jitter. The human ear is unforgiving below about 800 ms - under 2 seconds and it feels like a conversation, over 3 and it feels like a walkie-talkie.

Most of the cleverness is already in LiveKit’s agent framework. You are not writing a real-time audio pipeline from scratch, which is the whole reason this is doable in a weekend.

The Mac Studio side

I’m running the setup on an M2 Ultra 128 GB - more than most people need, but I already had it from the local LLM experiments I wrote up previously. An M2 Max 64 GB would be completely fine if you stay on a 13B class model.

The key services on the Mac are all run under launchd so they come back after reboots:

  • Ollama serving a 30B-class model on localhost:11434. I’m using a Qwen3 MoE variant for most calls and falling back to Claude Sonnet via the Anthropic API for anything that needs proper reasoning. On an M2 Ultra you can comfortably sit at 4-bit quantisation and still get first-token latency under 400 ms, which is the real bottleneck for voice.
  • whisper.cpp with the Metal backend, running a streaming server that takes audio chunks and returns partial transcripts. large-v3-turbo is my default - fast enough on Apple Silicon and very forgiving with UK accents.
  • Piper TTS for voice out. The en_GB-alan-medium voice is the best local option I’ve tried. If I’m feeling flush I swap in ElevenLabs streaming TTS for a much nicer voice - but then the audio does leave the house.
  • A thin Python worker that holds the tool-call plumbing - GitHub status, a read-only SSH into my dev box, a couple of bash shortcuts for “check the build” and “what’s Claude saying about PR 42”.

The Python worker is the only code I actually had to write. Everything else is off-the-shelf.

The voice orchestrator

I spent an evening comparing LiveKit Agents, pipecat and the OpenAI Realtime API. For a local-first setup, LiveKit wins easily:

  • Pluggable STT, LLM and TTS, each of which can point at a local endpoint. I plug Whisper in for STT, Ollama in for LLM, and Piper in for TTS - no cloud dependency required.
  • Proper voice-activity detection and turn-taking out of the box. This is the part that makes it feel like a phone call rather than a radio handset.
  • SIP ingress via LiveKit SIP so a real phone call lands cleanly inside your agent session.

Pipecat is lovely but its SIP story is thinner. The OpenAI Realtime API is superb but it’s a cloud model - which defeats the point for me.

The phone number

For the actual PSTN side you need a SIP trunk. I’m using Twilio Programmable Voice because the SIP configuration is easy and I already have an account. A UK local number is about £1 a month plus pennies per minute.

The TwiML config is literally a single verb that dials the SIP URI exposed by LiveKit. Twilio handles the PSTN to SIP bridge. LiveKit handles the SIP to WebRTC bridge. Your agent just sees an audio track.

If you already use Telnyx or Vonage the same pattern works - anything that can terminate a SIP URI is fine. I wouldn’t use a consumer VoIP provider; the jitter is noticeably worse.

Stopping strangers from talking to your agent

This is the part I wish every tutorial addressed and almost none do. The moment your Twilio number is live, anyone on Earth can dial it and start a conversation with a thing that has shell access to your dev box. Caller ID spoofing is trivial. A naive setup is an open door.

The layers I use, in order:

  1. Allowlist on caller ID. Twilio hands the From header to LiveKit. I reject anything that isn’t my personal mobile. This stops 99% of nonsense but is not real security - caller ID can be spoofed.
  2. A spoken PIN. On answer, the agent asks for a short passphrase before it will do anything. Transcribed by Whisper, compared to a hash. If it fails twice the call drops.
  3. Read-only by default. The agent boots into a tool set that can only look at things. Destructive tools (push, merge, delete, run shortcut) are gated behind a second verbal confirmation each time.
  4. Rate limits and budget caps. The SIP worker drops the call if Whisper or the LLM hit unusual usage in a short window - a runaway loop shouldn’t cost me £50.
  5. An audit log. Every call writes a transcript and tool-call trace to a file on the Mac. If something weird happened I can go back and see it.

None of this is exotic, but all of it is mandatory. Treat the phone number as a public API endpoint for your home, because that is exactly what it is.

Connecting the Mac to the world without opening ports

This is where most home setups go wrong. You do not want a public IP on your Mac Studio, and you do not want to mess with port forwarding on your home router.

I use Tailscale with two nodes:

  • The Mac Studio at home.
  • A tiny £4/month VPS (hosted at Hetzner) that runs the LiveKit server and the SIP worker.

The VPS has a public address and handles the call from Twilio. LiveKit then reaches the Mac agent process over the Tailscale interface, not the open internet. Latency between the VPS and the Mac is about 15 ms on my connection - which is fine for conversational voice.

If you’d rather not run a VPS at all, LiveKit Cloud works the same way and the free tier covers my usage. I just prefer owning the box that holds the SIP credentials.

The agent’s tools

This is the bit that separates a chat toy from something actually useful. The agent has a small, boring toolbox:

  • github_status(repo) - hits the GitHub API and summarises open PRs, failing checks, new issues.
  • check_ci(pipeline) - reads the latest run from GitHub Actions and returns the current step.
  • run_shortcut(name) - runs a macOS Shortcut on the Mac (brilliant for “start the dev server” or “open my daily note”).
  • search_notes(query) - greps my Obsidian vault.
  • ask_claude(prompt) - escalates a hard reasoning task to Claude Sonnet via the API, with the local context stuffed in.

That last tool is the quiet star of the show. A local 30B model is great at conversation and orchestration - not always great at nuanced code reasoning. Letting it delegate to a smarter model when it needs to is the difference between “neat demo” and “thing I actually use”.

All the tools are defined with function calling and exposed to the local model via Ollama’s tool calling support.

Making it feel like a phone call, not a radio

Getting the plumbing working is maybe 60% of the job. Making it feel conversational is the other 40%.

Things that matter:

  • Barge-in. The user should be able to interrupt the agent mid-sentence. LiveKit’s VAD handles this; make sure you enable it.
  • Short responses by default. I prompt the model to reply in under three sentences unless I ask for detail. Nothing kills a voice interface faster than a six-paragraph monologue.
  • Confirmations for actions. Anything destructive - pushing code, merging a PR, deleting anything - has to prompt back “are you sure?” and wait for a clear yes.
  • Filler audio. When a tool call takes more than about 800 ms, the agent says something like “one sec, checking” to bridge the silence. Without this, people hang up thinking the line has dropped.

These are prompt-engineering problems, not infrastructure problems. The LiveKit agent’s system prompt is where the personality lives.

Costs

Rounded monthly numbers for my setup:

  • Twilio number plus talk: ~£3-5
  • VPS: ~£4
  • Electricity for the Mac Studio: negligible on top of what it was already using (it idles most of the day)
  • Local models: free
  • ElevenLabs (when I use it): ~£4 for the starter tier

So about £10 a month, all in. Cheaper than another subscription, and it’s mine.

What I’d skip if I were starting again

  • Don’t start with Piper. Piper is genuinely good, but ElevenLabs is so much better that you’ll lose a weekend chasing local TTS before you give up and pay. Prototype with ElevenLabs, then decide if local is worth it.
  • Don’t build your own SIP bridge. Just use Twilio plus LiveKit SIP. The RTP and codec nightmares are not a good use of your time.
  • Don’t expose the agent to the raw internet. Tailscale or WireGuard is a half-hour job and saves you a very bad afternoon.
  • Don’t let the agent do destructive things without confirmation. You’re going to misspeak eventually.

Is this the future?

Probably, but in the specific sense that most homes will end up with a small agent they can talk to - not in the sense that most people will build this stack themselves. The hard parts get commoditised and somebody ships an appliance.

What surprised me was how quickly it stopped feeling like a chatbot. The agent has access to my repos, my notes, my shortcuts. It understands the patterns I use without me having to explain them each time. On a long walk it’s genuinely closer to a colleague than a tool.

The interesting part of this project isn’t the phone line. It’s the realisation that a competent local model plus five or six tool calls is already enough to replace most of the reasons I open a laptop on the weekend.

If you want the bigger picture view on where local agents fit, I’ve written separately about the local vs cloud AI trade-off in 2026 and what actually belongs in my AI dev stack. This phone setup slots neatly into both.

References