Voice

How to Phone Your Home AI Agent Running on a Mac Studio

TL;DR Goal: Call a real phone number and have a proper back-and-forth with my Mac Studio agent while walking the dog. Hardware: Mac Studio (M2 Ultra, 128 GB) running a local model via Ollama or MLX. Voice pipeline: Twilio SIP in, LiveKit Agents orchestrating STT / LLM / TTS, Whisper for transcription, Piper or ElevenLabs for speech. Brain: A local 30B-class model for chat plus tool calls, with Claude API as a fallback for the harder reasoning. Reach: Tailscale between the Mac and a tiny VPS so I never punch a hole in my home router. Outcome: I can ring a UK landline number, ask “what’s failing on the CI pipeline?” and get a spoken answer in ~2 seconds. Why bother phoning your own agent? Typing is great at a desk. Outside the desk, it’s hopeless. I wanted the simplest possible interface to the box sat under my desk at home - dial a number, talk, hang up. No app, no login, no VPN dance on my phone. ...

Grok's New Voice APIs: Speech Recognition and Synthesis at Enterprise Scale

TL;DR xAI launched standalone Speech-to-Text (STT) and Text-to-Speech (TTS) APIs built on the same stack powering Grok Voice, Tesla in-vehicle assistants, and Starlink customer support Grok’s STT is among the cheapest at $0.10/hour (batch) and $0.20/hour (streaming), with features like speaker diarization, word-level timestamps, and Inverse Text Normalization The TTS offering ships with five expressive voices, inline expression control tags ([laugh], [sigh], whisper), and covers 20 languages - priced at $4.20 per million characters xAI’s pitch is vendor consolidation: replacing three separate contracts (transcription, LLM, synthesis) with one stack on one billing account The best fit is teams already building on Grok for reasoning - for lowest-latency TTS, ElevenLabs Flash v2.5 at ~75ms is still unmatched xAI has released two standalone voice APIs - Speech-to-Text (STT) and Text-to-Speech (TTS) - built on the same stack powering Grok Voice, Tesla in-vehicle assistants, and Starlink customer support. The move puts xAI in direct competition with ElevenLabs, Deepgram, and AssemblyAI, three companies that have owned the enterprise voice API market for years. ...

MacWhisper vs Wispr Flow vs Superwhisper: The 2026 Dictation Stack Compared

TL;DR MacWhisper is a file transcription tool (audio in, text out) that runs entirely on-device - the right pick for journalists, researchers, and anyone transcribing recordings Wispr Flow is the easiest system-wide dictation option, with AI-powered prose cleanup and cross-platform sync, but it sends audio to the cloud with no on-device option Superwhisper matches Wispr Flow’s system-wide dictation but processes audio locally, with bring-your-own-key LLM cleanup and deep customisation for power users The core decision is simple: if your audio can leave your machine, use Wispr Flow; if it must stay local, use Superwhisper; if you just need transcription, use MacWhisper The real product differentiation is no longer the underlying Whisper model - it is hotkey ergonomics, auto-edit prompts, and workflow integration Voice input on the Mac used to mean fighting with the built-in Dictation feature or paying Nuance a small fortune. In 2026, the landscape looks completely different. A handful of indie and venture-backed apps have turned Whisper-class models into genuinely fast, accurate tools that sit quietly in your menu bar until you hold a hotkey. ...

OpenAI Voice Engine

TL;DR OpenAI Voice Engine is a text-to-speech model that can clone a realistic voice from just a 15-second audio sample It produces emotive, natural-sounding speech despite using a small model and minimal training data Access has remained in limited preview since its 2024 announcement due to responsible AI concerns around voice cloning and impersonation Approved testers must obtain clear consent from voice providers and inform listeners that voices are AI-generated As of 2026, the technology is restricted to approved partners and researchers rather than general availability About Voice cloning used to require hours of studio recordings and bespoke model training. OpenAI’s Voice Engine changes the equation: a 15-second audio sample is enough to produce a realistic, emotive voice clone. The capability is striking, which is exactly why OpenAI has kept it locked down since the initial preview. ...