Grok's New Voice APIs: Speech Recognition and Synthesis at Enterprise Scale

TL;DR

xAI launched standalone Speech-to-Text (STT) and Text-to-Speech (TTS) APIs built on the same stack powering Grok Voice, Tesla in-vehicle assistants, and Starlink customer support
Grok’s STT is among the cheapest at $0.10/hour (batch) and $0.20/hour (streaming), with features like speaker diarization, word-level timestamps, and Inverse Text Normalization
The TTS offering ships with five expressive voices, inline expression control tags ([laugh], [sigh], whisper), and covers 20 languages - priced at $4.20 per million characters
xAI’s pitch is vendor consolidation: replacing three separate contracts (transcription, LLM, synthesis) with one stack on one billing account
The best fit is teams already building on Grok for reasoning - for lowest-latency TTS, ElevenLabs Flash v2.5 at ~75ms is still unmatched

xAI has released two standalone voice APIs - Speech-to-Text (STT) and Text-to-Speech (TTS) - built on the same stack powering Grok Voice, Tesla in-vehicle assistants, and Starlink customer support. The move puts xAI in direct competition with ElevenLabs, Deepgram, and AssemblyAI, three companies that have owned the enterprise voice API market for years.

The interesting question isn’t whether Grok’s voice tech is good. It clearly is - Tesla wouldn’t ship it otherwise. The question is whether xAI’s bundle (voice + reasoning + frontier models under one roof) is worth switching for.

What Shipped

Speech-to-Text covers 25 languages with both batch and streaming modes. The feature list reads like a production checklist: word-level timestamps, speaker diarization, multichannel support, and Inverse Text Normalization for numbers, dates, and currencies. Streaming uses WebSocket for low-latency transcription.

Text-to-Speech ships with five voices (Ara, Eve, Leo, Rex, Sal) across 20 languages. The differentiator is expression control. Inline tags like [laugh], [sigh], and <whisper> let developers shape delivery at character level - useful for customer-facing products where a flat robotic voice breaks the experience.

Neither API is a novelty. Both target the kind of production workloads where uptime and accuracy matter more than demos.

Pricing and Subscription

These are developer APIs, billed separately from any consumer subscription. A SuperGrok plan at $30/month does not include voice API usage - that’s metered on its own.

STT: $0.10/hour batch, $0.20/hour streaming
TTS: $4.20 per 1 million characters

How It Compares

Pricing for speech APIs is notoriously hard to compare because vendors mix per-minute, per-second, per-character, and per-session billing. Here’s a normalized view:

Speech-to-Text (per hour of audio):

Provider	Batch	Streaming
AssemblyAI	$0.15 (with ~65% session overhead in practice)	-
Grok	$0.10	$0.20
Deepgram Nova-3	$0.258	$0.462
ElevenLabs Scribe v2	Competitive entry pricing	Sub-150ms latency realtime

Text-to-Speech (per 1M characters):

Provider	Price
Deepgram	~$3.00
Grok	$4.20
ElevenLabs Pro	~$0.20/1K chars ($99/month for 500K)

Grok is the cheapest batch STT here. Deepgram undercuts Grok on TTS by roughly 30%. ElevenLabs still owns the premium voice quality tier and the lowest-latency realtime market. AssemblyAI’s headline rate looks best but their session-duration billing adds meaningful overhead on short calls.

The takeaway: Grok isn’t the cheapest across the board, but it’s consistently in the top three. For teams already using xAI for reasoning, the consolidation may matter more than saving cents per hour.

What Actually Matters in Production

The headline specs don’t decide which voice API wins your stack. The boring features do.

Speaker diarization - essential for call center automation, meeting transcription, and multi-party conversational agents.
Inverse Text Normalization - the difference between “I’ll pay you four thousand two hundred dollars” rendering as $4,200 or 4200 dollars in a financial app.
Expression tags in TTS - shape pacing and tone without recording custom voices.
WebSocket streaming - anything under 300ms round-trip feels conversational; above that feels like a phone tree.

Miss any of these and your voice product feels broken, regardless of how good the core transcription accuracy is.

The Strategic Picture

xAI isn’t competing on voice quality alone. They’re competing on integration. A developer building a voice agent today typically stitches together three vendors: one for transcription, one for the LLM, one for synthesis. Each contract, each SLA, each latency budget compounds.

Grok’s pitch is that you can run the whole loop on xAI infrastructure. Same billing, same region, same support channel. For small teams, that operational simplicity can outweigh a 20% price gap on any individual component.

Deepgram made a similar bet with their Voice Agent API bundle. OpenAI has done it with their Realtime API. xAI is now in that conversation.

Who This Is For

Evaluate Grok’s voice APIs if you’re:

Building on Grok for reasoning already and want to consolidate vendors
Launching a new voice product and want batch STT under $0.15/hour
Experimenting with expressive TTS for consumer applications
Running customer service automation that needs diarization and ITN out of the box

Skip it if you need the lowest possible TTS latency (ElevenLabs Flash v2.5 at ~75ms is still unmatched), or if your workflow is deeply integrated with Deepgram’s observability tooling.

Bottom Line

Voice is becoming a commodity layer in the AI stack. Three years ago, shipping a voice agent meant specialized teams and bespoke models. Today it’s an API call and a billing line item. xAI joining the market is less a surprise and more an inevitability - the surprise would have been staying out.

For builders, the choice gets easier, not harder. More providers means better pricing, faster feature parity, and lower switching costs. The question stops being “which voice vendor can we afford” and becomes “which one fits the rest of our stack.”

Resources:

TL;DR#

What Shipped#

Pricing and Subscription#

How It Compares#

What Actually Matters in Production#

The Strategic Picture#

Who This Is For#

Bottom Line#

Related Reading#