GPT-5.5 Is Here: Real Step Forward or Quiet Iteration?

TL;DR

GPT-5.5 (“Spud”) is the first fully retrained base model since GPT-4.5, with architecture and pretraining reworked from scratch with agentic objectives in mind
It takes the top spot on Terminal-Bench 2.0 (82.7%) and GDPval (84.9%), narrowly beating Anthropic’s Claude Mythos Preview on agentic coding benchmarks
A 1M-token context window is new for OpenAI, enabling whole-codebase reasoning and long multi-step agent runs without context collapse
Pricing is competitive ($5/$30 per million input/output tokens) but the strategic story is about OpenAI building an integrated super app - chat, code, browser agent - all driven by one model
The gains are incremental, not a leap - but the full retraining signals OpenAI is betting the next two years on autonomous agentic work, not chat

OpenAI released GPT-5.5 on April 23, 2026, weeks after GPT-5.4 and only months after GPT-5. The cadence is starting to feel relentless. Codenamed “Spud” internally, GPT-5.5 is the first fully retrained base model since GPT-4.5 - architecture, pretraining corpus, and agent-oriented objectives all reworked from scratch.

The question worth asking is whether any of this is actually significant, or if we’ve reached the part of the curve where every new release looks like a small step.

What Actually Changed

The headline numbers are real. GPT-5.5 hits 82.7% on Terminal-Bench 2.0, 84.9% on GDPval (a benchmark that tests knowledge work across 44 occupations), and 78.7% on OSWorld-Verified for autonomous computer use. On Tau2-bench Telecom - complex customer service workflows - it reaches 98.0% without any prompt tuning.

The Terminal-Bench score is the one most people are talking about. It narrowly beats Anthropic’s Claude Mythos Preview at 82.0% and clears Claude Opus 4.7 at 69.4%. After months of Anthropic running ahead on agentic coding, OpenAI has the top spot back, even if only just.

Beyond benchmarks, the API now offers a 1M-token context window - a first for OpenAI, finally matching what Google has shipped for over a year. Multi-step workflows that previously needed careful chunking can now run end-to-end in a single context.

The Pricing Story

GPT-5.5 lists at $5 per million input tokens and $30 per million output tokens. The pro variant is $30 input, $180 output. That’s the same input price as Claude Opus 4.7 but more expensive on output ($30 vs $25).

OpenAI claims state-of-the-art intelligence at “half the cost of competitive frontier coding models” on Artificial Analysis’s Coding Index. That math depends heavily on how you weight input vs output and which competitor you’re benchmarking against - but for input-heavy agentic workflows that ingest large codebases, the economics are competitive.

The pricing pattern continues a trend worth watching. Despite frontier capability gains, neither OpenAI nor Anthropic is jacking up prices. The $20 subscription era I wrote about earlier is genuinely under pressure for power users, but the per-token cost is holding steady. The competition is keeping it that way.

The Agentic Pivot

The thing GPT-5.5 actually optimizes for is agentic work. Not chat. Not reasoning puzzles. Long-horizon tasks where the model has to plan, use tools, observe results, and keep going.

This matters because it confirms where the industry has decided to compete. Both Anthropic and OpenAI are now treating “can it complete a multi-hour engineering task without supervision” as the metric that matters. Chat is a solved product. Agents are not.

The 73.1% score on OpenAI’s internal Expert-SWE long-horizon benchmark and the SWE-Bench Pro performance suggest this isn’t marketing fluff. GPT-5.5 can take real GitHub issues and resolve them end-to-end more reliably than any previous OpenAI model.

This aligns with the agent-first architecture shift - engineers are becoming curators of systems that AI agents build and modify, rather than authors of every line. GPT-5.5 is built for that workflow.

The “Super App” Subtext

OpenAI’s framing keeps coming back to the same idea: ChatGPT, Codex, and the new browser agent should feel like one product. GPT-5.5 is the underlying model that makes this possible, because a single model now needs to write code, browse the web, operate a computer, draft documents, and stay coherent across hours of work.

This is a different product strategy than Anthropic’s. Anthropic ships models and lets developers build the agents. OpenAI is building the agent itself and shipping the model as the engine. Both are valid bets, but they imply different things about who captures the value.

If OpenAI’s super app vision works, the developer ecosystem matters less because most users never touch the API. If it doesn’t, the API and developer tooling become the moat. GPT-5.5 hedges - it’s strong as a raw model and strong as the brain of an agentic product.

Is This Actually Important?

Yes, but probably not in the way the announcement implies.

The benchmark wins are incremental. A few percentage points over Claude Mythos on Terminal-Bench is not a category-defining lead. Mythos isn’t even publicly available for normal use. By the time most developers can A/B test these models seriously, Anthropic will have shipped something new.

What matters more is the cumulative direction. Each release in this cycle - GPT-5.4, Opus 4.7, Mythos, now GPT-5.5 - adds another notch to autonomous capability. None is a leap. All of them together represent a genuine shift in what AI tools can do without human supervision.

The 1M context window is the kind of thing that quietly enables new use cases. Whole-codebase reasoning, long document workflows, multi-hour agent runs without context collapse. These weren’t impossible before, but they were fragile. They’re more robust now.

The fully retrained base model is the part that should make people pause. OpenAI doesn’t retrain base models often - it’s expensive and risky. The fact that they did, and that the gains are concentrated in agentic tasks rather than chat or reasoning, signals where they think the next two years go.

The Verdict

GPT-5.5 is not a wow release. It’s not the moment that changes how anyone thinks about AI. But it’s also not just a number bump. The retraining, the context window, and the agentic focus add up to something that matters for anyone building production AI workflows.

For developers using Claude Code or similar tools, the practical question is whether to switch. Probably not yet - the pricing is similar, the capability lead is narrow, and switching costs are real. But the gap between OpenAI and Anthropic just narrowed again, which is good for everyone using either one.

For OpenAI, this is a competitive response that puts them back at the top of the agentic coding charts. Whether they hold that position depends on what Anthropic ships next, which based on recent patterns won’t take long.

The model is rolling out now to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. API access via Responses and Chat Completions follows shortly.

For the durable take on where reasoning models are heading, see Reasoning Models in 2026.

TL;DR#

What Actually Changed#

The Pricing Story#

The Agentic Pivot#

The “Super App” Subtext#

Is This Actually Important?#

The Verdict#

Related Reading#