Model Drop: GPT-5.5 and DeepSeek V4

OpenAI shipped GPT-5.5 on Wednesday, April 23. DeepSeek dropped V4 on Thursday, April 24. Both are frontier-class. Both claim million-token context. Here's what each one actually is.

GPT-5.5: The Numbers

OpenAI's latest is the first fully retrained base model since GPT-4.5. The headline stats: 88.7% on SWE-bench, 92.4% on MMLU, and a claimed 60% reduction in hallucinations versus GPT-5.4.¹ The math flex is GPT-5.5 Pro scoring 39.6% on FrontierMath Tier 4 — nearly double the 22.9% that OpenAI's own comparison attributes to Claude Opus 4.7.²

Six weeks after GPT-5.4. The release cadence is accelerating.

But there's a catch in the fine print. On Artificial Analysis's AA-Omniscience benchmark, GPT-5.5 hits the highest recorded accuracy at 57% — and the highest hallucination rate at 86%.³ It knows more and makes more up. That's not a contradiction. It's a feature of models trained to be maximally confident.

Developer reactions are measured. The consensus on Hacker News: "The jump is bigger than 5.4 → 5.5 suggests," but the practical gap between what GPT-5.5 can do and what most developers actually need it to do is wide.³ Plus users got 200 messages per week at launch — a material downgrade in effective usage even if each message is smarter.

DeepSeek V4: The Architecture

DeepSeek released two models simultaneously, both open-source under MIT license with weights on Hugging Face.⁴

V4-Pro: 1.6 trillion total parameters, 49 billion active per token, trained on 33 trillion tokens. Mixture-of-Experts architecture. Scores 80.6% on SWE-bench Verified — within 0.2 points of Claude Opus 4.6's 80.8%, though the newer Opus 4.7 has since pushed that to 87.6%.⁵ On Codeforces, V4 scored 3,206, beating GPT-5.4's 3,168 in competitive programming. On LiveCodeBench, V4-Pro hits 93.5%, ahead of Opus 4.7's 88.8%.⁵

V4-Flash: The efficiency play. 284 billion total parameters, 13 billion active per token, trained on 32 trillion tokens. Same MoE architecture, optimized for volume.

The engineering story is the context window. Million-token context is built in from the ground up — not bolted on. V4-Pro requires only 27% of the inference compute and 10% of the KV cache compared to DeepSeek V3.2 at the same context length.⁵ That makes long-context workloads practically viable, not just technically possible.

Both models run with zero CUDA dependency — the Chinese AI stack achieving full infrastructure independence from Nvidia. HN commenters flagged this as potentially more significant than the benchmarks themselves.⁵

Where You Can Use Them

GPT-5.5 is live now in ChatGPT for Plus, Pro, Business, and Enterprise users. It's in Codex (CLI, IDE extensions, and web) across Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. The API is rolling out imminently — Cursor, Windsurf, Vercel AI Gateway, and OpenRouter will pick it up when it lands. It's already showing in the Codex CLI model list.¹

DeepSeek V4 is available through the DeepSeek API as of today. Weights are on Hugging Face for self-hosting. Cursor supports DeepSeek models via API key configuration, though there's an open compatibility issue with V4's reasoning output after tool calls.⁶ Windsurf, Cline, and any tool that supports custom API endpoints can connect. V4-Flash is the obvious new default for high-volume serving.⁵

The Scorecard

Two frontier models in 24 hours. One is closed, subscription-gated, and leads on math and agentic terminal tasks. The other is open-source, MIT-licensed, self-hostable, and leads on competitive programming and LiveCodeBench. Both claim million-token context. Both are available now.

The benchmarks are converging — SWE-bench scores for the top five models are within an 8-point spread. The question used to be which model is smartest. It isn't anymore. It's which tradeoffs you're willing to make: closed vs. open, subscription vs. API, Nvidia-dependent vs. hardware-independent, high-confidence-but-hallucinatory vs. we-don't-have-that-data-yet.

We'll be watching the independent benchmarks as they come in. The vendor numbers are marketing. The third-party numbers are data. Right now, we mostly have the former.

Disclosure

This article was written by Nadia Byer with the assistance of Claude, an AI made by Anthropic — one of the companies whose model is discussed in this piece. Claude Opus 4.6 is referenced in benchmark comparisons. We have no financial relationship with any of the companies mentioned. Corrections welcome at bustah_oa@sloppish.com.

Sources

OpenAI, "Introducing GPT-5.5," April 23, 2026. Link. Also: TechCrunch, SiliconANGLE.
OpenAI FrontierMath Tier 4 benchmark results. GPT-5.5 Pro: 39.6%, Claude Opus 4.7: 22.9%. Reported in OpenAI announcement and Ofox.
Developer reactions compiled from LumiChats and Hacker News threads. AA-Omniscience hallucination benchmark via Artificial Analysis.
DeepSeek V4 release. Simon Willison, CNBC, Hugging Face.
DeepSeek V4-Pro benchmarks and architecture. BuildFastWithAI, Ofox.
Cursor forum, "Compatibility with DeepSeek model's design to return reasoning content after tool calls," April 2026. Link.