The State of Open-Weight Models in 2026: Llama, Qwen, Mistral, DeepSeek

The open-weight model conversation in 2023 was about whether the open ecosystem could keep up with the frontier labs at all. The conversation in 2024 was about how big the gap was. The conversation in 2026 has changed shape: on most benchmarks that matter to most production workloads, the open-weight ecosystem has either closed or substantially narrowed the gap, and the strategic question is no longer “can we use open models” but “which open model fits this workload best.”

TL;DR

The 2026 open-weight frontier is dominated by five families: Meta Llama 4 (Scout and Maverick), Alibaba Qwen 3.5, DeepSeek V4 (Pro and Flash), Google Gemma 4, and Mistral Medium 3.5 - all five of which shipped frontier-class models within roughly the last quarter.
DeepSeek V4 Pro leads on raw capability with public scores of 80.6 on SWE-Bench Verified and 90.1 on GPQA Diamond, with a 1M token context. Mistral Medium 3.5 is the strongest EU-jurisdiction coding model. Qwen 3.5 is the best price-to-performance for general inference. Llama 4 Scout has the largest context at 10M tokens. Gemma 4 is the strongest small-model on-device option.
Mixture-of-Experts is now the default architecture at the frontier of the open ecosystem. DeepSeek V4-Pro is 1.6T total / 49B active. Llama 4 Maverick is 400B / 17B. Qwen 3.5 is 397B / 17B. Mistral Large 3 is 675B / 41B. The dense-model era at the frontier is essentially over.
Context length has stopped being a meaningful gate. Llama 4 Scout ships at 10M, DeepSeek V4 at 1M, Gemma 4 medium at 256K. The interesting question is what models can actually do with the context, not whether they have it.
The licence landscape has consolidated. Apache 2.0 has won as the default permissive licence (Gemma 4, Qwen 3.5, Mistral Large 3). DeepSeek V4 ships MIT. Llama 4 retains the Meta custom licence with the 700M monthly-active-users clause that has been a real obstacle for some deployments.
The gap with the frontier proprietary models is now smaller than the cost difference. For most production workloads in 2026, an open-weight model running on commodity infrastructure is the rational default unless you have a specific reason to be on the proprietary frontier.

The five families that matter

The 2026 open-weight ecosystem has consolidated around five families with credible frontier capability. Smaller open projects continue to ship excellent work in specific niches, but the five names below define the shape of what is possible at the frontier.

Meta Llama 4

Llama 4 is the most-deployed open-weight family in the world by a wide margin. Meta released two variants: Scout (109B parameters total in MoE) and Maverick (400B total / 17B active). Llama 4 Scout’s standout feature is the 10 million token context window, which is the largest in the open ecosystem and one of the largest in any production model. Maverick is the higher-capability sibling and competes credibly with the proprietary frontier on most coding and reasoning benchmarks.

The catch with Llama 4 is the licence. Meta retained its custom licence with the 700 million monthly active users clause, which means the largest hyperscalers and a handful of consumer applications cannot use it without separate commercial terms. For everyone else, the licence is permissive enough to be production-friendly, and Meta has continued to invest seriously in the surrounding ecosystem - the Llama Stack, the inference partnerships, and the Meta-Groq partnership that serves the official Llama API on Groq’s LPU infrastructure.

Alibaba Qwen 3.5

Qwen 3.5 is the most aggressive of the Chinese open-weight families on price-to-performance. The flagship 397B / 17B MoE model is competitive with the frontier on a wide range of benchmarks while shipping under Apache 2.0. The smaller Qwen 3.5 variants - notably the 32B dense model and the smaller MoE variants - are the default open-weight choice for cost-sensitive production deployments and for the long tail of fine-tuning use cases.

The Qwen ecosystem is also unusually deep on multimodal and on agentic tooling. Qwen-VL handles vision, Qwen-Audio handles speech, and the Qwen Agent framework has been adopted widely across the Chinese AI developer community. The result is that Qwen is the open-weight family most likely to win where the workload is multimodal or agentic and the cost is a real constraint.

DeepSeek V4

DeepSeek shipped V4 in late 2025 in two configurations: V4 Pro (1.6T total / 49B active) and V4 Flash (smaller, cheaper, faster). V4 Pro leads the open-weight ecosystem on most public benchmarks - 80.6 on SWE-Bench Verified, 90.1 on GPQA Diamond, 1M context - and is competitive with the proprietary frontier on most reasoning and coding tasks.

The strategic significance of DeepSeek goes beyond the benchmarks. The lab’s combination of frontier capability, MIT licensing, and aggressive API pricing has made it the single biggest pressure on the proprietary frontier’s pricing power. The R1 reasoning release in early 2025 reset the price floor for reasoning models, and the V4 release in late 2025 reset the price floor for general-purpose frontier inference. The pricing pressure has been the single most consequential effect of the open-weight ecosystem on the broader market in 2025-2026.

Google Gemma 4

Gemma 4 is the most thoughtfully designed open-weight family for on-device and edge deployment. The lineup ranges from a 2B variant suitable for mobile devices through a 270B flagship, with the 270B comparable to the frontier on quality benchmarks and the smaller variants market-leading on size-per-capability. The 256K context on the medium tier is sufficient for the vast majority of production use cases.

Gemma 4 ships under Apache 2.0 with a use policy that is more permissive than Llama 4 and that has made it the default open-weight choice for consumer applications and for embedded deployments. The Google branding helps with enterprise adoption, and the integration with the broader Google AI surface - Vertex AI, the Gemma Cookbook, the published fine-tuning recipes - has made it the easiest open-weight family to deploy seriously.

Mistral Medium 3.5

Mistral has continued to ship through 2025 and into 2026, and Medium 3.5 (released late April 2026) is the EU-jurisdiction frontier choice. The model scores 77.6% on SWE-Bench Verified, ships under Apache 2.0, and is fully covered by EU data protection regimes - which has become a real procurement consideration for enterprises operating under GDPR, the AI Act, and the various national-level extensions that have appeared since 2025.

The Mistral lineup includes Mistral Large 3 (675B / 41B MoE) at the upper end and a range of smaller variants - including the Codestral coding-specialised models - across the rest of the size spectrum. The EU-jurisdiction story is the primary commercial differentiator and it has been more consequential than the raw benchmark numbers in determining where Mistral has won enterprise deals.

The architectural story: MoE is the default now

The single biggest architectural shift in the open-weight ecosystem between 2024 and 2026 has been the universal adoption of Mixture-of-Experts at the frontier of the size distribution. Every frontier-class open model released in the last six months has been an MoE. The dense-model era at the top end of the open ecosystem is over.

The reason for the shift is the economics. An MoE model with N total parameters and K active parameters per token costs approximately K to serve while having the capability headroom of a dense model with somewhere between N and K parameters. The activation efficiency means you can ship a 1T+ parameter model that costs roughly as much to serve as a 50B dense model, with reasoning capability that benefits from the larger total parameter count.

The practical implication is that the open-weight ecosystem has bifurcated. The frontier-capability models are MoEs that need substantial GPU memory to load (typical numbers are 8-16 H100s or A100s for the larger variants) but that have inference costs roughly proportional to a small dense model. The cost-sensitive models, particularly Gemma 4 and the smaller Qwen and Mistral variants, are dense and fit on much smaller hardware. The middle of the size distribution - dense models in the 70-100B range - has substantially emptied out, because MoEs are more efficient at the top end and small dense models are more efficient at the bottom.

For people deploying these models, the implication is that the hardware story has changed. The default GPU configuration for serious open-weight inference is now larger than it was, and the smaller deployments have moved more aggressively toward dense models in the 7-30B range. The middle ground that dominated the deployment patterns of 2023-2024 has narrowed.

What “closed the gap” actually means

The frontier proprietary models - GPT-5.5, Gemini 3, Claude Opus 4.7 - are still ahead of the open-weight ecosystem on several important dimensions. The gap is real and the framing that says open models have caught up is overstated. The honest framing is more specific: the gap has closed on a specific list of benchmarks and capabilities, and remains real on others.

Where open has caught up. Single-shot coding tasks. Mathematical and scientific reasoning at the benchmark level. Routine text generation, summarisation, and classification. Multilingual capability outside of English. Long-context retrieval and synthesis on moderately complex inputs. The price-to-performance on these workloads now favours open models by a wide margin and the production deployments have moved accordingly.

Where the frontier proprietary models still lead. Long-horizon agentic workflows. Tool use reliability at scale. Novel-context generalisation - tasks that look different from anything in the training data. Subjective tasks where style and judgement matter more than correctness. Reliability on the long tail of edge cases. These are the workloads where the closed labs continue to invest in the surrounding infrastructure - tool integration, evaluation, deployment tooling - in ways that the open ecosystem has not matched.

The strategic implication is that the rational deployment pattern in 2026 is hybrid. Use open-weight models for the bulk of routine inference. Reserve the proprietary frontier for the small subset of workloads that genuinely benefit from it. This is the pattern most sophisticated enterprises have converged on and it is the one that makes the most economic sense given the price differential.

The geopolitical dimension

The open-weight conversation in 2026 has become a geopolitical conversation in ways it was not in 2024.

DeepSeek and Qwen are Chinese labs producing the most price-competitive open-weight frontier models in the world. The strategic implications are large. The US export control regime on AI chips was designed in part to slow Chinese frontier model development, and the empirical evidence is that it has not done so - DeepSeek V4 and Qwen 3.5 are credible frontier models that ship at significantly lower prices than the US proprietary equivalents. The policy response is still being worked out and it is one of the major topics in US AI policy circles for 2026-2027.

The European response has been different. The EU AI Act and the surrounding regulatory regime have made EU-jurisdiction models more attractive for enterprises operating under European regulation, and Mistral has been the primary beneficiary. The framing of European AI sovereignty has driven significant policy and procurement decisions, and the trajectory is toward a regulatory environment that explicitly favours European labs over both US and Chinese alternatives for certain workloads.

The implication for builders is that the choice of model is no longer purely a technical decision. Procurement teams in regulated industries are increasingly factoring in the jurisdiction of the model provider, the licence terms, the export-control implications, and the alignment with relevant regulatory regimes. The open-weight ecosystem makes the jurisdiction question more complicated rather than simpler, because the choice is no longer just between US providers but between US, Chinese, European, and a growing tail of smaller alternatives.

The controversial parts

Four claims in the open-weight conversation deserve more pushback than they typically get.

The first is the claim that open models have caught up with the frontier. The empirical case for this on benchmarks is real but the production-reliability gap is larger than the benchmarks suggest. The frontier proprietary models have advantages in long-tail edge cases, in tool integration, and in the surrounding evaluation infrastructure that do not show up in the public scores. The closure of the gap is real but it is not complete.

The second is the claim that DeepSeek’s success proves the US export-control regime has failed. The framing is appealing but the counterfactual is harder than it sounds. The capability gap between DeepSeek V4 and the US frontier is real even if the public benchmarks understate it, and the question of how fast DeepSeek would have shipped without the constraints is genuinely uncertain. The honest version is that the export controls have not stopped Chinese frontier model development but they have raised its cost and slowed its pace by some difficult-to-quantify amount.

The third is the claim that open weights are equivalent to open source. The licensing landscape has consolidated around Apache 2.0 and MIT for the most permissive options, but the training data, training code, and evaluation infrastructure behind these models are mostly not open. The framing of these models as “open source” is technically inaccurate and it has consequences for reproducibility, for safety research, and for the broader ecosystem’s ability to actually understand what these models are doing.

The fourth is the claim that the open-weight ecosystem will continue to compress the price floor faster than the proprietary frontier can keep up. The trajectory is favourable to this claim through mid-2026, but the energy and infrastructure constraints discussed in the AI energy crisis post bind everyone, not just the frontier labs. The compute economics for the open ecosystem are downstream of the same constraints as the closed ecosystem, and the price floor is likely to stabilise at a level that reflects those constraints rather than continuing to fall indefinitely.

Where this is heading

The most likely shape of 2027 is that the open-weight ecosystem becomes the default for the bulk of production AI inference, with the proprietary frontier reserved for the long tail of workloads that genuinely benefit from the additional capability. The deployment patterns are already moving in this direction and the economics will continue to push that way.

The other prediction worth making is that the architectural diversity in the open ecosystem will keep growing. The MoE-at-the-frontier, dense-at-the-edge pattern is the current shape but it is not the final shape. State-space models, hybrid architectures, and various forms of specialised compute - including the reasoning models discussed in the adjacent post - are going to produce a more heterogeneous picture by the end of 2026 than the current snapshot implies.

For people building with these models today, the practical guidance is the boring version of the exciting story. Use open weights for the workloads where they fit. Keep the proprietary frontier in the toolkit for the workloads where it wins. Pay attention to the licence terms because they actually matter in production. Watch the architectural trends because the deployment footprint changes when the architecture changes. Treat the price-floor compression as a feature you can build a business around, not as a trend that justifies postponing the build.

DeepSeek - the lab whose 2025 releases reset the price floor for the open ecosystem.
The Rise of Small Language Models: Why Size Isn’t Everything - the small-model end of the same story.
Reasoning Models in 2026: o3, R2, and the Compute-at-Inference Shift - the capability story the open-weight ecosystem now competes on alongside raw inference.
Running AI Models Locally with Ollama: From Setup to OpenClaw - the practical local-deployment angle for open-weight models.
When to Fine-Tune vs When to RAG: Choosing Your AI Architecture - the customisation decisions that open weights make possible.

TL;DR#

The five families that matter#

Meta Llama 4#

Alibaba Qwen 3.5#

DeepSeek V4#

Google Gemma 4#

Mistral Medium 3.5#

The architectural story: MoE is the default now#

What “closed the gap” actually means#

The geopolitical dimension#

The controversial parts#

Where this is heading#

Related Reading#