Skip to main content

SOTA AI Model Landscape — June 2026

Executive Summary

The AI model landscape in June 2026 is defined by three converging forces: frontier models have reached genuinely dangerous capability levels, the United States has demonstrated willingness to unilaterally disable access to those models worldwide, and the open-weight ecosystem has matured to the point where sovereign alternatives are technically viable. On June 12, 2026, the US Commerce Department ordered Anthropic to suspend global access to its most capable models — Fable 5 and Mythos 5 — three days after launch. Because nationality-based filtering proved technically infeasible, Anthropic disabled both models for all users worldwide, including paying enterprise customers. As of June 22, 2026, they remain suspended with no restoration date. This is the first time a commercially deployed frontier AI model has been forcibly recalled by government order. The incident transformed “sovereign AI” from a policy talking point into an operational imperative. Every organisation running critical workloads on US-hosted frontier models now faces a demonstrated risk: a single government directive can sever access without warning, without recourse, and without geographic exemption. Meanwhile, the open-weight ecosystem offers a credible alternative. Models like DeepSeek V4-Pro (MIT license, 80.6% SWE-Bench Verified), Qwen 3.6 (Apache 2.0, runs on a single consumer GPU), and Mistral Large 3 (Apache 2.0, European sovereign infrastructure) deliver performance that would have been frontier-class twelve months ago, under licenses that permit unrestricted sovereign deployment. Published open-weight models are currently exempt from US export controls. The architectural landscape has also diversified. Monolithic scaling continues at the frontier, but Mixture-of-Experts architectures now dominate (used by Anthropic, OpenAI, Google, DeepSeek, Mistral, Qwen, and Meta), reasoning chains add inference-time compute for hard problems, and multi-model ensemble systems offer a path to frontier-competitive performance at a fraction of the cost. For organisations willing to invest in orchestration rather than raw scale, the gap between “what you can build yourself” and “what the frontier offers” is narrower than it has ever been.

The Export Control Watershed

Timeline

DateEvent
June 9, 2026Anthropic launches Fable 5 and Mythos 5 globally. Fable 5 is the commercial product; Mythos 5 is the unrestricted variant for approved cybersecurity and government partners. Same weights, different safety layers.
June 11After criticism from cybersecurity researchers that silent rerouting to Opus 4.8 was blocking legitimate defensive work, Anthropic makes the safety fallback visible.
June 12, 5:21 PM ETUS Commerce Department’s Bureau of Industry and Security (BIS), under Secretary Howard Lutnick, issues directive to suspend all access for foreign nationals.
June 13Anthropic disables both models for ALL users worldwide. Services removed from AWS Bedrock, Google Cloud, Microsoft Foundry, Snowflake, Box, and direct APIs.
June 17G7 summit in Evian-les-Bains. AI executives meet with G7 heads of state. France announces Western democracies will establish a coordinated AI cooperation platform within one month.
June 18Proposed UK exemption collapses. US House members demand answers from the administration.
June 22Both models remain suspended for all users worldwide. No restoration date published.

The Stated Trigger

The Commerce Department cited a jailbreak technique that could cause Fable 5 to exhibit Mythos 5’s cybersecurity analysis capabilities — the kind of vulnerability discovery reasoning that could accelerate offensive cyber operations. Anthropic maintained the vulnerabilities were “known in advance” and “relatively minor in severity,” and that similar capabilities exist in GPT-5.5.

The Broader Context

This did not emerge from nothing. In February 2026, President Trump directed all federal agencies to cease using Anthropic after the company refused to waive contractual restrictions on Claude’s use for mass domestic surveillance and fully autonomous weapons. Defense Secretary Hegseth designated Anthropic a “supply chain risk” — the first time this designation was applied to an American company.

Global Reaction

France: Bruno Retailleau called it a “wake-up call.” Benjamin Haddad characterised it as “an accelerator of the geopolitical battle over AI.” Jordan Bardella urged accelerated government support for Mistral AI. United Kingdom: Al Carns stated “This isn’t an AI story. It’s the story of every industry we used to lead.” The UK’s proposed exemption from the directive collapsed. Netherlands: Geert Wilders called for accelerating domestic AI model development: “AI is more and more national sovereignty.” EU: The European Commission had already proposed the Cloud and AI Development Act on June 3 (pre-suspension), with goals to triple European data centre capacity over 5-7 years. The suspension dramatically accelerated political support. Australia: No formal government statement, but the incident has strengthened the sovereign AI debate domestically. Kate Carruthers (UNSW) wrote that the incident “makes sovereign AI real.” SmartCompany reported that access to advanced AI capabilities now depends on “export controls, nationality, and geopolitical considerations rather than just commercial decisions.”

What This Means

The Fable 5 suspension establishes three precedents:
  1. The US government will act unilaterally against specific model deployments when it perceives a national security basis.
  2. The practical effect is global, regardless of the targeted users’ nationality or location, because providers cannot technically segregate access in real time.
  3. No exemption exists for Five Eyes partners, EU allies, or any other country. The UK exemption proposal collapsed.
For any non-US organisation running critical workloads on US-hosted frontier models, the risk is no longer theoretical. It has been demonstrated.

Model Landscape

Anthropic (Fable 5, Mythos 5, Opus 4.8, Sonnet 4.6)

Architecture

Anthropic has not officially disclosed architecture type or parameter counts for any of its models. Third-party analysis strongly suggests Fable 5 / Mythos 5 use a sparse Mixture-of-Experts (MoE) architecture optimised for RAG and massive codebases. Anthropic has not confirmed or denied this. Fable 5 and Mythos 5 share identical weights — same training, same base model, same capability ceiling. The only difference is the safety layer: Fable 5 uses a multi-layer content classifier that reroutes high-risk queries to Opus 4.8; Mythos 5 is unrestricted. Mythos 5 is limited to Project Glasswing cybersecurity partners and select US government collaborators.

Capabilities and Benchmarks

BenchmarkFable 5Opus 4.8Sonnet 4.6Haiku 4.5
SWE-Bench Verified95.0%*88.6%72.7-79.6%73.3%
SWE-Bench Pro80.3%*69.2%39.5%
FrontierCode Diamond29.3%13.4%
Terminal-Bench 2.188.0%82.7%
GPQA Diamond93.6%~83%
MMLU91.8%
Humanity’s Last Exam59.0% (no tools) / 64.5% (tools)
ExploitBench (Mythos)78.0%40.0%
*The 80.3% SWE-Bench Pro score was produced using Anthropic’s own scaffolding, not a neutral evaluation harness. Independent evaluators have contested this figure. Vendor-scaffold numbers consistently run 10-30 points above Scale’s standardised leaderboard. Anthropic did not publish MMLU or HumanEval scores for Fable 5.

Context and Output

ModelContextMax Output
Fable 5 / Mythos 51M tokens128K tokens
Opus 4.81M tokens128K tokens
Sonnet 4.61M tokens (beta)64K tokens
Haiku 4.5200K tokens64K tokens

Pricing

ModelInput/MTokOutput/MTokCache HitBatch (In/Out)
Fable 5 / Mythos 5$10.00$50.00$1.00$5.00 / $25.00
Opus 4.8$5.00$25.00$0.50$2.50 / $12.50
Sonnet 4.6$3.00$15.00$0.30$1.50 / $7.50
Haiku 4.5$1.00$5.00$0.10$0.50 / $2.50

Deployment Model and Sovereign Limitations

Anthropic operates API-only through its own infrastructure, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry. There is no on-premise or self-hosted option. All data flows through US-based infrastructure. The June 12 export control incident demonstrated the operational consequence: the US government effectively exercised a kill switch over global access, and Anthropic had no technical means to maintain service for non-US customers even if it wanted to.

Compute and Financials

Anthropic is described as a “highly capital-intensive, quasi-infrastructure entity” rather than an asset-light SaaS business. Committed compute partnerships exceed $330 billion (Amazon >$100B over 10 years, Google ~$200B over 5 years, Microsoft $30B). Projected 2026 losses: approximately $29 billion against $25-30 billion in revenue, with 65-80% consumed by compute costs. Peak training spend estimated at ~$30 billion in the 2028 timeframe.

OpenAI (Codex, GPT Series, o-Series)

Architecture

OpenAI uses a Mixture-of-Experts (MoE) architecture for the GPT-5.x family. Exact parameter counts are not disclosed. Estimates suggest active parameters in the 2-5 trillion range with a total expert pool potentially 10-50+ trillion. The widely circulated 52.5 trillion figure represents total parameter capacity, not active parameters per inference. GPT-5.5 (codenamed “Spud,” released April 23, 2026) is the first fully retrained base model since GPT-4.5. Every model from GPT-5.0 through GPT-5.4 was an incremental post-training iteration on the same foundation; 5.5 is a ground-up rebuild.

Codex Platform

Codex is now OpenAI’s agentic coding platform, not a standalone model. It runs across four surfaces: the Codex app (desktop), Codex CLI (terminal agent), IDE extensions, and Codex Cloud (web). The underlying models are GPT-5.x variants. Current capabilities include computer use, Record and Replay workflow automation, PR review, multi-file terminal view, in-app browser, and SSH to remote devboxes.

Key Benchmarks

BenchmarkGPT-5.5GPT-5.4Notes
SWE-Bench Verified88.7%74.9%
SWE-Bench Pro58.6%57.7%
Terminal-Bench 2.082.7%75.1%
MMLU92.4%
GPQA Diamond93.6%92.8%
ARC-AGI-285.0%73.3%
FrontierMath T1-351.7%47.6%
Long-context 512K-1M (MRCR v2)74.0%36.6%Major improvement
Hallucination caveat: GPT-5.5 scores highest on factual recall (57% accuracy on AA-Omniscience) but has an 86% hallucination rate on that benchmark vs. Claude Opus 4.7’s 36%. It confabulates more aggressively at knowledge boundaries.

Reasoning Models (o-Series)

ModelInput/MTokOutput/MTokContextKey Score
o4-mini$1.10$4.40200KAIME 2025: 92.7%, SWE-Bench: 68.1%
o3$2.00$8.00200KCodeforces SOTA, MMMU leader
o3-pro$20.00$80.00200KAIME 2025: 98%, GPQA Diamond: 86%
The o-series models add explicit reasoning chains (inference-time compute) for harder problems. o3-pro targets the hardest 5% of problems: PhD-level science, competitive maths, complex formal reasoning.

Pricing

ModelInput/MTokOutput/MTokCached InputBatch (In/Out)
GPT-5.5$5.00$30.00$0.50$2.50 / $15.00
GPT-5.5 Pro$30.00$180.00$15.00 / $90.00
GPT-5.4$2.50$15.00$0.25$1.25 / $7.50
GPT-5.4 mini$0.75$4.50$0.075$0.375 / $2.25
GPT-5.4 nano$0.20$1.25$0.02$0.10 / $0.625
GPT-4.1$2.00$8.00$0.50$1.00 / $4.00
GPT-4.1 mini$0.40$1.60$0.10$0.20 / $0.80
GPT-4.1 nano$0.10$0.40$0.025$0.05 / $0.20

Open-Weight Models

OpenAI has released limited open-weight reasoning models under Apache 2.0:
ModelParametersLicensePurpose
gpt-oss-120b120BApache 2.0General reasoning
gpt-oss-20b20BApache 2.0Lightweight reasoning
gpt-oss-safeguard-120b120BApache 2.0Safety classification
gpt-oss-safeguard-20b20BApache 2.0Safety classification
All flagship models (GPT-5.x, o-series) remain closed-weight.

Deployment Model and Sovereign Position

OpenAI operates via its own API and Azure OpenAI Service. The Microsoft exclusivity arrangement was removed in April 2026 — OpenAI can now partner with other cloud providers. Azure Sovereign Cloud / Azure Local offers on-premises control planes for government and defence workloads. The NEXTDC partnership (“OpenAI for Australia”) involves an AUD $7+ billion hyperscale AI campus at Eastern Creek, Sydney (S7, 650MW total campus capacity, with OpenAI as initial offtaker at approximately 550MW). Phase 1 is expected H2 2027. However, this is OpenAI sovereign compute infrastructure, not customer-controlled infrastructure — the distinction matters. OpenAI has so far avoided direct export control restrictions. However, industry expectation is that export control obligations will extend across multiple providers over the next 12-24 months as models exceed capability thresholds.

Financials

Approximately $25 billion annualised revenue, approximately 900 million weekly users, projected $14 billion loss in 2026 (inference costs dominate). Training run estimates for frontier models: $500M+ per run. Stargate Abilene cluster coming online in phases.

Google (Gemini Family)

Architecture

All Gemini models from 2.5 onward use a sparse Mixture-of-Experts (MoE) architecture built on a dense Transformer backbone. The Gemini 2.5 Pro technical report (the only one with confirmed architecture details) describes: approximately 200 billion total parameters, decoder-only transformer, 80 layers, 16,384 hidden dimensions, 128 self-attention heads, MoE layers every other block with 64 experts per block and 8 active per token (approximately 12.5% of parameters active per inference). This yields roughly 1.6x compute/capacity efficiency over purely dense models. Gemini 3.x adds a “DeepThink System 2” deliberation layer with three-tier reasoning control (Low/Medium/High). Parameter counts for the 3.x family are not disclosed. Google trains entirely on custom TPU hardware (v5e, v6e Trillium) with no NVIDIA GPU fallback for Gemini models. The newly announced TPU 8t delivers 121 exaflops per superpod with 9,600 chips.

Current Model Lineup

ModelReleaseContextInput/MTokOutput/MTokKey Benchmark
Gemini 3.5 FlashMay 20261M$1.50$9.00Terminal-Bench 2.1: 76.2%
Gemini 3.5 ProPreviewed, not GA
Gemini 3.1 ProFeb 20261M$2.00/$4.00$12.00/$18.00SWE-Bench: 80.6%, GPQA: 94.3%
Gemini 3.1 Flash-Lite$0.25$1.50Budget frontier
Gemini 2.5 ProGA1M$1.25/$2.50$10.00/$15.00SWE-Bench: 63.8%
Gemini 2.5 FlashGA1M$0.30$2.50
Gemini 2.5 Flash-LiteGA1M$0.10$0.40Cheapest
Pricing tiers with ”/” indicate <=200K / >200K context pricing. Batch mode runs at 50% of standard pricing across all models.

Key Benchmarks (Gemini 3.1 Pro)

BenchmarkScore
SWE-Bench Verified80.6%
GPQA Diamond94.3% (highest at launch)
ARC-AGI-277.1%
Humanity’s Last Exam44.4% (text + multimodal, no tools)
MRCR v2 (128K long-context)84.9%
MMMU-Pro80.5%
MMMLU92.6%

Specialised Models

Google maintains the broadest multimodal portfolio: Veo 3.1 (video generation up to 4K), Lyria 3 (music generation), Imagen 4 (image generation, being replaced by Gemini-native generation), Gemini Computer Use Preview, Gemini Robotics-ER 1.6, real-time audio dialogue, real-time translation, and Deep Research.

Deployment Model and Sovereign Position

Google offers the most mature sovereign deployment story among the three major closed-model providers:
  • Gemini Developer API and Vertex AI: Standard cloud access with enterprise SLAs.
  • Google Distributed Cloud (GDC): Full on-premises AI deployment with managed infrastructure. Gemini models and Gemma open models available. Confidential external key management for regulated organisations. A new “sovereign agentic AI architecture” announced at Cloud Next 2026 ensures agentic workflows execute entirely within customer organisation boundaries.
  • Forrester recognition: Named a Leader in The Forrester Wave Sovereign Cloud Platforms, Q2 2026.
No reports of Google models being export-restricted as of June 2026, but the Anthropic precedent means all frontier providers face potential future restrictions.

Open Models: Gemma 4

VariantParametersContextArchitectureVRAM (Q4)
E2B~2B effective128KDense multimodal~1.5 GB
E4B~4B effective128KDense multimodal~5 GB
12B12B128K+Dense (encoder-free)~8 GB
26B-A4B26B total / 4B active256KMoE~18 GB
31B31B dense256KDense~20 GB
Licensed under Apache 2.0 — Google’s first truly open-source family under this license. Gemma 4 31B posts 89.2% AIME 2026 and 80.0% LiveCodeBench, competitive with some closed frontier models. The E2B model runs on phones.

Financials

Google guided $175-185 billion in 2026 capex, majority AI-related. Gemini 1.0 Ultra training cost approximately $191 million (Stanford AI Index / Epoch AI). Gemini 3.x training costs are not disclosed but estimated in the several-hundred-million-dollar range per model. Google’s use of custom TPUs significantly reduces marginal compute costs compared to competitors renting NVIDIA GPUs.

Open-Source / Open-Weight Models

The open-weight ecosystem is where the sovereign AI opportunity lives. Multiple model families now offer frontier-class performance under permissive licenses, deployable on sovereign infrastructure without any foreign provider dependency.

Meta Llama 4

ModelActive ParamsTotal ParamsExpertsContext
Scout17B109B1610M tokens
Maverick17B400B128512K tokens
Behemoth288B~2T16Not released
Architecture: MoE with alternating dense and MoE layers. 128 routed experts plus one shared expert per MoE layer; each token activates the shared expert plus one routed expert. License: Custom (Llama 4 Community License Agreement). This is not Apache 2.0 or MIT. The EU is explicitly excluded — rights do not extend to individuals domiciled in, or companies with principal place of business in, the European Union. Companies with >700 million MAU require a separate license. Government agencies must request exceptions case-by-case. This makes Llama 4 unsuitable for sovereign deployment in any EU member state and introduces legal uncertainty elsewhere. Hardware: Scout fits ~61 GB VRAM at Q4_K_M (single H100). Maverick needs ~224 GB (4x H100). Behemoth was never publicly released.

Mistral AI

ModelTotal ParamsActive ParamsArchitectureLicense
Mistral Large 3675B41BMoEApache 2.0
Mistral Small 4119B6BMoE (128 experts, 4 active)Apache 2.0
Ministral 3 (3B/8B/14B)3-14BDenseDenseApache 2.0
All models Apache 2.0. No MAU thresholds, no geographic exclusions. Mistral is the de facto European sovereign AI champion: framework agreement with the French Ministry of Armed Forces (2026-2030), EUR 2.1 billion in state investment, data centres in France with thousands of H100 GPUs, partnership with SAP and French/German governments for sovereign public administration AI. Key benchmarks: Large 3 posts 73.11% MMLU-Pro and 93.60% MATH-500. Ministral 3 14B reasoning variant achieves 85% on AIME 2025.

Qwen (Alibaba)

ModelTotal ParamsActive ParamsArchitectureContextLicense
Qwen3-235B-A22B235B22BMoEApache 2.0
Qwen3-Coder 480B480B35BMoEApache 2.0
Qwen 3.5-397B-A17B397B17BMoE (GDN hybrid)262K (ext. 1M+)Apache 2.0
Qwen 3.6-35B-A3B35B3BMoE1M nativeApache 2.0
Qwen 3.6-27B27B27BDense + vision1MApache 2.0
Smaller models0.6B-9BDenseApache 2.0
The broadest size range (0.6B to 480B) under Apache 2.0. Qwen 3.5 introduced Gated Delta Networks (GDN) fused with sparse MoE — 8.6x faster than Qwen3-Max at 32K context, 19x faster at 256K. 201-language support. Key benchmarks: Qwen3-235B-A22B posts 95.6 ArenaHard and 85.7 AIME’24. Qwen 3.6-35B-A3B runs on a single consumer GPU (~21 GB at Q4_K_M, 30 tok/s) and won coding benchmarks vs Gemma 4 26B-A4B by 21 points. Geopolitical note: Chinese origin. Self-hosted deployments involve no data flowing to China and weights are openly inspectable. The Qwen Chat web interface applies Chinese content restrictions, but this is a platform restriction, not a license restriction on the downloadable weights. For nations without anti-China procurement policies, Qwen is arguably the most versatile open model family available.

DeepSeek

ModelTotal ParamsActive ParamsArchitectureContextLicense
DeepSeek-V4-Pro1.6T49BMoE + MLA + CSA/HCA1MMIT
DeepSeek-V4-Flash284B13BMoE + MLA + CSA/HCA1MMIT
DeepSeek-R1671B37BMoE + MLA + RL128KMIT
DeepSeek-V3671B37BMoE + MLA128KMIT
MIT license — the most permissive possible. No restrictions of any kind. DeepSeek-V3’s training cost of approximately $5.6 million (2,048 H800 GPUs) sent shockwaves through the industry in January 2025, demonstrating frontier-class models could be trained at 10-20x lower cost than assumed. Key benchmarks (V4-Pro): SWE-bench Verified 80.6%, LiveCodeBench Pass@1 93.5% (highest among all models evaluated), Codeforces rating 3206, MMLU-Pro 87.5%. Geopolitical note: Chinese origin. Subject to PRC laws requiring cooperation with intelligence agencies. DeepSeek reportedly used tens of thousands of NVIDIA chips restricted from export to China. Self-hosted weights are safe — MIT license, no data flows to China, fully inspectable. The hosted API service applies Chinese content regulations and stores data under PRC law.

Other Notable Open Models

ModelParametersArchitectureLicenseStandout Feature
Gemma 4 (Google)2B-31BDense/MoEApache 2.0Runs on phones (E2B), 89.2% AIME 2026 (31B)
Phi-4 (Microsoft)3.8B-15BDenseMIT93.7% GSM8K at 14B, surpasses many 70B models on maths
Falcon H1 (TII, Abu Dhabi)3B-34BHybrid Mamba-TransformerApache 2.0-basedBest Arabic LLM, 4x input throughput
gpt-oss (OpenAI)20B/120BApache 2.0Reasoning-focused, safety classification
Cohere Command R+104BCC-BY-NCRAG-optimised with grounding citations

Comparison Matrix

Closed / API-Only Models

ModelArchitectureParams (Total/Active)ContextSWE-Bench VerifiedGPQA DiamondInput $/MTokOutput $/MTokOn-PremExport RiskFine-Tunable
Fable 5MoE (unconfirmed)Undisclosed1M95.0%*$10.00$50.00NoSUSPENDEDNo
Opus 4.8UndisclosedUndisclosed1M88.6%93.6%$5.00$25.00NoMediumNo
Sonnet 4.6UndisclosedUndisclosed1M72.7-79.6%~83%$3.00$15.00NoMediumNo
Haiku 4.5UndisclosedUndisclosed200K73.3%$1.00$5.00NoMediumNo
GPT-5.5MoEEst. 2-5T active1.05M88.7%93.6%$5.00$30.00Via Azure LocalMediumNo
GPT-5.4MoEUndisclosed1M74.9%92.8%$2.50$15.00Via Azure LocalMediumNo
GPT-5.4 miniMoEUndisclosed400K$0.75$4.50Via Azure LocalMediumNo
GPT-5.4 nanoMoEUndisclosed$0.20$1.25Via Azure LocalLowNo
o3Reasoning chainUndisclosed200K$2.00$8.00NoMediumNo
o3-proReasoning chainUndisclosed200K86%$20.00$80.00NoMediumNo
o4-miniReasoning chainUndisclosed200K68.1%$1.10$4.40NoMediumNo
Gemini 3.5 FlashSparse MoEUndisclosed1M$1.50$9.00Via GDCLowNo
Gemini 3.1 ProSparse MoE + DeepThinkUndisclosed1M80.6%94.3%$2.00/$4.00$12.00/$18.00Via GDCLowNo
Gemini 2.5 ProSparse MoE~200B total1M63.8%$1.25/$2.50$10.00/$15.00Via GDCLowNo
Gemini 2.5 Flash-LiteSparse MoEUndisclosed1M$0.10$0.40Via GDCLowNo
*Vendor-scaffold score; independent evaluations typically run 10-30 points lower.

Open-Weight Models

ModelArchitectureParams (Total/Active)ContextSWE-Bench VerifiedLicenseOn-PremExport RiskFine-TunableEU DeployableMin VRAM (Q4)
Llama 4 ScoutMoE109B / 17B10MCustomYesNone (published)YesNO~61 GB
Llama 4 MaverickMoE400B / 17B512KCustomYesNone (published)YesNO~224 GB
Mistral Large 3MoE675B / 41B256KApache 2.0YesNoneYesYesMulti-GPU
Mistral Small 4MoE119B / 6BApache 2.0YesNoneYesYes~25 GB
Ministral 3 14BDense14B / 14BApache 2.0YesNoneYesYes~10 GB
Qwen3-235B-A22BMoE235B / 22BApache 2.0YesNoneYesYesMulti-GPU
Qwen 3.6-35B-A3BMoE35B / 3B1MApache 2.0YesNoneYesYes~21 GB
Qwen 3.6-27BDense27B / 27B1MApache 2.0YesNoneYesYes~20 GB
DeepSeek V4-ProMoE + MLA1.6T / 49B1M80.6%MITYesNone (published)YesYes~1 TB+
DeepSeek V4-FlashMoE + MLA284B / 13B1MMITYesNone (published)YesYes~80 GB (FP8)
Gemma 4 31BDense31B / 31B256KApache 2.0YesNoneYesYes~20 GB
Gemma 4 12BDense12B / 12B128KApache 2.0YesNoneYesYes~8 GB
Gemma 4 E4BDense~4.5B128KApache 2.0YesNoneYesYes~5 GB
Gemma 4 E2BDense~2.3B128KApache 2.0YesNoneYesYes~1.5 GB
Phi-4 14BDense14B / 14B128KMITYesNoneYesYes~10 GB
Phi-4-miniDense3.8B / 3.8B128KMITYesNoneYesYes~3 GB
gpt-oss-120b120BApache 2.0YesNoneYesYesMulti-GPU
gpt-oss-20b20BApache 2.0YesNoneYesYes~15 GB
Falcon H1 34BHybrid Mamba-Transformer34B / 34BApache 2.0-basedYesNoneYesYes~22 GB

Architectural Approaches

The AI model landscape has diversified beyond “make the model bigger.” Four distinct architectural philosophies now compete, each with different implications for cost, capability, and sovereign deployability.

Monolithic Scaling

The original paradigm: train a single dense transformer with as many parameters as possible. GPT-4 (2023) was the high-water mark. By mid-2026, no frontier lab still uses purely dense architectures for their largest models — the compute cost scales linearly with parameter count, making trillion-parameter dense models economically impractical. Dense architectures remain optimal at smaller scales (Gemma 4 12B, Phi-4, Ministral 3) where every parameter earns its keep.

Sparse Mixture-of-Experts (MoE)

Now the dominant frontier architecture. Used by Anthropic (likely), OpenAI (GPT-5.x), Google (Gemini), DeepSeek, Mistral, Meta (Llama 4), and Qwen. The key insight: a model with 1.6 trillion total parameters but only 49 billion active per inference gets the knowledge capacity of the larger model at the inference cost of the smaller one. DeepSeek V4-Pro exemplifies this: 1.6T total parameters, 49B active, posting frontier-class benchmarks at dramatically lower training and inference costs. The efficiency gain is roughly 1.6-2x over equivalent dense models (per Google’s Gemini 2.5 technical report).

Reasoning Chains (Inference-Time Compute)

Pioneered by OpenAI’s o-series and adopted by Google (DeepThink System 2) and others. Instead of making the model larger, make it think longer on hard problems. The model generates explicit reasoning steps before producing a final answer, trading latency for accuracy. o3-pro achieves 98% on AIME 2025 through extended reasoning. This approach is additive — reasoning chains work on top of MoE or dense architectures. The cost model shifts from “pay for parameters” to “pay for thinking time,” which is controllable per-query (Google offers Low/Medium/High tiers).

Multi-Model Orchestration

Rather than routing tokens to experts within a single model, route entire tasks to specialist models. Annie’s Hierarchical Mixture-of-Experts architecture is one approach: twelve specialist small language models (250M to 27B parameters) orchestrated through a messaging backbone, with classification, expert selection, judgment panels, and verification. This approach trades single-model coherence for composability, cost efficiency, and sovereign deployability (each specialist model can run on consumer hardware). The research supports this: ensembles of smaller models can outperform single large models with both higher accuracy and fewer total FLOPs, and the gap widens as models become large. The limitation is orchestration complexity and latency from multi-hop routing.

Architectural Comparison

Cost-Capability Tradeoffs

ApproachTraining CostInference CostPeak CapabilitySovereign Deployability
Monolithic Dense (small)$2K-$500KLowest per tokenLimited by parameter countExcellent (laptop to single GPU)
Sparse MoE (large)$5M-$500M+Low per token (only active params)Highest (frontier)Poor to moderate (datacenter)
Reasoning ChainsAdds to base model costVariable (controllable)Highest on hard problemsSame as base model
Multi-Model EnsembleSum of specialists ($10K-$2M)Moderate (multiple models)Approaches frontier on defined tasksExcellent (consumer hardware per model)

The Sovereign AI Imperative

What Sovereignty Means in Practice

Sovereign AI is a nation’s or organisation’s ability to develop, deploy, and control AI using its own infrastructure, data, talent, and governance frameworks without critical dependencies on foreign providers. It spans four dimensions:
  1. Data sovereignty: Data collected, stored, and processed according to local laws without unauthorised foreign access.
  2. Model sovereignty: Ownership of model weights, training capability, and inference control.
  3. Compute sovereignty: Infrastructure under national jurisdiction on domestic soil.
  4. Interaction sovereignty: Prompts, queries, and outputs remain within sovereign boundaries.
Before June 12, 2026, most organisations treated sovereignty as a compliance checkbox. After June 12, it is an operational resilience requirement.

The Deployment Spectrum

Most practical sovereign AI strategies target Level 3-4 on critical dimensions while accepting Level 2 on others. Targeting Level 5 across all dimensions is economically prohibitive — only the US, China, and possibly the EU as a bloc can sustain it.

Hardware Requirements by Sovereignty Tier

TierWhat You NeedHardwareApprox. Cost
L1: API consumerInternet connectionNone$0.10-$50/MTok (recurring)
L2: Sovereign cloud tenantContract with sovereign cloud providerProvider-managed$50K-$500K/year
L3: Self-hosted open modelsOpen-weight models on own infrastructure1-8 GPUs per model$5K-$200K hardware + $105K-$210K/year electricity (AU rates, per rack)
L3+: Fine-tuned specialistsDomain adaptation of open modelsSame as L3 + training computeAdditional $2K-$500K per model for fine-tuning
L4: Domestic foundation modelTrain from scratch, 1B-7B parameters8x RTX 4090 to 64x A100$2K-$500K per model
L4+: Sovereign language modelNational-language foundation modelH100 cluster$8M-$32M (Brazil/Mexico research)
L5: Frontier-competitiveFull-scale foundation model trainingThousands of GPUs, dedicated power$100M-$1B+ per model

Cost Comparison: API vs Self-Hosted

The break-even depends entirely on utilisation. Bursty, low-volume usage favours APIs. Constant high-throughput workloads favour self-hosting.
ScenarioAPI CostSelf-Hosted CostWinner
Light usage (1M tokens/day)$3-$50/day$10-$50/day (amortised hardware + power)API
Medium usage (100M tokens/day)$300-$5,000/day$50-$200/daySelf-hosted
Heavy usage (1B+ tokens/day)$3,000-$50,000/day$200-$500/daySelf-hosted by 10-100x
Frontier capability requiredOnly option for some tasksOpen models lag on hardest 5-10% of tasksAPI (for now)
Frontier API costs are increasing: GPT-5.5 costs over 3x what GPT-5 cost 8 months ago; Gemini 3.5 Flash tripled pricing versus its predecessor.

The Australian Context

Government policy: The National AI Plan (March 2026) confirmed reliance on existing laws and sector regulators rather than a standalone AI Act. Defence released binding governance for AI use across ADF. New DTA Cloud Policy (effective July 1, 2026) mandates APS entities prioritise cloud computing. The AI Safety Institute is operational with AUD $29.9 million in funding. Defence spending: $1.2 billion in the 2025-26 budget for sovereign capability development in AI and autonomous systems. Defence Innovation Hub has funded 80+ AI-related projects. ASD-AWS “Top Secret Cloud” partnership worth approximately AUD $2 billion over a decade. Data centre infrastructure: Three main sovereign providers are building AI-capable facilities:
  • CDC Data Centres: 200MW AI campus near Perth (AUD $415M first stage, operational 2026).
  • Macquarie Data Centres: IC3 Super West, 47MW AI data centre in Sydney (AUD $350M, opening September 2026). Partnering with Dell for Sovereign AI Factories powered by NVIDIA.
  • NEXTDC: S7 site at Eastern Creek, Sydney (650MW capacity, partnered with OpenAI, Phase 1 expected H2 2027).
Regulatory landscape: ASIC requires AI in financial services to align with responsible lending and market integrity obligations. TGA released guidance on AI-based software as a medical device. New privacy obligations effective December 2026 require disclosure of automated decision-making. The gap: Australia has sovereign compute infrastructure under construction and defence funding in place, but lacks a domestic foundation model programme. The Fable 5 suspension demonstrated that Five Eyes membership provides no exemption from US export controls. Australia’s current position is Level 1-2 for frontier AI (API-dependent on US providers) with infrastructure being built for Level 2-3.

Implications for Annie

This section identifies what the landscape means for Annie’s positioning. The full competitive analysis is in doc 03.

Where the Gaps Are

  1. The 80% problem: For 80% of production use cases, a well-tuned specialist model works as well as a frontier model and costs 95% less. But the tooling, orchestration, and confidence to run multi-model systems does not exist as a product. Every organisation doing this today is building it from scratch.
  2. The sovereignty gap is operational, not theoretical: Before June 12, sovereign AI was a compliance discussion. Now it is about whether your AI infrastructure survives a single government directive. There is no product that packages sovereign AI deployment as a turnkey solution with the user experience of a frontier API.
  3. The ensemble evidence is strong but unexploited: Research consistently shows that ensembles of smaller models can outperform single large models with higher accuracy and fewer total FLOPs. No commercial product operationalises this finding.
  4. Fine-tuning at the bottom, frontier at the top, nothing in between: You can fine-tune a 7B model for under $5 or pay $50/MTok for Fable 5. There is no product that intelligently routes between a portfolio of specialists and frontier fallbacks based on task complexity.

What the Export Controls Create as Opportunity

The Fable 5 suspension created three market conditions that did not exist two weeks ago:
  1. Enterprise demand for multi-model resilience: 81% of enterprises now run three or more AI model families (up from 13% a year ago), and every procurement conversation now includes “what happens if we lose access.” A system architecturally designed for multi-model orchestration is no longer a nice-to-have.
  2. Government demand for sovereign AI that actually works: More than 60 nations have published AI strategies, over 30 have committed funding, and the sovereign AI infrastructure market is projected to reach $301.6 billion by 2040. But most sovereign AI initiatives are infrastructure plays (data centres, GPU clusters) without the model-layer product to run on them.
  3. The open-weight window: Published open-weight models are currently exempt from US export controls under ECCN 4E091. This regulatory posture could change. Sovereign entities should be downloading and fine-tuning open models now. A product that makes this easy has a time-limited but significant advantage.

Why Small Specialist Models Matter Now

  • Serving a 7B specialist is 10-30x cheaper than running a 70B-175B general model.
  • Training a 1B specialist costs $2K-$15K. Training a 7B specialist costs $50K-$500K. Fine-tuning a 7B model for a specific domain costs under $5.
  • Small models (250M to 27B) run on hardware ranging from phones to single consumer GPUs. No datacenter required.
  • India’s Bhashini programme demonstrates the sovereign small-model strategy at national scale: purpose-built language models serving 140 million users across 22 languages on domestic sovereign infrastructure.
  • Research shows performance gains decrease exponentially beyond certain parameter thresholds, making smaller models more cost-effective for most defined tasks.
The limitation is real: small models lag significantly on complex tasks requiring deeper reasoning or nuanced understanding. They match large models in specific, well-defined scenarios but not in general-purpose reasoning. This is precisely where intelligent orchestration — routing easy tasks to cheap specialists and hard tasks to capable models — closes the gap.

The Cost and Accessibility Advantage

The frontier labs are spending staggering amounts: Anthropic projects $29 billion in 2026 losses, OpenAI projects $14 billion, Google guided $175-185 billion in capex. These economics require massive scale to justify and produce products priced accordingly (Fable 5 at $50/MTok output, GPT-5.5 Pro at $180/MTok output). A system built from twelve specialist models in the 250M-27B range, each fine-tuned for its domain, running on hardware costing $5K-$50K total, with intelligent routing to minimise frontier API fallback, could deliver comparable task performance at 1-2 orders of magnitude lower cost. The total training cost for the specialist portfolio would be a rounding error in a frontier lab’s monthly electricity bill. This is not a hypothetical. The models exist (Gemma 4, Qwen 3.6, Phi-4, Mistral Small 4). The hardware exists (consumer GPUs). The research supports ensemble approaches. What does not yet exist is the product that makes it work reliably and is simple enough for organisations to adopt.

Sources

Anthropic

OpenAI

Google

Open-Source / Open-Weight

Sovereign AI and Export Controls

Australian Context

Hardware, Costs, and Small Models


Document prepared June 22, 2026 by Annie. The AI model landscape is evolving rapidly. Benchmark figures, pricing, and availability are subject to change. Where parameter counts or architecture details are not officially confirmed, this is noted explicitly.