AI IQ

The Best AI Models of April 2026

Ryan Shea — Mon, 04 May 2026 20:58:43 GMT

April 2026 was the most important month for AI model releases so far this year.

OpenAI released GPT-5.5. Anthropic released Claude Opus 4.7. Meta came back into the race with Muse Spark. xAI shipped Grok 4.3. And the Chinese frontier labs delivered a full wave of serious models: GLM-5.1, Qwen3.6, Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro.

The simple story is that GPT-5.5 and Opus 4.7 are now the two new co-champions of the frontier.

The more interesting story is that April was not one race. It was three races happening at once.

The first was the raw intelligence race, where GPT-5.5 moved into first place on AI IQ’s overall estimated IQ ranking.

The second was the agentic reliability race, where Opus 4.7 made a strong case that the most useful AI model is not always the one with the highest abstract score, but the one you most trust to carry work through messy, long-running workflows.

The third was the cost-performance race, where Chinese labs continued to compress the gap between “frontier” and “cheap enough to use everywhere.”

The frontier did not commoditize in April. GPT-5.5, Opus 4.7, and Gemini 3.1 Pro are still meaningfully ahead of the rest.

But Tier 2 did commoditize.

That is the real story.

April was the month agentic AI became the default benchmark

A year ago, model launches were still mostly judged by chat quality, coding snippets, MMLU-style knowledge tests, and a handful of math benchmarks.

That is no longer enough.

The models released in April were overwhelmingly marketed around agents: coding agents, research agents, computer-use agents, tool-use agents, long-horizon agents, and multi-agent orchestration.

OpenAI described GPT-5.5 as a model for “real work,” emphasizing coding, online research, data analysis, document and spreadsheet creation, software operation, and multi-tool task completion. OpenAI also reported GPT-5.5 at 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, 51.7% on FrontierMath Tier 1–3, and 35.4% on FrontierMath Tier 4.

Anthropic’s Opus 4.7 launch told a similar story from a different angle: complex coding work, long-running tasks, instruction-following, vision improvements, and stronger self-verification. Anthropic made Opus 4.7 generally available on April 16, 2026, at the same API price as Opus 4.6: $5 per million input tokens and $25 per million output tokens.

The Chinese labs leaned even harder into long-horizon autonomy. Z.ai described GLM-5.1 as a model designed for long-horizon tasks that can work independently for up to eight hours in a single run. Kimi K2.6 is positioned as an open-source multimodal agentic model for long-horizon coding, autonomous execution, coding-driven design, and swarm-based orchestration. DeepSeek-V4-Pro ships with 1M context and a 1.6T-parameter / 49B-active MoE architecture. Xiaomi’s MiMo-V2.5-Pro is also a 1M-context, open-sourced MoE model, built for complex software engineering and long-horizon tasks.

This is why the AI IQ framework matters. The old way to compare models was to ask, “Which model got the highest score on the benchmark?”

The better question now is: which model is best for the kind of cognition you actually need?

AI IQ breaks that into four dimensions: Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the mean of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.

That compression is important. A model should not become “the smartest model” just because it crushes a saturated or contaminated benchmark. The frontier should be judged by hard, still-discriminating tests.

The new Tier 1: GPT-5.5, Opus 4.7, and Gemini 3.1 Pro

AI IQ’s updated ranking now has three Tier 1 models:

GPT-5.5
Claude Opus 4.7
Gemini 3.1 Pro

Google did not release a major new model in April, but Gemini 3.1 Pro remains in the top cluster. That is important. The April narrative is not “OpenAI and Anthropic left everyone behind.” It is more precise than that: OpenAI and Anthropic refreshed into the frontier, while Google’s previous frontier model still held its ground.

The top-level read:

GPT-5.5 is the overall intelligence champion. It sits at the top of the AI IQ ranking and leads the April class on raw composite capability.

Opus 4.7 is the emotional intelligence and agentic trust champion. It does not beat GPT-5.5 on overall IQ, but it leads on EQ and remains one of the most compelling models for long-running professional work.

Gemini 3.1 Pro is the holdover giant. It did not need an April launch to remain relevant. On the AI IQ charts, it continues to sit in the frontier cluster and is especially strong in programmatic reasoning.

The frontier is now less like a single leaderboard and more like a three-model draft board. For any serious workflow, the right answer is usually not “use the highest-ranked model.” It is:

Use GPT-5.5 when you want the strongest general reasoner.

Use Opus 4.7 when you want the best blend of intelligence, EQ, instruction-following, and long-horizon reliability.

Use Gemini 3.1 Pro when you care about Google-stack integration, coding strength, multimodal workflows, or cost-performance tradeoffs inside that ecosystem.

GPT-5.5: the new raw intelligence leader

GPT-5.5 is the most important release of April.

On AI IQ, it takes the top overall spot. It also leads the April releases in Abstract Reasoning, Mathematical Reasoning, and Academic Reasoning. That is the core of its story: GPT-5.5 is not merely a coding model, a chat model, or a tool-use model. It is the broadest new general-purpose intelligence system in the April wave.

OpenAI’s own launch framing matches the AI IQ read. GPT-5.5 is presented as a model that can plan, use tools, check its work, navigate ambiguity, and keep going across multi-part tasks. OpenAI also emphasizes that GPT-5.5 matches GPT-5.4 per-token latency while performing at a higher level and using fewer tokens on Codex tasks.

But AI IQ adds an important correction: GPT-5.5 is not the obvious cost-performance winner.

The model is expensive. OpenAI lists future API pricing for gpt-5.5 at $5 per million input tokens and $30 per million output tokens, with GPT-5.5 Pro priced far higher at $30 input and $180 output.

So the practical recommendation is not “use GPT-5.5 for everything.”

It is: use GPT-5.5 when the marginal cost of being wrong is high.

That means difficult research, complex analysis, high-stakes coding tasks, scientific reasoning, mathematical reasoning, architecture decisions, and work where the model’s ability to notice hidden structure matters more than token price.

GPT-5.5 is the model you reach for when you are buying judgment.

Opus 4.7: the model people may actually trust more

If GPT-5.5 is the IQ winner, Opus 4.7 is the “would I hand this work to it?” winner.

On AI IQ’s EQ ranking, Anthropic continues to dominate. Opus 4.7 sits at the top of the EQ chart, ahead of GPT-5.5 and the rest of the April field. That matters more than people think.

EQ is not just “being nice.” In professional AI use, it often shows up as calibration, tone, judgment, refusal discipline, conversational tact, knowing when the user is confused, knowing when to push back, and knowing when a task needs clarification versus execution.

AI IQ estimates EQ from EQ-Bench 3 and Arena Elo, then maps those signals to the same normalized style of scale used for IQ. Anthropic models receive a 200-point EQ-Bench adjustment because EQ-Bench is Claude-judged, which makes Opus 4.7’s top placement more notable rather than less.

Anthropic’s launch post reads like a long list of enterprise users saying the same thing in different words: Opus 4.7 is better at the parts of work that happen after the first answer. It catches mistakes, follows instructions more literally, handles long-running workflows, uses tools more reliably, and produces stronger professional artifacts.

That is the real distinction between GPT-5.5 and Opus 4.7.

GPT-5.5 looks like the smartest model.

Opus 4.7 often looks like the better coworker.

For teams building agents, that distinction is not cosmetic. The best agent is not always the one with the highest peak score. It is the one that does not quietly drift, loop, overcomplicate, ignore constraints, or hand back a plausible but unverified answer after 40 minutes of tool calls.

Opus 4.7’s strongest case is not that it beats GPT-5.5 everywhere. It does not.

Its strongest case is that the frontier is now close enough that trust, EQ, and workflow stability become first-order selection criteria.

The surprise: Tier 2 is now crowded with genuinely useful models

The April Tier 2 list is where the market changed most.

AI IQ’s updated Tier 2 includes:

Grok 4.3
Kimi K2.6
DeepSeek-V4-Pro
Muse Spark
Qwen3.6
MiMo-V2.5-Pro
GLM-5.1

That is a lot of serious models in one month.

None of these cleanly displaces GPT-5.5, Opus 4.7, or Gemini 3.1 Pro as the best overall model. But that is the wrong bar. The point is that Tier 2 is now good enough to matter for production routing.

A year ago, “use the best model” was often a reasonable default.

In May 2026, that is lazy.

If a task is cheap, repetitive, narrow, or tolerant of a small quality drop, you should probably not be sending it to the most expensive frontier model. The cost-performance charts on AI IQ make this especially clear because they do not just plot sticker price. They use effective cost: token cost multiplied by token usage efficiency. AI IQ anchors token cost to a 2M input / 1M output workload, then adjusts by how many tokens the model burns on the Artificial Analysis evaluation suite.

That framing changes the conversation. Some models look cheap but waste tokens. Others look expensive but are efficient enough that the real gap is smaller. And some models are simply cheap enough that they should be in every serious evaluation harness.

The best April examples are Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro.

Kimi K2.6 is the cleanest open-source story: native multimodal, agentic, long-horizon coding, coding-driven design, and swarm orchestration.

DeepSeek-V4-Pro is the scale-and-efficiency story: 1.6T total parameters, 49B active, 1M context, and a claim of performance rivaling top closed models.

MiMo-V2.5-Pro is the “local/open agentic frontier” story: 1.02T total parameters, 42B active, 1M context, open-sourced weights, and strong long-horizon coding demonstrations.

The frontier labs still own the top.

But the rest of the market now has options.

Dimension-by-dimension winners

The cleanest way to understand April is to stop asking “which model is best?” and ask “best at what?”

Best overall IQ: GPT-5.5

GPT-5.5 is the overall AI IQ leader. It is the model to beat on broad benchmark-derived intelligence.

Its advantage is not confined to one dimension. It is especially strong across abstract, mathematical, and academic reasoning, which makes it the most credible default when the task is hard to classify.

For professionals, this matters because many high-value tasks are mixed-domain. A research memo might require reading dense technical material, checking a mathematical argument, writing code to test an assumption, and then producing a clean executive summary. The more mixed the task, the more valuable broad IQ becomes.

Best EQ: Opus 4.7

Opus 4.7 leads the EQ ranking.

This makes it the best candidate for emotionally sensitive communication, high-context collaboration, writing with judgment, ambiguous stakeholder-facing work, coaching, editing, and situations where the model needs to push back without becoming annoying.

EQ also matters in agents. A low-EQ model can be technically correct but operationally painful: too verbose, too agreeable, too brittle, too literal in the wrong places, or too eager to satisfy a bad instruction.

Opus 4.7’s advantage here is one reason it will likely remain the default for many “AI coworker” workflows even when GPT-5.5 leads on raw IQ.

Best abstract reasoning: GPT-5.5

Abstract reasoning is the closest AI IQ dimension to raw fluid intelligence: the ability to solve novel problems without relying heavily on memorized knowledge. AI IQ uses ARC-AGI-2 and ARC-AGI-1 for this dimension, with ARC-AGI-2 treated as the harder, more frontier-discriminating test.

GPT-5.5 leads the April class here.

That is important because abstract reasoning is one of the hardest capabilities to fake. A model can memorize facts. It can overfit public code tasks. It can get better at common math formats. But novel abstraction is much harder to brute-force through training contamination.

If you want the model most likely to notice the hidden pattern in a new problem, GPT-5.5 is the current pick.

Best mathematical reasoning: GPT-5.5, with GPT-5.3-Codex still looming

Among April’s general-purpose model releases, GPT-5.5 is the math leader.

AI IQ’s math dimension uses FrontierMath Tier 4 and AIME, with AIME compressed because of contamination and saturation concerns. That is the right call. A perfect or near-perfect AIME score no longer tells us as much as it once did.

The interesting wrinkle is that GPT-5.3-Codex remains extremely strong on mathematical reasoning in the AI IQ charts. That suggests OpenAI’s coding-specialist line is not just a coding specialist. It may also be a very strong formal reasoning system.

For most users, GPT-5.5 is the best general math choice. But for code-adjacent math, theorem work, formalization, and technical problem-solving inside a development workflow, GPT-5.3-Codex still deserves attention.

Best programmatic reasoning: Gemini 3.1 Pro overall; GPT-5.5 among April releases

The programmatic reasoning chart is one of the most interesting on AI IQ because it does not simply reward SWE-Bench.

AI IQ’s programmatic dimension combines Terminal-Bench 2.0, SWE-Bench Verified, and SciCode, with SWE-Bench compressed because of leakage and gameability concerns.

That makes the ranking more useful than a simple “who wins SWE-Bench?” scoreboard.

Gemini 3.1 Pro remains one of the strongest models overall on programmatic reasoning. Among the April releases, GPT-5.5 is the strongest broad programmatic reasoner, with Opus 4.7 also highly competitive.

Kimi K2.6 is the one to watch here. It does not win the chart, but its open-source, agentic-coding positioning makes it strategically important. The question is not whether Kimi K2.6 beats GPT-5.5 on every coding benchmark. It does not. The question is whether it is good enough, cheap enough, and controllable enough to become the default model for large volumes of coding-agent work.

That answer may be yes for many teams.

Best academic reasoning: GPT-5.5

GPT-5.5 also leads the April field on academic reasoning.

AI IQ’s academic dimension includes Humanity’s Last Exam, CritPt, and GPQA Diamond, with GPQA compressed due to contamination concerns.

This is where GPT-5.5’s breadth matters most. It is not just answering common questions better. It is performing well across expert-level, hard-to-game, high-breadth benchmarks.

DeepSeek-V4-Pro and Muse Spark are the Tier 2 standouts here. DeepSeek-V4-Pro’s knowledge and reasoning profile makes it one of the strongest Chinese models on academic-style tasks, while Muse Spark gives Meta a surprisingly credible return to high-end reasoning.

Meta’s launch emphasized Muse Spark as a natively multimodal reasoning model with tool use, visual chain of thought, and multi-agent orchestration, and reported strong results from its Contemplating mode on hard tasks like Humanity’s Last Exam and FrontierScience Research.

Muse Spark is not Tier 1 yet. But it is the first Meta model in a while that looks like it belongs in the serious frontier conversation.

The cost-performance story: the smartest model is not always the right model

AI IQ’s cost charts may be more practically important than the IQ chart.

The reason is simple: most real-world AI usage is not one heroic prompt. It is thousands or millions of calls across support workflows, coding agents, research loops, RAG systems, data extraction jobs, document workflows, and internal automations.

At that scale, the question changes from:

“Which model is best?”

to:

“Where does extra intelligence stop paying for itself?”

GPT-5.5 and Opus 4.7 justify their cost when the task is difficult, ambiguous, or high-value. But for many workflows, the April Tier 2 models are now strong enough to route into production.

A sensible model stack in May 2026 looks something like this:

Use GPT-5.5 for the hardest reasoning, research, math, architecture, and high-stakes synthesis.

Use Opus 4.7 for long-horizon agents, collaborative writing, sensitive communication, and workflows where reliability and tone matter.

Use Gemini 3.1 Pro where its programmatic strength, Google ecosystem, or multimodal tooling gives it an edge.

Use Kimi K2.6, DeepSeek-V4-Pro, MiMo-V2.5-Pro, Qwen3.6, GLM-5.1, or Grok 4.3 for cheaper routing, open-weight experimentation, local or sovereign deployment, and high-volume agentic workloads.

The best teams will not pick one model.

They will build routers.

The biggest surprise: China’s labs are not just copying the frontier — they are specializing around deployment

The April Chinese model wave was not random.

GLM-5.1, Qwen3.6, Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro all push toward the same broad theme: long context, agentic workflows, coding, tool use, and lower-cost deployment.

This is not just a benchmark race. It is a distribution strategy.

OpenAI and Anthropic are selling premium intelligence.

Chinese labs are increasingly selling frontier-adjacent capability that can be deployed more broadly, customized more deeply, and routed more aggressively.

The gap at the very top remains real. GPT-5.5 and Opus 4.7 are still better models overall.

But the gap below the frontier is shrinking fast.

That matters because most economic value from AI will not come from asking the single hardest question. It will come from running useful intelligence everywhere: every repo, every spreadsheet, every customer thread, every document workflow, every internal system, every agent harness.

In that world, the winner is not always the model with the highest IQ.

It is the model with the best intelligence per dollar, per second, per workflow, per failure mode.

What to watch next

The first thing to watch is whether GPT-5.5’s lead holds once more third-party data arrives. OpenAI’s own numbers are impressive, but AI IQ’s methodology is designed to normalize across benchmarks and penalize overreliance on gameable tests. That distinction will matter more as labs optimize for public leaderboards.

The second thing to watch is whether Opus 4.7 becomes the default model for serious agentic coding despite GPT-5.5’s overall IQ lead. If developers and enterprises keep reporting that Opus is easier to trust over long runs, Anthropic may continue to win workflows even when OpenAI wins charts.

The third thing to watch is Gemini. Google was quiet in April, but Gemini 3.1 Pro remains Tier 1. A Gemini 3.2 or Gemini 4 release would immediately reset the frontier.

The fourth thing to watch is open-weight agentic models. Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro are not just “cheap alternatives.” They are the beginning of a world where frontier-adjacent agents can be run, tuned, routed, hosted, and governed outside the closed-model APIs.

The fifth thing to watch is cost. AI IQ’s effective-cost framing is going to become more important every month. Sticker price is not enough. A model that uses fewer tokens, finishes faster, retries less, and fails less often can be cheaper in practice even when its posted token price is higher.

Bottom line

April 2026 gave us a new model hierarchy.

GPT-5.5 is the best overall model.

Opus 4.7 is the best EQ and agentic trust model.

Gemini 3.1 Pro remains a Tier 1 holdover.

Kimi K2.6, DeepSeek-V4-Pro, Muse Spark, Grok 4.3, Qwen3.6, MiMo-V2.5-Pro, and GLM-5.1 make Tier 2 much more competitive than it was a month ago.

The frontier is still a premium market.

But the middle of the market just got much smarter.

That is what professionals should take away from April. Not that one lab won. Not that one leaderboard settled the race. Not that open models caught the frontier.

The real lesson is that model choice is now a portfolio decision.

Use the smartest model when intelligence is the bottleneck.

Use the most emotionally intelligent model when trust and collaboration are the bottleneck.

Use the cheapest good-enough model when scale is the bottleneck.

And revisit the decision every month.

Because in AI, April 2026 was not a normal month.

It was a warning shot.

The Best AI Models of March 2026

Ryan Shea — Tue, 07 Apr 2026 01:50:00 GMT

March was a slower month for AI model releases than February.

There was no new Claude. No new Gemini. No new Grok. Most of the major US and Chinese labs were between release cycles.

But slow does not mean boring.

OpenAI released GPT-5.4 on March 5, and it immediately became one of the best models in the world. It did not completely erase Gemini 3.1 Pro or Opus 4.6, but it pushed the frontier forward in a very specific direction: real professional work. Spreadsheets, documents, presentations, computer use, coding agents, tool use, long-context work, and fewer factual errors.

Then, on March 18, two Chinese labs released strong Tier 2 models: MiniMax-M2.7 and Xiaomi MiMo-V2-Pro. Neither model took the overall crown, but both made the cost-performance and agentic-deployment tier more competitive.

So March was not a broad release wave.

It was a month with one main event and two important follow-ups.

The main event was GPT-5.4.

The follow-ups were MiniMax-M2.7 and MiMo-V2-Pro showing that the layer below the frontier is still getting better.

March 2026 model releases

March had three releases that matter for AI IQ’s model rankings, plus one smaller OpenAI release that matters for routing.

March 5: OpenAI released GPT-5.4
GPT-5.4 was the clear headline release of the month. OpenAI positioned it as its strongest mainline reasoning model, with improvements across professional work, coding, computer use, tool use, academic reasoning, factuality, and long-context tasks. OpenAI also said GPT-5.4 was the first mainline reasoning model to incorporate the frontier coding capabilities of GPT-5.3-Codex.

March 17: OpenAI released GPT-5.4 mini and GPT-5.4 nano
These were not Tier 1 models, but they matter for production systems. GPT-5.4 mini and nano were designed for faster, cheaper, high-volume workloads, with OpenAI specifically emphasizing coding subagents, classification, data extraction, ranking, multimodal use, and low-latency tool workflows.

March 18: MiniMax released MiniMax-M2.7
MiniMax-M2.7 was the most interesting non-OpenAI release of March. MiniMax described it as its first model to participate deeply in its own development cycle, with strong agent-harness capabilities, real-world software engineering results, office-work performance, and native Agent Teams. It scored 56.22% on SWE-Pro, 55.6% on VIBE-Pro, and 57.0% on Terminal Bench 2 according to MiniMax’s launch post.

March 18: Xiaomi released MiMo-V2-Pro
MiMo-V2-Pro was Xiaomi’s flagship agent model, built for real-world agentic workloads. Xiaomi described it as a trillion-parameter model with 42B active parameters, support for up to 1M-token context, and strong agent benchmark performance, including #3 globally on PinchBench and ClawEval in Xiaomi’s published comparisons.

That is a smaller calendar than February, but the model-market impact was still real. GPT-5.4 became a Tier 1 model immediately, and MiniMax-M2.7 and MiMo-V2-Pro both earned Tier 2 positions.

The new March ranking

By the end of March, AI IQ’s top tiers looked like this.

Tier 1

Gemini 3.1 Pro
Still the overall leader in AI IQ’s March ranking. Google did not release a new model in March, but Gemini 3.1 Pro remained the model to beat, especially on broad reasoning and programmatic capability.

GPT-5.4
The new March release and the main event of the month. GPT-5.4 moved OpenAI back into the top cluster and became one of the strongest all-around models, especially for professional work, computer use, coding, tool use, and factuality.

Claude Opus 4.6
Still Tier 1 despite no March release. Opus 4.6 remained one of the best models for long-running professional workflows, high-context collaboration, coding agents, and EQ-heavy work.

Tier 2

Grok 4.20
Still xAI’s strongest model in the March window, but below the top three.

Kimi K2.5
Still highly relevant as an open-weight, multimodal, agentic model from January.

GLM-5
Still one of the stronger Chinese frontier-adjacent models, especially for engineering-agent workflows.

DeepSeek-V3.2
Still an important open-weight baseline.

MiniMax-M2.7
New in March, and the strongest MiniMax model yet. It is especially interesting for coding agents, production debugging, office work, and multi-agent workflows.

Qwen3.5-397B
Still a strong open-weight MoE model from February.

MiMo-V2-Pro
New in March, and one of the more interesting Chinese agent models because of its 1M context, 42B-active architecture, API pricing, and OpenClaw-style agent positioning.

The short version: March did not reshuffle the entire leaderboard. It inserted GPT-5.4 into Tier 1 and made Tier 2 more competitive.

GPT-5.4: the professional-work model

GPT-5.4 is the March release that matters most.

OpenAI’s launch post was not framed around chat quality or one narrow benchmark. It was framed around professional work: spreadsheets, presentations, documents, legal analysis, financial modeling, coding, computer use, and tool-heavy agentic workflows.

That is a meaningful shift. The highest-value AI workloads are increasingly not “answer this question.” They are “do this work.”

OpenAI reported that GPT-5.4 scored 87.3% on an internal investment-banking-style spreadsheet modeling benchmark, compared with 68.4% for GPT-5.2. It also said human raters preferred GPT-5.4 presentations over GPT-5.2 presentations 68.0% of the time.

The factuality improvement is also important. OpenAI said GPT-5.4 was its most factual model yet, with individual claims 33% less likely to be false and full responses 18% less likely to contain any errors relative to GPT-5.2, measured on de-identified prompts where users had flagged factual errors.

That is the core of GPT-5.4’s case. It is not just smarter in an abstract sense. It is better at the kinds of work people actually pay AI systems to do.

GPT-5.4’s biggest leap may be computer use

The most striking GPT-5.4 result is not a math score or a coding score.

It is computer use.

OpenAI described GPT-5.4 as its first general-purpose model with native computer-use capabilities. On OSWorld-Verified, GPT-5.4 scored 75.0%, compared with 47.3% for GPT-5.2, and above the human performance baseline of 72.4%.

That matters because computer use is one of the bridges between “model that answers questions” and “model that can operate software.”

A model that can reason through screenshots, issue mouse and keyboard actions, use browser environments, and operate tools reliably is much closer to being useful in the messy middle of knowledge work. Not just writing code. Not just summarizing. Actually moving through software systems.

GPT-5.4 also scored 67.3% on WebArena-Verified and 92.8% on Online-Mind2Web using screenshot-based observations alone, according to OpenAI.

This is why GPT-5.4 feels different from a normal incremental model release. It is not just another reasoning bump. It is a stronger base for software-operating agents.

GPT-5.4 vs Gemini 3.1 Pro vs Opus 4.6

The March Tier 1 is not cleanly ordered by one simple criterion.

Gemini 3.1 Pro still had the strongest overall AI IQ position by the end of March. It remained the best broad benchmark model in the March window, especially on abstract and programmatic reasoning.

GPT-5.4 was the strongest new release and the biggest March mover. It made OpenAI much more competitive with Gemini 3.1 Pro and Opus 4.6, especially on professional work, computer use, coding, tool use, factuality, and documents.

Opus 4.6 remained the model with the strongest claim for long-running, high-context professional workflows and EQ-heavy work.

The practical distinction is:

Use Gemini 3.1 Pro when you want the strongest broad benchmark profile.

Use GPT-5.4 when you care about professional work, computer use, factuality, coding, and tool-heavy agents.

Use Opus 4.6 when the task is long, messy, collaborative, and sensitive to workflow quality.

That is the uncomfortable reality of March: there was no single model that made the other two irrelevant.

GPT-5.4 was excellent. But March still ended with a three-model Tier 1.

GPT-5.4 mini and nano: the routing angle

GPT-5.4 mini and nano were not the headline models, but they may matter a lot in production.

OpenAI released them on March 17 as faster, cheaper models designed for high-volume workloads. GPT-5.4 mini supports text and image inputs, tool use, function calling, web search, file search, computer use, and skills, with a 400K context window. It costs $0.75 per million input tokens and $4.50 per million output tokens. GPT-5.4 nano costs $0.20 per million input tokens and $1.25 per million output tokens.

That is not just a pricing detail. It is a product architecture detail.

The best agent systems increasingly do not use one model for everything. They use a large model for planning, final judgment, and hard reasoning, then smaller models for subagents, codebase search, extraction, ranking, classification, and fast supporting tasks.

OpenAI explicitly framed GPT-5.4 mini this way: GPT-5.4 can handle planning and final judgment, while GPT-5.4 mini subagents handle narrower subtasks in parallel.

That pattern is going to matter more every month. The March story is not just GPT-5.4 got smarter. It is that OpenAI’s model stack got easier to route.

MiniMax-M2.7: the best non-OpenAI March release

MiniMax-M2.7 was the most interesting March release outside OpenAI.

The headline is not that M2.7 beat the Tier 1 models. It did not.

The headline is that MiniMax is building toward a different model of progress: models that participate in their own improvement, build agent harnesses, use memory and complex skills, and operate across messy organizational workflows.

MiniMax described M2.7 as its first model to deeply participate in its own evolution. During development, MiniMax says it used the model to update memory, build complex skills, improve reinforcement-learning experiment harnesses, and iterate on its learning process based on experiment results.

That could sound like marketing if the model were weak. But M2.7’s reported results are strong enough to pay attention to. MiniMax reported 56.22% on SWE-Pro, 55.6% on VIBE-Pro, and 57.0% on Terminal Bench 2. It also reported a GDPval-AA Elo of 1495, which MiniMax says was the highest among open-source models in its comparison.

The most interesting parts are practical. MiniMax says M2.7 can handle production debugging, correlate monitoring metrics with deployment timelines, connect to databases to verify root causes, identify missing migrations, and propose non-blocking index creation before submitting a merge request.

That is not normal “can write Python” coding. That is closer to software engineering operations.

M2.7 belongs in Tier 2, not Tier 1. But for teams building coding agents, office agents, or internal automation systems, it became a model worth testing.

MiMo-V2-Pro: Xiaomi enters the serious agent conversation

Xiaomi’s MiMo-V2-Pro was the other important March Tier 2 release.

The model is interesting for three reasons: scale, context, and agent positioning.

Xiaomi says MiMo-V2-Pro has more than 1T total parameters with 42B active, supports up to 1M-token context, and includes a lightweight Multi-Token Prediction layer for fast generation.

It is also built very explicitly for agents. Xiaomi describes MiMo-V2-Pro as a foundation model for real-world agentic workloads and says it is designed to orchestrate complex workflows, drive production engineering tasks, and serve as the “brain” of agent systems.

The benchmark positioning is aggressive. Xiaomi reported MiMo-V2-Pro at #3 globally on PinchBench and #3 globally on ClawEval in its published comparisons, approaching Opus 4.6 on ClawEval and sitting just below Claude Opus 4.6 and MiMo-V2-Omni on PinchBench.

The pricing is also notable. Xiaomi lists MiMo-V2-Pro at $1 per million input tokens and $3 per million output tokens up to 256K context, and $2 input / $6 output from 256K to 1M context.

That makes MiMo-V2-Pro easy to understand: it is not the best model overall, but it is a serious agent model with long context and much lower pricing than the top closed models.

That is exactly the kind of model that can win routing decisions.

Dimension-by-dimension read

AI IQ evaluates models across Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the average of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.

That framework is useful for March because the models had very different shapes.

Best overall IQ: Gemini 3.1 Pro

Gemini 3.1 Pro remained the overall AI IQ leader in the March window.

GPT-5.4 narrowed the gap and moved OpenAI back into the top cluster, but Gemini 3.1 Pro still had the strongest overall profile by the end of March.

This is why March was not simply “GPT-5.4 wins.” GPT-5.4 was the best new release, but Gemini 3.1 Pro was still the model sitting at the top of the March ranking.

Best March release: GPT-5.4

Among models released in March, GPT-5.4 was the clear winner.

It was the only March release that earned a Tier 1 spot. It improved meaningfully over GPT-5.2, absorbed the Codex gains from GPT-5.3-Codex into a mainline model, and became one of the best models in the world across professional work, computer use, coding, tool use, and academic reasoning.

GPT-5.4 was not a minor checkpoint. It was the March model that changed the top tier.

Best EQ: Opus 4.6

Opus 4.6 remained the strongest EQ model in the March window.

This is one of the reasons Opus 4.6 stayed Tier 1 even after GPT-5.4 launched. AI IQ’s EQ ranking is designed to capture emotional intelligence, conversational judgment, and human preference signals. Those qualities matter more as models move from answering questions to working with people.

For long-running, high-context, user-facing work, Opus 4.6 still had a strong claim.

Best abstract reasoning: Gemini 3.1 Pro

Gemini 3.1 Pro remained the abstract reasoning leader.

GPT-5.4 improved OpenAI’s position substantially, especially compared with GPT-5.2, but Gemini 3.1 Pro was still the strongest model on AI IQ’s abstract reasoning dimension in the March window.

That matters because abstract reasoning is one of the harder dimensions to fake. It is less about memorized facts and more about solving novel patterns.

Best mathematical reasoning: GPT-5.4, with GPT-5.3-Codex still relevant

GPT-5.4 had the strongest March-release math profile.

OpenAI reported GPT-5.4 at 47.6% on FrontierMath Tier 1–3 and 27.1% on FrontierMath Tier 4, with GPT-5.4 Pro scoring 50.0% and 38.0% respectively.

GPT-5.3-Codex remained relevant for technical reasoning, especially where math overlaps with coding, tools, and verification. But GPT-5.4 mattered because it brought those Codex-style strengths into the mainline GPT family.

Best programmatic reasoning: Gemini 3.1 Pro overall; GPT-5.4 as the March mover

Gemini 3.1 Pro still had the strongest overall programmatic reasoning profile in the March window.

But GPT-5.4 was the major new programmatic-reasoning release. OpenAI reported GPT-5.4 at 57.7% on SWE-Bench Pro and 75.1% on Terminal-Bench 2.0. GPT-5.3-Codex still led GPT-5.4 on Terminal-Bench 2.0 in OpenAI’s table, but GPT-5.4 was a much stronger all-around mainline model than GPT-5.2.

MiniMax-M2.7 deserves mention here too. It did not take the top programmatic slot, but its SWE-Pro, VIBE-Pro, Terminal Bench 2, and production-debugging profile made it one of the strongest Tier 2 coding-agent releases of the month.

Best academic reasoning: Gemini 3.1 Pro and GPT-5.4

Academic reasoning was close at the top.

Gemini 3.1 Pro remained extremely strong, but GPT-5.4 gave OpenAI a major new academic-reasoning profile. OpenAI reported GPT-5.4 at 39.8% on Humanity’s Last Exam without tools, 52.1% with tools, 92.8% on GPQA Diamond, and 33.0% on Frontier Science Research.

That puts GPT-5.4 squarely in the top-tier academic-reasoning conversation.

Cost-performance: March was really about routing

The most practical lesson from March is that routing kept getting more important.

GPT-5.4 is expensive but good. OpenAI lists GPT-5.4 API pricing at $2.50 per million input tokens and $15 per million output tokens, compared with GPT-5.2 at $1.75 input and $14 output. GPT-5.4 Pro is far more expensive at $30 input and $180 output.

That pricing is reasonable if GPT-5.4 is doing high-value work. But it makes no sense to use the most expensive model for every subtask in a large agent loop.

That is where March got interesting.

GPT-5.4 mini and nano gave OpenAI cheaper subagent options. MiniMax-M2.7 gave builders a strong Tier 2 coding and office-work model. MiMo-V2-Pro gave builders a 1M-context agent model with much lower listed pricing than Opus-class models.

The better stack after March looked something like this:

Use Gemini 3.1 Pro for the strongest broad reasoning and benchmark profile.

Use GPT-5.4 for professional work, computer use, coding, documents, spreadsheets, presentations, factuality-sensitive tasks, and tool-heavy agents.

Use Opus 4.6 for long-running workflows where trust, EQ, and high-context collaboration matter.

Use GPT-5.4 mini or nano for cheaper OpenAI subagents, extraction, ranking, classification, lightweight coding tasks, and fast supporting work.

Use MiniMax-M2.7 for coding-agent experiments, office-work agents, production-debugging workflows, and multi-agent harnesses.

Use MiMo-V2-Pro for lower-cost 1M-context agent workflows, especially where OpenClaw-style agent behavior matters.

That is the March pattern. The top model matters, but the stack matters more.

What changed from February to March

February was a broad frontier-reset month.

Gemini 3.1 Pro and Opus 4.6 pushed Tier 1 forward. GPT-5.3-Codex made OpenAI harder to compare because it was clearly powerful, but specialized. Grok 4.20 brought xAI back into Tier 2. GLM-5, MiniMax-M2.5, and Qwen3.5-397B made the Chinese cost-performance layer stronger.

March was narrower.

There was one major frontier release: GPT-5.4.

That release mattered because it answered the main question left open by February: what happens when OpenAI folds GPT-5.3-Codex-style coding capability back into a general-purpose GPT model?

The answer was GPT-5.4.

And it was strong enough to make the Tier 1 conversation genuinely three-way again.

What to watch next

The first thing to watch is whether GPT-5.4 becomes the default model for professional agents. OpenAI’s benchmark story is strongest around documents, spreadsheets, presentations, computer use, and tool workflows. The real signal will be whether users feel that improvement in day-to-day work.

The second thing to watch is Gemini. Google skipped March after releasing Gemini 3.1 Pro in February. If Google ships another major update, the top of the ranking could move again quickly.

The third thing to watch is Anthropic. Opus 4.6 remained Tier 1 in March, especially on EQ and long-running workflows, but GPT-5.4 narrowed the professional-work gap. Anthropic’s next release will need to push hard on reliability, coding agents, and long-context work.

The fourth thing to watch is MiniMax-M2.7 adoption. The model’s self-evolution framing is interesting, but adoption will depend on whether developers find it reliable inside real harnesses.

The fifth thing to watch is MiMo-V2-Pro’s agent performance in the wild. Xiaomi’s pricing and 1M-context support are compelling. The question is whether it holds up outside curated agent benchmarks.

Bottom line

March 2026 was quieter than February, but it still mattered.

Gemini 3.1 Pro remained the overall leader in AI IQ’s March ranking.

GPT-5.4 was the best new model of the month and immediately joined Tier 1.

Opus 4.6 stayed Tier 1 because of its EQ, long-running workflow quality, and professional-agent profile.

MiniMax-M2.7 became one of the most interesting Tier 2 models for coding agents, office work, and agent-harness development.

MiMo-V2-Pro gave Xiaomi a serious Tier 2 agent model with 1M context, 42B active parameters, and aggressive pricing.

The practical takeaway is simple: March made the top tier more competitive, but it also made routing more important.

Use the best model when the task is hard.

Use the cheaper model when the task is frequent.

Use the model with the right shape when the workflow is specific.

GPT-5.4 was the headline.

But March’s bigger lesson is that serious AI systems are becoming model stacks, not model choices.

The Best AI Models of February 2026

Ryan Shea — Tue, 03 Mar 2026 03:38:00 GMT

February was a blockbuster month for AI model releases.

January was mostly about open-weight models becoming more useful underneath the frontier. February was different. The top of the leaderboard moved.

Anthropic released Opus 4.6. OpenAI released GPT-5.3-Codex. Google released Gemini 3.1 Pro. xAI pushed Grok 4.20 into public beta. And the Chinese labs shipped another strong wave: GLM-5, MiniMax-M2.5, and Qwen3.5-397B.

The simple story is that Gemini 3.1 Pro and Opus 4.6 became the new all-around state-of-the-art models.

The more interesting story is that February did not give us one clean winner. It gave us three different frontier claims.

Google had the best broad benchmark story with Gemini 3.1 Pro.

Anthropic had the best long-horizon professional workflow story with Opus 4.6.

OpenAI had the most awkward model to rank: GPT-5.3-Codex, a model that looks like a clear step up from GPT-5.2 in coding and agentic computer work, but is not quite the same thing as a general-purpose GPT-5.3.

That made February more interesting than a normal leaderboard update. The frontier moved, but it also got messier.

February 2026 model releases

February had two clear Tier 1 general-model launches, one major OpenAI specialist launch, one important xAI beta, and a strong Chinese model wave.

February 5: Anthropic released Claude Opus 4.6
Opus 4.6 improved on Opus 4.5 in coding, long-running agentic tasks, large-codebase reliability, debugging, code review, and long-context work. Anthropic also introduced a 1M token context window in beta for an Opus-class model and emphasized everyday professional tasks like financial analysis, research, documents, spreadsheets, and presentations.

February 5: OpenAI released GPT-5.3-Codex
GPT-5.3-Codex was OpenAI’s biggest February move, but not a normal flagship release. OpenAI described it as its most capable agentic coding model to date, combining the Codex and GPT-5 training stacks, advancing both GPT-5.2-Codex coding performance and GPT-5.2 professional reasoning, while running about 25% faster.

February 12: Z.ai released GLM-5
GLM-5 was designed for complex system engineering and long-range agent tasks. Z.ai framed it as a shift “from coding to engineering,” with stronger deep reasoning in backend architecture, complex algorithms, and stubborn bug fixing, plus DeepSeek Sparse Attention for token efficiency.

February 12: MiniMax released MiniMax-M2.5
MiniMax-M2.5 pushed heavily on coding, agentic tool use, search, office work, and economically valuable tasks. MiniMax reported 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp with context management, while also claiming the model could run continuously for $1/hour at 100 tokens per second.

Mid-February: xAI pushed Grok 4.20 into public beta
Grok 4.20 is harder to source cleanly than the OpenAI, Anthropic, and Google releases. It appeared in public beta in February, while xAI’s developer release notes list Grok 4.20 and Grok 4.20 Multi-agent as live on March 10. So I would treat it as a February capability signal but not as clean a product release as Opus 4.6 or Gemini 3.1 Pro.

February 16: Alibaba released Qwen3.5
Alibaba unveiled Qwen3.5 for the “agentic AI era,” with Reuters reporting claims of lower cost, better large-workload processing, and visual agentic capabilities across mobile and desktop apps. The open-weight Qwen3.5-397B-A17B model uses 397B total parameters with 17B activated and supports 262K native context, extensible to roughly 1M tokens.

February 19: Google released Gemini 3.1 Pro
Gemini 3.1 Pro was the biggest all-around release of the month. Google framed it as upgraded core intelligence for complex tasks, rolling out across the Gemini API, Vertex AI, Gemini app, NotebookLM, Google AI Studio, Antigravity, Gemini CLI, and Android Studio. Google also reported a verified 77.1% score on ARC-AGI-2, more than double Gemini 3 Pro’s reasoning performance.

That is a lot for one month.

The simple release-count story understates it. February was not just busy. It changed the top tier.

The new Tier 1

By the end of February, AI IQ’s Tier 1 looked like this:

Gemini 3.1 Pro
Claude Opus 4.6
GPT-5.3-Codex

That ranking needs a caveat.

Gemini 3.1 Pro and Opus 4.6 are clean all-around models. They are easy to compare against GPT-5.2, Gemini 3 Pro, and Opus 4.5.

GPT-5.3-Codex is different. It is clearly stronger than GPT-5.2 in important ways, especially in coding-agent work. But OpenAI did not release a general-purpose GPT-5.3 in February, and GPT-5.3-Codex did not get the same kind of broad benchmark coverage that GPT-5.2 had. So it belongs in Tier 1, but with an asterisk: it is a frontier model, but not a normal flagship model.

That distinction matters because model selection is no longer a one-column leaderboard problem.

For broad intelligence, Gemini 3.1 Pro had the strongest February claim.

For long-running professional work, Opus 4.6 had the strongest workflow claim.

For agentic coding and computer work, GPT-5.3-Codex looked like OpenAI’s most important model.

The best model depended more than usual on what kind of work you meant.

Gemini 3.1 Pro: the broad benchmark leader

Gemini 3.1 Pro was the cleanest all-around winner of February.

Google did not frame it as a minor Gemini 3 patch. It framed it as the upgraded core intelligence behind recent Deep Think progress, designed for tasks where “a simple answer isn’t enough.” The launch emphasized complex reasoning, data synthesis, visual explanations, code-based animation, API integration, and agentic development workflows.

The standout number was ARC-AGI-2. Google reported Gemini 3.1 Pro at 77.1% verified on ARC-AGI-2, more than double Gemini 3 Pro. That matters because ARC-AGI-2 is one of the better tests for novel abstract reasoning — the kind of problem-solving that is harder to explain away with benchmark contamination or memorized examples.

On AI IQ, that shows up clearly. Gemini 3.1 Pro sits at or near the top overall, and it has one of the strongest profiles across abstract, programmatic, and academic reasoning.

It also benefits from Google’s distribution. Gemini 3.1 Pro was not just a model-card release. It rolled into the Gemini app, NotebookLM, Vertex AI, Gemini Enterprise, Google AI Studio, Antigravity, Gemini CLI, and Android Studio. That matters because the model is not only competing in the lab; it is being pushed into the places where people actually work.

The weakness is the usual Gemini caveat: benchmarks and product feel have not always lined up perfectly. Google’s best models can look incredible on hard evals but still feel uneven in day-to-day chat or agent workflows. Gemini 3.1 Pro narrows that gap, but it does not erase the question.

Still, by the end of February, Gemini 3.1 Pro had the strongest claim to “best overall model.”

Opus 4.6: the professional workflow model

Opus 4.6’s case is different from Gemini’s.

Gemini 3.1 Pro had the cleaner broad benchmark story. Opus 4.6 had the better “give it a real project and let it work” story.

Anthropic emphasized coding, planning, long-running agentic tasks, large-codebase reliability, debugging, code review, financial analysis, research, and document/spreadsheet/presentation work. The model also got a 1M token context window in beta, which is especially relevant for enterprise workflows where the model has to reason over large document sets, repositories, or multi-file projects.

Anthropic also reported strong results on GDPval-AA, BrowseComp, long-context retrieval, cybersecurity, life sciences, and agentic coding. One particularly useful detail: Opus 4.6 scored 76% on an 8-needle 1M-context MRCR v2 task, compared with 18.5% for Sonnet 4.5. That is the kind of long-context retrieval result that matters in real work, not just benchmark theater.

On AI IQ, Opus 4.6 belongs in Tier 1 because it is strong across the board. But its practical advantage may be even more important than its composite score.

Opus 4.6 is the model I would most want to test for long-running tasks where the failure modes are subtle: drifting from the goal, missing buried constraints, over-editing code, ignoring project structure, or producing something polished but not quite right.

That is not the same as saying it beats Gemini 3.1 Pro on every dimension. It does not.

It is saying that “best benchmark model” and “most trustworthy model for messy professional work” are no longer obviously the same thing.

GPT-5.3-Codex: the hardest model to rank

GPT-5.3-Codex is the most interesting February release because it created a ranking problem.

On one hand, it looks like a major step forward. OpenAI says GPT-5.3-Codex combines the Codex and GPT-5 training stacks, advances both GPT-5.2-Codex and GPT-5.2, runs about 25% faster, and sets new highs on SWE-Bench Pro and Terminal-Bench. OpenAI also describes it as moving Codex beyond writing code toward doing end-to-end work on a computer.

On the other hand, it is not GPT-5.3.

That sounds like a pedantic distinction, but it matters. GPT-5.3-Codex is clearly a frontier agentic coding model. It may also be a very strong general professional-work model. But because OpenAI did not release a normal general-purpose GPT-5.3, we have less clean evidence for where the underlying GPT line stood in February.

That makes GPT-5.3-Codex feel like a partial preview of OpenAI’s next frontier rather than the next clean GPT flagship.

For developers, this may not matter. If your work is coding, tool use, repo search, debugging, terminal work, and computer operation, GPT-5.3-Codex is one of the first models to test.

For model rankings, it matters a lot. A coding-specialized frontier model can beat previous GPT releases on many important tasks without telling us exactly how OpenAI’s general-purpose model compares to Gemini 3.1 Pro or Opus 4.6.

That is why GPT-5.3-Codex belongs in Tier 1, but with more uncertainty than the other two February leaders.

Grok 4.20: xAI is still behind, but gaining ground

Grok 4.20 was not the cleanest February release, and it was not Tier 1 on AI IQ.

But it mattered.

xAI’s previous frontier position had been drifting. Grok 4 was useful, but it did not really put xAI in the same all-around conversation as OpenAI, Anthropic, and Google. Grok 4.20 changed that somewhat.

The model appeared in public beta in February, with coverage emphasizing a major step up in capability and daily bug-fix iteration, while xAI’s own developer notes later listed Grok 4.20 and Grok 4.20 Multi-agent as live on March 10. So this is not as clean as saying “xAI officially released a new flagship on February X.” It is better to say: February is when Grok 4.20 became visible as the next serious Grok checkpoint.

On AI IQ, Grok 4.20 sits in Tier 2. That feels right.

It is not yet Gemini 3.1 Pro, Opus 4.6, or GPT-5.3-Codex. But the jump from Grok 4 to Grok 4.20 suggests xAI is gaining ground rather than falling farther behind.

That is the important part. The fourth US lab is still in fourth place, but it is not irrelevant.

The Chinese wave: GLM-5, MiniMax-M2.5, and Qwen3.5

February also had a strong Chinese model wave.

None of GLM-5, MiniMax-M2.5, or Qwen3.5-397B became Tier 1 on AI IQ. But all three pushed the same broader pattern: Chinese labs are not merely trying to top a single benchmark. They are building models around agentic deployment, coding, cost, long context, and broad real-world utility.

GLM-5 was the strongest pure “engineering agent” story. Z.ai explicitly framed it as a move from coding to engineering, with an emphasis on backend architecture, complex algorithms, long-range agent tasks, and stubborn bug fixing. It also integrated DeepSeek Sparse Attention for better token efficiency while preserving long-context quality.

MiniMax-M2.5 was the cost-performance story. MiniMax claimed strong coding, tool-use, search, and office-work results, plus much lower cost for continuous agentic operation. The key line is not the SWE-Bench score. It is the $1/hour claim at 100 tokens per second. If that holds up in real workflows, it changes what kinds of agentic applications are economically reasonable.

Qwen3.5-397B was the architecture and deployment story. The open-weight version uses a 397B-total / 17B-active MoE design, combines Gated Delta Networks with sparse MoE, supports a large native context window, and is built as a native vision-language model. Reuters also reported Alibaba’s claim that Qwen3.5 was 60% cheaper and eight times better at processing large workloads than its predecessor.

The US labs still owned the top of the leaderboard in February.

But the Chinese labs continued to make Tier 2 much more usable.

That matters because most real AI usage will not be one-off frontier prompts. It will be millions of calls across coding agents, office agents, search agents, customer-support workflows, document systems, and internal automation. In that world, a model does not have to be the smartest model in the world to be economically important.

Updated model rankings

AI IQ’s February ranking can be summarized like this.

Tier 1

Gemini 3.1 Pro
The strongest overall model by the end of February. Best broad benchmark story, especially on abstract reasoning and all-around capability.

Claude Opus 4.6
The strongest long-running professional workflow model. Excellent for agentic work, coding, large context, research, and document/spreadsheet-heavy tasks.

GPT-5.3-Codex
A Tier 1 coding-agent model with unusually strong professional-work capabilities, but harder to compare because OpenAI did not release a general GPT-5.3.

Tier 2

Grok 4.20
A major step up for xAI, but still below the top three US labs overall.

Kimi K2.5
Still highly relevant from January as one of the strongest open-weight agentic models.

GLM-5
The strongest February Chinese release on engineering-agent positioning.

DeepSeek-V3.2
Still an important open-weight baseline even without being the month’s new headline.

MiniMax-M2.5
A cost-performance standout for coding, search, office work, and agentic tasks.

Qwen3.5-397B
An efficient open-weight native vision-language MoE model with a strong deployment story.

This is a more complicated ranking than January’s.

January’s story was: GPT-5.2 still leads, but open-weight models are becoming useful.

February’s story was: the top tier changed, but model choice became less obvious.

Dimension-by-dimension read

AI IQ evaluates models across Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. It also compresses easier or more gameable benchmarks so saturated tests cannot dominate the composite score. That is especially important in February because the new releases have very different shapes.

Best overall IQ: Gemini 3.1 Pro

Gemini 3.1 Pro had the strongest overall February profile.

Its advantage came from breadth. It was not just a coding model or a math model. It was strong across abstract reasoning, programmatic reasoning, academic reasoning, multimodal problem-solving, and tool-heavy workflows.

The most important public signal was Google’s ARC-AGI-2 score. A 77.1% verified result on a hard abstract reasoning benchmark is not a normal incremental upgrade. It is the kind of result that changes the top of the ranking.

Best EQ: Opus 4.6

Opus 4.6 had the strongest EQ and professional-collaboration profile.

EQ in AI IQ is not just friendliness. It is a proxy for conversational judgment, calibration, tact, user alignment, and the ability to work well in high-context settings. That matters more as models move from answering questions to doing work with people.

Opus 4.6’s strength is that it does not just score well; it feels designed for handoff. Give it a complex project, a large context, and a vague but real professional goal, and it is one of the models most likely to stay useful without constant correction.

Best abstract reasoning: Gemini 3.1 Pro

Gemini 3.1 Pro was the February abstract reasoning leader.

That is the clearest part of its case. Google’s ARC-AGI-2 result was a major jump from Gemini 3 Pro, and AI IQ’s abstract reasoning dimension puts real weight on ARC-style tasks because they test novel pattern-solving more directly than knowledge-heavy benchmarks.

Best mathematical reasoning: GPT-5.3-Codex

GPT-5.3-Codex had the strongest February claim on math-adjacent reasoning, especially where math overlaps with formal reasoning, code, and tool-based problem-solving.

This is one of the places where the model’s specialization matters. A Codex model that can reason through repositories, run terminal workflows, and verify intermediate results is not just a coding autocomplete system. It starts to become a practical reasoning engine for technical work.

The caveat is coverage. GPT-5.3-Codex did not get the same clean all-domain benchmark treatment as a normal GPT flagship. So I would call it the strongest technical-reasoning model of February, but not use it alone to infer the full state of OpenAI’s general GPT line.

Best programmatic reasoning: Gemini 3.1 Pro overall; GPT-5.3-Codex for coding agents

This is the most nuanced category.

On AI IQ’s broader programmatic dimension, Gemini 3.1 Pro had the strongest all-around position. But GPT-5.3-Codex was the most important coding-agent release of the month.

That distinction matters. Programmatic reasoning is broader than patching code. AI IQ includes Terminal-Bench, SWE-Bench, and SciCode, with compression applied to more gameable benchmarks. So a model can be the best “coding agent” in a product sense while another model has the strongest overall programmatic IQ profile.

For coding-agent products, GPT-5.3-Codex should be near the top of the eval list. For broad technical reasoning across coding, science, terminal work, and structured problem-solving, Gemini 3.1 Pro still has the best February case.

Best academic reasoning: Gemini 3.1 Pro and Opus 4.6

Academic reasoning was close between Gemini 3.1 Pro and Opus 4.6.

Gemini 3.1 Pro had the stronger broad-reasoning profile. Opus 4.6 had the stronger professional-knowledge-work story, especially on long-context retrieval, GDPval-AA, BrowseComp, finance, research, and expert workflows.

The practical difference is this: Gemini 3.1 Pro looks like the better academic benchmark model; Opus 4.6 looks like the model you would trust with a long, messy professional research task.

Both belong in Tier 1.

Cost-performance: the gap below the frontier keeps narrowing

February also made cost-performance more important.

The top three models — Gemini 3.1 Pro, Opus 4.6, and GPT-5.3-Codex — are the models to test when quality matters most. But most production AI systems should not route every call to the most expensive model.

That is where MiniMax-M2.5, GLM-5, Qwen3.5-397B, Kimi K2.5, and DeepSeek-V3.2 matter.

MiniMax-M2.5 is the clearest example. A model that performs well on coding, search, tool use, and office work while costing roughly $1/hour to run continuously at 100 tokens per second is not just a cheaper benchmark competitor. It changes the economics of background agents, always-on coding assistants, and multi-step office automation.

Qwen3.5-397B is another example. It does not beat Gemini 3.1 Pro overall, but a 397B-total / 17B-active native vision-language MoE model with long context and open weights is exactly the kind of model teams will want to experiment with for lower-cost multimodal agents.

GLM-5 sits in the same category. Its positioning around system engineering, long-range agent tasks, and token efficiency makes it relevant for teams trying to route complex coding and engineering work without paying top-tier closed-model prices every time.

The best model stack after February probably looked something like this:

Use Gemini 3.1 Pro for the strongest broad reasoning and abstract problem-solving.

Use Opus 4.6 for long-running professional workflows, research, documents, spreadsheets, and coding tasks where trust and context retention matter.

Use GPT-5.3-Codex for serious coding-agent workflows, terminal tasks, repo work, and computer-use-heavy execution.

Use MiniMax-M2.5, GLM-5, Qwen3.5-397B, Kimi K2.5, or DeepSeek-V3.2 when cost, openness, latency, or deployment control matter more than squeezing out the last few points of frontier capability.

Use Grok 4.20 if you are specifically evaluating xAI’s ecosystem or want to see how quickly xAI is improving.

The better setup is not one model. It is routing.

What changed from January to February

January was mostly about the floor rising.

February was about the ceiling moving.

In January, GPT-5.2 and Gemini 3 Pro still sat at the top, while Kimi K2.5, GLM-4.7, MiniMax-M2.1, and GLM-4.7-Flash made the lower tiers more useful.

In February, the top changed. Gemini 3.1 Pro and Opus 4.6 displaced the old frontier. GPT-5.3-Codex gave OpenAI a stronger model, but in a specialist package. Grok 4.20 made xAI relevant again in the Tier 2 conversation. And the Chinese labs kept filling in the cost-performance layer.

That is why February mattered.

It was not just a month with more releases. It changed the shape of the model market.

What to watch next

The first thing to watch is OpenAI. GPT-5.3-Codex is clearly important, but it leaves an obvious question: where is the general-purpose GPT-5.3? If OpenAI folds Codex-level coding into a broader GPT model, the top of the leaderboard could move again quickly.

The second thing to watch is whether Gemini 3.1 Pro’s benchmark strength translates into everyday agentic reliability. The ARC-AGI-2 result is excellent. The practical question is whether developers and professionals feel the same jump in real workflows.

The third thing to watch is Opus 4.6 adoption in enterprise agents. Anthropic’s model is extremely well-positioned for long-context, high-trust, professional work. If businesses increasingly evaluate models on end-to-end task completion instead of chat quality, Opus 4.6 may end up being more important than its raw ranking suggests.

The fourth thing to watch is Grok 4.20. xAI is still behind the top three US labs, but the improvement rate matters. If Grok 4.20’s gains carry into a cleaner API release and stronger benchmark coverage, xAI becomes harder to dismiss.

The fifth thing to watch is the Chinese cost-performance tier. GLM-5, MiniMax-M2.5, and Qwen3.5-397B are not Tier 1 overall, but they are exactly the kinds of models that can win production routing decisions.

Bottom line

February 2026 was one of the most important AI model months so far.

Gemini 3.1 Pro became the strongest overall model in AI IQ’s February ranking.

Opus 4.6 became the strongest long-running professional workflow model.

GPT-5.3-Codex became OpenAI’s most important agentic coding model, but it was harder to compare because OpenAI did not release a general GPT-5.3.

Grok 4.20 brought xAI back into the conversation, even though it still sat below the top frontier labs.

GLM-5, MiniMax-M2.5, and Qwen3.5-397B made Tier 2 more serious, especially for coding, agents, cost, and deployment control.

The practical takeaway is that model selection got harder.

In January, you could mostly say GPT-5.2 was still the default, while open-weight models were becoming useful.

After February, that was no longer enough.

Use Gemini 3.1 Pro when you want the strongest broad reasoning.

Use Opus 4.6 when the work is long, messy, and professional.

Use GPT-5.3-Codex when the task is coding-agent work.

Use the best Tier 2 models when scale, openness, or cost matters more than the last few points of capability.

February did not settle the race.

It made the routing problem real.

The Best AI Models of January 2026

Ryan Shea — Mon, 02 Feb 2026 21:36:00 GMT

January 2026 was not a clean frontier-reset month.

OpenAI did not release GPT-5.3. Anthropic did not release Opus 4.6. Google did not release Gemini 3.1. The very top of the leaderboard mostly stayed where December left it: GPT-5.2 and Gemini 3 Pro out in front, with Opus 4.5 still one of the most important models for coding agents and professional work.

That made January easy to underrate.

The month’s real action was one layer down. Kimi K2.5 was the major new model release. Z.ai shipped GLM-4.7-Flash. And several late-December models — especially GLM-4.7 and MiniMax-M2.1 — spent January getting benchmarked, integrated, and compared against the closed frontier.

So January was not about a new best model.

It was about the rest of the stack getting more useful.

January 2026 model releases

The corrected release calendar is important here.

GLM-4.7 and MiniMax-M2.1 were not January releases. GLM-4.7 was released on December 22, 2025, according to Z.ai’s release notes, and MiniMax-M2.1 was released on December 23, 2025, according to MiniMax’s own launch post. They mattered a lot in January, but they should be treated as late-December models that shaped the January rankings, not as January launches.

The true January model-release story was narrower:

January 19: Z.ai released GLM-4.7-Flash
GLM-4.7-Flash was a lighter, lower-latency version of GLM-4.7, designed as the free-tier model. Z.ai positioned it around coding, reasoning, writing, translation, roleplay, high-throughput use cases, and real-time interaction.

January 27: Moonshot released Kimi K2.5
Kimi K2.5 was the month’s most important new model. Moonshot described it as a native multimodal, open-source agentic model trained with continued pretraining over roughly 15T mixed visual and text tokens. It also introduced Agent Swarm, where the model can coordinate up to 100 sub-agents and run parallel workflows across up to 1,500 tool calls.

That is a quieter month than April would later become. But it was still useful. January clarified which models were actually worth testing below the top frontier tier.

The top tier barely moved

The best model available by the end of January was still GPT-5.2.

OpenAI had released GPT-5.2 in December as its most advanced model for professional work and long-running agents, with strong reported results across software engineering, math, abstract reasoning, science, long-context work, spreadsheets, presentations, coding, tool use, and multi-step projects.

Gemini 3 Pro was still right behind it. Google released Gemini 3 in November as its most intelligent model, with emphasis on reasoning, multimodality, coding, and agentic workflows.

Claude Opus 4.5 remained highly relevant, especially for coding agents, tool use, computer use, spreadsheets, and long-running professional work. Anthropic released it in November and described it as its best model for coding, agents, and computer use, with a lower Opus API price of $5 per million input tokens and $25 per million output tokens.

So the January headline is not “new model beats GPT.”

It is more specific: the top stayed mostly closed, but the next layer got much more interesting.

The best January release: Kimi K2.5

Kimi K2.5 was the clear standout January release.

It did not take the overall AI IQ crown from GPT-5.2. It did not make Gemini 3 Pro irrelevant. It did not erase Opus 4.5’s agentic-workflow strengths.

But it did make the open-weight category harder to ignore.

The model’s pitch was unusually product-shaped. Kimi K2.5 was not only a text reasoning model with better benchmark results. It was native multimodal. It could reason over images and video. It could code from visual inputs. It had instant, thinking, agent, and agent-swarm modes. And the Agent Swarm system made a serious attempt at scaling agentic work horizontally instead of just asking one model to think longer.

That last point is the interesting one.

Most agentic AI still runs as one long loop: plan, call tool, observe, revise, call another tool, and repeat. Kimi K2.5’s Agent Swarm design instead asks whether some tasks should be decomposed into many parallel subtasks. Moonshot reports that the system can create up to 100 sub-agents and reduce execution time by up to 4.5x compared with a single-agent setup.

That does not automatically make Kimi K2.5 the best model. It does make it one of the more strategically interesting January releases.

For high-stakes synthesis, GPT-5.2 still made more sense. For top-tier programmatic reasoning, Gemini 3 Pro remained extremely strong. For long-running coding-agent work, Opus 4.5 still had a strong case.

But for teams that care about open weights, multimodal agents, cost control, visual coding, and parallelized workflows, Kimi K2.5 became a model worth testing.

GLM-4.7-Flash: not a frontier reset, but useful

GLM-4.7-Flash was not the most important model of January. Kimi K2.5 was.

But GLM-4.7-Flash still mattered because it pointed at a different part of the market: high-frequency, lower-latency usage.

Z.ai described GLM-4.7-Flash as a lightweight and efficient model designed as the free-tier version of GLM-4.7, with strong performance across coding, reasoning, and generative tasks. The release was explicitly framed around low latency, high throughput, writing, translation, roleplay, and real-time use cases.

That is not glamorous, but it is important.

A lot of real AI usage does not need the best model. It needs a model that is good enough, fast enough, and cheap enough to run constantly. Customer support drafts, internal assistants, low-stakes code edits, document cleanup, translation, lightweight research, routing, summarization, and agent pre-processing do not all need GPT-5.2.

GLM-4.7-Flash was not competing to be the smartest model in the world. It was competing to be useful at scale.

That is a different race, and January had more of that than a simple leaderboard would suggest.

The late-December models that shaped January

The reason January felt busier than its release calendar is that two important late-December releases were still being absorbed by the ecosystem.

GLM-4.7 was released on December 22, 2025. Z.ai described it as a foundation model with improvements in coding, reasoning, agentic capabilities, long-context understanding, and end-to-end task execution across real-world development workflows.

MiniMax-M2.1 was released on December 23, 2025. MiniMax emphasized real-world complex tasks, multi-language programming, office workflows, lower token consumption, faster response speed, and better generalization across coding-agent frameworks like Claude Code, Droid, Cline, Kilo Code, Roo Code, and BlackBox.

These models should not be listed as January launches. But they absolutely belong in a January model analysis, because January was when teams had time to test them against the frontier.

GLM-4.7 was the stronger general open-weight reasoning and coding story.

MiniMax-M2.1 was the practical coding-agent story: less about one heroic score, more about multi-language software work, scaffold compatibility, token efficiency, and whether a model behaves well inside actual coding tools.

That distinction matters. Real coding agents do not live inside one benchmark. They live inside harnesses, repos, terminals, IDEs, issue trackers, and messy project context.

Updated January model rankings

Using the AI IQ January window, the model hierarchy looked roughly like this.

Tier 1 / top frontier

GPT-5.2
The best overall model available by the end of January. Strong across broad reasoning, math, academic work, software engineering, tool use, and professional tasks.

Gemini 3 Pro
Still one of the strongest models overall, and especially strong on programmatic reasoning and multimodal workflows.

High-end professional / frontier-adjacent

Claude Opus 4.5
Not the overall AI IQ leader, but still one of the most important models for coding agents, computer use, long-running tasks, spreadsheets, and professional workflows.

Kimi K2.5
The best true January release. Not top overall, but the most interesting new open-weight agentic model of the month.

GLM-4.7
A late-December release that remained highly relevant in January for open-weight coding, reasoning, and agentic development workflows.

MiniMax-M2.1
Also a late-December release, but important in January for coding-agent workflows, multi-language programming, and practical deployment.

GLM-4.7-Flash
The January speed-and-throughput release. Less important for the top leaderboard, more important for cheap, frequent, lower-latency usage.

That hierarchy is less clean than a single leaderboard, but it is more useful.

January was not about replacing GPT-5.2. It was about giving teams more credible models below it.

Dimension-by-dimension read

AI IQ evaluates models across four cognitive dimensions: Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the average of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.

That framework is especially helpful for January because the new and recently released models had different shapes.

Best overall IQ: GPT-5.2

GPT-5.2 was still the best overall model in the January window.

Its advantage was breadth. It was not only a coding model, math model, or chat model. It remained the safest default for mixed professional tasks: reading dense material, reasoning through tradeoffs, writing code, checking math, working with long context, and producing polished outputs.

For general high-value work, GPT-5.2 was the default pick.

Best January release overall: Kimi K2.5

Among true January releases, Kimi K2.5 had the strongest overall story.

Its benchmark profile was good, but the more interesting thing was the model shape: multimodal input, coding, visual debugging, agentic execution, office productivity, and parallel sub-agent orchestration.

It was not the best model in the world. But it was the January release most likely to change what serious teams tested.

Best abstract reasoning: GPT-5.2

GPT-5.2 remained the January-window leader on abstract reasoning.

That matters because abstract reasoning is one of the harder capabilities to explain away through benchmark familiarity. AI IQ’s abstract reasoning dimension uses ARC-AGI-2 and ARC-AGI-1, with ARC-AGI-2 treated as the harder, more frontier-discriminating benchmark.

Kimi K2.5 and GLM-4.7 were useful, but they did not close the gap at the very top.

Best mathematical reasoning: GPT-5.2

GPT-5.2 was also the math leader in the January window.

AI IQ’s math dimension uses FrontierMath Tier 4 and AIME, with AIME compressed because it is easier to saturate and more exposed to contamination. That is a better signal than simply asking which model got the highest AIME score.

Kimi K2.5 was strong for an open-weight January release, but GPT-5.2 still had the broader mathematical reasoning profile.

Best programmatic reasoning: Gemini 3 Pro

Gemini 3 Pro had the strongest case on programmatic reasoning among models available by the end of January.

This is one place where “best overall” and “best for a specific task type” diverge. GPT-5.2 led overall, but Gemini 3 Pro remained extremely competitive on coding-heavy and programmatic tasks.

AI IQ’s programmatic dimension is also more useful than a simple SWE-Bench leaderboard because it combines Terminal-Bench 2.0, SWE-Bench Verified, and SciCode, while compressing SWE-Bench due to leakage and gameability concerns.

Among the January-related models, Kimi K2.5 was the most important new programmatic entrant. GLM-4.7 and MiniMax-M2.1 were also worth testing, especially for teams that cared about open-weight deployment and coding-agent workflows.

Best academic reasoning: Gemini 3 Pro, with GPT-5.2 close

Academic reasoning was one of the tighter parts of the January top tier.

Gemini 3 Pro had the strongest case on academic reasoning in the January window, with GPT-5.2 close behind. Both were meaningfully ahead of the true January releases on broad expert-knowledge tasks.

AI IQ’s academic reasoning dimension includes Humanity’s Last Exam, CritPt, and GPQA Diamond, with GPQA compressed because public graduate-level science benchmarks are easier to contaminate than newer expert-screened tests.

Kimi K2.5 was still impressive, especially with tools. But the best academic models were still the late-2025 closed frontier models.

Best EQ: GPT-5.2

In AI IQ’s January-window view, GPT-5.2 had the strongest EQ profile.

That may surprise people who associate Claude with the best conversational feel. But AI IQ’s EQ score is not just vibes. It combines EQ-Bench 3 Elo and Arena Elo, maps them onto an EQ-like scale, and applies a 200-point EQ-Bench penalty to Anthropic models to correct for Claude-judged family bias.

That does not mean GPT-5.2 will feel better than Opus 4.5 in every workflow. It does mean GPT-5.2 looked extremely strong on the measured EQ signals AI IQ tracks.

The cost-performance story

January made the cost-performance conversation more serious.

GPT-5.2 and Gemini 3 Pro were better models overall. But they were not automatically the right models for every call in every workflow.

AI IQ’s effective-cost metric helps here because it does not stop at sticker price. It starts with the cost of a 2M input / 1M output workload, then adjusts by token usage efficiency so token-hungry models are penalized and token-efficient models get credit.

That changes the model-selection problem.

For high-stakes reasoning, GPT-5.2 was worth paying for.

For programmatic reasoning and multimodal coding workflows, Gemini 3 Pro was hard to ignore.

For long-running coding agents and professional workflows, Opus 4.5 remained an important option.

For open-weight agentic work, Kimi K2.5 became the model to test.

For coding-agent experimentation and practical engineering workflows, GLM-4.7 and MiniMax-M2.1 were relevant even though they were December releases.

For lower-latency, high-frequency tasks, GLM-4.7-Flash pointed toward the cheaper end of the routing stack.

The better setup was not one model. It was routing.

The bigger January theme: open-weight models became more practical

The main January pattern was not “China caught OpenAI.”

That would be too strong.

The top was still mostly closed. GPT-5.2 and Gemini 3 Pro were ahead overall, and Opus 4.5 remained one of the most useful professional-agent models.

The better read is that open-weight and frontier-adjacent models became more practical.

That is different from being the best.

A practical infrastructure model needs to be cheap enough, fast enough, available enough, customizable enough, and capable enough. It does not need to win every benchmark. It needs to handle large volumes of useful work without forcing teams to send every intermediate step to the most expensive frontier API.

Kimi K2.5, GLM-4.7, MiniMax-M2.1, and GLM-4.7-Flash all pointed in that direction.

Kimi pushed toward open multimodal agents and parallel sub-agent execution.

GLM pushed toward open-weight coding, reasoning, and faster deployment tiers.

MiniMax pushed toward practical coding-agent generalization across programming languages, tools, and scaffolds.

That was the January story.

The frontier did not move much. The layer underneath it got more usable.

What to watch next

The first thing to watch is whether Kimi K2.5 gets real adoption inside coding-agent and office-agent products. A model can look good in a launch post and still fail to become part of actual workflows. The real signal will be whether developers and teams route meaningful work through it.

The second thing to watch is whether Agent Swarm becomes a serious pattern or stays mostly a demo. Parallel agents are compelling, but they introduce coordination overhead, verification problems, and new failure modes. If the pattern works, it could become one of the more important agent architectures of 2026.

The third thing to watch is GLM-4.7-Flash-style routing. The market needs cheap, fast, good-enough models just as much as it needs frontier reasoning models.

The fourth thing to watch is MiniMax’s practical coding-agent direction. M2.1 was not a January release, but its emphasis on multi-language work, scaffold generalization, and office workflows was exactly where model evaluation needs to go.

The fifth thing to watch is the next true frontier release. January did not bring one. The next OpenAI, Anthropic, or Google release could quickly change the top of the AI IQ ranking.

Bottom line

January 2026 did not give us a new overall champion.

GPT-5.2 remained the best overall model in AI IQ’s January-window ranking.

Gemini 3 Pro remained one of the strongest models overall and had the best case on programmatic reasoning.

Opus 4.5 stayed important for coding agents, computer use, and long-running professional work.

Kimi K2.5 was the best true January release.

GLM-4.7-Flash was the month’s practical speed-and-throughput release.

GLM-4.7 and MiniMax-M2.1 were late-December releases, not January releases, but both shaped the January conversation around open-weight coding and agentic workflows.

The practical takeaway is simple: January made routing more important.

For the hardest work, pay for the frontier. For coding-heavy, latency-sensitive, open-weight, or high-volume agent workflows, the January and late-December models deserved real evaluation.

The best model did not change.

The set of models worth using did.