The Best AI Models of April 2026

GPT-5.5 vs Opus 4.7 — and the month the Tier 2 field exploded

May 04, 2026

April 2026 was the most important month for AI model releases so far this year.

OpenAI released GPT-5.5. Anthropic released Claude Opus 4.7. Meta came back into the race with Muse Spark. xAI shipped Grok 4.3. And the Chinese frontier labs delivered a full wave of serious models: GLM-5.1, Qwen3.6, Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro.

The simple story is that GPT-5.5 and Opus 4.7 are now two of the most important models at the frontier.

The more interesting story is that April was not one race. It was four races happening at once.

The first was the raw intelligence race, where GPT-5.5 moved into first place on AI IQ’s overall estimated IQ ranking.

The second was the EQ and writing-quality race, where Opus 4.7 stood out most clearly.

The third was the instruction-following race, where Grok 4.3 leads the field.

The fourth was the cost-performance race, where Tier 2 became crowded enough to matter for real routing decisions.

That is the real story of April: model quality is no longer one-dimensional.

GPT-5.5 leads on raw intelligence, Opus 4.7 leads on EQ, Grok 4.3 leads on instruction-following, and Gemini 3.1 Pro remains highly competitive on programmatic reasoning.

And the Tier 2 field is now good enough that routing matters more than ever.

April was the month model selection became multi-dimensional

A year ago, model launches were still mostly judged by chat quality, coding snippets, MMLU-style knowledge tests, and a handful of math benchmarks.

That is no longer enough.

The models released in April are all competing to become work systems: coding models, research models, tool-use models, document models, spreadsheet models, long-context models, and models that can sit inside multi-step workflows.

But we should not evaluate them by repeating launch claims.

The point of AI IQ is to normalize performance across hard benchmarks, compress saturated tests, and separate model capability into dimensions that actually matter.

The old way to compare models was to ask:

“Which model got the highest score?”

The better question now is:

“Which model is best for the kind of cognition you actually need?”

AI IQ breaks raw intelligence into four dimensions: Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the mean of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.

That compression is important. A model should not become “the smartest model” just because it crushes a saturated or contaminated benchmark. The frontier should be judged by hard, still-discriminating tests.

But April made something else clear too: IQ is not enough.

For real workflows, we also care about EQ, instruction-following, cost, latency, tool use, refusal behavior, and recovery from failure.

Those are not one thing.

They are separate axes.

The new Tier 1: GPT-5.5, Opus 4.7, and Gemini 3.1 Pro

AI IQ’s updated ranking now has three Tier 1 models:

GPT-5.5

Claude Opus 4.7

Gemini 3.1 Pro

Google did not release a major new model in April, but Gemini 3.1 Pro remains in the top cluster. That is important. The April narrative is not “OpenAI and Anthropic left everyone behind.” It is more precise than that: OpenAI and Anthropic refreshed into the frontier, while Google’s previous frontier model still held its ground.

The top-level read:

GPT-5.5 is the overall intelligence champion. It sits at the top of the AI IQ ranking and leads the April class on raw composite capability.

Opus 4.7 is the EQ and writing-quality standout. It does not beat GPT-5.5 on overall IQ, but it leads AI IQ’s current EQ ranking and remains one of the most compelling models for high-context collaboration, editing, and professional writing.

Gemini 3.1 Pro is the holdover giant. It did not need an April launch to remain relevant. On the AI IQ charts, it continues to sit in the frontier cluster and is especially strong in programmatic reasoning.

Grok 4.3 is not Tier 1 overall, but it deserves attention for a different reason: it leads on instruction-following. That matters because many real workflows fail not because the model lacks general intelligence, but because it misses a constraint.

The frontier is now less like a single leaderboard and more like a model draft board. For any serious workflow, the right answer is usually not “use the highest-ranked model.” It is:

Use GPT-5.5 when you want the strongest general reasoner.

Use Opus 4.7 when you want strong EQ, writing quality, editing, and collaborative feel.

Use Gemini 3.1 Pro when you care about programmatic reasoning or multimodal workflows.

Use Grok 4.3 when instruction-following is the bottleneck.

And use Tier 2 models when the frontier premium is not worth paying.

GPT-5.5: the new raw intelligence leader

GPT-5.5 is the most important release of April.

On AI IQ, it takes the top overall spot. It also leads the April releases in Abstract Reasoning, Mathematical Reasoning, and Academic Reasoning. That is the core of its story: GPT-5.5 is not merely a coding model, a chat model, or a tool-use model. It is the broadest new general-purpose intelligence system in the April wave.

The key point is not that GPT-5.5 wins every possible workflow.

It does not.

The point is that GPT-5.5 is now the model to beat when raw intelligence is the bottleneck.

That means difficult research, complex analysis, high-stakes coding tasks, scientific reasoning, mathematical reasoning, architecture decisions, and work where the model’s ability to notice hidden structure matters more than token price.

GPT-5.5 is expensive. OpenAI lists future API pricing for gpt-5.5 at $5 per million input tokens and $30 per million output tokens, with GPT-5.5 Pro priced far higher at $30 input and $180 output.

So the practical recommendation is not “use GPT-5.5 for everything.”

It is: use GPT-5.5 when the marginal cost of being wrong is high.

GPT-5.5 is the model you reach for when you are buying judgment.

Opus 4.7: the EQ and writing-quality standout

If GPT-5.5 is the clearest raw intelligence winner of April, Opus 4.7 is the harder model to summarize.

On AI IQ’s EQ ranking, Anthropic continues to perform extremely well. Opus 4.7 sits at the top of the EQ chart, ahead of GPT-5.5 and the rest of the April field.

That matters.

EQ is not just “being nice.” In professional AI use, it shows up as tone, judgment, calibration, tact, editing, stakeholder communication, and knowing when to push back.

This is where Claude models have often felt different from the rest of the frontier. Opus 4.7 continues that pattern. It is one of the strongest candidates for writing, editing, high-context collaboration, and work where the model’s interaction style matters.

But EQ is not the same thing as every other form of usefulness.

It is not the same thing as raw intelligence. GPT-5.5 leads there.

It is not the same thing as programmatic reasoning. Gemini 3.1 Pro remains highly competitive there.

And it is not the same thing as instruction-following. Grok 4.3 leads there.

That does not make Opus 4.7 weak overall.

It makes the category sharper.

Opus 4.7’s strongest case is not that it is the universal “best model for work.” Its strongest case is that, among frontier models, it appears especially strong on EQ, writing quality, editing, and collaborative feel.

For teams building real workflows, that distinction matters. A good AI system may need raw intelligence, EQ, instruction-following, coding ability, tool use, low cost, and reliability over long runs. Those are not the same capability.

Different models can win different parts of the bundle.

So the lesson is not “GPT-5.5 is smarter, Opus 4.7 is more trustworthy.”

The lesson is that frontier model quality is now multi-dimensional.

Grok 4.3: the instruction-following standout

Grok 4.3 deserves its own mention because it leads on instruction-following.

That is different from leading on raw IQ. It is also different from leading on EQ.

Instruction-following matters because many practical AI tasks are constraint-heavy. The user does not just want a good answer. They want the answer in a specific format, obeying specific requirements, avoiding specific mistakes, using specific sources, or satisfying specific operational constraints.

On IFBench, Grok 4.3 leads the April field at 81.3%. Grok 4.200309 v2 is essentially tied at 81.2%. MiMo-V2.5-Pro, DeepSeek V4Flash, Nova 2.0 Pro Preview, Gemini 3.1 Pro Preview, Qwen3.6 Max, DeepSeek V4Pro, GLM-5.1, Kimi K2.6, GPT-5.5, Muse Spark, and MiniMax-M2.7 cluster between 75.7% and 79.9%. Claude Opus 4.7 scores 58.6%, and Claude Sonnet 4.6 scores 56.6%.

That is not the whole story of model quality.

But it is a useful warning against collapsing everything into one ranking.

A model can be excellent at writing, editing, and collaboration while still being weaker at exact constraint-following. Another model can be less compelling overall but better at obeying precise instructions. For agents, automation, and production workflows, that distinction matters.

Grok 4.3’s IFBench result gives it a clearer role in the model stack: instruction-heavy workflows where exact constraint-following matters.

The surprise: Tier 2 is now crowded with genuinely useful models

The April Tier 2 list is where the market changed most.

AI IQ’s updated Tier 2 includes:

Grok 4.3
Kimi K2.6
DeepSeek-V4-Pro
Muse Spark
Qwen3.6
MiMo-V2.5-Pro
GLM-5.1

That is a lot of serious models in one month.

None of these cleanly displaces GPT-5.5, Opus 4.7, or Gemini 3.1 Pro as the best overall model. But that is the wrong bar. The point is that Tier 2 is now good enough to matter for production routing.

A year ago, “use the best model” was often a reasonable default.

In May 2026, that is lazy.

If a task is cheap, repetitive, narrow, or tolerant of a small quality drop, you should probably not be sending it to the most expensive frontier model. The cost-performance charts on AI IQ make this especially clear because they do not just plot sticker price. They use effective cost: token cost multiplied by token usage efficiency. AI IQ anchors token cost to a 2M input / 1M output workload, then adjusts by how many tokens the model burns on the Artificial Analysis evaluation suite.

That framing changes the conversation. Some models look cheap but waste tokens. Others look expensive but are efficient enough that the real gap is smaller. And some models are simply cheap enough that they should be in every serious evaluation harness.

This is where the Chinese model wave matters because their scores and prices put real pressure on the gap between “best model” and “good enough model.”

Kimi K2.6, DeepSeek-V4-Pro, Qwen3.6, GLM-5.1, and MiMo-V2.5-Pro are best understood as routing pressure. They may not win the overall chart, but they are close enough on enough dimensions that many teams will ask the only production question that matters:

When is the frontier premium actually worth paying?

Grok 4.3 is different. Its standout instruction-following result gives it a clearer role: constraint-heavy workflows where exact formatting, requirement satisfaction, and instruction adherence matter.

The frontier labs still own the top.

But the rest of the market now has options.

Dimension-by-dimension winners

The cleanest way to understand April is to stop asking “which model is best?” and ask “best at what?”

Best overall IQ: GPT-5.5

GPT-5.5 is the overall AI IQ leader. It is the model to beat on broad benchmark-derived intelligence.

Its advantage is not confined to one dimension. It is especially strong across abstract, mathematical, and academic reasoning, which makes it the most credible default when the task is hard to classify.

For professionals, this matters because many high-value tasks are mixed-domain. A research memo might require reading dense technical material, checking a mathematical argument, writing code to test an assumption, and then producing a clean executive summary. The more mixed the task, the more valuable broad IQ becomes.

Best EQ: Opus 4.7

Opus 4.7 leads AI IQ’s current EQ ranking.

That is meaningful, but it should be treated as directional. AI IQ’s EQ estimate blends EQ-Bench 3 with Arena Elo, and EQ-Bench 3 uses Claude as the judge. AI IQ applies a 200-point penalty to Claude models on EQ-Bench 3, but we do not yet know whether that fully corrects for model-family bias.

The stronger version of this analysis would run EQ-Bench 3 with multiple independent judges and compare how much the rankings change.

For now, the narrower claim is enough: Opus 4.7 has the strongest EQ signal in AI IQ’s current framework, and that matches many user reports that Claude models are unusually strong at tone, editing, and high-context collaboration.

Best instruction-following: Grok 4.3

Grok 4.3 leads on instruction-following.

That matters because instruction-following is not the same thing as EQ, raw IQ, or coding ability. In many workflows, the model has to satisfy a set of exact constraints. It is not enough to produce something smart or polished.

On IFBench, Grok 4.3 leads the April field at 81.3%. GPT-5.5 scores 75.9%, and Claude Opus 4.7 scores 58.6%.

This is one of the clearest examples of why the frontier should not be collapsed into a single “best model” ranking.

Best abstract reasoning: GPT-5.5

Abstract reasoning is the closest AI IQ dimension to raw fluid intelligence: the ability to solve novel problems without relying heavily on memorized knowledge. AI IQ uses ARC-AGI-2 and ARC-AGI-1 for this dimension, with ARC-AGI-2 treated as the harder, more frontier-discriminating test.

GPT-5.5 leads the April class here.

That is important because abstract reasoning is one of the hardest capabilities to fake. A model can memorize facts. It can overfit public code tasks. It can get better at common math formats. But novel abstraction is much harder to brute-force through training contamination.

If you want the model most likely to notice the hidden pattern in a new problem, GPT-5.5 is the current pick.

Best mathematical reasoning: GPT-5.5

Among April’s general-purpose model releases, GPT-5.5 is the math leader.

AI IQ’s math dimension uses FrontierMath Tier 4 and AIME, with AIME compressed because of contamination and saturation concerns. That is the right call. A perfect or near-perfect AIME score no longer tells us as much as it once did.

The interesting wrinkle is that GPT-5.3-Codex remains extremely strong on mathematical reasoning in the AI IQ charts. That suggests OpenAI’s coding-specialist line is not just a coding specialist. It may also be a very strong formal reasoning system.

For most users, GPT-5.5 is the best general math choice. But for code-adjacent math, theorem work, formalization, and technical problem-solving inside a development workflow, GPT-5.3-Codex still deserves attention.

Best programmatic reasoning: Gemini 3.1 Pro and GPT-5.5

The programmatic reasoning chart is one of the most interesting on AI IQ because it does not simply reward SWE-Bench.

AI IQ’s programmatic dimension combines Terminal-Bench 2.0, SWE-Bench Verified, and SciCode, with SWE-Bench compressed because of leakage and gameability concerns.

That makes the ranking more useful than a simple “who wins SWE-Bench?” scoreboard.

Gemini 3.1 Pro remains one of the strongest models overall on programmatic reasoning. Among the April releases, GPT-5.5 is the strongest broad programmatic reasoner, with Opus 4.7 also highly competitive.

Kimi K2.6 is the one to watch here. It does not win the chart, but its open-source, agentic-coding positioning makes it strategically important. The question is not whether Kimi K2.6 beats GPT-5.5 on every coding benchmark. It does not. The question is whether it is good enough, cheap enough, and controllable enough to become the default model for large volumes of coding-agent work.

That answer may be yes for many teams.

Best academic reasoning: GPT-5.5

GPT-5.5 also leads the April field on academic reasoning.

AI IQ’s academic dimension includes Humanity’s Last Exam, CritPt, and GPQA Diamond, with GPQA compressed due to contamination concerns.

This is where GPT-5.5’s breadth matters most. It is not just answering common questions better. It is performing well across expert-level, hard-to-game, high-breadth benchmarks.

DeepSeek-V4-Pro and Muse Spark are the Tier 2 standouts here. DeepSeek-V4-Pro’s knowledge and reasoning profile makes it one of the strongest Chinese models on academic-style tasks, while Muse Spark gives Meta a surprisingly credible return to high-end reasoning.

Muse Spark is not Tier 1 yet. But it is the first Meta model in a while that looks like it belongs in the serious frontier conversation.

The cost-performance story: the smartest model is not always the right model

AI IQ’s cost charts may be more practically important than the IQ chart.

The reason is simple: most real-world AI usage is not one heroic prompt. It is thousands or millions of calls across support workflows, coding agents, research loops, RAG systems, data extraction jobs, document workflows, and internal automations.

At that scale, the question changes from:

“Which model is best?”

to:

“Where does extra intelligence stop paying for itself?”

GPT-5.5 and Opus 4.7 justify their cost when the task is difficult, ambiguous, or high-value. But for many workflows, the April Tier 2 models are now strong enough to route into production.

A sensible model stack in May 2026 looks something like this:

Use GPT-5.5 for the hardest reasoning, research, math, architecture, and high-stakes synthesis.

Use Opus 4.7 for collaborative writing, sensitive communication, editing, and workflows where tone and interaction quality matter.

Use Grok 4.3 for instruction-heavy workflows where exact constraint-following matters.

Use Gemini 3.1 Pro where its programmatic strength or multimodal tooling gives it an edge.

Use Kimi K2.6, DeepSeek-V4-Pro, MiMo-V2.5-Pro, Qwen3.6, GLM-5.1, or Grok 4.3 for cheaper routing, open-weight experimentation, local or sovereign deployment, and high-volume workloads.

For high-stakes reasoning agents, GPT-5.5’s raw intelligence may matter most. For instruction-heavy agents, Grok 4.3 deserves attention. For writing-heavy workflows, Opus 4.7 is still the right choice. For many production workflows, the answer will be a routed stack rather than a single model.

The best teams will not pick one model.

They will build routers.

The biggest surprise: Tier 2 is no longer filler

The April model wave changed the middle of the market.

Kimi K2.6, DeepSeek-V4-Pro, Qwen3.6, GLM-5.1, MiMo-V2.5-Pro, Grok 4.3, and Muse Spark do not erase the frontier. GPT-5.5, Opus 4.7, and Gemini 3.1 Pro are still the top cluster.

But the gap below them is shrinking.

That matters because most economic value from AI will not come from asking the single hardest question. It will come from running useful intelligence everywhere: every repo, every spreadsheet, every customer thread, every document workflow, every internal system, every agent harness.

In that world, the winner is not always the model with the highest IQ.

It is the model with the best intelligence per dollar, per second, per workflow, per failure mode.

That is why Tier 2 matters now.

It is not because every model in Tier 2 is secretly frontier.

It is because the production question has changed.

The question is no longer:

“Which model is best?”

It is:

“Which model is good enough for this task, at this price, with this failure profile?”

That is the question that turns model selection from a leaderboard exercise into an engineering problem.

What to watch next

The first thing to watch is whether GPT-5.5’s lead holds once more third-party data arrives. OpenAI’s own numbers are impressive, but AI IQ’s methodology is designed to normalize across benchmarks and penalize over-reliance on game-able tests. That distinction will matter more as labs optimize for public leaderboards.

The second thing to watch is which model becomes the default for serious agentic coding. GPT-5.5, Opus 4.7, Grok 4.3, and Gemini-3.1 Pro all have different strengths. The answer should come from long-running coding-agent evaluations, not launch posts or vibes.

The third thing to watch is Gemini. Google was quiet in April, but Gemini 3.1 Pro remains Tier 1. A Gemini 3.2 or Gemini 4 release would immediately reset the frontier.

The fourth thing to watch is open-weight and frontier-adjacent models. Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro are not just “cheap alternatives.” They are part of a world where capable models can be routed, tuned, hosted, and governed outside the most expensive closed-model APIs.

The fifth thing to watch is cost. AI IQ’s effective-cost framing is going to become more important every month. Sticker price is not enough. A model that uses fewer tokens, finishes faster, retries less, and fails less often can be cheaper in practice even when its posted token price is higher.

The sixth thing to watch is measurement quality itself. EQ-Bench 3 is useful, but Claude-judged EQ scores should not be treated as final truth. IFBench is useful too, because it helps separate instruction-following from general chat preference. The broader lesson is that we need more benchmarks that separate emotional intelligence, instruction-following, agent reliability, coding ability, tool use, and raw reasoning instead of blending them into one “best model” story.

Bottom line

April 2026 gave us a new model hierarchy.

GPT-5.5 is the best overall model.

Opus 4.7 is the strongest EQ and writing-quality model.

Gemini 3.1 Pro remains a Tier 1 holdover.

Grok 4.3 deserves attention as the leader on instruction-following.

Kimi K2.6, DeepSeek-V4-Pro, Muse Spark, Qwen3.6, MiMo-V2.5-Pro, and GLM-5.1 make Tier 2 much more competitive than it was a month ago.

The frontier is still a premium market.

But the middle of the market just got much smarter.

That is what professionals should take away from April. Not that one lab won. Not that one leaderboard settled the race. Not that open models caught the frontier.

The real lesson is that model choice is now a portfolio decision.

Use the smartest model when intelligence is the bottleneck.

Use the best EQ model when tone, nuance, and collaboration are the bottleneck.

Use the best instruction-following model when exact constraints are the bottleneck.

Use the cheapest good-enough model when scale is the bottleneck.

And revisit the decision every month.

See you in the next update.

Discussion about this post

Ready for more?