<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI IQ]]></title><description><![CDATA[A weekly brief on which AI models are actually worth using, based on benchmark performance, cost, speed, and capability tradeoffs.]]></description><link>https://newsletter.aiiq.org</link><image><url>https://substackcdn.com/image/fetch/$s_!V3Dd!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4946728e-afc8-4906-9668-cf7fbc255898_640x640.png</url><title>AI IQ</title><link>https://newsletter.aiiq.org</link></image><generator>Substack</generator><lastBuildDate>Tue, 23 Jun 2026 03:41:08 GMT</lastBuildDate><atom:link href="https://newsletter.aiiq.org/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Ryan Shea]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[aiiqbrief@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aiiqbrief@substack.com]]></itunes:email><itunes:name><![CDATA[Ryan Shea]]></itunes:name></itunes:owner><itunes:author><![CDATA[Ryan Shea]]></itunes:author><googleplay:owner><![CDATA[aiiqbrief@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aiiqbrief@substack.com]]></googleplay:email><googleplay:author><![CDATA[Ryan Shea]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Best AI Models of May 2026]]></title><description><![CDATA[Claude Opus 4.8 has joined GPT-5.5 at the top &#8212; and the smartest move now is learning when to use each.]]></description><link>https://newsletter.aiiq.org/p/the-best-ai-models-of-may-2026</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/the-best-ai-models-of-may-2026</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Tue, 02 Jun 2026 21:05:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3gm5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3gm5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3gm5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 424w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 848w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 1272w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3gm5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png" width="2020" height="899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:2020,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e5c192f-6e16-4044-9177-25361ee8729a_2048x1258.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3gm5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 424w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 848w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 1272w, https://substackcdn.com/image/fetch/$s_!3gm5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07bb93fa-9544-4fd9-9871-2c586a1b0b9a_2020x899.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">caption...</figcaption></figure></div><p>May was the month Claude Opus 4.8 joined GPT-5.5 at the top of the AI model landscape.</p><p>GPT-5.5 still has the strongest claim to the top composite spot on AI IQ. But Opus 4.8 now wins enough important categories &#8212; especially writing, design, frontend engineering, desktop navigation, and hallucination resistance &#8212; that treating GPT-5.5 as the sole leader misses what changed in May.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.aiiq.org/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI IQ! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The interesting thing is that the frontier has become more textured than that. Opus 4.8 and GPT-5.5 are both #1-caliber models, but they are #1 in different ways.</p><p>GPT-5.5 looks stronger at math and science, terminal use, backend engineering, and instruction following.</p><p>Claude Opus 4.8 looks stronger at writing, design, desktop navigation, frontend engineering, and resistance to hallucination.</p><p>At the broadest level, I would call reasoning and coding basically tied. But &#8220;reasoning&#8221; and &#8220;coding&#8221; are now too broad to be especially useful. The important action is happening underneath those labels.</p><p>The practical lesson is this:</p><p><strong>Using only one of these models all the time is leaving intelligence on the table.</strong></p><p>The best AI users and teams are increasingly going to act less like fans of one model family and more like portfolio managers of intelligence. They will use GPT-5.5 when GPT-5.5 is the right tool. They will use Opus 4.8 when Opus 4.8 is the right tool. And when the task does not require maximum frontier intelligence, they will pull in models like Gemini 3.1 Pro, Grok 4.3, Kimi K2.6, and DeepSeek v4 Flash to get better intelligence per dollar.</p><p>That is the state of AI models in May 2026.</p><p>The frontier is not a single winner anymore.</p><p>It is a map.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GBrr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GBrr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 424w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 848w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GBrr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png" width="1456" height="1327" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1327,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183801,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GBrr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 424w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 848w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!GBrr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F086e6728-397f-4116-a373-a7a0d1674713_1800x1640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Claude Opus 4.8 and GPT-5.5 split wins across coding, reasoning, computer use, and reliability</figcaption></figure></div><h2>The benchmark-card era is not enough anymore</h2><p>Every major model launch now comes with a benchmark card.</p><p>The lab announces the model. The card highlights the scores where the new model looks strong. The obvious comparison points are included. The less flattering or less convenient comparisons are often missing.</p><p>That does not mean the cards are useless. They are often full of real information.</p><p>But they are incomplete by design.</p><p>They are launch materials, not complete maps of the model landscape.</p><p>That matters more now because the frontier models are no longer separated by huge, obvious gaps. When one model dominates across the board, a simple benchmark card can be good enough. But when models are trading wins across coding, reasoning, agentic work, instruction following, reliability, cost, and style, the old &#8220;which one has the biggest number?&#8221; approach starts to break.</p><p>May made that very clear.</p><p>Claude Opus 4.8 launched and immediately became one of the most important models in the world. GPT-5.5 already had the strongest claim to the overall top spot on AI IQ. Gemini 3.1 Pro remained extremely competitive, especially when cost is part of the conversation. Grok 4.3 produced one of the most surprising benchmark wins of the whole comparison. May also brought Qwen3.7-Max, another notable release in a month where the frontier kept getting more crowded and more interesting.</p><p>But the central story is Opus 4.8 joining GPT-5.5 at the top.</p><p>And the deeper story is that &#8220;best model&#8221; is no longer a one-dimensional question.</p><h2>Opus 4.8 vs GPT-5.5: co-leaders, not clones</h2><p>The easiest mistake is to ask whether Opus 4.8 is better than GPT-5.5.</p><p>That question is too blunt.</p><p>Opus 4.8 is better on some things. GPT-5.5 is better on others. Both are extraordinary. Both are state-of-the-art. Both are models you can use for serious work. And both will leave performance on the table if you use them blindly for every task.</p><p>On the benchmark set I pulled together, Opus 4.8 has major wins in areas like SWE-bench Pro, OSWorld-Verified, MCP Atlas, and AA-Omniscience. Those results line up with a broader impression: Opus 4.8 is extremely good when you want judgment, taste, carefulness, writing quality, frontend work, and GUI-style desktop navigation.</p><p>GPT-5.5 has major wins in areas like SWE-rebench, GPQA Diamond, SciCode, CritPt, Terminal-Bench, BrowseComp, and IFBench. That lines up with its own profile: GPT-5.5 is extremely good when you want math, science, terminal use, backend engineering, and precise instruction following.</p><p>The wrong conclusion is &#8220;Claude wins&#8221; or &#8220;GPT wins.&#8221;</p><p>The right conclusion is that the model landscape now has two co-leaders with different shapes.</p><p>That is a much more useful picture.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ISjW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ISjW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 424w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 848w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ISjW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png" width="1456" height="1252" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1252,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ISjW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 424w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 848w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!ISjW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8e0e291-3f37-41b5-89f8-1824661339ed_1600x1376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A cleaner side-by-side view: GPT-5.5 and Opus 4.8 each win important categories.</figcaption></figure></div><h2>The best models of May 2026, by use case</h2><p>Here is how I would think about model selection right now:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tGxt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tGxt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 424w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 848w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 1272w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tGxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png" width="1313" height="1322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1322,&quot;width&quot;:1313,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:204297,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0613e599-1506-43f9-83f6-90956fa174b5_1524x1322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tGxt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 424w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 848w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 1272w, https://substackcdn.com/image/fetch/$s_!tGxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb652d8eb-525b-43f4-a328-97a9a0129716_1313x1322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That table is deliberately practical.</p><p>Most people do not wake up wondering which model has the most elegant aggregate benchmark profile. They wake up wanting to build something, debug something, write something, analyze something, research something, automate something, or get something done.</p><p>For that kind of user, the right question is not:</p><p>&#8220;What is the best model?&#8221;</p><p>The right question is:</p><p>&#8220;What am I trying to do?&#8221;</p><p>Then you pick the model whose strengths match the job.</p><p>For my own work, I would reach for GPT-5.5 first for math, science, backend-heavy engineering, terminal-heavy workflows, and tasks where I want very tight instruction following.</p><p>Meanwhile, I would reach for Claude Opus 4.8 first for writing, design, frontend engineering, desktop navigation, product taste, and work where hallucination resistance and judgment matter more than raw technical aggression.</p><p>And when cost matters, I would start pulling Gemini, Grok, Kimi, and DeepSeek into the rotation.</p><p>That last point matters more than it may seem.</p><h2>The four-model picture is even more interesting</h2><p>Once you add Gemini 3.1 Pro and Grok 4.3, the story gets more complicated in a useful way.</p><p>Claude Opus 4.8 and GPT-5.5 are the central co-leaders. But Gemini and Grok are not just background characters.</p><p>Gemini 3.1 Pro remains one of the most important models in the ecosystem because it is strong enough to compete on serious reasoning and reliability benchmarks while also sitting in a much more attractive cost-performance region. It is the model I would pay especially close attention to if I were trying to maximize intelligence per dollar rather than chase the absolute top score.</p><p>Grok 4.3 is also more interesting than a simple composite ranking makes it look. Its IFBench result was one of the most surprising findings in this comparison. Instruction following is not a decorative trait. It is one of the core traits that determines whether a model is useful in real workflows.</p><p>A model that follows instructions well can be easier to route, easier to automate, easier to constrain, and easier to trust inside a repeatable system.</p><p>That is why the four-model view matters. It shows that the landscape is not just a ranking of overall intelligence. It is a set of different capability profiles, and those profiles become much more useful once you start thinking in terms of model rotation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w10m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w10m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 424w, https://substackcdn.com/image/fetch/$s_!w10m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 848w, https://substackcdn.com/image/fetch/$s_!w10m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!w10m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w10m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png" width="1456" height="911" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:911,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w10m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 424w, https://substackcdn.com/image/fetch/$s_!w10m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 848w, https://substackcdn.com/image/fetch/$s_!w10m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!w10m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc14c1a1-90ce-4b52-a19d-c510e47565b8_2200x1376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Once Gemini 3.1 Pro and Grok 4.3 enter the comparison, the landscape looks less like a ladder and more like a portfolio.</figcaption></figure></div><h2>Reasoning is not one thing</h2><p>The word &#8220;reasoning&#8221; gets used as if it describes one clean capability, but the May results show a more useful picture: GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro are all fantastic reasoning models.</p><p>There are still meaningful differences. Gemini 3.1 Pro and GPT-5.5 have the strongest claim on the scientific reasoning slice of this comparison, while Opus 4.8 remains highly competitive and adds a win on Humanity&#8217;s Last Exam. The important point is not that one model owns reasoning. It is that the top tier now gives users a real pool of excellent reasoning options.</p><p>That is why broad model rankings are useful, but not sufficient. When people ask &#8220;which model is best at reasoning?&#8221; they are usually compressing too much into one word. Do they mean scientific reasoning, physics reasoning, long-form exam performance, code-adjacent reasoning, mathematical reasoning, or long-horizon planning?</p><p>The answer can change depending on which kind of reasoning they mean. The more frontier models converge overall, the more these subcategories matter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Q_S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Q_S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 424w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 848w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 1272w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Q_S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png" width="1456" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79802,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Q_S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 424w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 848w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 1272w, https://substackcdn.com/image/fetch/$s_!6Q_S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905ab6af-bfcf-4dd5-b319-af5523a54134_1800x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reasoning benchmarks show why a single &#8220;best reasoning model&#8221; label hides too much useful information.</figcaption></figure></div><h2>Coding is not one thing either</h2><p>The same is true for coding. &#8220;Best coding model&#8221; sounds useful, but it hides too much. Coding includes backend engineering, frontend engineering, terminal use, bug fixing, repo navigation, algorithmic problem solving, refactoring, product implementation, UI taste, and long-running agentic work across messy codebases.</p><p>That is why I see GPT-5.5 and Opus 4.8 as effectively tied in coding overall, even though I would not use them identically. GPT-5.5 is the model I would favor for backend engineering, terminal-heavy workflows, systems-oriented implementation, and cases where the model needs to follow a precise technical plan. Claude Opus 4.8 is the model I would favor for frontend engineering, design-sensitive implementation, product polish, and work where writing quality, taste, and user experience need to come together with technical execution.</p><p>This distinction matters because &#8220;working code&#8221; is not always the same as &#8220;good software.&#8221; Some tasks reward raw implementation power. Others reward taste, interface judgment, maintainability, and the ability to produce something a human actually wants to use. The best coding model depends on which kind of coding you are doing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CoNk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CoNk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CoNk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1095004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CoNk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!CoNk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe898c7d-adf9-4f3f-b18a-979e819d4717_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GPT-5.5 and Claude Opus 4.8 are effectively tied as top-tier coding models, but the split matters: Opus 4.8 leads SWE-bench Verified, Gemini 3.1 Pro edges out LiveCodeBench, and GPT-5.5 leads Terminal-Bench Hard.</figcaption></figure></div><h2>Reliability is becoming a first-class capability</h2><p>Reliability is not the boring part of model evaluation. It is one of the places where model differences become most operationally important.</p><p>A model that can solve hard problems but does not follow instructions cleanly is harder to automate. A model that is brilliant but overconfident is harder to trust. A model that looks impressive in demos but drifts in long workflows is harder to build around. This is why reliability benchmarks deserve more attention than they usually get.</p><p>Two results stood out in May: Grok 4.3 topping IFBench and Gemini 3.1 Pro topping AA-Omniscience. IFBench matters because instruction following is the foundation of almost every repeatable AI workflow. AA-Omniscience matters because hallucination resistance determines whether a model can stay useful when the answer is uncertain, obscure, or not actually knowable from context.</p><p>These results make the four-model picture more useful. GPT-5.5 and Opus 4.8 remain the co-leaders overall, but Gemini and Grok each show up strongly on traits that matter in production: knowing when not to guess, and doing exactly what the user asked. That is the kind of model intelligence that does not always dominate a composite leaderboard, but can make a huge difference in practice.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i3Ur!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i3Ur!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i3Ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:978516,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.aiiq.org/i/200138010?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i3Ur!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!i3Ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d65387b-a677-4a90-9666-dd3354039ae9_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reliability is not one trait. Grok 4.3 leads IFBench for instruction following, while Gemini 3.1 Pro leads AA-Omniscience for hallucination resistance &#8212; two capabilities that matter a lot once models move from demos into real workflows.</figcaption></figure></div><h2>Cost matters because agents change the math</h2><p>For a single high-stakes question, it often makes sense to use the strongest model available. If the task is important enough, pay for maximum intelligence.</p><p>But agentic workflows change the economics. Once engineers and organizations start running many agents for many tasks over long periods of time, token spend becomes an allocation problem. The question is no longer just &#8220;What is the smartest model?&#8221; It becomes &#8220;How much useful intelligence can I buy for this budget?&#8221;</p><p>That can point in two directions. Some teams want to reduce token spend while preserving as much capability as possible. Others are willing to spend the same amount, but want more total work done: more agents, more attempts, more critiques, more parallel exploration. In both cases, intelligence per unit cost becomes part of the routing decision.</p><p>This is where Gemini 3.1 Pro, Grok 4.3, Kimi K2.6, and DeepSeek v4 Flash become especially relevant. Not every task deserves the most expensive frontier model at the highest reasoning level. Sometimes the right move is GPT-5.5 or Opus 4.8 at full strength. Sometimes it is the same model at a lower reasoning level. Sometimes it is a cheaper model that gets you most of the way there for a fraction of the cost.</p><p>That is why the IQ vs. effective cost view on AI IQ matters. It does not just ask which model is smartest. It asks which models sit in the most useful region of the cost-performance landscape.</p><p>The next serious advantage is not just access to the best model. It is knowing how to allocate intelligence.</p><h2>What AI IQ is trying to do</h2><p>There is no single correct way to rate AI models. The best we can do is use the strongest available benchmarks, combine them carefully, and keep improving the map as the models and evaluations get better.</p><p>That is what AI IQ is for.</p><p>The IQ score is not meant to be the final word on a model. It is meant to create a better starting point: a common frame for comparing models across time, tracking the frontier, seeing which providers are moving fastest, and putting intelligence and cost on the same map.</p><p>That matters because the model ecosystem is getting too complex for launch cards and vibes. A single model can be excellent overall and still be the wrong choice for a specific workflow. Another model can sit lower on the composite leaderboard and still be the right choice when cost, instruction following, hallucination resistance, speed, or style matters more.</p><p>The ambition for AI IQ is to become more than a leaderboard. It should be a routing map, a buying guide, and a community intelligence layer for the model ecosystem: a way to understand which models are smartest, which are worth paying for, and which ones to use for the job in front of you.</p><p>That is where this gets exciting. The goal is not just to watch the frontier move. It is to help people use it better.</p><h2>My current read on May 2026</h2><p>Here is the compact version:</p><p><strong>GPT-5.5 and Claude Opus 4.8 are the two top-tier defaults.</strong> GPT-5.5 still has the strongest claim to the top composite spot on AI IQ, with excellent breadth and particular strength in math, science, backend engineering, terminal-heavy work, and instruction following. Opus 4.8 has earned co-leader status with meaningful advantages in writing, design, frontend engineering, desktop navigation, and hallucination resistance.</p><p><strong>Gemini 3.1 Pro is the cost-effective high-intelligence model to watch.</strong> It is strong enough to compete seriously on hard reasoning and reliability benchmarks, while sitting in a much more attractive cost-performance region than the most expensive frontier models.</p><p><strong>Grok 4.3 is more useful than its broad ranking suggests.</strong> Its IFBench result is a reminder that specific capabilities can matter more than aggregate rank when you are routing models for real workflows.</p><p><strong>Kimi K2.6 and DeepSeek v4 Flash belong in the cost-conscious agent rotation.</strong> When you care about high-volume work, parallel agents, and intelligence per dollar, they become part of the practical landscape.</p><p><strong>Qwen3.7-Max was another notable May release.</strong> The field keeps getting deeper, broader, and more competitive.</p><p>The most important conclusion is that the best AI setup is no longer one model. It is the right rotation.</p><h2>What should you use tomorrow?</h2><p>Start with the task.</p><p>For backend engineering, terminal-heavy work, math, and technical problem-solving, I would start with GPT-5.5. For writing, design, frontend engineering, desktop navigation, and product-sensitive work, I would start with Claude Opus 4.8. When cost matters, I would look closely at Gemini 3.1 Pro. For high-volume agent workflows, I would bring Grok 4.3, Kimi K2.6, and DeepSeek v4 Flash into the rotation.</p><p>And when the task really matters, do not force yourself to pick only one model. Ask GPT-5.5. Ask Opus 4.8. Compare the answers. Let one critique the other. The frontier is now rich enough that using multiple models is often the smarter default.</p><p>That is the practical takeaway from May: the best model choice depends on the work in front of you. Use the charts, test the models against your own workflows, and build a rotation that gives you the most intelligence for the task, budget, and level of risk.</p><h2>Help shape where AI IQ goes next</h2><p>I&#8217;m going to keep adding benchmarks to AI IQ where they make the model map better: coding, reasoning, reliability, EQ, computer use, cost efficiency, and agentic performance.</p><p>But the bigger question is not just what AI IQ should measure. It is what AI IQ should help you do.</p><p>What would make AI IQ useful enough to become part of your actual model-selection workflow? Would you want personalized model recommendations, a routing guide by profession, a model picker for specific tasks, a browser extension, a team dashboard, cost-aware model routing, or a way to compare your own outputs across models?</p><p>That is the feedback I would most like to hear.</p><p>AI IQ started as a way to make sense of model intelligence. The opportunity now is to help people and teams use the right intelligence at the right time.</p><p>And that is where I&#8217;d love your help: what should AI IQ become next?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.aiiq.org/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI IQ! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Join my new subscriber chat]]></title><description><![CDATA[A private space for us to converse and connect]]></description><link>https://newsletter.aiiq.org/p/join-my-new-subscriber-chat</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/join-my-new-subscriber-chat</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Fri, 22 May 2026 15:15:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KYZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today I&#8217;m announcing a brand new addition to my Substack publication: AI IQ subscriber chat.</p><p>This is a conversation space exclusively for subscribers&#8212;kind of like a group chat or live hangout. I&#8217;ll post questions and updates that come my way, and you can jump into the discussion.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/aiiqbrief/chat&quot;,&quot;text&quot;:&quot;Join chat&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://open.substack.com/pub/aiiqbrief/chat"><span>Join chat</span></a></p><div><hr></div><h2>How to get started</h2><ol><li><p><strong>Get the Substack app by clicking <a href="https://substack.com/app/app-store-redirect">this link</a> or the button below.</strong> New chat threads won&#8217;t be sent sent via email, so turn on push notifications so you don&#8217;t miss conversation as it happens. You can also access chat <a href="https://open.substack.com/pub/aiiqbrief/chat">on the web</a>.</p></li></ol><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://substack.com/app/app-store-redirect&quot;,&quot;text&quot;:&quot;Get app&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://substack.com/app/app-store-redirect"><span>Get app</span></a></p><ol start="2"><li><p><strong>Open the app and tap the Chat icon.</strong> It looks like two bubbles in the bottom bar, and you&#8217;ll see a row for my chat inside.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KYZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KYZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KYZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241528,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://kylewarrentest.substack.com/i/114198534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KYZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KYZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f63c9a-2296-4c96-a2f9-52648999bb00_2000x1000.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol start="3"><li><p><strong>That&#8217;s it!</strong> Jump into my thread to say hi, and if you have any issues, check out <a href="https://support.substack.com/hc/en-us/sections/360007461791-Frequently-Asked-Questions">Substack&#8217;s FAQ</a>.</p></li></ol>]]></content:encoded></item><item><title><![CDATA[The Best AI Models of April 2026]]></title><description><![CDATA[GPT-5.5 vs Opus 4.7 &#8212; and the month the Tier 2 field exploded]]></description><link>https://newsletter.aiiq.org/p/the-best-ai-models-of-april-2026-deb</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/the-best-ai-models-of-april-2026-deb</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Mon, 04 May 2026 20:58:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!32L4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!32L4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!32L4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!32L4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!32L4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!32L4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!32L4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2317733,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brief.aiiq.org/i/196472146?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!32L4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!32L4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!32L4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!32L4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9932ef9c-2277-4f9c-ae23-2eeef87c85b5_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>April 2026 was the most important month for AI model releases so far this year.</p><p>OpenAI released GPT-5.5. Anthropic released Claude Opus 4.7. Meta came back into the race with Muse Spark. xAI shipped Grok 4.3. And the Chinese frontier labs delivered a full wave of serious models: GLM-5.1, Qwen3.6, Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.aiiq.org/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI IQ! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The simple story is that GPT-5.5 and Opus 4.7 are now two of the most important models at the frontier.</p><p>The more interesting story is that April was not one race. It was four races happening at once.</p><p>The first was the raw intelligence race, where GPT-5.5 moved into first place on AI IQ&#8217;s overall estimated IQ ranking.</p><p>The second was the EQ and writing-quality race, where Opus 4.7 stood out most clearly.</p><p>The third was the instruction-following race, where Grok 4.3 leads the field.</p><p>The fourth was the cost-performance race, where Tier 2 became crowded enough to matter for real routing decisions.</p><p>That is the real story of April: model quality is no longer one-dimensional.</p><p>GPT-5.5 leads on raw intelligence, Opus 4.7 leads on EQ, Grok 4.3 leads on instruction-following, and Gemini 3.1 Pro remains highly competitive on programmatic reasoning.</p><p>And the Tier 2 field is now good enough that routing matters more than ever.</p><h3>April was the month model selection became multi-dimensional</h3><p>A year ago, model launches were still mostly judged by chat quality, coding snippets, MMLU-style knowledge tests, and a handful of math benchmarks.</p><p>That is no longer enough.</p><p>The models released in April are all competing to become work systems: coding models, research models, tool-use models, document models, spreadsheet models, long-context models, and models that can sit inside multi-step workflows.</p><p>But we should not evaluate them by repeating launch claims.</p><p>The point of AI IQ is to normalize performance across hard benchmarks, compress saturated tests, and separate model capability into dimensions that actually matter.</p><p>The old way to compare models was to ask:</p><p>&#8220;Which model got the highest score?&#8221;</p><p>The better question now is:</p><p>&#8220;Which model is best for the kind of cognition you actually need?&#8221;</p><p>AI IQ breaks raw intelligence into four dimensions: Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the mean of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.</p><p>That compression is important. A model should not become &#8220;the smartest model&#8221; just because it crushes a saturated or contaminated benchmark. The frontier should be judged by hard, still-discriminating tests.</p><p>But April made something else clear too: IQ is not enough.</p><p>For real workflows, we also care about EQ, instruction-following, cost, latency, tool use, refusal behavior, and recovery from failure.</p><p>Those are not one thing.</p><p>They are separate axes.</p><h3>The new Tier 1: GPT-5.5, Opus 4.7, and Gemini 3.1 Pro</h3><p>AI IQ&#8217;s updated ranking now has three Tier 1 models:</p><p>GPT-5.5</p><p>Claude Opus 4.7</p><p>Gemini 3.1 Pro</p><p>Google did not release a major new model in April, but Gemini 3.1 Pro remains in the top cluster. That is important. The April narrative is not &#8220;OpenAI and Anthropic left everyone behind.&#8221; It is more precise than that: OpenAI and Anthropic refreshed into the frontier, while Google&#8217;s previous frontier model still held its ground.</p><p>The top-level read:</p><p>GPT-5.5 is the overall intelligence champion. It sits at the top of the AI IQ ranking and leads the April class on raw composite capability.</p><p>Opus 4.7 is the EQ and writing-quality standout. It does not beat GPT-5.5 on overall IQ, but it leads AI IQ&#8217;s current EQ ranking and remains one of the most compelling models for high-context collaboration, editing, and professional writing.</p><p>Gemini 3.1 Pro is the holdover giant. It did not need an April launch to remain relevant. On the AI IQ charts, it continues to sit in the frontier cluster and is especially strong in programmatic reasoning.</p><p>Grok 4.3 is not Tier 1 overall, but it deserves attention for a different reason: it leads on instruction-following. That matters because many real workflows fail not because the model lacks general intelligence, but because it misses a constraint.</p><p>The frontier is now less like a single leaderboard and more like a model draft board. For any serious workflow, the right answer is usually not &#8220;use the highest-ranked model.&#8221; It is:</p><p>Use GPT-5.5 when you want the strongest general reasoner.</p><p>Use Opus 4.7 when you want strong EQ, writing quality, editing, and collaborative feel.</p><p>Use Gemini 3.1 Pro when you care about programmatic reasoning or multimodal workflows.</p><p>Use Grok 4.3 when instruction-following is the bottleneck.</p><p>And use Tier 2 models when the frontier premium is not worth paying.</p><h3>GPT-5.5: the new raw intelligence leader</h3><p>GPT-5.5 is the most important release of April.</p><p>On AI IQ, it takes the top overall spot. It also leads the April releases in Abstract Reasoning, Mathematical Reasoning, and Academic Reasoning. That is the core of its story: GPT-5.5 is not merely a coding model, a chat model, or a tool-use model. It is the broadest new general-purpose intelligence system in the April wave.</p><p>The key point is not that GPT-5.5 wins every possible workflow.</p><p>It does not.</p><p>The point is that GPT-5.5 is now the model to beat when raw intelligence is the bottleneck.</p><p>That means difficult research, complex analysis, high-stakes coding tasks, scientific reasoning, mathematical reasoning, architecture decisions, and work where the model&#8217;s ability to notice hidden structure matters more than token price.</p><p>GPT-5.5 is expensive. OpenAI lists future API pricing for gpt-5.5 at $5 per million input tokens and $30 per million output tokens, with GPT-5.5 Pro priced far higher at $30 input and $180 output.</p><p>So the practical recommendation is not &#8220;use GPT-5.5 for everything.&#8221;</p><p>It is: use GPT-5.5 when the marginal cost of being wrong is high.</p><p>GPT-5.5 is the model you reach for when you are buying judgment.</p><h3>Opus 4.7: the EQ and writing-quality standout</h3><p>If GPT-5.5 is the clearest raw intelligence winner of April, Opus 4.7 is the harder model to summarize.</p><p>On AI IQ&#8217;s EQ ranking, Anthropic continues to perform extremely well. Opus 4.7 sits at the top of the EQ chart, ahead of GPT-5.5 and the rest of the April field.</p><p>That matters.</p><p>EQ is not just &#8220;being nice.&#8221; In professional AI use, it shows up as tone, judgment, calibration, tact, editing, stakeholder communication, and knowing when to push back.</p><p>This is where Claude models have often felt different from the rest of the frontier. Opus 4.7 continues that pattern. It is one of the strongest candidates for writing, editing, high-context collaboration, and work where the model&#8217;s interaction style matters.</p><p>But EQ is not the same thing as every other form of usefulness.</p><p>It is not the same thing as raw intelligence. GPT-5.5 leads there.</p><p>It is not the same thing as programmatic reasoning. Gemini 3.1 Pro remains highly competitive there.</p><p>And it is not the same thing as instruction-following. Grok 4.3 leads there.</p><p>That does not make Opus 4.7 weak overall.</p><p>It makes the category sharper.</p><p>Opus 4.7&#8217;s strongest case is not that it is the universal &#8220;best model for work.&#8221; Its strongest case is that, among frontier models, it appears especially strong on EQ, writing quality, editing, and collaborative feel.</p><p>For teams building real workflows, that distinction matters. A good AI system may need raw intelligence, EQ, instruction-following, coding ability, tool use, low cost, and reliability over long runs. Those are not the same capability.</p><p>Different models can win different parts of the bundle.</p><p>So the lesson is not &#8220;GPT-5.5 is smarter, Opus 4.7 is more trustworthy.&#8221;</p><p>The lesson is that frontier model quality is now multi-dimensional.</p><h3>Grok 4.3: the instruction-following standout</h3><p>Grok 4.3 deserves its own mention because it leads on instruction-following.</p><p>That is different from leading on raw IQ. It is also different from leading on EQ.</p><p>Instruction-following matters because many practical AI tasks are constraint-heavy. The user does not just want a good answer. They want the answer in a specific format, obeying specific requirements, avoiding specific mistakes, using specific sources, or satisfying specific operational constraints.</p><p>On IFBench, Grok 4.3 leads the April field at 81.3%. Grok 4.200309 v2 is essentially tied at 81.2%. MiMo-V2.5-Pro, DeepSeek V4Flash, Nova 2.0 Pro Preview, Gemini 3.1 Pro Preview, Qwen3.6 Max, DeepSeek V4Pro, GLM-5.1, Kimi K2.6, GPT-5.5, Muse Spark, and MiniMax-M2.7 cluster between 75.7% and 79.9%. Claude Opus 4.7 scores 58.6%, and Claude Sonnet 4.6 scores 56.6%.</p><p>That is not the whole story of model quality.</p><p>But it is a useful warning against collapsing everything into one ranking.</p><p>A model can be excellent at writing, editing, and collaboration while still being weaker at exact constraint-following. Another model can be less compelling overall but better at obeying precise instructions. For agents, automation, and production workflows, that distinction matters.</p><p>Grok 4.3&#8217;s IFBench result gives it a clearer role in the model stack: instruction-heavy workflows where exact constraint-following matters.</p><h3>The surprise: Tier 2 is now crowded with genuinely useful models</h3><p>The April Tier 2 list is where the market changed most.</p><p>AI IQ&#8217;s updated Tier 2 includes:</p><ol><li><p>Grok 4.3</p></li><li><p>Kimi K2.6</p></li><li><p>DeepSeek-V4-Pro</p></li><li><p>Muse Spark</p></li><li><p>Qwen3.6</p></li><li><p>MiMo-V2.5-Pro</p></li><li><p>GLM-5.1</p></li></ol><p>That is a lot of serious models in one month.</p><p>None of these cleanly displaces GPT-5.5, Opus 4.7, or Gemini 3.1 Pro as the best overall model. But that is the wrong bar. The point is that Tier 2 is now good enough to matter for production routing.</p><p>A year ago, &#8220;use the best model&#8221; was often a reasonable default.</p><p>In May 2026, that is lazy.</p><p>If a task is cheap, repetitive, narrow, or tolerant of a small quality drop, you should probably not be sending it to the most expensive frontier model. The cost-performance charts on AI IQ make this especially clear because they do not just plot sticker price. They use effective cost: token cost multiplied by token usage efficiency. AI IQ anchors token cost to a 2M input / 1M output workload, then adjusts by how many tokens the model burns on the Artificial Analysis evaluation suite.</p><p>That framing changes the conversation. Some models look cheap but waste tokens. Others look expensive but are efficient enough that the real gap is smaller. And some models are simply cheap enough that they should be in every serious evaluation harness.</p><p>This is where the Chinese model wave matters because their scores and prices put real pressure on the gap between &#8220;best model&#8221; and &#8220;good enough model.&#8221;</p><p>Kimi K2.6, DeepSeek-V4-Pro, Qwen3.6, GLM-5.1, and MiMo-V2.5-Pro are best understood as routing pressure. They may not win the overall chart, but they are close enough on enough dimensions that many teams will ask the only production question that matters:</p><p>When is the frontier premium actually worth paying?</p><p>Grok 4.3 is different. Its standout instruction-following result gives it a clearer role: constraint-heavy workflows where exact formatting, requirement satisfaction, and instruction adherence matter.</p><p>The frontier labs still own the top.</p><p>But the rest of the market now has options.</p><h3>Dimension-by-dimension winners</h3><p>The cleanest way to understand April is to stop asking &#8220;which model is best?&#8221; and ask &#8220;best at what?&#8221;</p><h4>Best overall IQ: GPT-5.5</h4><p>GPT-5.5 is the overall AI IQ leader. It is the model to beat on broad benchmark-derived intelligence.</p><p>Its advantage is not confined to one dimension. It is especially strong across abstract, mathematical, and academic reasoning, which makes it the most credible default when the task is hard to classify.</p><p>For professionals, this matters because many high-value tasks are mixed-domain. A research memo might require reading dense technical material, checking a mathematical argument, writing code to test an assumption, and then producing a clean executive summary. The more mixed the task, the more valuable broad IQ becomes.</p><h4>Best EQ: Opus 4.7</h4><p>Opus 4.7 leads AI IQ&#8217;s current EQ ranking.</p><p>That is meaningful, but it should be treated as directional. AI IQ&#8217;s EQ estimate blends EQ-Bench 3 with Arena Elo, and EQ-Bench 3 uses Claude as the judge. AI IQ applies a 200-point penalty to Claude models on EQ-Bench 3, but we do not yet know whether that fully corrects for model-family bias.</p><p>The stronger version of this analysis would run EQ-Bench 3 with multiple independent judges and compare how much the rankings change.</p><p>For now, the narrower claim is enough: Opus 4.7 has the strongest EQ signal in AI IQ&#8217;s current framework, and that matches many user reports that Claude models are unusually strong at tone, editing, and high-context collaboration.</p><h3>Best instruction-following: Grok 4.3</h3><p>Grok 4.3 leads on instruction-following.</p><p>That matters because instruction-following is not the same thing as EQ, raw IQ, or coding ability. In many workflows, the model has to satisfy a set of exact constraints. It is not enough to produce something smart or polished.</p><p>On IFBench, Grok 4.3 leads the April field at 81.3%. GPT-5.5 scores 75.9%, and Claude Opus 4.7 scores 58.6%.</p><p>This is one of the clearest examples of why the frontier should not be collapsed into a single &#8220;best model&#8221; ranking.</p><h4>Best abstract reasoning: GPT-5.5</h4><p>Abstract reasoning is the closest AI IQ dimension to raw fluid intelligence: the ability to solve novel problems without relying heavily on memorized knowledge. AI IQ uses ARC-AGI-2 and ARC-AGI-1 for this dimension, with ARC-AGI-2 treated as the harder, more frontier-discriminating test.</p><p>GPT-5.5 leads the April class here.</p><p>That is important because abstract reasoning is one of the hardest capabilities to fake. A model can memorize facts. It can overfit public code tasks. It can get better at common math formats. But novel abstraction is much harder to brute-force through training contamination.</p><p>If you want the model most likely to notice the hidden pattern in a new problem, GPT-5.5 is the current pick.</p><h4>Best mathematical reasoning: GPT-5.5</h4><p>Among April&#8217;s general-purpose model releases, GPT-5.5 is the math leader.</p><p>AI IQ&#8217;s math dimension uses FrontierMath Tier 4 and AIME, with AIME compressed because of contamination and saturation concerns. That is the right call. A perfect or near-perfect AIME score no longer tells us as much as it once did.</p><p>The interesting wrinkle is that GPT-5.3-Codex remains extremely strong on mathematical reasoning in the AI IQ charts. That suggests OpenAI&#8217;s coding-specialist line is not just a coding specialist. It may also be a very strong formal reasoning system.</p><p>For most users, GPT-5.5 is the best general math choice. But for code-adjacent math, theorem work, formalization, and technical problem-solving inside a development workflow, GPT-5.3-Codex still deserves attention.</p><h4>Best programmatic reasoning: Gemini 3.1 Pro and GPT-5.5</h4><p>The programmatic reasoning chart is one of the most interesting on AI IQ because it does not simply reward SWE-Bench.</p><p>AI IQ&#8217;s programmatic dimension combines Terminal-Bench 2.0, SWE-Bench Verified, and SciCode, with SWE-Bench compressed because of leakage and gameability concerns.</p><p>That makes the ranking more useful than a simple &#8220;who wins SWE-Bench?&#8221; scoreboard.</p><p>Gemini 3.1 Pro remains one of the strongest models overall on programmatic reasoning. Among the April releases, GPT-5.5 is the strongest broad programmatic reasoner, with Opus 4.7 also highly competitive.</p><p>Kimi K2.6 is the one to watch here. It does not win the chart, but its open-source, agentic-coding positioning makes it strategically important. The question is not whether Kimi K2.6 beats GPT-5.5 on every coding benchmark. It does not. The question is whether it is good enough, cheap enough, and controllable enough to become the default model for large volumes of coding-agent work.</p><p>That answer may be yes for many teams.</p><h4>Best academic reasoning: GPT-5.5</h4><p>GPT-5.5 also leads the April field on academic reasoning.</p><p>AI IQ&#8217;s academic dimension includes Humanity&#8217;s Last Exam, CritPt, and GPQA Diamond, with GPQA compressed due to contamination concerns.</p><p>This is where GPT-5.5&#8217;s breadth matters most. It is not just answering common questions better. It is performing well across expert-level, hard-to-game, high-breadth benchmarks.</p><p>DeepSeek-V4-Pro and Muse Spark are the Tier 2 standouts here. DeepSeek-V4-Pro&#8217;s knowledge and reasoning profile makes it one of the strongest Chinese models on academic-style tasks, while Muse Spark gives Meta a surprisingly credible return to high-end reasoning.</p><p>Muse Spark is not Tier 1 yet. But it is the first Meta model in a while that looks like it belongs in the serious frontier conversation.</p><h3>The cost-performance story: the smartest model is not always the right model</h3><p>AI IQ&#8217;s cost charts may be more practically important than the IQ chart.</p><p>The reason is simple: most real-world AI usage is not one heroic prompt. It is thousands or millions of calls across support workflows, coding agents, research loops, RAG systems, data extraction jobs, document workflows, and internal automations.</p><p>At that scale, the question changes from:</p><p>&#8220;Which model is best?&#8221;</p><p>to:</p><p>&#8220;Where does extra intelligence stop paying for itself?&#8221;</p><p>GPT-5.5 and Opus 4.7 justify their cost when the task is difficult, ambiguous, or high-value. But for many workflows, the April Tier 2 models are now strong enough to route into production.</p><p>A sensible model stack in May 2026 looks something like this:</p><p>Use GPT-5.5 for the hardest reasoning, research, math, architecture, and high-stakes synthesis.</p><p>Use Opus 4.7 for collaborative writing, sensitive communication, editing, and workflows where tone and interaction quality matter.</p><p>Use Grok 4.3 for instruction-heavy workflows where exact constraint-following matters.</p><p>Use Gemini 3.1 Pro where its programmatic strength or multimodal tooling gives it an edge.</p><p>Use Kimi K2.6, DeepSeek-V4-Pro, MiMo-V2.5-Pro, Qwen3.6, GLM-5.1, or Grok 4.3 for cheaper routing, open-weight experimentation, local or sovereign deployment, and high-volume workloads.</p><p>For high-stakes reasoning agents, GPT-5.5&#8217;s raw intelligence may matter most. For instruction-heavy agents, Grok 4.3 deserves attention. For writing-heavy workflows, Opus 4.7 is still the right choice. For many production workflows, the answer will be a routed stack rather than a single model.</p><p>The best teams will not pick one model.</p><p>They will build routers.</p><h3>The biggest surprise: Tier 2 is no longer filler</h3><p>The April model wave changed the middle of the market.</p><p>Kimi K2.6, DeepSeek-V4-Pro, Qwen3.6, GLM-5.1, MiMo-V2.5-Pro, Grok 4.3, and Muse Spark do not erase the frontier. GPT-5.5, Opus 4.7, and Gemini 3.1 Pro are still the top cluster.</p><p>But the gap below them is shrinking.</p><p>That matters because most economic value from AI will not come from asking the single hardest question. It will come from running useful intelligence everywhere: every repo, every spreadsheet, every customer thread, every document workflow, every internal system, every agent harness.</p><p>In that world, the winner is not always the model with the highest IQ.</p><p>It is the model with the best intelligence per dollar, per second, per workflow, per failure mode.</p><p>That is why Tier 2 matters now.</p><p>It is not because every model in Tier 2 is secretly frontier.</p><p>It is because the production question has changed.</p><p>The question is no longer:</p><p>&#8220;Which model is best?&#8221;</p><p>It is:</p><p>&#8220;Which model is good enough for this task, at this price, with this failure profile?&#8221;</p><p>That is the question that turns model selection from a leaderboard exercise into an engineering problem.</p><h3>What to watch next</h3><p>The first thing to watch is whether GPT-5.5&#8217;s lead holds once more third-party data arrives. OpenAI&#8217;s own numbers are impressive, but AI IQ&#8217;s methodology is designed to normalize across benchmarks and penalize over-reliance on game-able tests. That distinction will matter more as labs optimize for public leaderboards.</p><p>The second thing to watch is which model becomes the default for serious agentic coding. GPT-5.5, Opus 4.7, Grok 4.3, and Gemini-3.1 Pro all have different strengths. The answer should come from long-running coding-agent evaluations, not launch posts or vibes.</p><p>The third thing to watch is Gemini. Google was quiet in April, but Gemini 3.1 Pro remains Tier 1. A Gemini 3.2 or Gemini 4 release would immediately reset the frontier.</p><p>The fourth thing to watch is open-weight and frontier-adjacent models. Kimi K2.6, DeepSeek-V4-Pro, and MiMo-V2.5-Pro are not just &#8220;cheap alternatives.&#8221; They are part of a world where capable models can be routed, tuned, hosted, and governed outside the most expensive closed-model APIs.</p><p>The fifth thing to watch is cost. AI IQ&#8217;s effective-cost framing is going to become more important every month. Sticker price is not enough. A model that uses fewer tokens, finishes faster, retries less, and fails less often can be cheaper in practice even when its posted token price is higher.</p><p>The sixth thing to watch is measurement quality itself. EQ-Bench 3 is useful, but Claude-judged EQ scores should not be treated as final truth. IFBench is useful too, because it helps separate instruction-following from general chat preference. The broader lesson is that we need more benchmarks that separate emotional intelligence, instruction-following, agent reliability, coding ability, tool use, and raw reasoning instead of blending them into one &#8220;best model&#8221; story.</p><h3>Bottom line</h3><p>April 2026 gave us a new model hierarchy.</p><p>GPT-5.5 is the best overall model.</p><p>Opus 4.7 is the strongest EQ and writing-quality model.</p><p>Gemini 3.1 Pro remains a Tier 1 holdover.</p><p>Grok 4.3 deserves attention as the leader on instruction-following.</p><p>Kimi K2.6, DeepSeek-V4-Pro, Muse Spark, Qwen3.6, MiMo-V2.5-Pro, and GLM-5.1 make Tier 2 much more competitive than it was a month ago.</p><p>The frontier is still a premium market.</p><p>But the middle of the market just got much smarter.</p><p>That is what professionals should take away from April. Not that one lab won. Not that one leaderboard settled the race. Not that open models caught the frontier.</p><p>The real lesson is that model choice is now a portfolio decision.</p><p>Use the smartest model when intelligence is the bottleneck.</p><p>Use the best EQ model when tone, nuance, and collaboration are the bottleneck.</p><p>Use the best instruction-following model when exact constraints are the bottleneck.</p><p>Use the cheapest good-enough model when scale is the bottleneck.</p><p>And revisit the decision every month.</p><p>See you in the next update.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.aiiq.org/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AI IQ Brief! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Best AI Models of March 2026]]></title><description><![CDATA[GPT-5.4 moved the frontier, while MiniMax and Xiaomi made Tier 2 more interesting]]></description><link>https://newsletter.aiiq.org/p/the-best-ai-model-releases-of-march</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/the-best-ai-model-releases-of-march</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Tue, 07 Apr 2026 01:50:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Nr-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nr-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nr-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nr-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2217469,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brief.aiiq.org/i/196371528?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nr-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Nr-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2129c0a-a3f2-4f60-9c01-72145b6250f7_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>March was a slower month for AI model releases than February.</p><p>There was no new Claude. No new Gemini. No new Grok. Most of the major US and Chinese labs were between release cycles.</p><p>But slow does not mean boring.</p><p>OpenAI released GPT-5.4 on March 5, and it immediately became one of the best models in the world. It did not completely erase Gemini 3.1 Pro or Opus 4.6, but it pushed the frontier forward in a very specific direction: real professional work. Spreadsheets, documents, presentations, computer use, coding agents, tool use, long-context work, and fewer factual errors.</p><p>Then, on March 18, two Chinese labs released strong Tier 2 models: MiniMax-M2.7 and Xiaomi MiMo-V2-Pro. Neither model took the overall crown, but both made the cost-performance and agentic-deployment tier more competitive.</p><p>So March was not a broad release wave.</p><p>It was a month with one main event and two important follow-ups.</p><p>The main event was GPT-5.4.</p><p>The follow-ups were MiniMax-M2.7 and MiMo-V2-Pro showing that the layer below the frontier is still getting better.</p><h3>March 2026 model releases</h3><p>March had three releases that matter for AI IQ&#8217;s model rankings, plus one smaller OpenAI release that matters for routing.</p><p><strong>March 5: OpenAI released GPT-5.4</strong><br>GPT-5.4 was the clear headline release of the month. OpenAI positioned it as its strongest mainline reasoning model, with improvements across professional work, coding, computer use, tool use, academic reasoning, factuality, and long-context tasks. OpenAI also said GPT-5.4 was the first mainline reasoning model to incorporate the frontier coding capabilities of GPT-5.3-Codex.</p><p><strong>March 17: OpenAI released GPT-5.4 mini and GPT-5.4 nano</strong><br>These were not Tier 1 models, but they matter for production systems. GPT-5.4 mini and nano were designed for faster, cheaper, high-volume workloads, with OpenAI specifically emphasizing coding subagents, classification, data extraction, ranking, multimodal use, and low-latency tool workflows.</p><p><strong>March 18: MiniMax released MiniMax-M2.7</strong><br>MiniMax-M2.7 was the most interesting non-OpenAI release of March. MiniMax described it as its first model to participate deeply in its own development cycle, with strong agent-harness capabilities, real-world software engineering results, office-work performance, and native Agent Teams. It scored 56.22% on SWE-Pro, 55.6% on VIBE-Pro, and 57.0% on Terminal Bench 2 according to MiniMax&#8217;s launch post.</p><p><strong>March 18: Xiaomi released MiMo-V2-Pro</strong><br>MiMo-V2-Pro was Xiaomi&#8217;s flagship agent model, built for real-world agentic workloads. Xiaomi described it as a trillion-parameter model with 42B active parameters, support for up to 1M-token context, and strong agent benchmark performance, including #3 globally on PinchBench and ClawEval in Xiaomi&#8217;s published comparisons.</p><p>That is a smaller calendar than February, but the model-market impact was still real. GPT-5.4 became a Tier 1 model immediately, and MiniMax-M2.7 and MiMo-V2-Pro both earned Tier 2 positions.</p><h3>The new March ranking</h3><p>By the end of March, AI IQ&#8217;s top tiers looked like this.</p><h4>Tier 1</h4><p><strong>Gemini 3.1 Pro</strong><br>Still the overall leader in AI IQ&#8217;s March ranking. Google did not release a new model in March, but Gemini 3.1 Pro remained the model to beat, especially on broad reasoning and programmatic capability.</p><p><strong>GPT-5.4</strong><br>The new March release and the main event of the month. GPT-5.4 moved OpenAI back into the top cluster and became one of the strongest all-around models, especially for professional work, computer use, coding, tool use, and factuality.</p><p><strong>Claude Opus 4.6</strong><br>Still Tier 1 despite no March release. Opus 4.6 remained one of the best models for long-running professional workflows, high-context collaboration, coding agents, and EQ-heavy work.</p><h4>Tier 2</h4><p><strong>Grok 4.20</strong><br>Still xAI&#8217;s strongest model in the March window, but below the top three.</p><p><strong>Kimi K2.5</strong><br>Still highly relevant as an open-weight, multimodal, agentic model from January.</p><p><strong>GLM-5</strong><br>Still one of the stronger Chinese frontier-adjacent models, especially for engineering-agent workflows.</p><p><strong>DeepSeek-V3.2</strong><br>Still an important open-weight baseline.</p><p><strong>MiniMax-M2.7</strong><br>New in March, and the strongest MiniMax model yet. It is especially interesting for coding agents, production debugging, office work, and multi-agent workflows.</p><p><strong>Qwen3.5-397B</strong><br>Still a strong open-weight MoE model from February.</p><p><strong>MiMo-V2-Pro</strong><br>New in March, and one of the more interesting Chinese agent models because of its 1M context, 42B-active architecture, API pricing, and OpenClaw-style agent positioning.</p><p>The short version: March did not reshuffle the entire leaderboard. It inserted GPT-5.4 into Tier 1 and made Tier 2 more competitive.</p><h3>GPT-5.4: the professional-work model</h3><p>GPT-5.4 is the March release that matters most.</p><p>OpenAI&#8217;s launch post was not framed around chat quality or one narrow benchmark. It was framed around professional work: spreadsheets, presentations, documents, legal analysis, financial modeling, coding, computer use, and tool-heavy agentic workflows.</p><p>That is a meaningful shift. The highest-value AI workloads are increasingly not &#8220;answer this question.&#8221; They are &#8220;do this work.&#8221;</p><p>OpenAI reported that GPT-5.4 scored 87.3% on an internal investment-banking-style spreadsheet modeling benchmark, compared with 68.4% for GPT-5.2. It also said human raters preferred GPT-5.4 presentations over GPT-5.2 presentations 68.0% of the time.</p><p>The factuality improvement is also important. OpenAI said GPT-5.4 was its most factual model yet, with individual claims 33% less likely to be false and full responses 18% less likely to contain any errors relative to GPT-5.2, measured on de-identified prompts where users had flagged factual errors.</p><p>That is the core of GPT-5.4&#8217;s case. It is not just smarter in an abstract sense. It is better at the kinds of work people actually pay AI systems to do.</p><h3>GPT-5.4&#8217;s biggest leap may be computer use</h3><p>The most striking GPT-5.4 result is not a math score or a coding score.</p><p>It is computer use.</p><p>OpenAI described GPT-5.4 as its first general-purpose model with native computer-use capabilities. On OSWorld-Verified, GPT-5.4 scored 75.0%, compared with 47.3% for GPT-5.2, and above the human performance baseline of 72.4%.</p><p>That matters because computer use is one of the bridges between &#8220;model that answers questions&#8221; and &#8220;model that can operate software.&#8221;</p><p>A model that can reason through screenshots, issue mouse and keyboard actions, use browser environments, and operate tools reliably is much closer to being useful in the messy middle of knowledge work. Not just writing code. Not just summarizing. Actually moving through software systems.</p><p>GPT-5.4 also scored 67.3% on WebArena-Verified and 92.8% on Online-Mind2Web using screenshot-based observations alone, according to OpenAI.</p><p>This is why GPT-5.4 feels different from a normal incremental model release. It is not just another reasoning bump. It is a stronger base for software-operating agents.</p><h3>GPT-5.4 vs Gemini 3.1 Pro vs Opus 4.6</h3><p>The March Tier 1 is not cleanly ordered by one simple criterion.</p><p><strong>Gemini 3.1 Pro</strong> still had the strongest overall AI IQ position by the end of March. It remained the best broad benchmark model in the March window, especially on abstract and programmatic reasoning.</p><p><strong>GPT-5.4</strong> was the strongest new release and the biggest March mover. It made OpenAI much more competitive with Gemini 3.1 Pro and Opus 4.6, especially on professional work, computer use, coding, tool use, factuality, and documents.</p><p><strong>Opus 4.6</strong> remained the model with the strongest claim for long-running, high-context professional workflows and EQ-heavy work.</p><p>The practical distinction is:</p><p>Use <strong>Gemini 3.1 Pro</strong> when you want the strongest broad benchmark profile.</p><p>Use <strong>GPT-5.4</strong> when you care about professional work, computer use, factuality, coding, and tool-heavy agents.</p><p>Use <strong>Opus 4.6</strong> when the task is long, messy, collaborative, and sensitive to workflow quality.</p><p>That is the uncomfortable reality of March: there was no single model that made the other two irrelevant.</p><p>GPT-5.4 was excellent. But March still ended with a three-model Tier 1.</p><h3>GPT-5.4 mini and nano: the routing angle</h3><p>GPT-5.4 mini and nano were not the headline models, but they may matter a lot in production.</p><p>OpenAI released them on March 17 as faster, cheaper models designed for high-volume workloads. GPT-5.4 mini supports text and image inputs, tool use, function calling, web search, file search, computer use, and skills, with a 400K context window. It costs $0.75 per million input tokens and $4.50 per million output tokens. GPT-5.4 nano costs $0.20 per million input tokens and $1.25 per million output tokens.</p><p>That is not just a pricing detail. It is a product architecture detail.</p><p>The best agent systems increasingly do not use one model for everything. They use a large model for planning, final judgment, and hard reasoning, then smaller models for subagents, codebase search, extraction, ranking, classification, and fast supporting tasks.</p><p>OpenAI explicitly framed GPT-5.4 mini this way: GPT-5.4 can handle planning and final judgment, while GPT-5.4 mini subagents handle narrower subtasks in parallel.</p><p>That pattern is going to matter more every month. The March story is not just GPT-5.4 got smarter. It is that OpenAI&#8217;s model stack got easier to route.</p><h3>MiniMax-M2.7: the best non-OpenAI March release</h3><p>MiniMax-M2.7 was the most interesting March release outside OpenAI.</p><p>The headline is not that M2.7 beat the Tier 1 models. It did not.</p><p>The headline is that MiniMax is building toward a different model of progress: models that participate in their own improvement, build agent harnesses, use memory and complex skills, and operate across messy organizational workflows.</p><p>MiniMax described M2.7 as its first model to deeply participate in its own evolution. During development, MiniMax says it used the model to update memory, build complex skills, improve reinforcement-learning experiment harnesses, and iterate on its learning process based on experiment results.</p><p>That could sound like marketing if the model were weak. But M2.7&#8217;s reported results are strong enough to pay attention to. MiniMax reported 56.22% on SWE-Pro, 55.6% on VIBE-Pro, and 57.0% on Terminal Bench 2. It also reported a GDPval-AA Elo of 1495, which MiniMax says was the highest among open-source models in its comparison.</p><p>The most interesting parts are practical. MiniMax says M2.7 can handle production debugging, correlate monitoring metrics with deployment timelines, connect to databases to verify root causes, identify missing migrations, and propose non-blocking index creation before submitting a merge request.</p><p>That is not normal &#8220;can write Python&#8221; coding. That is closer to software engineering operations.</p><p>M2.7 belongs in Tier 2, not Tier 1. But for teams building coding agents, office agents, or internal automation systems, it became a model worth testing.</p><h3>MiMo-V2-Pro: Xiaomi enters the serious agent conversation</h3><p>Xiaomi&#8217;s MiMo-V2-Pro was the other important March Tier 2 release.</p><p>The model is interesting for three reasons: scale, context, and agent positioning.</p><p>Xiaomi says MiMo-V2-Pro has more than 1T total parameters with 42B active, supports up to 1M-token context, and includes a lightweight Multi-Token Prediction layer for fast generation.</p><p>It is also built very explicitly for agents. Xiaomi describes MiMo-V2-Pro as a foundation model for real-world agentic workloads and says it is designed to orchestrate complex workflows, drive production engineering tasks, and serve as the &#8220;brain&#8221; of agent systems.</p><p>The benchmark positioning is aggressive. Xiaomi reported MiMo-V2-Pro at #3 globally on PinchBench and #3 globally on ClawEval in its published comparisons, approaching Opus 4.6 on ClawEval and sitting just below Claude Opus 4.6 and MiMo-V2-Omni on PinchBench.</p><p>The pricing is also notable. Xiaomi lists MiMo-V2-Pro at $1 per million input tokens and $3 per million output tokens up to 256K context, and $2 input / $6 output from 256K to 1M context.</p><p>That makes MiMo-V2-Pro easy to understand: it is not the best model overall, but it is a serious agent model with long context and much lower pricing than the top closed models.</p><p>That is exactly the kind of model that can win routing decisions.</p><h3>Dimension-by-dimension read</h3><p>AI IQ evaluates models across Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the average of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.</p><p>That framework is useful for March because the models had very different shapes.</p><h4>Best overall IQ: Gemini 3.1 Pro</h4><p>Gemini 3.1 Pro remained the overall AI IQ leader in the March window.</p><p>GPT-5.4 narrowed the gap and moved OpenAI back into the top cluster, but Gemini 3.1 Pro still had the strongest overall profile by the end of March.</p><p>This is why March was not simply &#8220;GPT-5.4 wins.&#8221; GPT-5.4 was the best new release, but Gemini 3.1 Pro was still the model sitting at the top of the March ranking.</p><h4>Best March release: GPT-5.4</h4><p>Among models released in March, GPT-5.4 was the clear winner.</p><p>It was the only March release that earned a Tier 1 spot. It improved meaningfully over GPT-5.2, absorbed the Codex gains from GPT-5.3-Codex into a mainline model, and became one of the best models in the world across professional work, computer use, coding, tool use, and academic reasoning.</p><p>GPT-5.4 was not a minor checkpoint. It was the March model that changed the top tier.</p><h4>Best EQ: Opus 4.6</h4><p>Opus 4.6 remained the strongest EQ model in the March window.</p><p>This is one of the reasons Opus 4.6 stayed Tier 1 even after GPT-5.4 launched. AI IQ&#8217;s EQ ranking is designed to capture emotional intelligence, conversational judgment, and human preference signals. Those qualities matter more as models move from answering questions to working with people.</p><p>For long-running, high-context, user-facing work, Opus 4.6 still had a strong claim.</p><h4>Best abstract reasoning: Gemini 3.1 Pro</h4><p>Gemini 3.1 Pro remained the abstract reasoning leader.</p><p>GPT-5.4 improved OpenAI&#8217;s position substantially, especially compared with GPT-5.2, but Gemini 3.1 Pro was still the strongest model on AI IQ&#8217;s abstract reasoning dimension in the March window.</p><p>That matters because abstract reasoning is one of the harder dimensions to fake. It is less about memorized facts and more about solving novel patterns.</p><h4>Best mathematical reasoning: GPT-5.4, with GPT-5.3-Codex still relevant</h4><p>GPT-5.4 had the strongest March-release math profile.</p><p>OpenAI reported GPT-5.4 at 47.6% on FrontierMath Tier 1&#8211;3 and 27.1% on FrontierMath Tier 4, with GPT-5.4 Pro scoring 50.0% and 38.0% respectively.</p><p>GPT-5.3-Codex remained relevant for technical reasoning, especially where math overlaps with coding, tools, and verification. But GPT-5.4 mattered because it brought those Codex-style strengths into the mainline GPT family.</p><h4>Best programmatic reasoning: Gemini 3.1 Pro overall; GPT-5.4 as the March mover</h4><p>Gemini 3.1 Pro still had the strongest overall programmatic reasoning profile in the March window.</p><p>But GPT-5.4 was the major new programmatic-reasoning release. OpenAI reported GPT-5.4 at 57.7% on SWE-Bench Pro and 75.1% on Terminal-Bench 2.0. GPT-5.3-Codex still led GPT-5.4 on Terminal-Bench 2.0 in OpenAI&#8217;s table, but GPT-5.4 was a much stronger all-around mainline model than GPT-5.2.</p><p>MiniMax-M2.7 deserves mention here too. It did not take the top programmatic slot, but its SWE-Pro, VIBE-Pro, Terminal Bench 2, and production-debugging profile made it one of the strongest Tier 2 coding-agent releases of the month.</p><h4>Best academic reasoning: Gemini 3.1 Pro and GPT-5.4</h4><p>Academic reasoning was close at the top.</p><p>Gemini 3.1 Pro remained extremely strong, but GPT-5.4 gave OpenAI a major new academic-reasoning profile. OpenAI reported GPT-5.4 at 39.8% on Humanity&#8217;s Last Exam without tools, 52.1% with tools, 92.8% on GPQA Diamond, and 33.0% on Frontier Science Research.</p><p>That puts GPT-5.4 squarely in the top-tier academic-reasoning conversation.</p><h3>Cost-performance: March was really about routing</h3><p>The most practical lesson from March is that routing kept getting more important.</p><p>GPT-5.4 is expensive but good. OpenAI lists GPT-5.4 API pricing at $2.50 per million input tokens and $15 per million output tokens, compared with GPT-5.2 at $1.75 input and $14 output. GPT-5.4 Pro is far more expensive at $30 input and $180 output.</p><p>That pricing is reasonable if GPT-5.4 is doing high-value work. But it makes no sense to use the most expensive model for every subtask in a large agent loop.</p><p>That is where March got interesting.</p><p>GPT-5.4 mini and nano gave OpenAI cheaper subagent options. MiniMax-M2.7 gave builders a strong Tier 2 coding and office-work model. MiMo-V2-Pro gave builders a 1M-context agent model with much lower listed pricing than Opus-class models.</p><p>The better stack after March looked something like this:</p><p>Use <strong>Gemini 3.1 Pro</strong> for the strongest broad reasoning and benchmark profile.</p><p>Use <strong>GPT-5.4</strong> for professional work, computer use, coding, documents, spreadsheets, presentations, factuality-sensitive tasks, and tool-heavy agents.</p><p>Use <strong>Opus 4.6</strong> for long-running workflows where trust, EQ, and high-context collaboration matter.</p><p>Use <strong>GPT-5.4 mini or nano</strong> for cheaper OpenAI subagents, extraction, ranking, classification, lightweight coding tasks, and fast supporting work.</p><p>Use <strong>MiniMax-M2.7</strong> for coding-agent experiments, office-work agents, production-debugging workflows, and multi-agent harnesses.</p><p>Use <strong>MiMo-V2-Pro</strong> for lower-cost 1M-context agent workflows, especially where OpenClaw-style agent behavior matters.</p><p>That is the March pattern. The top model matters, but the stack matters more.</p><h3>What changed from February to March</h3><p>February was a broad frontier-reset month.</p><p>Gemini 3.1 Pro and Opus 4.6 pushed Tier 1 forward. GPT-5.3-Codex made OpenAI harder to compare because it was clearly powerful, but specialized. Grok 4.20 brought xAI back into Tier 2. GLM-5, MiniMax-M2.5, and Qwen3.5-397B made the Chinese cost-performance layer stronger.</p><p>March was narrower.</p><p>There was one major frontier release: GPT-5.4.</p><p>That release mattered because it answered the main question left open by February: what happens when OpenAI folds GPT-5.3-Codex-style coding capability back into a general-purpose GPT model?</p><p>The answer was GPT-5.4.</p><p>And it was strong enough to make the Tier 1 conversation genuinely three-way again.</p><h3>What to watch next</h3><p>The first thing to watch is whether GPT-5.4 becomes the default model for professional agents. OpenAI&#8217;s benchmark story is strongest around documents, spreadsheets, presentations, computer use, and tool workflows. The real signal will be whether users feel that improvement in day-to-day work.</p><p>The second thing to watch is Gemini. Google skipped March after releasing Gemini 3.1 Pro in February. If Google ships another major update, the top of the ranking could move again quickly.</p><p>The third thing to watch is Anthropic. Opus 4.6 remained Tier 1 in March, especially on EQ and long-running workflows, but GPT-5.4 narrowed the professional-work gap. Anthropic&#8217;s next release will need to push hard on reliability, coding agents, and long-context work.</p><p>The fourth thing to watch is MiniMax-M2.7 adoption. The model&#8217;s self-evolution framing is interesting, but adoption will depend on whether developers find it reliable inside real harnesses.</p><p>The fifth thing to watch is MiMo-V2-Pro&#8217;s agent performance in the wild. Xiaomi&#8217;s pricing and 1M-context support are compelling. The question is whether it holds up outside curated agent benchmarks.</p><h3>Bottom line</h3><p>March 2026 was quieter than February, but it still mattered.</p><p><strong>Gemini 3.1 Pro remained the overall leader in AI IQ&#8217;s March ranking.</strong></p><p><strong>GPT-5.4 was the best new model of the month and immediately joined Tier 1.</strong></p><p><strong>Opus 4.6 stayed Tier 1 because of its EQ, long-running workflow quality, and professional-agent profile.</strong></p><p><strong>MiniMax-M2.7 became one of the most interesting Tier 2 models for coding agents, office work, and agent-harness development.</strong></p><p><strong>MiMo-V2-Pro gave Xiaomi a serious Tier 2 agent model with 1M context, 42B active parameters, and aggressive pricing.</strong></p><p>The practical takeaway is simple: March made the top tier more competitive, but it also made routing more important.</p><p>Use the best model when the task is hard.</p><p>Use the cheaper model when the task is frequent.</p><p>Use the model with the right shape when the workflow is specific.</p><p>GPT-5.4 was the headline.</p><p>But March&#8217;s bigger lesson is that serious AI systems are becoming model stacks, not model choices.</p>]]></content:encoded></item><item><title><![CDATA[The Best AI Models of February 2026]]></title><description><![CDATA[Gemini 3.1 Pro and Opus 4.6 reset Tier 1, while GPT-5.3-Codex made OpenAI harder to compare]]></description><link>https://newsletter.aiiq.org/p/the-best-ai-models-of-february-2026</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/the-best-ai-models-of-february-2026</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Tue, 03 Mar 2026 03:38:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Tl-V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tl-V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tl-V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tl-V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2216569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brief.aiiq.org/i/196375094?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tl-V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Tl-V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3351d61-18fd-48fd-9701-3aae455a6b69_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>February was a blockbuster month for AI model releases.</p><p>January was mostly about open-weight models becoming more useful underneath the frontier. February was different. The top of the leaderboard moved.</p><p>Anthropic released Opus 4.6. OpenAI released GPT-5.3-Codex. Google released Gemini 3.1 Pro. xAI pushed Grok 4.20 into public beta. And the Chinese labs shipped another strong wave: GLM-5, MiniMax-M2.5, and Qwen3.5-397B.</p><p>The simple story is that Gemini 3.1 Pro and Opus 4.6 became the new all-around state-of-the-art models.</p><p>The more interesting story is that February did not give us one clean winner. It gave us three different frontier claims.</p><p>Google had the best broad benchmark story with Gemini 3.1 Pro.</p><p>Anthropic had the best long-horizon professional workflow story with Opus 4.6.</p><p>OpenAI had the most awkward model to rank: GPT-5.3-Codex, a model that looks like a clear step up from GPT-5.2 in coding and agentic computer work, but is not quite the same thing as a general-purpose GPT-5.3.</p><p>That made February more interesting than a normal leaderboard update. The frontier moved, but it also got messier.</p><h3>February 2026 model releases</h3><p>February had two clear Tier 1 general-model launches, one major OpenAI specialist launch, one important xAI beta, and a strong Chinese model wave.</p><p><strong>February 5: Anthropic released Claude Opus 4.6</strong><br>Opus 4.6 improved on Opus 4.5 in coding, long-running agentic tasks, large-codebase reliability, debugging, code review, and long-context work. Anthropic also introduced a 1M token context window in beta for an Opus-class model and emphasized everyday professional tasks like financial analysis, research, documents, spreadsheets, and presentations.</p><p><strong>February 5: OpenAI released GPT-5.3-Codex</strong><br>GPT-5.3-Codex was OpenAI&#8217;s biggest February move, but not a normal flagship release. OpenAI described it as its most capable agentic coding model to date, combining the Codex and GPT-5 training stacks, advancing both GPT-5.2-Codex coding performance and GPT-5.2 professional reasoning, while running about 25% faster.</p><p><strong>February 12: Z.ai released GLM-5</strong><br>GLM-5 was designed for complex system engineering and long-range agent tasks. Z.ai framed it as a shift &#8220;from coding to engineering,&#8221; with stronger deep reasoning in backend architecture, complex algorithms, and stubborn bug fixing, plus DeepSeek Sparse Attention for token efficiency.</p><p><strong>February 12: MiniMax released MiniMax-M2.5</strong><br>MiniMax-M2.5 pushed heavily on coding, agentic tool use, search, office work, and economically valuable tasks. MiniMax reported 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp with context management, while also claiming the model could run continuously for $1/hour at 100 tokens per second.</p><p><strong>Mid-February: xAI pushed Grok 4.20 into public beta</strong><br>Grok 4.20 is harder to source cleanly than the OpenAI, Anthropic, and Google releases. It appeared in public beta in February, while xAI&#8217;s developer release notes list Grok 4.20 and Grok 4.20 Multi-agent as live on March 10. So I would treat it as a February capability signal but not as clean a product release as Opus 4.6 or Gemini 3.1 Pro.</p><p><strong>February 16: Alibaba released Qwen3.5</strong><br>Alibaba unveiled Qwen3.5 for the &#8220;agentic AI era,&#8221; with Reuters reporting claims of lower cost, better large-workload processing, and visual agentic capabilities across mobile and desktop apps. The open-weight Qwen3.5-397B-A17B model uses 397B total parameters with 17B activated and supports 262K native context, extensible to roughly 1M tokens.</p><p><strong>February 19: Google released Gemini 3.1 Pro</strong><br>Gemini 3.1 Pro was the biggest all-around release of the month. Google framed it as upgraded core intelligence for complex tasks, rolling out across the Gemini API, Vertex AI, Gemini app, NotebookLM, Google AI Studio, Antigravity, Gemini CLI, and Android Studio. Google also reported a verified 77.1% score on ARC-AGI-2, more than double Gemini 3 Pro&#8217;s reasoning performance.</p><p>That is a lot for one month.</p><p>The simple release-count story understates it. February was not just busy. It changed the top tier.</p><h3>The new Tier 1</h3><p>By the end of February, AI IQ&#8217;s Tier 1 looked like this:</p><ol><li><p><strong>Gemini 3.1 Pro</strong></p></li><li><p><strong>Claude Opus 4.6</strong></p></li><li><p><strong>GPT-5.3-Codex</strong></p></li></ol><p>That ranking needs a caveat.</p><p>Gemini 3.1 Pro and Opus 4.6 are clean all-around models. They are easy to compare against GPT-5.2, Gemini 3 Pro, and Opus 4.5.</p><p>GPT-5.3-Codex is different. It is clearly stronger than GPT-5.2 in important ways, especially in coding-agent work. But OpenAI did not release a general-purpose GPT-5.3 in February, and GPT-5.3-Codex did not get the same kind of broad benchmark coverage that GPT-5.2 had. So it belongs in Tier 1, but with an asterisk: it is a frontier model, but not a normal flagship model.</p><p>That distinction matters because model selection is no longer a one-column leaderboard problem.</p><p>For broad intelligence, Gemini 3.1 Pro had the strongest February claim.</p><p>For long-running professional work, Opus 4.6 had the strongest workflow claim.</p><p>For agentic coding and computer work, GPT-5.3-Codex looked like OpenAI&#8217;s most important model.</p><p>The best model depended more than usual on what kind of work you meant.</p><h3>Gemini 3.1 Pro: the broad benchmark leader</h3><p>Gemini 3.1 Pro was the cleanest all-around winner of February.</p><p>Google did not frame it as a minor Gemini 3 patch. It framed it as the upgraded core intelligence behind recent Deep Think progress, designed for tasks where &#8220;a simple answer isn&#8217;t enough.&#8221; The launch emphasized complex reasoning, data synthesis, visual explanations, code-based animation, API integration, and agentic development workflows.</p><p>The standout number was ARC-AGI-2. Google reported Gemini 3.1 Pro at 77.1% verified on ARC-AGI-2, more than double Gemini 3 Pro. That matters because ARC-AGI-2 is one of the better tests for novel abstract reasoning &#8212; the kind of problem-solving that is harder to explain away with benchmark contamination or memorized examples.</p><p>On AI IQ, that shows up clearly. Gemini 3.1 Pro sits at or near the top overall, and it has one of the strongest profiles across abstract, programmatic, and academic reasoning.</p><p>It also benefits from Google&#8217;s distribution. Gemini 3.1 Pro was not just a model-card release. It rolled into the Gemini app, NotebookLM, Vertex AI, Gemini Enterprise, Google AI Studio, Antigravity, Gemini CLI, and Android Studio. That matters because the model is not only competing in the lab; it is being pushed into the places where people actually work.</p><p>The weakness is the usual Gemini caveat: benchmarks and product feel have not always lined up perfectly. Google&#8217;s best models can look incredible on hard evals but still feel uneven in day-to-day chat or agent workflows. Gemini 3.1 Pro narrows that gap, but it does not erase the question.</p><p>Still, by the end of February, Gemini 3.1 Pro had the strongest claim to &#8220;best overall model.&#8221;</p><h3>Opus 4.6: the professional workflow model</h3><p>Opus 4.6&#8217;s case is different from Gemini&#8217;s.</p><p>Gemini 3.1 Pro had the cleaner broad benchmark story. Opus 4.6 had the better &#8220;give it a real project and let it work&#8221; story.</p><p>Anthropic emphasized coding, planning, long-running agentic tasks, large-codebase reliability, debugging, code review, financial analysis, research, and document/spreadsheet/presentation work. The model also got a 1M token context window in beta, which is especially relevant for enterprise workflows where the model has to reason over large document sets, repositories, or multi-file projects.</p><p>Anthropic also reported strong results on GDPval-AA, BrowseComp, long-context retrieval, cybersecurity, life sciences, and agentic coding. One particularly useful detail: Opus 4.6 scored 76% on an 8-needle 1M-context MRCR v2 task, compared with 18.5% for Sonnet 4.5. That is the kind of long-context retrieval result that matters in real work, not just benchmark theater.</p><p>On AI IQ, Opus 4.6 belongs in Tier 1 because it is strong across the board. But its practical advantage may be even more important than its composite score.</p><p>Opus 4.6 is the model I would most want to test for long-running tasks where the failure modes are subtle: drifting from the goal, missing buried constraints, over-editing code, ignoring project structure, or producing something polished but not quite right.</p><p>That is not the same as saying it beats Gemini 3.1 Pro on every dimension. It does not.</p><p>It is saying that &#8220;best benchmark model&#8221; and &#8220;most trustworthy model for messy professional work&#8221; are no longer obviously the same thing.</p><h3>GPT-5.3-Codex: the hardest model to rank</h3><p>GPT-5.3-Codex is the most interesting February release because it created a ranking problem.</p><p>On one hand, it looks like a major step forward. OpenAI says GPT-5.3-Codex combines the Codex and GPT-5 training stacks, advances both GPT-5.2-Codex and GPT-5.2, runs about 25% faster, and sets new highs on SWE-Bench Pro and Terminal-Bench. OpenAI also describes it as moving Codex beyond writing code toward doing end-to-end work on a computer.</p><p>On the other hand, it is not GPT-5.3.</p><p>That sounds like a pedantic distinction, but it matters. GPT-5.3-Codex is clearly a frontier agentic coding model. It may also be a very strong general professional-work model. But because OpenAI did not release a normal general-purpose GPT-5.3, we have less clean evidence for where the underlying GPT line stood in February.</p><p>That makes GPT-5.3-Codex feel like a partial preview of OpenAI&#8217;s next frontier rather than the next clean GPT flagship.</p><p>For developers, this may not matter. If your work is coding, tool use, repo search, debugging, terminal work, and computer operation, GPT-5.3-Codex is one of the first models to test.</p><p>For model rankings, it matters a lot. A coding-specialized frontier model can beat previous GPT releases on many important tasks without telling us exactly how OpenAI&#8217;s general-purpose model compares to Gemini 3.1 Pro or Opus 4.6.</p><p>That is why GPT-5.3-Codex belongs in Tier 1, but with more uncertainty than the other two February leaders.</p><h3>Grok 4.20: xAI is still behind, but gaining ground</h3><p>Grok 4.20 was not the cleanest February release, and it was not Tier 1 on AI IQ.</p><p>But it mattered.</p><p>xAI&#8217;s previous frontier position had been drifting. Grok 4 was useful, but it did not really put xAI in the same all-around conversation as OpenAI, Anthropic, and Google. Grok 4.20 changed that somewhat.</p><p>The model appeared in public beta in February, with coverage emphasizing a major step up in capability and daily bug-fix iteration, while xAI&#8217;s own developer notes later listed Grok 4.20 and Grok 4.20 Multi-agent as live on March 10. So this is not as clean as saying &#8220;xAI officially released a new flagship on February X.&#8221; It is better to say: February is when Grok 4.20 became visible as the next serious Grok checkpoint.</p><p>On AI IQ, Grok 4.20 sits in Tier 2. That feels right.</p><p>It is not yet Gemini 3.1 Pro, Opus 4.6, or GPT-5.3-Codex. But the jump from Grok 4 to Grok 4.20 suggests xAI is gaining ground rather than falling farther behind.</p><p>That is the important part. The fourth US lab is still in fourth place, but it is not irrelevant.</p><h3>The Chinese wave: GLM-5, MiniMax-M2.5, and Qwen3.5</h3><p>February also had a strong Chinese model wave.</p><p>None of GLM-5, MiniMax-M2.5, or Qwen3.5-397B became Tier 1 on AI IQ. But all three pushed the same broader pattern: Chinese labs are not merely trying to top a single benchmark. They are building models around agentic deployment, coding, cost, long context, and broad real-world utility.</p><p><strong>GLM-5</strong> was the strongest pure &#8220;engineering agent&#8221; story. Z.ai explicitly framed it as a move from coding to engineering, with an emphasis on backend architecture, complex algorithms, long-range agent tasks, and stubborn bug fixing. It also integrated DeepSeek Sparse Attention for better token efficiency while preserving long-context quality.</p><p><strong>MiniMax-M2.5</strong> was the cost-performance story. MiniMax claimed strong coding, tool-use, search, and office-work results, plus much lower cost for continuous agentic operation. The key line is not the SWE-Bench score. It is the $1/hour claim at 100 tokens per second. If that holds up in real workflows, it changes what kinds of agentic applications are economically reasonable.</p><p><strong>Qwen3.5-397B</strong> was the architecture and deployment story. The open-weight version uses a 397B-total / 17B-active MoE design, combines Gated Delta Networks with sparse MoE, supports a large native context window, and is built as a native vision-language model. Reuters also reported Alibaba&#8217;s claim that Qwen3.5 was 60% cheaper and eight times better at processing large workloads than its predecessor.</p><p>The US labs still owned the top of the leaderboard in February.</p><p>But the Chinese labs continued to make Tier 2 much more usable.</p><p>That matters because most real AI usage will not be one-off frontier prompts. It will be millions of calls across coding agents, office agents, search agents, customer-support workflows, document systems, and internal automation. In that world, a model does not have to be the smartest model in the world to be economically important.</p><h3>Updated model rankings</h3><p>AI IQ&#8217;s February ranking can be summarized like this.</p><h4>Tier 1</h4><p><strong>Gemini 3.1 Pro</strong><br>The strongest overall model by the end of February. Best broad benchmark story, especially on abstract reasoning and all-around capability.</p><p><strong>Claude Opus 4.6</strong><br>The strongest long-running professional workflow model. Excellent for agentic work, coding, large context, research, and document/spreadsheet-heavy tasks.</p><p><strong>GPT-5.3-Codex</strong><br>A Tier 1 coding-agent model with unusually strong professional-work capabilities, but harder to compare because OpenAI did not release a general GPT-5.3.</p><h4>Tier 2</h4><p><strong>Grok 4.20</strong><br>A major step up for xAI, but still below the top three US labs overall.</p><p><strong>Kimi K2.5</strong><br>Still highly relevant from January as one of the strongest open-weight agentic models.</p><p><strong>GLM-5</strong><br>The strongest February Chinese release on engineering-agent positioning.</p><p><strong>DeepSeek-V3.2</strong><br>Still an important open-weight baseline even without being the month&#8217;s new headline.</p><p><strong>MiniMax-M2.5</strong><br>A cost-performance standout for coding, search, office work, and agentic tasks.</p><p><strong>Qwen3.5-397B</strong><br>An efficient open-weight native vision-language MoE model with a strong deployment story.</p><p>This is a more complicated ranking than January&#8217;s.</p><p>January&#8217;s story was: GPT-5.2 still leads, but open-weight models are becoming useful.</p><p>February&#8217;s story was: the top tier changed, but model choice became less obvious.</p><h3>Dimension-by-dimension read</h3><p>AI IQ evaluates models across Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. It also compresses easier or more gameable benchmarks so saturated tests cannot dominate the composite score. That is especially important in February because the new releases have very different shapes.</p><h4>Best overall IQ: Gemini 3.1 Pro</h4><p>Gemini 3.1 Pro had the strongest overall February profile.</p><p>Its advantage came from breadth. It was not just a coding model or a math model. It was strong across abstract reasoning, programmatic reasoning, academic reasoning, multimodal problem-solving, and tool-heavy workflows.</p><p>The most important public signal was Google&#8217;s ARC-AGI-2 score. A 77.1% verified result on a hard abstract reasoning benchmark is not a normal incremental upgrade. It is the kind of result that changes the top of the ranking.</p><h4>Best EQ: Opus 4.6</h4><p>Opus 4.6 had the strongest EQ and professional-collaboration profile.</p><p>EQ in AI IQ is not just friendliness. It is a proxy for conversational judgment, calibration, tact, user alignment, and the ability to work well in high-context settings. That matters more as models move from answering questions to doing work with people.</p><p>Opus 4.6&#8217;s strength is that it does not just score well; it feels designed for handoff. Give it a complex project, a large context, and a vague but real professional goal, and it is one of the models most likely to stay useful without constant correction.</p><h4>Best abstract reasoning: Gemini 3.1 Pro</h4><p>Gemini 3.1 Pro was the February abstract reasoning leader.</p><p>That is the clearest part of its case. Google&#8217;s ARC-AGI-2 result was a major jump from Gemini 3 Pro, and AI IQ&#8217;s abstract reasoning dimension puts real weight on ARC-style tasks because they test novel pattern-solving more directly than knowledge-heavy benchmarks.</p><h4>Best mathematical reasoning: GPT-5.3-Codex</h4><p>GPT-5.3-Codex had the strongest February claim on math-adjacent reasoning, especially where math overlaps with formal reasoning, code, and tool-based problem-solving.</p><p>This is one of the places where the model&#8217;s specialization matters. A Codex model that can reason through repositories, run terminal workflows, and verify intermediate results is not just a coding autocomplete system. It starts to become a practical reasoning engine for technical work.</p><p>The caveat is coverage. GPT-5.3-Codex did not get the same clean all-domain benchmark treatment as a normal GPT flagship. So I would call it the strongest technical-reasoning model of February, but not use it alone to infer the full state of OpenAI&#8217;s general GPT line.</p><h4>Best programmatic reasoning: Gemini 3.1 Pro overall; GPT-5.3-Codex for coding agents</h4><p>This is the most nuanced category.</p><p>On AI IQ&#8217;s broader programmatic dimension, Gemini 3.1 Pro had the strongest all-around position. But GPT-5.3-Codex was the most important coding-agent release of the month.</p><p>That distinction matters. Programmatic reasoning is broader than patching code. AI IQ includes Terminal-Bench, SWE-Bench, and SciCode, with compression applied to more gameable benchmarks. So a model can be the best &#8220;coding agent&#8221; in a product sense while another model has the strongest overall programmatic IQ profile.</p><p>For coding-agent products, GPT-5.3-Codex should be near the top of the eval list. For broad technical reasoning across coding, science, terminal work, and structured problem-solving, Gemini 3.1 Pro still has the best February case.</p><h4>Best academic reasoning: Gemini 3.1 Pro and Opus 4.6</h4><p>Academic reasoning was close between Gemini 3.1 Pro and Opus 4.6.</p><p>Gemini 3.1 Pro had the stronger broad-reasoning profile. Opus 4.6 had the stronger professional-knowledge-work story, especially on long-context retrieval, GDPval-AA, BrowseComp, finance, research, and expert workflows.</p><p>The practical difference is this: Gemini 3.1 Pro looks like the better academic benchmark model; Opus 4.6 looks like the model you would trust with a long, messy professional research task.</p><p>Both belong in Tier 1.</p><h3>Cost-performance: the gap below the frontier keeps narrowing</h3><p>February also made cost-performance more important.</p><p>The top three models &#8212; Gemini 3.1 Pro, Opus 4.6, and GPT-5.3-Codex &#8212; are the models to test when quality matters most. But most production AI systems should not route every call to the most expensive model.</p><p>That is where MiniMax-M2.5, GLM-5, Qwen3.5-397B, Kimi K2.5, and DeepSeek-V3.2 matter.</p><p>MiniMax-M2.5 is the clearest example. A model that performs well on coding, search, tool use, and office work while costing roughly $1/hour to run continuously at 100 tokens per second is not just a cheaper benchmark competitor. It changes the economics of background agents, always-on coding assistants, and multi-step office automation.</p><p>Qwen3.5-397B is another example. It does not beat Gemini 3.1 Pro overall, but a 397B-total / 17B-active native vision-language MoE model with long context and open weights is exactly the kind of model teams will want to experiment with for lower-cost multimodal agents.</p><p>GLM-5 sits in the same category. Its positioning around system engineering, long-range agent tasks, and token efficiency makes it relevant for teams trying to route complex coding and engineering work without paying top-tier closed-model prices every time.</p><p>The best model stack after February probably looked something like this:</p><p>Use <strong>Gemini 3.1 Pro</strong> for the strongest broad reasoning and abstract problem-solving.</p><p>Use <strong>Opus 4.6</strong> for long-running professional workflows, research, documents, spreadsheets, and coding tasks where trust and context retention matter.</p><p>Use <strong>GPT-5.3-Codex</strong> for serious coding-agent workflows, terminal tasks, repo work, and computer-use-heavy execution.</p><p>Use <strong>MiniMax-M2.5</strong>, <strong>GLM-5</strong>, <strong>Qwen3.5-397B</strong>, <strong>Kimi K2.5</strong>, or <strong>DeepSeek-V3.2</strong> when cost, openness, latency, or deployment control matter more than squeezing out the last few points of frontier capability.</p><p>Use <strong>Grok 4.20</strong> if you are specifically evaluating xAI&#8217;s ecosystem or want to see how quickly xAI is improving.</p><p>The better setup is not one model. It is routing.</p><h3>What changed from January to February</h3><p>January was mostly about the floor rising.</p><p>February was about the ceiling moving.</p><p>In January, GPT-5.2 and Gemini 3 Pro still sat at the top, while Kimi K2.5, GLM-4.7, MiniMax-M2.1, and GLM-4.7-Flash made the lower tiers more useful.</p><p>In February, the top changed. Gemini 3.1 Pro and Opus 4.6 displaced the old frontier. GPT-5.3-Codex gave OpenAI a stronger model, but in a specialist package. Grok 4.20 made xAI relevant again in the Tier 2 conversation. And the Chinese labs kept filling in the cost-performance layer.</p><p>That is why February mattered.</p><p>It was not just a month with more releases. It changed the shape of the model market.</p><h3>What to watch next</h3><p>The first thing to watch is OpenAI. GPT-5.3-Codex is clearly important, but it leaves an obvious question: where is the general-purpose GPT-5.3? If OpenAI folds Codex-level coding into a broader GPT model, the top of the leaderboard could move again quickly.</p><p>The second thing to watch is whether Gemini 3.1 Pro&#8217;s benchmark strength translates into everyday agentic reliability. The ARC-AGI-2 result is excellent. The practical question is whether developers and professionals feel the same jump in real workflows.</p><p>The third thing to watch is Opus 4.6 adoption in enterprise agents. Anthropic&#8217;s model is extremely well-positioned for long-context, high-trust, professional work. If businesses increasingly evaluate models on end-to-end task completion instead of chat quality, Opus 4.6 may end up being more important than its raw ranking suggests.</p><p>The fourth thing to watch is Grok 4.20. xAI is still behind the top three US labs, but the improvement rate matters. If Grok 4.20&#8217;s gains carry into a cleaner API release and stronger benchmark coverage, xAI becomes harder to dismiss.</p><p>The fifth thing to watch is the Chinese cost-performance tier. GLM-5, MiniMax-M2.5, and Qwen3.5-397B are not Tier 1 overall, but they are exactly the kinds of models that can win production routing decisions.</p><h3>Bottom line</h3><p>February 2026 was one of the most important AI model months so far.</p><p><strong>Gemini 3.1 Pro became the strongest overall model in AI IQ&#8217;s February ranking.</strong></p><p><strong>Opus 4.6 became the strongest long-running professional workflow model.</strong></p><p><strong>GPT-5.3-Codex became OpenAI&#8217;s most important agentic coding model, but it was harder to compare because OpenAI did not release a general GPT-5.3.</strong></p><p><strong>Grok 4.20 brought xAI back into the conversation, even though it still sat below the top frontier labs.</strong></p><p><strong>GLM-5, MiniMax-M2.5, and Qwen3.5-397B made Tier 2 more serious, especially for coding, agents, cost, and deployment control.</strong></p><p>The practical takeaway is that model selection got harder.</p><p>In January, you could mostly say GPT-5.2 was still the default, while open-weight models were becoming useful.</p><p>After February, that was no longer enough.</p><p>Use Gemini 3.1 Pro when you want the strongest broad reasoning.</p><p>Use Opus 4.6 when the work is long, messy, and professional.</p><p>Use GPT-5.3-Codex when the task is coding-agent work.</p><p>Use the best Tier 2 models when scale, openness, or cost matters more than the last few points of capability.</p><p>February did not settle the race.</p><p>It made the routing problem real.</p>]]></content:encoded></item><item><title><![CDATA[The Best AI Models of January 2026]]></title><description><![CDATA[GPT-5.2 stayed on top, but Kimi K2.5 made January interesting]]></description><link>https://newsletter.aiiq.org/p/the-best-ai-models-of-january-2026</link><guid isPermaLink="false">https://newsletter.aiiq.org/p/the-best-ai-models-of-january-2026</guid><dc:creator><![CDATA[Ryan Shea]]></dc:creator><pubDate>Mon, 02 Feb 2026 21:36:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!12XE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!12XE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!12XE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!12XE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!12XE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!12XE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!12XE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2223163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brief.aiiq.org/i/196468259?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!12XE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!12XE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!12XE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!12XE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7983cd-a0f5-4b6a-8b2f-e4759bfbd10e_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>January 2026 was not a clean frontier-reset month.</p><p>OpenAI did not release GPT-5.3. Anthropic did not release Opus 4.6. Google did not release Gemini 3.1. The very top of the leaderboard mostly stayed where December left it: GPT-5.2 and Gemini 3 Pro out in front, with Opus 4.5 still one of the most important models for coding agents and professional work.</p><p>That made January easy to underrate.</p><p>The month&#8217;s real action was one layer down. Kimi K2.5 was the major new model release. Z.ai shipped GLM-4.7-Flash. And several late-December models &#8212; especially GLM-4.7 and MiniMax-M2.1 &#8212; spent January getting benchmarked, integrated, and compared against the closed frontier.</p><p>So January was not about a new best model.</p><p>It was about the rest of the stack getting more useful.</p><h3>January 2026 model releases</h3><p>The corrected release calendar is important here.</p><p><strong>GLM-4.7 and MiniMax-M2.1 were not January releases.</strong> GLM-4.7 was released on December 22, 2025, according to Z.ai&#8217;s release notes, and MiniMax-M2.1 was released on December 23, 2025, according to MiniMax&#8217;s own launch post. They mattered a lot in January, but they should be treated as late-December models that shaped the January rankings, not as January launches.</p><p>The true January model-release story was narrower:</p><p><strong>January 19: Z.ai released GLM-4.7-Flash</strong><br>GLM-4.7-Flash was a lighter, lower-latency version of GLM-4.7, designed as the free-tier model. Z.ai positioned it around coding, reasoning, writing, translation, roleplay, high-throughput use cases, and real-time interaction.</p><p><strong>January 27: Moonshot released Kimi K2.5</strong><br>Kimi K2.5 was the month&#8217;s most important new model. Moonshot described it as a native multimodal, open-source agentic model trained with continued pretraining over roughly 15T mixed visual and text tokens. It also introduced Agent Swarm, where the model can coordinate up to 100 sub-agents and run parallel workflows across up to 1,500 tool calls.</p><p>That is a quieter month than April would later become. But it was still useful. January clarified which models were actually worth testing below the top frontier tier.</p><h3>The top tier barely moved</h3><p>The best model available by the end of January was still <strong>GPT-5.2</strong>.</p><p>OpenAI had released GPT-5.2 in December as its most advanced model for professional work and long-running agents, with strong reported results across software engineering, math, abstract reasoning, science, long-context work, spreadsheets, presentations, coding, tool use, and multi-step projects.</p><p><strong>Gemini 3 Pro</strong> was still right behind it. Google released Gemini 3 in November as its most intelligent model, with emphasis on reasoning, multimodality, coding, and agentic workflows.</p><p><strong>Claude Opus 4.5</strong> remained highly relevant, especially for coding agents, tool use, computer use, spreadsheets, and long-running professional work. Anthropic released it in November and described it as its best model for coding, agents, and computer use, with a lower Opus API price of $5 per million input tokens and $25 per million output tokens.</p><p>So the January headline is not &#8220;new model beats GPT.&#8221;</p><p>It is more specific: <strong>the top stayed mostly closed, but the next layer got much more interesting.</strong></p><h3>The best January release: Kimi K2.5</h3><p>Kimi K2.5 was the clear standout January release.</p><p>It did not take the overall AI IQ crown from GPT-5.2. It did not make Gemini 3 Pro irrelevant. It did not erase Opus 4.5&#8217;s agentic-workflow strengths.</p><p>But it did make the open-weight category harder to ignore.</p><p>The model&#8217;s pitch was unusually product-shaped. Kimi K2.5 was not only a text reasoning model with better benchmark results. It was native multimodal. It could reason over images and video. It could code from visual inputs. It had instant, thinking, agent, and agent-swarm modes. And the Agent Swarm system made a serious attempt at scaling agentic work horizontally instead of just asking one model to think longer.</p><p>That last point is the interesting one.</p><p>Most agentic AI still runs as one long loop: plan, call tool, observe, revise, call another tool, and repeat. Kimi K2.5&#8217;s Agent Swarm design instead asks whether some tasks should be decomposed into many parallel subtasks. Moonshot reports that the system can create up to 100 sub-agents and reduce execution time by up to 4.5x compared with a single-agent setup.</p><p>That does not automatically make Kimi K2.5 the best model. It does make it one of the more strategically interesting January releases.</p><p>For high-stakes synthesis, GPT-5.2 still made more sense. For top-tier programmatic reasoning, Gemini 3 Pro remained extremely strong. For long-running coding-agent work, Opus 4.5 still had a strong case.</p><p>But for teams that care about open weights, multimodal agents, cost control, visual coding, and parallelized workflows, Kimi K2.5 became a model worth testing.</p><h3>GLM-4.7-Flash: not a frontier reset, but useful</h3><p>GLM-4.7-Flash was not the most important model of January. Kimi K2.5 was.</p><p>But GLM-4.7-Flash still mattered because it pointed at a different part of the market: high-frequency, lower-latency usage.</p><p>Z.ai described GLM-4.7-Flash as a lightweight and efficient model designed as the free-tier version of GLM-4.7, with strong performance across coding, reasoning, and generative tasks. The release was explicitly framed around low latency, high throughput, writing, translation, roleplay, and real-time use cases.</p><p>That is not glamorous, but it is important.</p><p>A lot of real AI usage does not need the best model. It needs a model that is good enough, fast enough, and cheap enough to run constantly. Customer support drafts, internal assistants, low-stakes code edits, document cleanup, translation, lightweight research, routing, summarization, and agent pre-processing do not all need GPT-5.2.</p><p>GLM-4.7-Flash was not competing to be the smartest model in the world. It was competing to be useful at scale.</p><p>That is a different race, and January had more of that than a simple leaderboard would suggest.</p><h3>The late-December models that shaped January</h3><p>The reason January felt busier than its release calendar is that two important late-December releases were still being absorbed by the ecosystem.</p><p><strong>GLM-4.7</strong> was released on December 22, 2025. Z.ai described it as a foundation model with improvements in coding, reasoning, agentic capabilities, long-context understanding, and end-to-end task execution across real-world development workflows.</p><p><strong>MiniMax-M2.1</strong> was released on December 23, 2025. MiniMax emphasized real-world complex tasks, multi-language programming, office workflows, lower token consumption, faster response speed, and better generalization across coding-agent frameworks like Claude Code, Droid, Cline, Kilo Code, Roo Code, and BlackBox.</p><p>These models should not be listed as January launches. But they absolutely belong in a January model analysis, because January was when teams had time to test them against the frontier.</p><p>GLM-4.7 was the stronger general open-weight reasoning and coding story.</p><p>MiniMax-M2.1 was the practical coding-agent story: less about one heroic score, more about multi-language software work, scaffold compatibility, token efficiency, and whether a model behaves well inside actual coding tools.</p><p>That distinction matters. Real coding agents do not live inside one benchmark. They live inside harnesses, repos, terminals, IDEs, issue trackers, and messy project context.</p><h3>Updated January model rankings</h3><p>Using the AI IQ January window, the model hierarchy looked roughly like this.</p><h4>Tier 1 / top frontier</h4><p><strong>GPT-5.2</strong><br>The best overall model available by the end of January. Strong across broad reasoning, math, academic work, software engineering, tool use, and professional tasks.</p><p><strong>Gemini 3 Pro</strong><br>Still one of the strongest models overall, and especially strong on programmatic reasoning and multimodal workflows.</p><h4>High-end professional / frontier-adjacent</h4><p><strong>Claude Opus 4.5</strong><br>Not the overall AI IQ leader, but still one of the most important models for coding agents, computer use, long-running tasks, spreadsheets, and professional workflows.</p><p><strong>Kimi K2.5</strong><br>The best true January release. Not top overall, but the most interesting new open-weight agentic model of the month.</p><p><strong>GLM-4.7</strong><br>A late-December release that remained highly relevant in January for open-weight coding, reasoning, and agentic development workflows.</p><p><strong>MiniMax-M2.1</strong><br>Also a late-December release, but important in January for coding-agent workflows, multi-language programming, and practical deployment.</p><p><strong>GLM-4.7-Flash</strong><br>The January speed-and-throughput release. Less important for the top leaderboard, more important for cheap, frequent, lower-latency usage.</p><p>That hierarchy is less clean than a single leaderboard, but it is more useful.</p><p>January was not about replacing GPT-5.2. It was about giving teams more credible models below it.</p><h4>Dimension-by-dimension read</h4><p>AI IQ evaluates models across four cognitive dimensions: Abstract Reasoning, Mathematical Reasoning, Programmatic Reasoning, and Academic Reasoning. The composite IQ is the average of those dimensions, with easier or more gameable benchmarks compressed so they cannot dominate the final score.</p><p>That framework is especially helpful for January because the new and recently released models had different shapes.</p><h4>Best overall IQ: GPT-5.2</h4><p>GPT-5.2 was still the best overall model in the January window.</p><p>Its advantage was breadth. It was not only a coding model, math model, or chat model. It remained the safest default for mixed professional tasks: reading dense material, reasoning through tradeoffs, writing code, checking math, working with long context, and producing polished outputs.</p><p>For general high-value work, GPT-5.2 was the default pick.</p><h4>Best January release overall: Kimi K2.5</h4><p>Among true January releases, Kimi K2.5 had the strongest overall story.</p><p>Its benchmark profile was good, but the more interesting thing was the model shape: multimodal input, coding, visual debugging, agentic execution, office productivity, and parallel sub-agent orchestration.</p><p>It was not the best model in the world. But it was the January release most likely to change what serious teams tested.</p><h4>Best abstract reasoning: GPT-5.2</h4><p>GPT-5.2 remained the January-window leader on abstract reasoning.</p><p>That matters because abstract reasoning is one of the harder capabilities to explain away through benchmark familiarity. AI IQ&#8217;s abstract reasoning dimension uses ARC-AGI-2 and ARC-AGI-1, with ARC-AGI-2 treated as the harder, more frontier-discriminating benchmark.</p><p>Kimi K2.5 and GLM-4.7 were useful, but they did not close the gap at the very top.</p><h4>Best mathematical reasoning: GPT-5.2</h4><p>GPT-5.2 was also the math leader in the January window.</p><p>AI IQ&#8217;s math dimension uses FrontierMath Tier 4 and AIME, with AIME compressed because it is easier to saturate and more exposed to contamination. That is a better signal than simply asking which model got the highest AIME score.</p><p>Kimi K2.5 was strong for an open-weight January release, but GPT-5.2 still had the broader mathematical reasoning profile.</p><h4>Best programmatic reasoning: Gemini 3 Pro</h4><p>Gemini 3 Pro had the strongest case on programmatic reasoning among models available by the end of January.</p><p>This is one place where &#8220;best overall&#8221; and &#8220;best for a specific task type&#8221; diverge. GPT-5.2 led overall, but Gemini 3 Pro remained extremely competitive on coding-heavy and programmatic tasks.</p><p>AI IQ&#8217;s programmatic dimension is also more useful than a simple SWE-Bench leaderboard because it combines Terminal-Bench 2.0, SWE-Bench Verified, and SciCode, while compressing SWE-Bench due to leakage and gameability concerns.</p><p>Among the January-related models, Kimi K2.5 was the most important new programmatic entrant. GLM-4.7 and MiniMax-M2.1 were also worth testing, especially for teams that cared about open-weight deployment and coding-agent workflows.</p><h4>Best academic reasoning: Gemini 3 Pro, with GPT-5.2 close</h4><p>Academic reasoning was one of the tighter parts of the January top tier.</p><p>Gemini 3 Pro had the strongest case on academic reasoning in the January window, with GPT-5.2 close behind. Both were meaningfully ahead of the true January releases on broad expert-knowledge tasks.</p><p>AI IQ&#8217;s academic reasoning dimension includes Humanity&#8217;s Last Exam, CritPt, and GPQA Diamond, with GPQA compressed because public graduate-level science benchmarks are easier to contaminate than newer expert-screened tests.</p><p>Kimi K2.5 was still impressive, especially with tools. But the best academic models were still the late-2025 closed frontier models.</p><h4>Best EQ: GPT-5.2</h4><p>In AI IQ&#8217;s January-window view, GPT-5.2 had the strongest EQ profile.</p><p>That may surprise people who associate Claude with the best conversational feel. But AI IQ&#8217;s EQ score is not just vibes. It combines EQ-Bench 3 Elo and Arena Elo, maps them onto an EQ-like scale, and applies a 200-point EQ-Bench penalty to Anthropic models to correct for Claude-judged family bias.</p><p>That does not mean GPT-5.2 will feel better than Opus 4.5 in every workflow. It does mean GPT-5.2 looked extremely strong on the measured EQ signals AI IQ tracks.</p><h3>The cost-performance story</h3><p>January made the cost-performance conversation more serious.</p><p>GPT-5.2 and Gemini 3 Pro were better models overall. But they were not automatically the right models for every call in every workflow.</p><p>AI IQ&#8217;s effective-cost metric helps here because it does not stop at sticker price. It starts with the cost of a 2M input / 1M output workload, then adjusts by token usage efficiency so token-hungry models are penalized and token-efficient models get credit.</p><p>That changes the model-selection problem.</p><p>For high-stakes reasoning, GPT-5.2 was worth paying for.</p><p>For programmatic reasoning and multimodal coding workflows, Gemini 3 Pro was hard to ignore.</p><p>For long-running coding agents and professional workflows, Opus 4.5 remained an important option.</p><p>For open-weight agentic work, Kimi K2.5 became the model to test.</p><p>For coding-agent experimentation and practical engineering workflows, GLM-4.7 and MiniMax-M2.1 were relevant even though they were December releases.</p><p>For lower-latency, high-frequency tasks, GLM-4.7-Flash pointed toward the cheaper end of the routing stack.</p><p>The better setup was not one model. It was routing.</p><h3>The bigger January theme: open-weight models became more practical</h3><p>The main January pattern was not &#8220;China caught OpenAI.&#8221;</p><p>That would be too strong.</p><p>The top was still mostly closed. GPT-5.2 and Gemini 3 Pro were ahead overall, and Opus 4.5 remained one of the most useful professional-agent models.</p><p>The better read is that open-weight and frontier-adjacent models became more practical.</p><p>That is different from being the best.</p><p>A practical infrastructure model needs to be cheap enough, fast enough, available enough, customizable enough, and capable enough. It does not need to win every benchmark. It needs to handle large volumes of useful work without forcing teams to send every intermediate step to the most expensive frontier API.</p><p>Kimi K2.5, GLM-4.7, MiniMax-M2.1, and GLM-4.7-Flash all pointed in that direction.</p><p>Kimi pushed toward open multimodal agents and parallel sub-agent execution.</p><p>GLM pushed toward open-weight coding, reasoning, and faster deployment tiers.</p><p>MiniMax pushed toward practical coding-agent generalization across programming languages, tools, and scaffolds.</p><p>That was the January story.</p><p>The frontier did not move much. The layer underneath it got more usable.</p><h3>What to watch next</h3><p>The first thing to watch is whether Kimi K2.5 gets real adoption inside coding-agent and office-agent products. A model can look good in a launch post and still fail to become part of actual workflows. The real signal will be whether developers and teams route meaningful work through it.</p><p>The second thing to watch is whether Agent Swarm becomes a serious pattern or stays mostly a demo. Parallel agents are compelling, but they introduce coordination overhead, verification problems, and new failure modes. If the pattern works, it could become one of the more important agent architectures of 2026.</p><p>The third thing to watch is GLM-4.7-Flash-style routing. The market needs cheap, fast, good-enough models just as much as it needs frontier reasoning models.</p><p>The fourth thing to watch is MiniMax&#8217;s practical coding-agent direction. M2.1 was not a January release, but its emphasis on multi-language work, scaffold generalization, and office workflows was exactly where model evaluation needs to go.</p><p>The fifth thing to watch is the next true frontier release. January did not bring one. The next OpenAI, Anthropic, or Google release could quickly change the top of the AI IQ ranking.</p><h3>Bottom line</h3><p>January 2026 did not give us a new overall champion.</p><p><strong>GPT-5.2 remained the best overall model in AI IQ&#8217;s January-window ranking.</strong></p><p><strong>Gemini 3 Pro remained one of the strongest models overall and had the best case on programmatic reasoning.</strong></p><p><strong>Opus 4.5 stayed important for coding agents, computer use, and long-running professional work.</strong></p><p><strong>Kimi K2.5 was the best true January release.</strong></p><p><strong>GLM-4.7-Flash was the month&#8217;s practical speed-and-throughput release.</strong></p><p><strong>GLM-4.7 and MiniMax-M2.1 were late-December releases, not January releases, but both shaped the January conversation around open-weight coding and agentic workflows.</strong></p><p>The practical takeaway is simple: January made routing more important.</p><p>For the hardest work, pay for the frontier. For coding-heavy, latency-sensitive, open-weight, or high-volume agent workflows, the January and late-December models deserved real evaluation.</p><p>The best model did not change.</p><p>The set of models worth using did.</p>]]></content:encoded></item></channel></rss>