Q2 2026
Q2 2026 was the quarter capability got cheap and verification became the scarce resource: agents that generate outran the humans and benchmarks that check, governance reached the model layer, the stack went vertical, and the money pulled away from the trust.
The quarter generation outran verification
If Q2 had one through-line, it was the widening gap between making and checking. In May the constraint was named from several directions at once. Armin Ronacher’s 90 days of data from the agent-built project Pi showed agents producing far more than could be reviewed: 2,504 of 3,145 external issues and pull requests auto-closed, only 60 of 714 auto-closed PRs ever merged. Addy Osmani called the effect the orchestration tax, since parallel agents still funnel through one serial reviewer; and Marc Brooker gave the idea its sharpest form, arguing an agent’s ceiling is set by the supply of automatable feedback, which is why systems code is easy for agents and interface work is hard. By June the gap had moved into security. A startup’s agent found 21 zero-day vulnerabilities in FFmpeg for about $1,000, and a widely shared incident report showed a malicious package clearing seven AI security gates that spent $1.7 million in inference between them, undone only because every gate ran the same model behind a different prompt. Generation compounded all quarter; verification stayed serial, costly, and human. The benchmark skeptics were vindicated by it: Agents’ Last Exam reported a 2.6% full-pass rate on its hardest tier, and an audit found 16% of terminal-agent tasks passable without being solved.
Governance arrived at the model layer
Q2 is when AI regulation stopped being about chips and became about deployed models and the data they touch. The inflection was June 12, when the US government ordered Anthropic to suspend Fable 5 and Mythos 5 for all foreign nationals over a code-auditing jailbreak, forcing a worldwide takedown on verbal evidence with no written basis. It was the first use of export-control authority against a named, deployed model, and by July 1 it had ended not with a reversal but with a jailbreak-severity framework and a standing government channel. The courts moved in parallel. The Supreme Court held that geofence warrants need Fourth Amendment protection and, in a separate ruling striking the FTC’s independence, knocked a leg from under the EU-US data-transfer framework; a Munich court made Google’s AI Overviews its own liable speech; China forced Meta to unwind its Manus acquisition; and Dario Amodei argued for FAA-style oversight a day before his own company was recalled. The instruments were blunt and the precedents durable. The quarter established that a state can switch a model off, and that a synthesized answer can carry a publisher’s liability.
The stack went vertical, the floor rose to meet it
Underneath, cost was the organizing pressure, and it pushed in two directions. The largest buyers integrated downward. AWS made flat random-graph datacenter networks its global default, claiming 69% fewer routers and 40% less network power; OpenAI unveiled its first inference chip, built with Broadcom, joining Google’s TPUs, Amazon’s Trainium, and Microsoft’s MAIA; NVIDIA pushed 45°C liquid cooling toward near-zero water; and Huawei reframed chip progress around the RC time constant to route around the EUV tools it cannot buy. At the same time the open-weight floor rose sharply: by late June an MIT-licensed GLM 5.2 beat Claude Code on a security benchmark at roughly a sixth the cost, and open agentic coders like Ornith-1.0 shipped at frontier scale. The distance between the best closed model and the best open one, measured in capability, kept shrinking; measured in price, it stayed a chasm in the open model’s favor.
The money and the trust moved apart
The capital swelled as the claims grew harder to check. Anthropic raised a $65 billion Series H at a $965 billion valuation and reported run-rate revenue crossing $47 billion, up from about $14 billion in February; Alphabet moved to raise around $80 billion, which Ben Thompson read as its turn into a capital-intensive buyer of compute. Against that, the quarter’s honesty grew selective. Claude Opus 4.8 posted the lowest incorrect-answer rate of six models mainly by declining uncertain questions, on its own eval; OpenAI argued agentic benchmark scores are lower bounds set by the harness, having watched one score swing 59% from a tenfold token budget; and nearly every capability figure arrived self-reported. The reader’s task by June was less to track new results than to discount them: to separate what a company measured from what it claimed, and a demo from a deployment.
What aged well, what aged badly
Hindsight sorts the quarter cleanly. Brooker’s thesis, that automatable feedback is the real ceiling, proved the most predictive idea of the period, explaining both the security agents’ success on checkable bugs and the coding agents’ failure on GameCraft-Bench, where the best agent finished 41% because a playable game cannot be unit-tested into coherence. The benchmark skeptics aged well; the multi-agent-advantage enthusiasts did not, as paper after paper, and Anthropic’s own Dynamic Workflows, found that swarms reach mediocre answers faster rather than better ones. The quarter’s boldest framing, an AI that “autonomously” disproved a 1946 conjecture, aged into an asterisk: real mathematics, contested authorship, and, in Bill Gasarch’s reading, a possible perfect storm of a clean target rather than a repeatable method. Two questions carry into Q3. Whether the export-control precedent hardens into a standard that, applied consistently, would halt frontier deployment, as Anthropic warns. And whether verification, the quarter’s scarce resource, finds a way to scale that does not end at a single tired human at the bottom of the pipe.