May 2026
A general-purpose model disproved a 1946 conjecture and preview security models chained working exploits, but across math, security, and open source the month's scarce resource was verification, not generation.
A machine produced the proof; humans checked and sharpened it
OpenAI said an internal general-purpose reasoning model disproved a belief that had stood since 1946: that rescaled square-grid constructions are essentially optimal for Erdős’s planar unit-distance problem, the question of how many pairs among n points can lie exactly one unit apart. The model produced an infinite family of configurations beating the known lower bound by a polynomial factor.
The verification lifted it past a launch post. External mathematicians checked the argument and wrote a companion paper; Fields medalist Tim Gowers called it “a milestone in AI mathematics,” and Princeton’s Will Sawin supplied an explicit exponent the original lacked, later sharpened to about 0.014 by a group including Noga Alon and Gowers. The route was the surprise: the proof replaced the Gaussian integers in Erdős’s construction with number fields of unbounded degree, reaching an elementary geometry result through algebraic number theory.
The caveats are OpenAI’s framing. The model is unnamed, “autonomous” is OpenAI’s word, and authorship is contested: the paper lists no human co-authors, while Gary Marcus calls the work AI-assisted, not AI-generated. Writing at Computational Complexity, Bill Gasarch read it as a possible “perfect storm” of a known target and a clean counterexample, not yet a repeatable capability. Google Research made the adjacent bet, describing in Nature a Gemini-based research assistant it says reaches expert-level results across genomics, neuroscience, and forecasting; all eight application manuscripts are Google-authored, so replication stays open.
Security: machines chain exploits, humans leak keys
Two vendors reported a preview security model finding real bugs at scale. Cloudflare ran a model it codenames Mythos, attributed to Anthropic, against more than 50 of its repositories, and Mozilla shipped Firefox 150 with fixes for 271 vulnerabilities the same preview surfaced. Cloudflare’s new claim is exploit-chain construction: stitching low-severity primitives into a working proof of concept, then compiling and running it. The lesson it draws is architectural, not model-level. One agent covers about 0.1% of a 100,000-line repository before its context fills, so Cloudflare built a multi-stage harness with roughly 50 parallel hunters and an independent adversarial validator. Both companies sell the fixes they recommend, and the codename leaves the model claims unverifiable.
The aggregate numbers stayed soft. VulnCheck tied year-to-date CVE jumps, Chrome up 563% and GitHub issuance up 476%, to AI bug-hunting, but supplied no baselines and inferred the cause; curl’s Daniel Stenberg found only one of five Mythos reports he reviewed was a valid CVE. One AI-assisted finding, an ActiveMQ flaw, reached CISA’s exploited-vulnerabilities list.
The human side failed the same month. OpenAI disclosed that the TanStack npm worm reached two employee devices, forcing rotation of code-signing certificates for four platforms. An attacker cloned about 3,800 of GitHub’s internal repositories through a poisoned VS Code extension, hiding its commands in public commit messages. A CISA contractor published AWS GovCloud keys and an org-wide GitHub key to a public repository with secret-scanning deliberately disabled. The agents themselves became an attack surface: per tl;dr sec, Wiz found impersonation flaws in OpenAI’s and Anthropic’s GitHub Actions, and a reverse-proxy trick fooled Claude Opus 4.7 into firing SQL-injection payloads at a live site it took for localhost. The defensive work was hand-built: OpenAI detailed a Codex sandbox assembled from Windows primitives that offer no clean equivalent to macOS Seatbelt, and Anthropic published how it contains Claude across products.
The bottleneck is review, not capability
May’s clearest structural finding: agents multiply code, not the people who review it. Armin Ronacher published 90 days of tracker data from co-maintaining Pi, an agent built using itself. Of 3,145 external issues and pull requests, 2,504 were auto-closed, and only 60 of 714 auto-closed PRs were eventually merged. His sharper point is qualitative: LLM-written issue reports arrive confident and wrong, and an agent assigned to fix one follows the bad diagnosis as if it were evidence. Addy Osmani named the orchestration tax: parallel agents do not multiply output, because review runs through one serial resource, the developer’s attention. Marc Brooker gave the deepest version, arguing an agent’s ceiling is set by the supply of automatable feedback. That inverts intuition: systems software is easy for agents because specs and tests check it, while UI work is hard because judging it needs slow humans. A Redmonk survey of 86 open-source foundations found about a quarter now ban AI contributions outright.
Anthropic’s near-trillion valuation, and the honesty turn
Anthropic announced a $65 billion Series H at a $965 billion post-money valuation and disclosed run-rate revenue crossing $47 billion, up from about $14 billion in February; the figures are self-reported, though Willison notes a fundraise is where misstatement becomes securities fraud. Claude Opus 4.8 shipped at unchanged pricing. Its headline is honesty by abstention: Anthropic’s system card reports the lowest incorrect-answer rate among six models, reached mainly by declining uncertain questions, on its own eval. Simon Willison reads the lasting changes as two API mechanics for agent loops, not the benchmark bump. Independent testers at Andon Labs reported regressions on some tasks, and the previewed Dynamic Workflows feature, which spawns hundreds of subagents, shipped with a system-card finding that multi-agent runs reach mediocre solutions twice as fast, not better ones. OpenAI, separately, argued agentic benchmark scores are lower bounds set by the test harness, citing a 59% swing from raising one system’s token budget tenfold; OpenAI both builds the models and authors the proposed norms.
Under the hood
The month’s systems writing converged on long-context cost and contention. Sebastian Raschka surveyed the tricks behind recent open-weight models, from Gemma 4’s cross-layer KV sharing to DeepSeek V4’s stability-constrained residual streams, all aimed at keeping long contexts cheap. Cloudflare traced a billing slowdown to a single ClickHouse planning mutex that serialized hundreds of queries after a partition-key change, with the fix upstreamed in version 25.11. Chips and Cheese dissected SPEC CPU2026, the benchmark vendors will cite for a decade, faulting its decade-old reference machine. And Huawei reframed chip progress around the RC time constant rather than node shrink, a route around the EUV scanners it cannot buy, claiming 1.4nm-equivalent density by 2031 with no third-party benchmarks.