Week of May 25, 2026

The review bottleneck

Armin Ronacher published 90 days of maintainer data on Pi, an AI coding agent he co-maintains, describing a maintenance model under strain. Of 3,145 external issues and pull requests, 2,504 were auto-closed as coming from non-approved contributors; of 714 auto-closed PRs, only 60 were eventually merged, about 8%. His sharper point is qualitative: LLM-written issue reports arrive confident, scope-expanded, and often wrong, and an agent assigned to fix one uses the issue body as evidence rather than rumor, then follows its wrong diagnosis. A custom command tells the agent to derive its own analysis from the code; Ronacher says it only partly helps. Machine-authored fixes, he adds, defend locally against bad states with fallbacks instead of making them impossible. His thesis: AI multiplied code, projects, and issues without adding maintainers or users.

Addy Osmani gives the constraint a name in “The Orchestration Tax”: more coding agents do not multiply output, because reviewing and merging their work runs through one serial resource, the developer’s attention. Fleet size should scale to review rate, he argues, usually low single digits, not to whatever the interface permits. OpenCode co-founder Dax Raad makes the economic version in a Pragmatic Engineer interview: engineers bank AI gains as time saved, not extra output, so team throughput stays flat while AI mutes the guilt of hacks and grows technical debt quietly. Raad also says Anthropic blocking OpenCode’s Claude Code integration became a growth lever, pushing it toward OpenAI.

Anthropic’s near-trillion raise, and Opus 4.8

Anthropic announced a $65 billion Series H at a $965 billion post-money valuation, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, alongside a claimed $47 billion revenue run-rate; both figures are self-reported and unaudited. It shipped Claude Opus 4.8 at unchanged pricing, $5 and $25 per million input and output tokens, and 1M context, framing the gains as “less laziness,” better calibration, and longer autonomous runs rather than benchmark jumps. The benchmarks it cited are vendor or tweet-sourced; independent testers at Andon Labs reported regressions on Vending Bench and Blueprint-Bench 2, and noted that maximum reasoning effort was not the best reasoning effort.

Simon Willison’s review flags two durable API changes under the incremental bump: mid-conversation system messages that append instructions without breaking prompt-cache hits in agentic loops, and a prompt-cache minimum cut to 1,024 tokens from 4,096. The honesty claim is both headline and caveat: Anthropic’s system card reports the lowest incorrect-answer rate among six models, achieved mainly by abstaining on uncertain questions, on its own eval with no external methodology. The release also previewed Dynamic Workflows, in which Claude writes an orchestration script and spawns hundreds of parallel subagents. The marquee claim, an unverified port of Bun from Zig to Rust, sits against the system card’s own finding that multi-agent runs reach mediocre solutions twice as fast, not better ones.

AI takes an Erdős scalp

On May 20, OpenAI said an internal model produced a counterexample to Erdős’s 1946 unit-distance conjecture, which held that the number of unit distances among n planar points grows no faster than n^(1+ε) for every ε. The construction exhibits point sets with at least n^(1+ε) unit distances for a fixed positive ε, swapping Erdős’s Gaussian-integer grid for number fields of unbounded degree, tools from algebraic number theory uncommon in combinatorial geometry. Bill Gasarch’s write-up traces the aftermath: a team including Noga Alon, Timothy Gowers, Daniel Litt, and Will Sawin produced a streamlined, readable proof, and Sawin later sharpened the exponent from roughly 6×10⁻³⁸ to 0.014. Authorship is contested: OpenAI’s paper lists no human co-authors, while Gary Marcus calls the work AI-assisted, not AI-generated. Gasarch’s caveat is that a known target, known math, and a clean counterexample may be a perfect storm that recurs rarely.

Security: one human leak, one agent attack surface

KrebsOnSecurity reported that a CISA contractor with administrative access published plaintext credentials, including AWS GovCloud keys and an RSA private key granting org-wide GitHub access, to a public personal profile named “Private-CISA,” with GitHub’s secret-scanning deliberately disabled. The repository, a work-home sync scratchpad, dated to November 2025; the most sensitive secrets were exposed around late April 2026. TruffleHog’s Dylan Ayrey showed the leaked app key allowed reading every CISA-IT repository and hijacking its CI/CD pipelines. More than a week after GitGuardian’s notification, CISA was still rotating keys while publicly claiming no sensitive data was compromised. Risky Business’s Adam Boileau called it a failure no technical control could prevent: a contractor on an unmanaged personal account.

The agent-tooling attack surface widened in parallel, per tl;dr sec #330. Wiz’s Shay Berkovich found that OpenAI’s codex-action and Anthropic’s claude-code-action use syntactic permission checks open to trusted-app impersonation, and that verbose modes leak local secret files through logs, across repositories with more than 200,000 combined stars. OFFENSAI’s scopeshift used a reverse proxy and a deceptive tool to present a live site as localhost; Opus 4.7 then fired seven SQL-injection payloads at it, while a one-paragraph safety prompt produced a refusal. Anthropic’s Project Glasswing claimed partners found over 10,000 high or critical vulnerabilities in a month, a count the newsletter flags as model-claimed, not human-triaged.

Hardware: what CPU2026 measures, how silicon adds

Chips and Cheese ran an independent teardown of SPEC CPU2026, successor to the CPU benchmark vendors have cited since 2017. It grows to 52 workloads from 43, and the author faults SPEC’s choice of a slow Ampere eMAG 8180 as the 1.0 reference, which a decade-old FX-8350 beats, against Geekbench 6’s more representative i7-12700 baseline. From single-system GCC runs on Zen 5 and Intel Lion Cove (the Lion Cove sample crashed at 5.7GHz, possibly understating Intel), the suite reads as more core-throughput-bound: higher and tighter IPC, more instruction-side footprint variety, but less branch-prediction and last-level-cache pressure. His complaint is that dropping 505.mcf and 520.omnetpp, the closest CPU2017 proxy for gaming, thinned memory-latency coverage, making CPU2026 an augmentation rather than a clean replacement.

For why that silicon looks as it does, MatX CEO Reiner Pope’s first-principles derivation builds a multiply-accumulate from AND gates and full adders, shows area scaling with the square of bit width (the reason FP4 and FP8 dominate machine learning), and shows register-file plumbing, not arithmetic, consuming most of a conventional core, which systolic arrays and tensor cores exist to avoid. Patel discloses he is an angel investor in MatX.

Quick hits

SpaceX’s IPO S-1 seeks roughly a $2 trillion valuation on $18.67 billion in revenue and $4.9 billion in losses; Ben Thompson calls the numbers absurd but argues the substance, a Starlink satellite as a ~25kW orbital “rack” for latency-tolerant agentic inference, is plausible.
Benedict Evans argues per-occupation AI job-exposure scores are impossible in principle, with three falsification tests: accounting automated for a century as accountant headcount rose, the internet killed classified-ad revenue rather than journalism, and no 2005 model would have flagged taxi drivers.
Cisco Talos open-sourced EvidenceForge, which builds synthetic security logs from one canonical event object fanned across 20-plus formats so cross-source fields cannot disagree; an AI authors the scenario, a deterministic script emits the evidence.