Week of May 18, 2026

At the research frontier

OpenAI says an internal general-purpose reasoning model disproved a decades-old belief about Paul Erdős’s 1946 unit-distance problem: among n points in the plane, how many pairs can lie exactly one unit apart. Mathematicians had treated rescaled square-grid constructions as essentially optimal for the known lower bound, which has stood since 1946. The model produced an infinite family of configurations that beats it by a polynomial factor.

The proof reaches outside geometry. It replaces the Gaussian integers in Erdős’s construction with richer algebraic number fields, using infinite class field towers and Golod–Shafarevich theory. External mathematicians checked the argument and wrote a companion paper. Tim Gowers, a Fields medalist, calls it “a milestone in AI mathematics.” Princeton’s Will Sawin then refined the result to give an explicit exponent the original lacked, so a human materially extended the machine’s work.

The claims warrant the usual discount: this is an OpenAI announcement about an unnamed model, and “autonomous” is the company’s word. The corroboration is what lifts it above a launch post: an independent companion paper and named mathematicians on record. Thomas Bloom, quoted endorsing the result’s originality, still offers only a “moderated yes” on how much it advances understanding. What sets it apart from AlphaProof and other scaffolded provers is that the model was general-purpose, not built or prompted for the problem.

Google Research made an adjacent bet. It described Empirical Research Assistance (ERA) in Nature on May 19: a Gemini-based system that searches the literature, writes code, and runs a tree search over thousands of candidate solutions against a stated success metric. Google says it reaches expert-level performance on benchmarks across genomics, public health, satellite imagery, neuroscience, and forecasting; the blog gives no numbers, leaving the peer-reviewed paper to carry the claims. ERA and AlphaEvolve now power Computational Discovery, rolling out to trusted testers through Gemini for Science. All eight application manuscripts Google cites are internally authored, so independent replication is the open question.

Security: machines find bugs, humans leak keys

Cloudflare reported on Project Glasswing, its test of a preview frontier security model codenamed “Mythos,” an Anthropic preview, against more than 50 of its own repositories. The new capability over prior models, Cloudflare says, is exploit-chain construction: stitching low-severity primitives into a working proof of concept, from use-after-free to arbitrary read/write to ROP, then compiling and running the code and iterating on failures. Other frontier models found the same bugs but stopped short of chaining them.

The durable lesson is architectural. A single coding agent covers roughly 0.1% of a 100,000-line repository before its context fills, so Cloudflare built a multi-stage harness: reconnaissance, about 50 concurrent hunters, an independent adversarial validator running a different model that cannot itself emit findings, cross-repo reachability tracing, and a schema-validated report. The codename means none of the model claims are independently verifiable. Two findings cut against the grain. The preview, shipped without its GA safeguards, refused some tasks inconsistently, the same request accepted or declined by framing, which Cloudflare cites as evidence that emergent guardrails cannot serve as a safety boundary. And on the race toward sub-two-hour CVE-to-patch times, it argues speed is a trap: letting the model write patches broke dependent code, and the better defense is architecture that limits blast radius. Cloudflare both ran the study and sells the mitigations it recommends.

Two breaches the same week made the developer endpoint the story. Sophos X-Ops reports that an attacker compromised a GitHub employee’s machine through a poisoned Nx Console VS Code extension, harvested IDE-resident tokens, and cloned about 3,800 of GitHub’s internal private repositories, later listed for sale above $50,000. GitHub says customer repositories, enterprise accounts, and user data are unaffected, and that it detected the intrusion on May 19, rotated secrets, and isolated endpoints; those are GitHub’s assertions, relayed by Sophos and The Hacker News. The novel detail is the command channel: the recovered backdoor, cat.py, polls GitHub’s own commit-search API hourly for a keyword and runs RSA-signed commands hidden in public commit messages.

The hygiene lesson repeated at CISA. KrebsOnSecurity reported, on a tip from GitGuardian, that a CISA contractor kept a public GitHub repository exposing admin keys to three AWS GovCloud accounts, plaintext password files, and tokens, with push-protection secret scanning reportedly disabled. CISA says it found no indication sensitive data was compromised, an unverified self-assessment; the exposed keys reportedly stayed valid about 48 hours after takedown.

What makes an agent capable

Marc Brooker, a senior principal engineer at AWS, offered a model for where coding agents win: treat the agent as a feedback loop wrapped around a flawed open-loop component, the LLM, the way control theory turns an analog multiplier into a square-rooter by closing a loop around it. His “feedback loop hypothesis” holds that an agent’s ceiling is set by the availability of accurate, automatable feedback, not by raw model quality. From that he draws an inversion of the usual intuition: SaaS and UI work, often called easy, is hard for agents because judging it needs slow, inconsistent humans, while systems software is comparatively easy because specs, APIs, and safety and liveness properties supply feedback a machine can check. He expects specification and verification tooling to gain value: Rust, Verus, TLA+, P, property-based testing, simulators. It is a reasoned hypothesis backed by observation, not data.

Eric Jang, formerly VP of AI at 1X and before that at Google DeepMind, reached an adjacent point from a from-scratch AlphaGo rebuild on Dwarkesh Patel’s podcast. Monte Carlo Tree Search hands the learner a better action at every move, while policy-gradient RL must work out which of more than 100,000 tokens in a trajectory earned the reward, the credit-assignment problem that makes LLM RL so sample-inefficient. He notes KataGo cut the compute to train a strong Go bot from scratch about 40-fold in 2020, and that rebuilding a 2017-frontier system is now a few-thousand-dollar solo project.

Microsoft Research pushed the other end of the size curve. Its MagenticLite release runs an on-device agent on two small purpose-built models: MagenticBrain, a 14B orchestrator fine-tuned from Qwen 3 14B and trained inside the same harness it runs in, and Fara1.5, a 4B-to-27B computer-use family that Microsoft says leads small computer-use models on the 300-task Online-Mind2Web benchmark, with the 27B variant above 90%. The bet is that agentic capability comes from tool orchestration, not stored knowledge, so codesigning data, training, and harness lets small models drive capable agents. The benchmark numbers are self-reported with no methodology disclosed.

Under the hood of open weights

Sebastian Raschka surveyed the transformer-block changes behind the latest open-weight releases, all aimed at cutting long-context cost as reasoning and agent workloads keep more tokens live. Gemma 4’s small E2B and E4B variants share KV across layers, with later layers reusing a recent same-type layer’s keys and values, and store extra capacity in cheap per-layer embedding tables so “effective” parameters stay low; Raschka notes Google published no ablation against a plain dense model of the same size. Poolside’s Laguna XS.2 budgets attention per layer, giving cheap sliding-window layers more query heads and expensive global layers fewer. Zyphra’s ZAYA1-8B, trained on AMD GPUs, runs attention in a compressed latent space. DeepSeek V4 adds manifold-constrained hyper-connections, several stability-constrained residual streams at about 6.7% training overhead. The efficiency and quality claims are each lab’s own.

Open weights also revived an old idea. Sean Goedecke argues that activation steering, nudging a model’s internal activations mid-inference, becomes practical now that local models like DeepSeek-V4-Flash are strong enough to be worth steering, after antirez baked it into a stripped-down llama.cpp fork as a first-class feature. Goedecke doubts the ambitious version pays off: a steering vector for “intelligence” or “knowledge of my codebase” would approximate the full weight set, collapsing back into “just train the model.” The sharpest counterpoint came from commenters, including antirez: steering can strip a model’s trained-in refusals in ways prompting cannot, and it degrades capability less than editing weights because it applies only when invoked.

Quick hits

Daniel Lemire and Jaël Champagne Gareau show a 64-bit integer can be turned into eight ASCII digits in about two AVX-512 IFMA instructions, a peer-reviewed 1.4–2× gain over the best existing routines and 2–4× over std::to_chars on a serialization hot path; it needs recent AVX-512 IFMA hardware.
Google launched Gemini 3.5 Flash at I/O 2026 (generally available, 1M-token context). Independent Artificial Analysis confirms a strong speed-and-intelligence position but measures it at 5.5× the cost of Gemini 3 Flash and 75% more than Gemini 3.1 Pro to run their suite, undercutting the cheap-Flash branding; the same digest reports Andrej Karpathy has joined Anthropic.