Week of May 11, 2026

AI and the vulnerability curve

VulnCheck, a vulnerability-intelligence vendor, argues that year-to-date CVE disclosures have surged across major suppliers and attributes the jump to AI-assisted bug-hunting. Its post lists year-to-date increases of 563% for Chrome, 476% for CVEs issued by GitHub, 181% for VMware, 170% for Apache, and 157% for Mozilla. The figures carry no absolute baselines, and the link to AI is inferred, not measured.

The narrative centers on Anthropic’s April 7 announcement of “Project Glasswing” and a “Claude Mythos” preview, which VulnCheck says claimed thousands of zero-days across major operating systems and browsers, restricted access to a vendor coalition, and came with a $1.5M donation to the Apache Software Foundation. Those product names appear only in the post and are not independently verified. GitHub’s Madison Oliver Ficorilli is quoted saying no single reporter accounts for more than about 3% of volume and no single project more than about 7%, which VulnCheck reads as a systemic shift rather than one tool’s output.

The counterweight comes from curl. Maintainer Daniel Stenberg says only one of five Mythos-”confirmed” vulnerabilities in curl held up as a real CVE; the rest were false positives or non-security bugs. On the other side, an AI-assisted finding became ActiveMQ’s CVE-2026-34197, credited “80% Claude, 20% human,” now on CISA’s Known Exploited Vulnerabilities list and exploited in the wild. The post is a vendor blog that ends in a product pitch; its one actionable claim is narrow: if volumes stay high, prioritize by evidence of exploitation, not by disclosure date.

The supply-chain risk turned concrete the same week. OpenAI disclosed that on May 11 the widely used TanStack npm library was compromised as part of the “Mini Shai-Hulud” worm, infecting two employee devices that predated its rollout of package controls. The affected internal repositories held code-signing certificates for iOS, macOS, Windows, and Android, so OpenAI rotated all of them; macOS users must update ChatGPT Desktop, the Codex app and CLI, and Atlas by June 26, a deadline pushed from June 12 in coordination with Apple. OpenAI says only limited credential material left its network, with no customer data, intellectual property, or production systems affected and no observed misuse, all self-reported. The control it says would have stopped the worm, an npm minimumReleaseAge delay plus CI/CD credential isolation, existed after an earlier Axios incident but had not reached the two devices.

Sandboxing autonomous agents

If models now write code and find its bugs at machine speed, the containment is still hand-built. An OpenAI engineer details how the Codex team built filesystem and network isolation for its coding agent on Windows, which ships no equivalent to macOS Seatbelt or Linux seccomp and bubblewrap. The team rejected AppContainer, Windows Sandbox, and Mandatory Integrity Control as wrong-shaped for open-ended development. An unelevated prototype scoped file writes with a synthetic SID and a write-restricted token, but its network suppression was only advisory: poisoned proxy environment variables that any program opening a socket directly could ignore. The redesign runs agent commands as dedicated local users, blocks the offline one with a Windows Firewall rule, and works around a privilege wall at CreateProcessAsUserW using a separate runner that mints the restricted token itself. The result spans four binaries, and the safety claims are the author’s, with no red-team results shown.

Systems internals

Cloudflare published the week’s most complete debugging story. After a January 2025 migration changed the partition key of a shared ClickHouse analytics table from (day) to (namespace, day), billing aggregation slowed even though no query read more data. The team found that query duration tracked total part count, which grew from 30,000 toward 160,000 per replica. The diagnosis hinged on a switch from CPU flame graphs, which sample only running threads, to “Real” traces that capture waiting ones: over half of each query’s duration was spent waiting on one mutex during planning, where every thread took an exclusive lock, copied the entire parts vector, then filtered it. Three fixes stabilized latency: a shared lock, a deferred shared copy of the parts list, and a binary search over the namespace-prefixed partition ID. The first two shipped upstream in ClickHouse 25.11 as PR #85535. The authors note the binary search does not generalize to namespace IN (…) filters and that ZooKeeper metadata bloat remains unsolved.

Microsoft Research’s RiSE group published a fast-path walkthrough of mimalloc, its roughly 12,000-line drop-in malloc replacement. Each thread owns about 64 KiB pages split into size classes, so most allocations and frees need no synchronization; only cross-thread frees take an atomic compare-and-swap, and three free lists per page make even those rarely contended. The v3 design replaces an older pointer-alignment trick with an on-demand whole-memory map that catches invalid pointers. Microsoft reports mimalloc as the allocator for NoGIL CPython 3.13+, Unreal Engine, and Bing, and cites an 800-thread benchmark where it committed 1.3x its live data against about 4x for a competitor it does not name; the promised technical report is still pending.

Quick hits

IBM released Granite Embedding Multilingual R2, two Apache-2.0 embedding models on a ModernBERT backbone with a 32K-token context window: a 97M-parameter model that IBM says leads open sub-100M retrieval at 60.3 on MTEB Multilingual, against 50.9 for multilingual-e5-small, and a 311M model at 65.2. Benchmarks are self-reported, and competitors lead several axes in IBM’s own table.
CMU researchers presented CHAI, a video-captioning pipeline and CVPR 2026 Highlight arguing the bottleneck for cinematic video understanding is annotation vocabulary, not model scale. Trained humans critique model-drafted captions instead of writing them; the authors say an 8B Qwen3-VL trained on the resulting triples matches GPT-5 and Gemini-3.1-Pro on their own metrics, with no independent benchmarking.