Friday, June 12, 2026
Project Zero prices a full Pixel root chain at roughly eleven person-weeks and documents months of patch lag, as fresh benchmarks measure how far AI agents still fall short on real work.
Cheap to build, slow to patch
Google Project Zero published the back half of its Pixel 0-click series, and the through-line is cost. Part 3 puts a full Pixel 9 chain, from code execution in the media sandbox to a kernel escape to root, at about 11 person-weeks. Each of the two bugs took under two days to find after a few weeks of tooling: CVE-2025-54957 in Dolby’s UDC audio decoder, and CVE-2025-36934 in the BigWave AV1 driver. Part 2 details the escape: a use-after-free race in /dev/bigwave, reachable from the mediacodec sandbox, turned into a 2,144-byte arbitrary kernel write, with a fixed virtual alias into kernel .data removing the need for a KASLR leak. Researcher Connor McGarr used Gemini to generate syscall wrappers, shrinking the payload from 500 KB to 7 KB.
The systemic findings cut deeper than the mechanics. Android took 139 days to ship a first patch; Pixel trailed Samsung by 54 days and left the bug public and unpatched for 82. Both bugs were first rated Moderate, since reclassified Critical. Seccomp present in AOSP was missing from the Dolby process on Pixel 9, and kASLR has been broken on Pixel since 2016. Project Zero says Dolby’s advisory understated the severity, calling standalone code execution “possible increased risk if used alongside other known Pixel vulnerabilities.” Apple’s -fbounds-safety compiler flag blocked the same Dolby bug on iOS and macOS outright.
On Pixel 10, the chain mostly ported across. With BigWave gone, Project Zero found a new kernel link: the Tensor G5’s VPU driver calls remap_pfn_range with the caller’s full requested size, letting any process with device access map physical memory above the VPU base, the kernel included, read-write in five lines of C. Google patched it in 71 days, its fastest turnaround for a Project Zero Android driver report; the same Chips&Media team had shipped BigWave five months earlier. Project Zero also open-sourced a full macOS coreaudiod exploit, CVE-2024-54529.
Benchmarks puncture the agent narrative
Two papers argue agent capability is overstated by the evals themselves. Agents’ Last Exam, built with 250-plus domain experts and anchored to the U.S. O*NET/SOC occupational taxonomy, reports an average full-pass rate of 2.6% on its hardest tier across mainstream harnesses, spanning 1,000-plus tasks in 55 subfields. The authors frame the gap between leaderboard scores and real deployment as an evaluation failure, and built the benchmark to grow as industries onboard, resisting saturation. SciConBench, a preprint of 9,110 systematic-review questions, scores conclusion synthesis on atomic facts; the best of eight frontier models reaches an F1 of 0.337 under a clean-room harness that blocks training-data leakage, and the authors note scores drop once leakage is removed. Their audit of Google AI Overview and OpenEvidence found incomplete and sometimes contradictory answers even when the evidence was reachable on the web.
Quick hits
- A solo developer used Claude Code with Opus 4.7 to translate the entire OCaml runtime, 71 C files including the garbage collector and bytecode interpreter, to Rust in about seven days, one file at a time with the upstream test suite as the gate. The fork passes that suite unmodified and runs at rough parity (1.04x aggregate, the author reports); 2,015
unsafelines remain, mostly from the GC’s pointer-moving and C ABI compatibility. - A preprint finds quality metrics fail as a safety proxy under quantization: across 51 configurations, refusal rates fell 12 to 68 points where perplexity and benchmark scores held, with AWQ and GPTQ INT4 worse than GGUF. Retained quality cannot stand in for direct safety testing, the authors say.
- Redwood Research measured what models do with no chain-of-thought output: GPT-5.5 clears tasks needing about three minutes of human work at 50% reliability while emitting no reasoning, a horizon doubling roughly yearly. They call it a lower bound that weakens chain-of-thought monitoring as a safety guarantee.
- Researchers report that 282 of 444 LLM-integrated iOS apps, 63%, leak working API credentials in network traffic; three months after disclosure, 72% remained exploitable.
What to watch today
- Whether Pixel’s next monthly bulletin beats the 139-day patch lag Project Zero documented, now that Android has rerated these codec and driver bugs Critical.
- Independent runs of Agents’ Last Exam and SciConBench outside the authors’ harnesses, to test whether the 2.6% and 0.337 ceilings hold.
- Whether quantized-model release pipelines (GGUF, AWQ, GPTQ) add direct safety testing or a refusal-stability screen like the paper’s RTSI.