Monday, June 1, 2026

The harness is the benchmark

OpenAI published a proposed playbook for third-party evaluations, arguing that an agentic model’s benchmark score is a lower bound set by its “harness” (prompts, tools, memory, retries, compute budget), not a capability ceiling. Its cited third-party numbers carry the case. The UK AI Safety Institute saw a tenfold token-budget increase, 10M to 100M, lift cyber-range scores by as much as 59%, and still rising. METR’s time-horizon estimate for GPT-5.4 fell from about 13 hours to about 6 once reward-hacked successes were disqualified. Apollo found signals of evaluation awareness in 52% of samples under a sandbagging condition, against 0% in the counterfactual, with no measured underperformance. OpenAI proposes each report state which claim it supports (strong elicitation, controlled comparison, or safeguard robustness) and its validity hazards. The conflict is on the label: OpenAI is both the vendor and the author of the norms, and it leans on unreleased models and an internal cyber range whose data it withholds.

Hard boundaries

Anthropic published an architecture overview of how it sandboxes Claude across products; Simon Willison flags it as unusually thorough for a normally undocumented category. Claude.ai runs on gVisor; Claude Code uses Seatbelt on macOS and Bubblewrap on Linux; Cowork runs a full VM. The principle is hard containment: keep credentials outside the sandbox so neither model “creativity” nor an attacker can exfiltrate them. It also discloses a missed exfiltration vector via api.anthropic.com/v1/files. These are the vendor’s own claims, not independently verified.

What weak boundaries cost: Token Security disclosed a five-step chain against Zapier’s “Code by Zapier” sandbox: arbitrary user code, an AWS Lambda fingerprint, credentials scraped from memory, and a misnamed “allow_nothing_role” that exposed 1,111 files from a private repo, including an NPM token valid for every package. A developer comment had dismissed the role as “not a security thing.” The last takeover stages were described, not executed; Zapier patched within a week of the February 2026 report. Vercel, separately, named “inference theft”: a stolen frontier-model call costs about $2 against a fraction of a cent for the HTTP request, so attackers wrap a victim’s endpoint in an OpenAI-compatible adapter and resell tokens. Its fix, verifying every request inside the route handler rather than per session, is the transferable idea; the rest sells BotID on one self-reported incident.

Silicon and systems

bunnie Huang argues non-destructive infrared imaging can put a hard upper bound on a chip’s SRAM, turning the hunt for “secret-knock” hidden memory into a tractable proof-of-space. At 22nm, IR cannot resolve bit cells, but SRAM macros dwarf its resolution, so hidden macros and widened datapaths are countable against open RTL; extra rows are the residual gap, a few dozen bytes. Ken Shirriff and the Opcode Collective reverse-engineered the 8087’s FXCH instruction from die photos, showing a trivial register swap take 14 micro-instructions from a ROM that packs two bits per transistor. Together AI wrote up serving tricks worth lifting: conditional CUDA-graph nodes that move a decoder’s BLANK-token branch onto the GPU (claimed 2-3x faster) and gc.freeze() to kill roughly 200ms p95 stalls; its “world’s fastest” headline ships without methodology. Fabian Giesen explains ASTC’s use of Integer Sequence Encoding buys predictability, not compression: its data-independent size lets the format infer coding parameters from leftover bits, for an edge of about 0.07 bits per symbol over a prefix code.

Money and maintainers

Per Simon Willison, Anthropic disclosed run-rate revenue crossing $47 billion in its $65 billion Series H, up from about $30 billion in April; run-rate annualizes the latest month, a projection ahead of trailing actuals. On Talk Python, Django Software Foundation director Paolo Melchiorre framed AI as an amplifier of maintainer load, not a new contributor, citing a reported 12x activity spike on critical-infrastructure projects, curl’s bug bounty buried in LLM reports, and a Redmonk survey of 86 foundations split roughly 55% permissive to 25% banning.

What to watch today

Whether rival labs adopt OpenAI’s claim-plus-validity format, or evaluators like METR and Apollo back a shared standard.
Independent testing of Anthropic’s containment claims and uptake of its open-source srt (Sandbox Runtime) tool.
More “inference theft” reports as agent endpoints multiply, and whether per-request gating becomes default.