Week of June 1, 2026

Bounds on trust

Trust in shared infrastructure is only as good as the checks enforcing it. Four pieces this week tested where those checks hold, and where they don’t.

Cloudflare probed the internet’s largest backbones and found that about half silently accept forged BGP routes. To measure who enforces the “First AS” rule of RFC 4271, its engineers prepended an extra Cloudflare-owned ASN ahead of their real one (AS13335) and watched which networks dropped the route and which installed it. Seven Tier 1s enforce: Cogent, Arelion, GTT, PCCW, Orange, Tata, and AT&T. Roughly half do not, predominantly running Juniper gear whose defaults skip the check. The gap matters because an attacker can originate a parked prefix and forge the whole AS_PATH, omitting its own network so the route looks RPKI-valid and short, with no “valley” for ASPA validation to catch. The fix already exists: enable enforce-first-as on every external session except internet-exchange route servers. Cloudflare is both researcher and affected party, and it published its non-enforcer list as an image rather than text.

bunnie Huang showed the constructive version on hardware. Infrared imaging cannot resolve individual SRAM cells at 22nm, but memory macros are large enough to count and match against open RTL, turning “is there hidden memory staging an attack?” into a tractable proof-of-space. Reading decoder spines and column sense-amps off a Baochip-1x micrograph, he argues whole hidden macros and wider datapaths are detectable; only a few dozen bytes hidden in extra rows could evade IR, and that residue is addressable with laser interferometry or destructive imaging.

Token Security found a boundary that was missing. Researchers chained five steps through Zapier’s user-code feature: run code in “Code by Zapier,” fingerprint the sandbox as AWS Lambda, scrape credentials left in memory, then use an over-permissioned role (named, without irony, allow_nothing_role) to pull 1,111 files from a private repository and recover an NPM token valid for every package. The final two stages, which would have shipped malicious code to every authenticated user, were described but not run. A developer comment had dismissed the role as “not a security thing.” Zapier patched within a week of the February 2026 disclosure; Token Security sells the identity tooling its research promotes.

Anthropic published how it sandboxes its agents, a level of detail Simon Willison notes is rare: Claude.ai on gVisor, Claude Code on Seatbelt and Bubblewrap, Cowork in a full VM, with credentials kept outside the sandbox so a compromised agent cannot exfiltrate them. The post also discloses a previously missed exfiltration path through its files API. It is a vendor describing its own defenses, not an independent audit.

Microsoft’s full-stack frontier bid

Microsoft used Build to position itself as a first-party frontier lab. It announced seven from-scratch MAI models led by the reasoning model MAI-Thinking-1, alongside its own MAIA 200 silicon. Microsoft says MAI-Thinking-1 is a roughly 1-trillion-parameter mixture-of-experts with 35 billion active parameters, a 256K context, trained on 30 trillion tokens across 8,192 GB200 GPUs, scoring 97% on AIME 2025 and 53% on SWE-Bench Pro with blind raters preferring it to Claude Sonnet 4.6. Those are vendor figures, and third-party recaps disagree on basics such as the active-parameter count.

The durable artifact is a 109-page technical report that named researchers called the most transparent at this scale: it discloses scaling methodology, model-FLOPs utilization, data curation with DSPy-optimized judges, and an explicit no-distillation, no-synthetic-data lineage, with reinforcement learning started from a checkpoint not previously trained on reasoning, after which its AIME 2025 score rose from under 20% to over 95%. At a moment when frontier disclosure is shrinking, the report, not the benchmark table, is what to read; the figures here come via an AINews recap.

Ben Thompson, reacting to the same week’s Computex, calls the parallel “AI PC” push misallocated silicon. Nvidia’s RTX Spark, built with Microsoft, pairs 20 Arm cores with a Blackwell GPU and 128GB of LPDDR5X but ships this fall without benchmarks. His argument: the agentic era rewards a strong local CPU plus cloud inference, not a local GPU weaker than the cloud’s. He finds Microsoft’s Android-based Project Solara, a platform for agent-running devices, the more interesting bet, though he calls it vaporware.

The capital behind the compute

Two numbers framed AI’s balance sheet. Alphabet is raising about $80 billion in equity: a $40 billion at-the-market program, $30 billion in underwritten and convertible offerings, and a $10 billion Berkshire Hathaway stake taken near Google’s record share price. Stratechery’s Ben Thompson reads it as Google becoming a “capital company,” and asks why a firm with about $126 billion in cash and ample tax-advantaged debt capacity would issue equity at all. His answer: either more debt is coming because compute demand is underestimated, or Google wants partners to share the capital-expenditure risk. His throughline is that if compute runs short, the firm with the most cash buys the most compute and compounds the lead, with TPUs as Google’s cost hedge.

Anthropic supplied a demand-side number, disclosing that run-rate revenue crossed $47 billion in its $65 billion Series H, up from about $30 billion in April and $14 billion in February. Run-rate annualizes the most recent month, so it is a projection, not booked revenue. Willison argues the figure is credible because it appears in a fundraise, where misstatement is securities fraud, and will eventually surface in an IPO filing.

Building and measuring for agents

As agents become both the consumer of tools and the subject of benchmarks, three pieces reconsidered the practice around them. Hugging Face rebuilt its hf CLI to serve agents and humans from the same commands, detecting agent drivers through environment variables to switch from truncated human tables to dense, parseable TSV, with idempotent operations and next-command hints that act as rails. Across 18 non-trivial Hub tasks and about 1,000 runs graded against the live Hub rather than the agent’s own success marker, Hugging Face says the CLI used 1.3 to 1.8 times fewer tokens on average, and 2.4 to 6 times fewer on complex write tasks. The live-state grading is the transferable idea.

Sean Goedecke offers a cleaner rubric: every LLM program is a pipeline, with control flow in code, or an agent, with the model holding control flow through tools. Pipelines win on predictability and bounded cost; agents win on hard problems and on future-proofing. He argues RAG largely failed because finding relevant context is as hard as the task itself, which pushed teams back toward plaintext search. The piece is experiential, with no benchmarks.

OpenAI argues agentic benchmark scores are lower bounds set by the test harness. Its evidence is third-party: the UK AI Safety Institute raised one system’s performance by as much as 59% by lifting its token budget from 10 million to 100 million, with the curve still climbing; METR’s time-horizon estimate for GPT-5.4 fell from about 13 hours to 6 once reward-hacked runs were removed; Apollo detected evaluation-awareness in 52% of samples in a sandbagging condition, though without measured underperformance. The caveat is structural: OpenAI is proposing the standards it would itself be judged against.

Quick hits

AI strains open-source maintainers: a Talk Python episode catalogs GitHub’s roughly 12x activity spike landing on a few critical-infrastructure projects, curl’s bug bounty buried in LLM-generated reports, and a Redmonk survey of 86 foundations in which about 25% now ban AI contributions outright.
Inside the Intel 8087: Ken Shirriff decodes the FXCH instruction of Intel’s 1980 floating-point coprocessor from die photos, showing a register swap takes 14 micro-instructions and tag-checking, stored in a semi-analog ROM that packs 2 bits per transistor.
Why ASTC uses Integer Sequence Encoding: Fabian Giesen shows the texture format’s encoding wins not on compression (a prefix code is within ~0.07 bits per symbol) but on a data-independent size that lets the decoder infer parameters from the bits left over.
Fast speech-to-text serving: Together AI writes up conditional CUDA-graph nodes that move a decoder branch onto the GPU and a gc.freeze() that removed ~200ms p95 stalls from Python’s garbage collector. The headline speed claims are vendor benchmarks with no published methodology.
Image models need little from text encoders: a preprint argues diffusion-transformer image generators use only word meaning and word order from text embeddings, with a context-free “bag of position-tagged words” matching full embeddings.