Monday, May 18, 2026

The architecture changes behind the new models

Sebastian Raschka surveyed the architecture changes behind the latest open-weight LLMs, all aimed at one cost: keeping long contexts cheap as reasoning and agent workloads hold more tokens live. The piece is a mechanism reference and skips training, data, and benchmarks.

Google’s Gemma 4 small variants reuse key-value cache across layers: later layers read the KV of the nearest earlier same-type layer instead of computing their own, which Raschka says saves about 2.7 GB (E2B) to 6 GB (E4B) at a 128k-token context in bf16. Their per-layer embeddings stash extra capacity in cheap lookup tables, keeping “effective” parameter counts low; Raschka notes Google published no ablation against a plain dense model, so the gain is the company’s word.

Three other models add distinct tricks. Poolside’s Laguna XS.2 varies the query-head count per layer, spending more heads on cheap sliding-window layers and fewer on expensive global ones. Zyphra’s ZAYA1-8B, trained on AMD GPUs, runs attention inside a compressed latent space, which the company says beats Multi-head Latent Attention at similar compression. DeepSeek V4 contributes manifold-constrained hyper-connections, several parallel residual streams kept stable by projection onto doubly-stochastic matrices, at a reported 6.7% training overhead, plus compressed long-context attention.

RL, search, and steering

Two pieces argue fashionable control techniques are narrower than they look.

Eric Jang rebuilt AlphaGo from scratch with current coding tools on the Dwarkesh Patel podcast, using it to frame why reinforcement learning on LLMs is so sample-inefficient. Monte Carlo Tree Search hands the learner a better move target at every step; policy-gradient RL must work out which of 100,000-plus tokens in a trajectory earned the reward. Jang notes KataGo (David Wu, 2020) cut the compute to train a strong Go bot about 40-fold, and that what once took a DeepMind team and millions is now a few-thousand-dollar solo project.

Sean Goedecke revisited activation steering, nudging a model by editing its activations mid-inference, now that local open-weight models like DeepSeek-V4-Flash are strong enough to be worth steering, after antirez’s DwarfStar 4 fork shipped it as a first-class feature. Goedecke is skeptical: most steering gains are matched by prompting, and a vector for something like “intelligence” would approximate the full weights, collapsing into retraining. A correction from Hacker News commenters, including antirez, is the sharper point: steering can strip a model’s trained-in refusals in ways prompting cannot, and damages capability less than editing weights.

Quick hits

Google DeepMind announced Gemini 3.5, shipping 3.5 Flash now, promising 3.5 Pro in June 2026, and making Flash the default for the Gemini app and AI Mode in Search. The benchmarks (Terminal-Bench 2.1 76.2%, 1656 Elo on GDPval-AA) are vendor-reported, with no methodology or named baselines.
The Pion project explained how RFC 8260’s I-DATA chunk fixes sender-side head-of-line blocking on WebRTC data channels, letting small streams interleave between a large message’s fragments; it ships on by default in Pion 4.2.13 and Firefox, behind a flag in Chrome.
RIPE Labs measured the 44 million IPv4 addresses of legacy space still outside any RIPE NCC contract and found it heterogeneous, not lawless: about 37% of prefixes look dormant, yet out-of-contract space appears in Spamhaus’s hijacked-range list at roughly nine times the under-contract rate (1.06% vs 0.12%).
A Six Degrees of Robotics analysis reads China’s 15th Five-Year Plan as naming embodied AI a national priority, with state-funded “training grounds” for robot data; the author argues physical AI’s binding constraint is data, not chips, which would make US chip-export controls the wrong lever.
Redwood Research argued that lab risk reports ignore “deployment-time spread,” where an aligned model develops and propagates misaligned goals in the field without ever evading evaluations; the position piece credits only Anthropic’s report for addressing the pathway.

What to watch today

Gemini 3.5 Pro, which Google dates to June 2026, and independent evals to test the vendor Terminal-Bench and GDPval numbers Flash launched with.
Whether DeepSeek V4’s manifold-constrained hyper-connections and ZAYA1’s compressed attention draw independent replication of their “beats baseline” claims.
Goedecke’s six-month bet: whether the open-weight community finds steering uses on DeepSeek-V4-Flash beyond a verbosity slider.
Whether any lab beyond Anthropic adds Redwood’s deployment-time-spread pathway to its next risk report.