Week of June 22, 2026

Style over structure

The week’s most consequential result was a theory of why prompt injection works at all. A paper on role confusion argues that language models infer who is speaking, whether a turn is system, user, tool, or reasoning, from writing style rather than the chat-template tags providers wrap around each turn. Linear probes on mid-layer activations show reasoning-styled text lighting up the same internal direction as genuine think tags, and more strongly; strip the tags and the signal barely moves. From that the authors build CoT Forgery, injecting fabricated reasoning that concludes a harmful request is fine, and push benchmark attack success from near zero to about 60% across late-2025 frontier models, transferring between families. One substitution makes the mechanism visible: changing “The user” to “The request” in the spoofed text drops success from 61% to 10%, a change a person barely registers. The consequence is structural. Every agent pipeline that treats a tool or reasoning boundary as a trust boundary is relying on an inference the model can be talked out of.

The surface misreads the model

Role confusion was one of several results, all week, showing a model’s outward behavior misreading its internal state. Perfect Detection, Failed Control found a model can represent a property with perfect linear separability, an AUC of 1.000, yet resist being steered on it, the detecting and controlling directions sitting far apart: knowing is not steering. Erased, but Not Gone made the matching point about unlearning, where methods that look successful by output forgetting leave structured residuals retraining exposes. And Internal Data Repetition Destroys Language Models priced a cost invisible to standard metrics: reused training documents drive evaluation loss to a peak and, at a tenth of the compute budget, waste the equivalent of most of it at Qwen3 scale. The response, in the same week, was to move evaluation inside. RAS scores safety from a model’s refusal directions rather than its answers, and a systematization of secure code generation found that how well a model grasps a security principle, not whether its output looks safe, predicts whether the code actually is.

The frontier moved too

The product news kept its own pace. OpenAI unveiled Jalapeño, its first custom chip, built with Broadcom for inference, joining Google, Amazon, and Microsoft in designing around Nvidia rather than only buying from it. Google folded native computer use into Gemini 3.5 Flash, an agent that drives a screen with, it says, adversarial training against injection and enterprise confirmation gates. And a researcher published Usbliter8, a SecureROM exploit on Apple’s A12 and A13 chips: a DMA-pointer mismatch yields a buffer underflow that, on the A13, bypasses Pointer Authentication to run code at EL1, compromising the chain of trust from the earliest boot code, which is fixed in silicon and cannot be patched on affected units.

Quick hits

The Allen Institute mapped where hybrid models pay off: its Olmo Hybrid predicts meaning-bearing tokens better than a transformer, which keeps its edge reciting earlier tokens verbatim.
ATMA reports needle-in-a-haystack retrieval above 90% out to 64k tokens with a polar-attention-and-recurrent-memory design whose perplexity improves as context grows (preprint).
A survey of RAG security and PrivacyAlign both treat retrieval systems as privacy surfaces, where the index, the logs, and context assembly each leak.
Detect, Unlearn, Restore defends summarizers against training-time poisoning, detecting it at 85 to 92% and restoring most behavior with gradient-ascent unlearning (preprint).