Friday, June 26, 2026

Knowing is not steering

A new preprint, Perfect Detection, Failed Control, separates two things interpretability work usually conflates: a model can represent a property with perfect linear separability yet resist being steered on it. The authors detect fabricated entities at an AUC of 1.000 from layer five, while the direction that detects hallucination sits at a cosine of 0.12 to 0.20 from the direction that controls it, a gap that survives instruction tuning and only partly closes under a 15-degree rotation that then introduces false positives. A second preprint, Erased, but Not Gone, makes the mirror-image point about unlearning: methods judged by output forgetting, low accuracy on the forget set, look successful while leaving structured residuals in the model’s representations that retraining exposes. In both, the output signal misreads the internal state.

What repetition costs

Internal Data Repetition Destroys Language Models, a preprint, puts a price on training-data reuse that survives deduplication. Evaluation loss peaks at an intermediate repeat count that shifts with model size along a power law; when repeated documents take up a tenth of the training compute, the authors estimate the hit is equivalent to losing about 67% of FLOPs at Qwen3 scale. A deliberately misspecified linear model reproduces the same loss peak, which they read as a general memorization-versus-generalization tradeoff rather than a quirk of language models. Alongside it, the Allen Institute mapped where hybrid models pay off: its Olmo Hybrid predicts meaning-bearing tokens, nouns and verbs, better than a transformer, while the transformer keeps its edge on reciting tokens verbatim from earlier input, the job attention is costly but good at.

Safety you can read from the inside

Two preprints move safety evaluation from outputs to internals. RAS scores a model by extracting refusal directions from an aligned reference and measuring their presence in the target, mapping representation-level alignment to a 0-to-100 score that separates aligned models from abliterated ones faster than judge-based tests. A systematization of secure code generation finds the harder problem downstream: models that understand secure-coding principles often fail to actuate them, and how well a model grasps the principle predicts whether its code is both correct and secure.

What to watch today

ATMA reports needle-in-a-haystack retrieval above 90% out to 64k tokens with a polar-attention-plus-recurrent-memory design whose perplexity keeps improving as context grows; a length-invariance claim to see replicated.
PrivacyAlign and a RAG security survey both frame retrieval systems as privacy surfaces, where the index, the query logs, and context assembly each leak.
Whether representation-level safety scores hold up as a deployment gate, or only measure what the reference model already knew to refuse.