Safety & policy

Alignment, interpretability, and AI governance.

arxiv.org2026-07-02AI models safetyrel 8/10 score 5.0

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

HeRA offers a novel approach to aligning multimodal representations at the granularity of individual attention heads, potentially improving the accuracy and reliability of multimodal large language models (MLLMs).

Proposes Head-Wise Representation Alignment (HeRA) method
Focuses on preserving topological structure using Mutual K-Nearest Neighbor (MKNN) alignment metric
Improves performance on challenging vision-centric tasks across multiple MLLMs and benchmarks

Full summary

The paper introduces Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads in multimodal large language models (MLLMs). HeRA uses the Mutual K-Nearest Neighbor (MKNN) alignment metric to preserve topological structure across modalities. Evaluations show that aligning less aligned heads yields significant performance improvements on vision-centric tasks and reduces visual hallucinations by mitigating over-reliance on linguistic priors.

arxiv.org2026-07-02AI safetyrel 8/10 score 4.7

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

HARC offers a novel approach to enhancing safety alignment in large language models by coupling harmfulness and refusal directions, potentially mitigating vulnerabilities that allow jailbreaks.

Details

HARC is a fine-tuning method for LLMs that pairs harmfulness and refusal directions across prompt and response positions
The method achieves the strongest robustness-capability-usability trade-off among six baselines tested
HARC's intervention leaves the rest of the residual stream intact, preserving general capability without over-refusal

The paper introduces HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method for large language models that enhances safety alignment by coupling harmfulness and refusal directions. The study reveals that jailbreaks succeed by suppressing either the refusal or harmfulness direction before token generation. HARC pairs these two directions across both prompt and response positions, showing robust performance without degrading general model capability. Across extensive experiments with five model families and two scales, HARC demonstrates superior trade-offs compared to existing methods.