HeRA offers a novel approach to aligning multimodal representations at the granularity of individual attention heads, potentially improving the accuracy and reliability of multimodal large language models (MLLMs).
Focuses on preserving topological structure using Mutual K-Nearest Neighbor (MKNN) alignment metric
Improves performance on challenging vision-centric tasks across multiple MLLMs and benchmarks
Full summary
The paper introduces Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads in multimodal large language models (MLLMs). HeRA uses the Mutual K-Nearest Neighbor (MKNN) alignment metric to preserve topological structure across modalities. Evaluations show that aligning less aligned heads yields significant performance improvements on vision-centric tasks and reduces visual hallucinations by mitigating over-reliance on linguistic priors.
HARC offers a novel approach to enhancing safety alignment in large language models by coupling harmfulness and refusal directions, potentially mitigating vulnerabilities that allow jailbreaks.
Details
HARC is a fine-tuning method for LLMs that pairs harmfulness and refusal directions across prompt and response positions
The method achieves the strongest robustness-capability-usability trade-off among six baselines tested
HARC's intervention leaves the rest of the residual stream intact, preserving general capability without over-refusal
The paper introduces HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method for large language models that enhances safety alignment by coupling harmfulness and refusal directions. The study reveals that jailbreaks succeed by suppressing either the refusal or harmfulness direction before token generation. HARC pairs these two directions across both prompt and response positions, showing robust performance without degrading general model capability. Across extensive experiments with five model families and two scales, HARC demonstrates superior trade-offs compared to existing methods.