ML & methods — Research

arxiv.org2026-07-02Research mlrel 8/10 score 6.0

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Graph-native reinforcement learning offers a pathway to more interpretable AI systems capable of generating scientifically valid hypotheses through structured reasoning.

Graph-PRefLexOR is a family of models fine-tuned with Group Relative Policy Optimization (GRPO)
Achieves 40-65% improvements over base models on materials science questions
Shows approximately 2-3 times greater semantic diversity than baselines

Full summary

The paper introduces Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) to enhance scientific hypothesis generation. These models organize reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. On materials science questions, Graph-PRefLexOR demonstrates significant improvements over base models in terms of traceability and semantic diversity, achieving up to 65% better performance. The model's test-time graph expansion primarily enhances long-range conceptual recombination within a bounded semantic space.

arxiv.org2026-07-02Research mlrel 8/10 score 7.1

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

This research challenges the conventional approach to reinforcement learning adaptation in transformers by demonstrating that training a single layer can achieve similar results to full-parameter training.

Details

Training a single transformer layer can match or exceed the gains of full-parameter RL training
Layer contribution measures quantify how much improvement a single layer provides when trained in isolation
High-contribution layers are consistently found in the middle of the transformer stack across different models and tasks

This study investigates the distribution of reinforcement learning gains across transformer layers during post-training adaptation. It finds that training a single layer can recover most or even surpass the benefits of full-parameter RL training. The research introduces 'layer contribution' to measure the improvement from isolating individual layers, revealing a consistent pattern where high-contribution layers are concentrated in the middle of the stack, while input and output layers contribute less. This phenomenon is observed across various models, tasks, and reinforcement learning algorithms.

arxiv.org2026-07-02Research mlrel 8/10 score 4.9

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

This research introduces a novel method for improving large language model (LLM) self-assessment and uncertainty expression, which is crucial for enhancing trustworthiness and reliability in AI systems.

Details

Reinforcement Learning with Metacognitive Feedback (RLMF) uses self-judgments to refine completion rankings during preference optimization
Metacognitive data selection identifies high-value training examples using similar self-judgments, outperforming naive active learning methods
The approach achieves state-of-the-art faithful calibration on diverse tasks while preserving accuracy

This paper presents a new method called Reinforcement Learning with Metacognitive Feedback (RLMF) aimed at improving large language model (LLM) self-assessment and uncertainty expression. The approach involves using metacognitive feedback to refine completion rankings during preference optimization, as well as identifying high-value training examples through metacognitive data selection. Extensive experiments demonstrate that RLMF achieves state-of-the-art faithful calibration on diverse tasks while maintaining accuracy, outperforming standard reinforcement learning by up to 63%. This method positions itself as a promising paradigm for enhancing LLM metacognition and alignment.

arxiv.org2026-07-01Research ml sciencerel 8/10 score 5.4

Multi-Block Diffusion Language Models

MBD-LMs offer significant improvements in text generation efficiency and accuracy through Multi-block Teacher Forcing and optimized decoding algorithms.

Details

Proposes Multi-Block Diffusion Language Models (MBD-LMs) to extend Block Diffusion Language Models
Introduces Multi-block Teacher Forcing (MultiTF) for training MBD-LMs, improving inference states
Employs an optimized decoding algorithm with the Block Buffer mechanism to preserve prefix-cache reuse and maintain static input shapes

The article introduces Multi-Block Diffusion Language Models (MBD-LMs) as an extension to Block Diffusion Language Models, utilizing Multi-block Teacher Forcing (MultiTF) for better alignment between training and inference states. The proposed method includes an optimized decoding algorithm with the Block Buffer mechanism that enhances efficiency by preserving prefix-cache reuse and maintaining static input shapes. Empirical results show significant improvements in text generation performance: MBD-LLaDA2-Mini increases TPF from 3.47 to 6.19 and accuracy from 79.95% to 81.03%. When combined with DMax, the model achieves a TPF of 9.34 while maintaining near-zero accuracy loss.

arxiv.org2026-07-01Research ml sciencerel 8/10 score 5.6

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

BlockPilot introduces an adaptive policy for speculative decoding that significantly improves inference speed without compromising accuracy, making it a valuable tool for optimizing large language models.

Details

Proposes BlockPilot, which predicts the optimal block size adaptively based on input representation
Achieves a 4.20x speedup on Qwen3-4B under temperature T=1
Reduces decision space to low-dimensional and structured for efficient policy learning

BlockPilot is an instance-adaptive policy that predicts the optimal block size for diffusion-based speculative decoding from the prefilling representation. This approach reduces the problem to a simpler decision space, enabling significant speedups with minimal overhead. Experiments show BlockPilot achieves a 4.20x speedup on Qwen3-4B under temperature T=1 without affecting accuracy.

arxiv.org2026-06-30Research ml sciencerel 8/10 score 5.3

LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

This research advances real-time video editing by enabling stable, high-fidelity edits suitable for AR and other interactive applications.

Details

Three-stage distillation pipeline transfers editing capability from a bidirectional foundation model to an unidirectional streaming editor
AR-oriented mask cache reuses region-related computation across frames, reducing redundant processing and accelerating inference
Achieves state-of-the-art visual quality among streaming baselines

The paper introduces LiveEdit, a novel framework for real-time video editing that addresses stability and latency issues through a three-stage distillation pipeline. This method transfers editing capabilities from a bidirectional foundation model to an efficient unidirectional streaming editor, ensuring stable long-term edits without compromising visual fidelity. Additionally, the use of an AR-oriented mask cache reduces redundant computation across frames, significantly accelerating inference speed to 12.66 FPS. The framework is evaluated and shown to achieve state-of-the-art visual quality while being suitable for interactive and augmented reality applications.

arxiv.org2026-06-29Research ml sciencerel 8/10 score 4.0

Quantum Generative Diffusion Model for Real-World Time Series

This work introduces the first quantum generative diffusion model for time series, demonstrating significant improvements in efficiency and performance compared to classical models.

Details

QDiffusion-TS is the first quantum generative diffusion model for real-world time series synthesis
Validated on IQM quantum processor with financial time series data from Apple and Amazon
Reduces number of trainable parameters by nearly three orders of magnitude compared to classical models

The paper presents QDiffusion-TS, the first quantum generative diffusion model for time series synthesis. This hybrid quantum transformer replaces feed-forward components in a denoising transformer with quantum neural networks, significantly reducing the number of trainable parameters. When evaluated on financial data from Apple and Amazon, QDiffusion-TS generates synthetic data that more accurately reproduces real distributions, as measured by a 44% reduction in Wasserstein distance compared to classical models. Additionally, it improves predictive performance up to 71% in RMSE over baselines trained solely on real data.