ASPIRE represents a significant advancement in autonomous robotics by enabling robots to learn and refine their own control programs through continuous experience.
ASPIRE operates in an open-ended loop with three main components: robot execution engine, skill library, and evolutionary search
Achieves up to 77% improvement on LIBERO-Pro manipulation under perturbation compared to prior methods
Employs a code-as-policy paradigm for autonomous failure diagnosis and repair synthesis
Full summary
ASPIRE is an innovative continual learning system designed for robotics that autonomously writes and refines robot control programs in a code-as-policy paradigm. It consists of three components: a closed-loop execution engine, a skill library, and evolutionary search mechanisms. ASPIRE outperforms existing methods by up to 77% on LIBERO-Pro manipulation tasks under perturbation conditions and shows evidence of sim-to-real transfer, significantly reducing the effort required for real-robot programming across different embodiments and APIs.
Understanding how delayed verification affects multi-agent LLM belief stability can help improve system reliability and prevent misinformation spread in AI networks.
Details
Models use verifier and critic agents to suppress hallucinations
False claims propagate during verification delay, leading to instability
Spectral decomposition by grounded Laplacian yields a closed-form stability threshold
This paper explores how delayed verification destabilizes multi-agent large language model (LLM) belief systems. It models this process using a graph with grounded corrector nodes and finds that excessive or delayed correction can lead to oscillations rather than consensus. The study identifies an instability threshold, particularly for delay two, which is the inverse golden ratio. Additionally, it suggests a supermodular placement objective for optimal allocation of limited corrector resources and confirms predictions through experiments on five open models.
HydraCollab optimizes communication efficiency in multi-robot systems without compromising perception accuracy, making it a critical advancement for real-world distributed autonomous applications.
Details
HydraCollab selectively transmits the most informative sensor features to minimize bandwidth usage
Framework uses spatial confidence maps to dynamically adjust collaboration strategies
Outperforms state-of-the-art Where2comm on V2X-R and V2X-Radar datasets in terms of accuracy and communication cost
HydraCollab is an adaptive collaborative-perception framework designed to enhance situational awareness in multi-robot systems by optimizing communication efficiency and perception accuracy. It selectively transmits sensor data based on informativeness and employs dynamic collaboration strategies using spatial confidence maps. Evaluations on V2X-R, V2X-Radar, and UAV3D-mini datasets demonstrate that HydraCollab achieves superior performance relative to existing methods while significantly reducing bandwidth usage.
LLM-based agentic systems like QPipe can autonomously generate quantum applications from natural-language requirements, potentially revolutionizing software engineering optimization.
Details
QPipe is a multi-agent architecture that translates NL requirements into traceable quantum-application workflows
Evaluates on 20 NL requirements with real-world benchmarks and test-optimization problems
The paper introduces QPipe, a large language model (LLM)-based multi-agent system designed to autonomously generate quantum applications from natural-language requirements for test optimization tasks. Evaluated on 20 real-world benchmarks, QPipe demonstrates high success rates in code compilation and application execution, with average generation costs of 260.1 seconds and 1.89M tokens per requirement. The generated applications outperform an offline genetic algorithm baseline in most cases, highlighting the potential of agentic coordination for quantum software engineering.
Senior SWE-Bench provides a realistic benchmark for evaluating AI agents as senior software engineers, addressing the gap in current benchmarks that often assess agents at junior levels.
Details
Validation agent uses expert-designed recipes to write behavioral tests for submitted solutions
Scores combine runtime correctness with quality metrics based on observed codebase practices
Senior SWE-Bench is an open-source benchmark designed to evaluate AI agents as senior software engineers. It features realistic, underspecified instructions and tasks that reflect natural communication with agents. The validation process includes expert-designed recipes for behavioral tests, and bug tasks are sourced from PRs requiring significant runtime investigation. Scores are determined by combining runtime correctness with quality metrics based on observed codebase practices. Top models fail to complete senior-level tasks correctly over 75% of the time.
AutoTrainess demonstrates a significant leap in autonomous language model training by outperforming CLI-only methods on PostTrainBench.
Details
Achieves an average score of 26.94 with GPT-5.4 (Codex) on PostTrainBench
Outperforms DeepSeek-V4-Flash from 12.13 to 19.58 compared to CLI-only baselines
Externalizes human experience into explicit workflows, rules, and execution constraints
AutoTrainess is a language model agent designed to autonomously improve other language models by externalizing prior human experience into structured workflows. It outperforms CLI-only methods on the PostTrainBench, achieving an average score of 26.94 with GPT-5.4 (Codex) and improving DeepSeek-V4-Flash from 12.13 to 19.58. This framework enhances the reliability and effectiveness of training behavior in autonomous settings.
TRIAGE addresses a critical limitation in agentic reinforcement learning by refining how credit is assigned to actions, potentially improving the efficiency and effectiveness of AI agents.
Details
TRIAGE introduces role-typed credit assignment for agentic reinforcement learning
Standard GRPO uses final verifier outcome as uniform advantage over all action tokens
TRIAGE classifies segments into decisive progress, useful exploration, no-progress infrastructure, or regression
The paper introduces TRIAGE, a role-typed credit assignment framework for agentic reinforcement learning that addresses limitations of standard GRPO by classifying action segments into specific roles (decisive progress, useful exploration, no-progress infrastructure, or regression) and assigning rewards accordingly. This approach improves success rates in environments like ALFWorld, Search-QA, and WebShop compared to GRPO and other baselines, demonstrating the effectiveness of role-conditioned credit assignment.