Agents

Agentic systems, tool use, and orchestration.

arxiv.org2026-07-02AI agentsrel 8/10 score 6.6

ASPIRE: Agentic /Skills Discovery for Robotics

ASPIRE represents a significant advancement in autonomous robotics by enabling robots to learn and refine their own control programs through continuous experience.

ASPIRE operates in an open-ended loop with three main components: robot execution engine, skill library, and evolutionary search
Achieves up to 77% improvement on LIBERO-Pro manipulation under perturbation compared to prior methods
Employs a code-as-policy paradigm for autonomous failure diagnosis and repair synthesis

Full summary

ASPIRE is an innovative continual learning system designed for robotics that autonomously writes and refines robot control programs in a code-as-policy paradigm. It consists of three components: a closed-loop execution engine, a skill library, and evolutionary search mechanisms. ASPIRE outperforms existing methods by up to 77% on LIBERO-Pro manipulation tasks under perturbation conditions and shows evidence of sim-to-real transfer, significantly reducing the effort required for real-robot programming across different embodiments and APIs.

arxiv.org2026-07-02AI agentsrel 8/10 score 5.0

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Understanding how delayed verification affects multi-agent LLM belief stability can help improve system reliability and prevent misinformation spread in AI networks.

Details

Models use verifier and critic agents to suppress hallucinations
False claims propagate during verification delay, leading to instability
Spectral decomposition by grounded Laplacian yields a closed-form stability threshold

This paper explores how delayed verification destabilizes multi-agent large language model (LLM) belief systems. It models this process using a graph with grounded corrector nodes and finds that excessive or delayed correction can lead to oscillations rather than consensus. The study identifies an instability threshold, particularly for delay two, which is the inverse golden ratio. Additionally, it suggests a supermodular placement objective for optimal allocation of limited corrector resources and confirms predictions through experiments on five open models.

arxiv.org2026-07-02AI agentsrel 8/10 score 5.8

HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems

HydraCollab optimizes communication efficiency in multi-robot systems without compromising perception accuracy, making it a critical advancement for real-world distributed autonomous applications.

Details

HydraCollab selectively transmits the most informative sensor features to minimize bandwidth usage
Framework uses spatial confidence maps to dynamically adjust collaboration strategies
Outperforms state-of-the-art Where2comm on V2X-R and V2X-Radar datasets in terms of accuracy and communication cost

HydraCollab is an adaptive collaborative-perception framework designed to enhance situational awareness in multi-robot systems by optimizing communication efficiency and perception accuracy. It selectively transmits sensor data based on informativeness and employs dynamic collaboration strategies using spatial confidence maps. Evaluations on V2X-R, V2X-Radar, and UAV3D-mini datasets demonstrate that HydraCollab achieves superior performance relative to existing methods while significantly reducing bandwidth usage.

arxiv.org2026-07-02AI agents appsrel 8/10 score 4.8

Leveraging LLM-Based Agentic Systems to Generate Quantum Applications for Test Optimization

LLM-based agentic systems like QPipe can autonomously generate quantum applications from natural-language requirements, potentially revolutionizing software engineering optimization.

Details

QPipe is a multi-agent architecture that translates NL requirements into traceable quantum-application workflows
Evaluates on 20 NL requirements with real-world benchmarks and test-optimization problems
Achieves 100% code compilation success rate and 96.7% application execution success rate

The paper introduces QPipe, a large language model (LLM)-based multi-agent system designed to autonomously generate quantum applications from natural-language requirements for test optimization tasks. Evaluated on 20 real-world benchmarks, QPipe demonstrates high success rates in code compilation and application execution, with average generation costs of 260.1 seconds and 1.89M tokens per requirement. The generated applications outperform an offline genetic algorithm baseline in most cases, highlighting the potential of agentic coordination for quantum software engineering.

senior-swe-bench.snorkel.ai2026-07-02AI agents evalsrel 8/10 score 5.2

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

Senior SWE-Bench provides a realistic benchmark for evaluating AI agents as senior software engineers, addressing the gap in current benchmarks that often assess agents at junior levels.

Details

Validation agent uses expert-designed recipes to write behavioral tests for submitted solutions
Bug tasks sourced from PRs needing significant runtime investigation (logs, profiling data)
Scores combine runtime correctness with quality metrics based on observed codebase practices

Senior SWE-Bench is an open-source benchmark designed to evaluate AI agents as senior software engineers. It features realistic, underspecified instructions and tasks that reflect natural communication with agents. The validation process includes expert-designed recipes for behavioral tests, and bug tasks are sourced from PRs requiring significant runtime investigation. Scores are determined by combining runtime correctness with quality metrics based on observed codebase practices. Top models fail to complete senior-level tasks correctly over 75% of the time.

arxiv.org2026-07-02AI agentsrel 8/10 score 5.4

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

AutoTrainess demonstrates a significant leap in autonomous language model training by outperforming CLI-only methods on PostTrainBench.

Details

Achieves an average score of 26.94 with GPT-5.4 (Codex) on PostTrainBench
Outperforms DeepSeek-V4-Flash from 12.13 to 19.58 compared to CLI-only baselines
Externalizes human experience into explicit workflows, rules, and execution constraints

AutoTrainess is a language model agent designed to autonomously improve other language models by externalizing prior human experience into structured workflows. It outperforms CLI-only methods on the PostTrainBench, achieving an average score of 26.94 with GPT-5.4 (Codex) and improving DeepSeek-V4-Flash from 12.13 to 19.58. This framework enhances the reliability and effectiveness of training behavior in autonomous settings.

arxiv.org2026-07-01AI agentsrel 8/10 score 4.8

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

TRIAGE addresses a critical limitation in agentic reinforcement learning by refining how credit is assigned to actions, potentially improving the efficiency and effectiveness of AI agents.

Details

TRIAGE introduces role-typed credit assignment for agentic reinforcement learning
Standard GRPO uses final verifier outcome as uniform advantage over all action tokens
TRIAGE classifies segments into decisive progress, useful exploration, no-progress infrastructure, or regression

The paper introduces TRIAGE, a role-typed credit assignment framework for agentic reinforcement learning that addresses limitations of standard GRPO by classifying action segments into specific roles (decisive progress, useful exploration, no-progress infrastructure, or regression) and assigning rewards accordingly. This approach improves success rates in environments like ALFWorld, Search-QA, and WebShop compared to GRPO and other baselines, demonstrating the effectiveness of role-conditioned credit assignment.

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generationarxiv.org

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluationarxiv.org

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agentsarxiv.org

Ornith-1.0: self-improving open-source models for agentic codinggithub.com

Micro-Agent: Beat Frontier Models with Collaboration inside Model APIvllm.ai