Evals — AI — Eclecta

senior-swe-bench.snorkel.ai2026-07-02AI agents evalsrel 8/10 score 5.2

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

Senior SWE-Bench provides a realistic benchmark for evaluating AI agents as senior software engineers, addressing the gap in current benchmarks that often assess agents at junior levels.

Validation agent uses expert-designed recipes to write behavioral tests for submitted solutions
Bug tasks sourced from PRs needing significant runtime investigation (logs, profiling data)
Scores combine runtime correctness with quality metrics based on observed codebase practices

Full summary

Senior SWE-Bench is an open-source benchmark designed to evaluate AI agents as senior software engineers. It features realistic, underspecified instructions and tasks that reflect natural communication with agents. The validation process includes expert-designed recipes for behavioral tests, and bug tasks are sourced from PRs requiring significant runtime investigation. Scores are determined by combining runtime correctness with quality metrics based on observed codebase practices. Top models fail to complete senior-level tasks correctly over 75% of the time.

arxiv.org2026-07-01AI agents evalsrel 8/10 score 5.3

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

The AFTER benchmark provides a standardized way to evaluate and improve procedural memory in LLM agents, crucial for enhancing their performance on recurring tasks.

Details

AFTER includes 382 realistic enterprise tasks spanning six professional roles and 22 procedural skills
Single refinement rounds can boost performance by 3.7-6.7 points
Skills evolved from diverse multi-model execution traces achieve 73.1% cross-model test accuracy

The AFTER benchmark evaluates procedural memory in LLM agents through a suite of 382 realistic enterprise tasks across six professional roles and 22 procedural skills. It assesses skill transferability across tasks, roles, and model backbones. Experiments show that procedural memory can significantly enhance performance: single refinement rounds improve aggregate scores by 3.7-6.7 points, and diverse multi-model traces yield 73.1% cross-model test accuracy. The study also reveals varying generalizability of skills, with some being broadly applicable while others are role-specific.

arxiv.org2026-07-01AI agents evalsrel 8/10 score 5.4

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

QVal offers a cost-effective way to evaluate dense supervision signals in long-horizon LLM agents, enabling researchers to compare different methodologies without the need for extensive training runs.

Details

QVal is a training-free testbed that measures how well a method's score aligns with Q-values of a strong reference policy
Benchmarks 21 dense supervision methods across four diverse environments and seven methodological families
Conducted over 1.2K evaluation experiments using six open-weight model backbones

QVal is a novel testbed introduced to evaluate dense supervision signals in long-horizon LLM agents without requiring training runs. It assesses the alignment of these signals with Q-values from a strong reference policy for state-action pairs. The study benchmarks 21 methods across diverse environments and methodological families, revealing that simple prompting baselines often outperform more complex recent approaches. This framework is designed to be extensible, allowing researchers to iterate on dense supervision methods before committing to training runs.

arxiv.org2026-06-30AI evalsrel 8/10 score 5.1

Trimming the Long-Tail of Visual World Modeling Evaluation

Tailor-Bench reveals significant limitations in current visual world models' ability to generalize beyond common physical interactions, highlighting a critical gap in AI's understanding of the real world.

Details

Introduces Tailor-Bench for evaluating model performance on irregular physical interactions
Three scenario modes: Regular (common tool-task pairs), Unconventional (attribute-compatible substitutes), Impossible (attribute-violating tools)
Two settings under unified protocol: predictive generation and descriptive generation

The paper introduces Tailor-Bench, a benchmark designed to evaluate visual world models on their ability to simulate irregular physical interactions. It includes three scenario modes—Regular, Unconventional, and Impossible—to progressively challenge model reasoning. The benchmark also features two settings: predictive generation for inferring outcomes without guidance and descriptive generation for faithful realization of specified outcomes. Experimental results indicate a significant performance gap in handling uncommon scenarios compared to common ones, suggesting that current models struggle with generalizing beyond typical physical interactions.