<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Eclecta — ai</title><description>Frontier and open-weight models, agents, the labs, and the policy around them.</description><link>https://eclecta.co/</link><language>en-us</language><docs>https://eclecta.co/ai/</docs><item><title>For the First Time, a Cell Built From Scratch Grows and Divides</title><link>https://quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701</link><guid isPermaLink="true">https://quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701</guid><description>This breakthrough represents a significant step towards understanding the origins of life and could pave the way for synthetic biology applications in material science, drug development, and beyond.</description><pubDate>Wed, 01 Jul 2026 18:45:42 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This breakthrough represents a significant step towards understanding the origins of life and could pave the way for synthetic biology applications in material science, drug development, and beyond.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;First synthetic cell built from scratch that can grow, replicate DNA, and divide&lt;/li&gt;&lt;li&gt;Led by Kate Adamala at the University of Minnesota&lt;/li&gt;&lt;li&gt;Involves lipid membrane, DNA replication system, commercial enzymes for reading DNA and making proteins&lt;/li&gt;&lt;li&gt;Requires constant deliveries of food and ribosomes to function&lt;/li&gt;&lt;li&gt;Potential applications include creating new materials like biofuels and drugs&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Researchers led by Kate Adamala at the University of Minnesota have created a synthetic cell from scratch that can grow, replicate its DNA, and divide. This cell, which is not yet self-sustaining, demonstrates the potential to generate life-like behavior from non-living components. The team used lipid membranes, a DNA replication system, and commercial enzymes for reading DNA and making proteins. While it requires constant deliveries of food and ribosomes, this breakthrough could lead to applications in material science, drug development, and understanding the origins of life.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48747304&quot;&gt;Hacker News (827) · 272c&lt;/a&gt; · &lt;a href=&quot;https://www.quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701/&quot;&gt;Mastodon trending links (28)&lt;/a&gt; · &lt;a href=&quot;https://www.quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701/&quot;&gt;Quanta Magazine&lt;/a&gt; · &lt;a href=&quot;https://www.quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-20260701/&quot;&gt;Quanta Magazine – Quantum Computing&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Previewing GPT-5.6 Sol: a next-generation model</title><link>https://openai.com/index/previewing-gpt-5-6-sol</link><guid isPermaLink="true">https://openai.com/index/previewing-gpt-5-6-sol</guid><description>GPT-5.6 Sol introduces significant performance improvements and enhanced safety measures in coding, biology, and cybersecurity tasks, setting a new standard for AI model capabilities.</description><pubDate>Fri, 26 Jun 2026 22:00:12 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; GPT-5.6 Sol introduces significant performance improvements and enhanced safety measures in coding, biology, and cybersecurity tasks, setting a new standard for AI model capabilities.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;GPT-5.6 series includes Sol (flagship), Terra (balanced), and Luna (fast and affordable) models&lt;/li&gt;&lt;li&gt;Sol sets state-of-the-art on Terminal-Bench 2.1 and GeneBench v1, with strong cybersecurity performance on ExploitBench² and ExploitGym&lt;/li&gt;&lt;li&gt;Models priced per 1M tokens: Sol ($5 input / $30 output), Terra ($2.50 input / $15 output), Luna ($1 input / $6 output)&lt;/li&gt;&lt;li&gt;Safety features include layered safeguards, real-time checks, account-level reviews, and differentiated access&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;OpenAI introduces the GPT-5.6 series with Sol as the flagship model, Terra for balanced performance at half the cost of GPT-5.5, and Luna for strong capabilities at the lowest cost. Sol excels in coding, biology, and cybersecurity tasks, achieving state-of-the-art results on Terminal-Bench 2.1 and GeneBench v1, while demonstrating competitive performance with fewer tokens compared to previous models on ExploitBench² and ExploitGym. The series includes enhanced safety features such as layered safeguards, real-time checks, account-level reviews, and differentiated access, tested through extensive red-teaming efforts. Pricing is tiered based on model capabilities, with Sol priced at $5 input / $30 output per 1M tokens, Terra at $2.50 input / $15 output, and Luna at $1 input / $6 output.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://openai.com/index/previewing-gpt-5-6-sol&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48689028&quot;&gt;Hacker News (980) · 606c&lt;/a&gt; · &lt;a href=&quot;https://openai.com/index/previewing-gpt-5-6-sol&quot;&gt;OpenAI News&lt;/a&gt; · &lt;a href=&quot;https://openai.com/index/previewing-gpt-5-6-sol/&quot;&gt;Daring Fireball&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving</title><link>https://arxiv.org/abs/2607.00466</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00466</guid><description>ELDR optimizes routing for PD-disaggregated MoE models, reducing latency and improving efficiency in large-scale deployments.</description><pubDate>Thu, 02 Jul 2026 20:32:51 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; ELDR optimizes routing for PD-disaggregated MoE models, reducing latency and improving efficiency in large-scale deployments.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;ELDR uses expert-locality-aware routing to predict and partition the workload across decode workers&lt;/li&gt;&lt;li&gt;Balanced K-means partitions signature space offline; locality-band routing matches requests online&lt;/li&gt;&lt;li&gt;Signature cache co-indexed with KV cache ensures exact signatures under prefix caching&lt;/li&gt;&lt;li&gt;Implemented in vLLM, evaluated on up to 40 GPUs, showing median TPOT reductions of 5.9-13.9% over baselines&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;ELDR is an expert-locality-aware decode router designed for PD-disaggregated MoE models. It predicts the experts a request will activate during generation and partitions signature space using balanced K-means offline. Online, it uses locality-band routing to send requests to the least-loaded worker matching their signature. A co-indexed signature cache ensures exact signatures under prefix caching. Evaluated on up to 40 GPUs, ELDR reduces median TPOT by 5.9-13.9% over four load-balancing baselines without changing model outputs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00466&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2607.00466&quot;&gt;Hugging Face Daily Papers (17)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00466&quot;&gt;arXiv cs.DC&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Escape from Ostrogradsky via Hidden Ghost Parity</title><link>https://arxiv.org/abs/2607.00096</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00096</guid><description>This work challenges a long-standing theorem in quantum field theory, potentially opening new avenues for constructing viable high-energy physics models.</description><pubDate>Thu, 02 Jul 2026 17:17:09 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This work challenges a long-standing theorem in quantum field theory, potentially opening new avenues for constructing viable high-energy physics models.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Counterexample to Ostrogradsky&apos;s no-go theorem using four-derivative quantum field theory&lt;/li&gt;&lt;li&gt;Theory is UV-complete with consistent perturbative expansion&lt;/li&gt;&lt;li&gt;Quantization on indefinite state space (Krein space) ensures causality and unitarity&lt;/li&gt;&lt;li&gt;Generalized Born rule for Krein spaces maintains positive transition probabilities despite ghost states&lt;/li&gt;&lt;li&gt;Hidden &apos;ghost parity&apos; symmetry crucial for proof&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The article presents a counterexample to Ostrogradsky&apos;s no-go theorem in quantum field theory by introducing a four-derivative, UV-complete QFT with consistent perturbative expansion. The theory is quantized on an indefinite state space (Krein space) and maintains causality and unitarity through the use of covariant methods. A generalized Born rule for Krein spaces ensures positive transition probabilities despite ghost states, facilitated by a hidden &apos;ghost parity&apos; symmetry.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00096&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2607.00096&quot;&gt;arXiv gr-qc&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00096&quot;&gt;arXiv hep-ph&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00096&quot;&gt;arXiv hep-th&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00096&quot;&gt;arXiv math-ph&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>ASPIRE: Agentic /Skills Discovery for Robotics</title><link>https://arxiv.org/abs/2607.00272</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00272</guid><description>ASPIRE represents a significant advancement in autonomous robotics by enabling robots to learn and refine their own control programs through continuous experience.</description><pubDate>Thu, 02 Jul 2026 17:16:46 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; ASPIRE represents a significant advancement in autonomous robotics by enabling robots to learn and refine their own control programs through continuous experience.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;ASPIRE operates in an open-ended loop with three main components: robot execution engine, skill library, and evolutionary search&lt;/li&gt;&lt;li&gt;Achieves up to 77% improvement on LIBERO-Pro manipulation under perturbation compared to prior methods&lt;/li&gt;&lt;li&gt;Employs a code-as-policy paradigm for autonomous failure diagnosis and repair synthesis&lt;/li&gt;&lt;li&gt;Demonstrates sim-to-real transfer with evidence of reduced real-robot programming effort across different embodiments and APIs&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;ASPIRE is an innovative continual learning system designed for robotics that autonomously writes and refines robot control programs in a code-as-policy paradigm. It consists of three components: a closed-loop execution engine, a skill library, and evolutionary search mechanisms. ASPIRE outperforms existing methods by up to 77% on LIBERO-Pro manipulation tasks under perturbation conditions and shows evidence of sim-to-real transfer, significantly reducing the effort required for real-robot programming across different embodiments and APIs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00272&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2607.00272&quot;&gt;Hugging Face Daily Papers (9)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00272&quot;&gt;arXiv cs.AI&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00272&quot;&gt;arXiv cs.RO&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement</title><link>https://arxiv.org/abs/2606.27409</link><guid isPermaLink="true">https://arxiv.org/abs/2606.27409</guid><description>Understanding how delayed verification affects multi-agent LLM belief stability can help improve system reliability and prevent misinformation spread in AI networks.</description><pubDate>Thu, 02 Jul 2026 17:00:56 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Understanding how delayed verification affects multi-agent LLM belief stability can help improve system reliability and prevent misinformation spread in AI networks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Models use verifier and critic agents to suppress hallucinations&lt;/li&gt;&lt;li&gt;False claims propagate during verification delay, leading to instability&lt;/li&gt;&lt;li&gt;Spectral decomposition by grounded Laplacian yields a closed-form stability threshold&lt;/li&gt;&lt;li&gt;For delay two, the instability threshold is the inverse golden ratio (approximately 0.618)&lt;/li&gt;&lt;li&gt;Grounded factual answering eliminates oscillation effect&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This paper explores how delayed verification destabilizes multi-agent large language model (LLM) belief systems. It models this process using a graph with grounded corrector nodes and finds that excessive or delayed correction can lead to oscillations rather than consensus. The study identifies an instability threshold, particularly for delay two, which is the inverse golden ratio. Additionally, it suggests a supermodular placement objective for optimal allocation of limited corrector resources and confirms predictions through experiments on five open models.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27409&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.27409&quot;&gt;Hugging Face Daily Papers (3)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27409&quot;&gt;arXiv cs.CL&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27409&quot;&gt;arXiv cs.LG&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Mind the Heads: Topological Representation Alignment for Multimodal LLMs</title><link>https://arxiv.org/abs/2606.23885</link><guid isPermaLink="true">https://arxiv.org/abs/2606.23885</guid><description>HeRA offers a novel approach to aligning multimodal representations at the granularity of individual attention heads, potentially improving the accuracy and reliability of multimodal large language models (MLLMs).</description><pubDate>Thu, 02 Jul 2026 17:00:35 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; HeRA offers a novel approach to aligning multimodal representations at the granularity of individual attention heads, potentially improving the accuracy and reliability of multimodal large language models (MLLMs).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Proposes Head-Wise Representation Alignment (HeRA) method&lt;/li&gt;&lt;li&gt;Focuses on preserving topological structure using Mutual K-Nearest Neighbor (MKNN) alignment metric&lt;/li&gt;&lt;li&gt;Improves performance on challenging vision-centric tasks across multiple MLLMs and benchmarks&lt;/li&gt;&lt;li&gt;Aligning the least aligned heads yields the largest gains, contrary to intuition&lt;/li&gt;&lt;li&gt;Reduces visual hallucinations by curbing over-reliance on linguistic priors&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads in multimodal large language models (MLLMs). HeRA uses the Mutual K-Nearest Neighbor (MKNN) alignment metric to preserve topological structure across modalities. Evaluations show that aligning less aligned heads yields significant performance improvements on vision-centric tasks and reduces visual hallucinations by mitigating over-reliance on linguistic priors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.23885&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.23885&quot;&gt;Hugging Face Daily Papers (3)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.23885&quot;&gt;arXiv cs.CL&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.23885&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems</title><link>https://arxiv.org/abs/2607.00191</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00191</guid><description>HydraCollab optimizes communication efficiency in multi-robot systems without compromising perception accuracy, making it a critical advancement for real-world distributed autonomous applications.</description><pubDate>Thu, 02 Jul 2026 16:27:58 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; HydraCollab optimizes communication efficiency in multi-robot systems without compromising perception accuracy, making it a critical advancement for real-world distributed autonomous applications.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;HydraCollab selectively transmits the most informative sensor features to minimize bandwidth usage&lt;/li&gt;&lt;li&gt;Framework uses spatial confidence maps to dynamically adjust collaboration strategies&lt;/li&gt;&lt;li&gt;Outperforms state-of-the-art Where2comm on V2X-R and V2X-Radar datasets in terms of accuracy and communication cost&lt;/li&gt;&lt;li&gt;Achieves 0.78% performance improvement over Where2comm using only 41% bandwidth on the V2X-R dataset&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;HydraCollab is an adaptive collaborative-perception framework designed to enhance situational awareness in multi-robot systems by optimizing communication efficiency and perception accuracy. It selectively transmits sensor data based on informativeness and employs dynamic collaboration strategies using spatial confidence maps. Evaluations on V2X-R, V2X-Radar, and UAV3D-mini datasets demonstrate that HydraCollab achieves superior performance relative to existing methods while significantly reducing bandwidth usage.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00191&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2607.00191&quot;&gt;arXiv cs.AI&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00191&quot;&gt;arXiv cs.LG&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00191&quot;&gt;arXiv cs.RO&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Scaling Up Thermodynamic AI Models</title><link>https://arxiv.org/abs/2607.00170</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00170</guid><description>Developing scalable training methods for thermodynamic AI models could enable more efficient, low-power edge computing solutions.</description><pubDate>Thu, 02 Jul 2026 16:27:34 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Developing scalable training methods for thermodynamic AI models could enable more efficient, low-power edge computing solutions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Thermodynamic computing devices based on the Ising model promise low-power AI inference and edge computing&lt;/li&gt;&lt;li&gt;A theoretical correspondence between high-temperature Gibbs-sampled Ising systems and feed-forward neural networks is turned into a scalable backpropagation-based algorithm&lt;/li&gt;&lt;li&gt;Image classification models achieve 94.9% accuracy on CIFAR-10 and 76.0% on CIFAR-100 under binary Gibbs sampling&lt;/li&gt;&lt;li&gt;Mathematical theory relates inference cost to accuracy and controls autocorrelation times&lt;/li&gt;&lt;li&gt;Asymptotic results show that inference cost is bounded by a tradeoff with performance&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper presents a scalable backpropagation-based algorithm for training deep convolutional networks on Ising machine hardware, which promises low-power AI inference. Theoretical work establishes the correspondence between high-temperature Gibbs-sampled Ising systems and feed-forward neural networks. Experimental results show that image classification models achieve significant accuracy under binary Gibbs sampling. Additionally, a mathematical theory is developed to relate inference cost to accuracy and control autocorrelation times, with asymptotic results indicating a bounded tradeoff between inference cost and performance.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00170&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2607.00170&quot;&gt;arXiv cs.AI&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00170&quot;&gt;arXiv cs.LG&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00170&quot;&gt;arXiv cond-mat&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Leveraging LLM-Based Agentic Systems to Generate Quantum Applications for Test Optimization</title><link>https://arxiv.org/abs/2607.00939</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00939</guid><description>LLM-based agentic systems like QPipe can autonomously generate quantum applications from natural-language requirements, potentially revolutionizing software engineering optimization.</description><pubDate>Thu, 02 Jul 2026 07:07:51 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; LLM-based agentic systems like QPipe can autonomously generate quantum applications from natural-language requirements, potentially revolutionizing software engineering optimization.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;QPipe is a multi-agent architecture that translates NL requirements into traceable quantum-application workflows&lt;/li&gt;&lt;li&gt;Evaluates on 20 NL requirements with real-world benchmarks and test-optimization problems&lt;/li&gt;&lt;li&gt;Achieves 100% code compilation success rate and 96.7% application execution success rate&lt;/li&gt;&lt;li&gt;Average generation costs are 260.1 seconds and 1.89M tokens per requirement&lt;/li&gt;&lt;li&gt;Outperforms offline genetic algorithm baseline in most cases&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces QPipe, a large language model (LLM)-based multi-agent system designed to autonomously generate quantum applications from natural-language requirements for test optimization tasks. Evaluated on 20 real-world benchmarks, QPipe demonstrates high success rates in code compilation and application execution, with average generation costs of 260.1 seconds and 1.89M tokens per requirement. The generated applications outperform an offline genetic algorithm baseline in most cases, highlighting the potential of agentic coordination for quantum software engineering.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00939&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2607.00939&quot;&gt;arXiv cs.SE&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00939&quot;&gt;arXiv quant-ph&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment</title><link>https://arxiv.org/abs/2607.00572</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00572</guid><description>HARC offers a novel approach to enhancing safety alignment in large language models by coupling harmfulness and refusal directions, potentially mitigating vulnerabilities that allow jailbreaks.</description><pubDate>Thu, 02 Jul 2026 06:52:40 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; HARC offers a novel approach to enhancing safety alignment in large language models by coupling harmfulness and refusal directions, potentially mitigating vulnerabilities that allow jailbreaks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;HARC is a fine-tuning method for LLMs that pairs harmfulness and refusal directions across prompt and response positions&lt;/li&gt;&lt;li&gt;The method achieves the strongest robustness-capability-usability trade-off among six baselines tested&lt;/li&gt;&lt;li&gt;HARC&apos;s intervention leaves the rest of the residual stream intact, preserving general capability without over-refusal&lt;/li&gt;&lt;li&gt;Findings are consistent across five model families and two scales, indicating broad applicability&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method for large language models that enhances safety alignment by coupling harmfulness and refusal directions. The study reveals that jailbreaks succeed by suppressing either the refusal or harmfulness direction before token generation. HARC pairs these two directions across both prompt and response positions, showing robust performance without degrading general model capability. Across extensive experiments with five model families and two scales, HARC demonstrates superior trade-offs compared to existing methods.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00572&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2607.00572&quot;&gt;arXiv cs.AI&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00572&quot;&gt;arXiv cs.CR&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers</title><link>https://senior-swe-bench.snorkel.ai/</link><guid isPermaLink="true">https://senior-swe-bench.snorkel.ai/</guid><description>Senior SWE-Bench provides a realistic benchmark for evaluating AI agents as senior software engineers, addressing the gap in current benchmarks that often assess agents at junior levels.</description><pubDate>Thu, 02 Jul 2026 06:37:20 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Senior SWE-Bench provides a realistic benchmark for evaluating AI agents as senior software engineers, addressing the gap in current benchmarks that often assess agents at junior levels.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Validation agent uses expert-designed recipes to write behavioral tests for submitted solutions&lt;/li&gt;&lt;li&gt;Bug tasks sourced from PRs needing significant runtime investigation (logs, profiling data)&lt;/li&gt;&lt;li&gt;Scores combine runtime correctness with quality metrics based on observed codebase practices&lt;/li&gt;&lt;li&gt;Top models fail senior-level tasks correctly over 75% of the time&lt;/li&gt;&lt;li&gt;Tasks span multiple services and require hundreds of steps&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Senior SWE-Bench is an open-source benchmark designed to evaluate AI agents as senior software engineers. It features realistic, underspecified instructions and tasks that reflect natural communication with agents. The validation process includes expert-designed recipes for behavioral tests, and bug tasks are sourced from PRs requiring significant runtime investigation. Scores are determined by combining runtime correctness with quality metrics based on observed codebase practices. Top models fail to complete senior-level tasks correctly over 75% of the time.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://senior-swe-bench.snorkel.ai/&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48755928&quot;&gt;Hacker News (150) · 102c&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>AI summaries of Tripadvisor hotel reviews downplay serious complaints, investigation finds</title><link>https://theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints</link><guid isPermaLink="true">https://theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints</guid><description>AI-generated hotel review summaries may misrepresent serious issues, potentially endangering travelers&apos; safety and trust in the platform.</description><pubDate>Thu, 02 Jul 2026 06:36:37 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; AI-generated hotel review summaries may misrepresent serious issues, potentially endangering travelers&apos; safety and trust in the platform.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Tripadvisor&apos;s AI reviews downplay complaints like food poisoning, sexual harassment, and lack of clean water&lt;/li&gt;&lt;li&gt;The Riu Palace Santa Maria in Cape Verde was described as &apos;spotless&apos;, despite guest reports of raw chicken and illness&lt;/li&gt;&lt;li&gt;Tripadvisor is refining its AI tool but advises users to check full reviews and other sites for accuracy&lt;/li&gt;&lt;li&gt;Google removed some health-related AI summaries due to misleading information&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;An investigation by Which? found that Tripadvisor&apos;s AI-generated hotel review summaries often downplay serious complaints such as food poisoning, sexual harassment, and lack of clean water. For instance, the Riu Palace Santa Maria in Cape Verde was described positively despite guest reports of raw chicken and illness. While Tripadvisor is refining its AI tool, it advises users to verify information against full reviews and other sites. This issue highlights potential risks to traveler safety and trust in automated review systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://www.theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints&quot;&gt;The Guardian&lt;/a&gt; · &lt;a href=&quot;https://www.theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints&quot;&gt;World news | The Guardian&lt;/a&gt; · &lt;a href=&quot;https://www.theguardian.com/business/2026/jul/02/ai-summaries-tripadvisor-hotel-reviews-downplay-serious-complaints&quot;&gt;Guardian Technology&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>ABot-M0.5: Unified Mobility-and-Manipulation World Action Model</title><link>https://arxiv.org/abs/2607.00678</link><guid isPermaLink="true">https://arxiv.org/abs/2607.00678</guid><description>ABot-M0.5 addresses key challenges in mobile manipulation by improving temporal granularity, disentangling action spaces, and enhancing train-test consistency.</description><pubDate>Thu, 02 Jul 2026 05:49:18 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; ABot-M0.5 addresses key challenges in mobile manipulation by improving temporal granularity, disentangling action spaces, and enhancing train-test consistency.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Introduces intermediate latent actions to bridge video latents and embodiment-specific controls&lt;/li&gt;&lt;li&gt;Uses dual-level Mixture-of-Transformers architecture for modality representation and action subspace disentanglement&lt;/li&gt;&lt;li&gt;Proposes dream-forcing training strategy to improve model robustness and alignment&lt;/li&gt;&lt;li&gt;Achieves state-of-the-art performance in long-horizon task success and fine-grained control accuracy&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;ABot-M0.5 is a new World Action Model (WAM) designed for mobile manipulation, addressing limitations of existing models by introducing intermediate latent actions to improve temporal granularity. It employs a dual-level Mixture-of-Transformers architecture to disentangle modality representations and action subspaces, enhancing the model&apos;s ability to handle complex tasks. Additionally, ABot-M0.5 uses a dream-forcing training strategy to ensure better train-test alignment and robustness during autoregressive prediction. Experimental results show superior performance in long-horizon task success and fine-grained control accuracy.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00678&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2607.00678&quot;&gt;Hugging Face Daily Papers (8)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00678&quot;&gt;arXiv cs.CV&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2607.00678&quot;&gt;arXiv cs.RO&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors</title><link>https://arxiv.org/abs/2606.32029</link><guid isPermaLink="true">https://arxiv.org/abs/2606.32029</guid><description>This research provides a systematic evaluation of data referencing errors in LLMs when processing tabular data, offering insights into improving model reliability and accuracy.</description><pubDate>Thu, 02 Jul 2026 05:34:30 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This research provides a systematic evaluation of data referencing errors in LLMs when processing tabular data, offering insights into improving model reliability and accuracy.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;LLMs make data referencing errors (DREs) despite understanding table structure&lt;/li&gt;&lt;li&gt;Systematic evaluation shows DREs occur across models from 1.7B to 20B parameters&lt;/li&gt;&lt;li&gt;Incorporating a critic improves answer accuracy up to 12.0%&lt;/li&gt;&lt;li&gt;A lightweight 4B-parameter critic model achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper presents the first systematic evaluation of data referencing errors (DREs) in large language models (LLMs) when processing tabular data, showing that these errors occur across various model sizes from 1.7B to 20B parameters. The study demonstrates that incorporating a critic mechanism can significantly improve answer accuracy by up to 12.0%. Additionally, the researchers developed a lightweight 4B-parameter critic model capable of detecting both in-distribution and out-of-distribution DREs with an average F1 score of 78.2%, effectively assisting larger models during inference.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32029&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.32029&quot;&gt;Hugging Face Daily Papers (3)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32029&quot;&gt;arXiv cs.CL&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>AutoTrainess: Teaching Language Models to Improve Language Models Autonomously</title><link>https://arxiv.org/abs/2606.31551</link><guid isPermaLink="true">https://arxiv.org/abs/2606.31551</guid><description>AutoTrainess demonstrates a significant leap in autonomous language model training by outperforming CLI-only methods on PostTrainBench.</description><pubDate>Thu, 02 Jul 2026 05:34:05 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; AutoTrainess demonstrates a significant leap in autonomous language model training by outperforming CLI-only methods on PostTrainBench.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Achieves an average score of 26.94 with GPT-5.4 (Codex) on PostTrainBench&lt;/li&gt;&lt;li&gt;Outperforms DeepSeek-V4-Flash from 12.13 to 19.58 compared to CLI-only baselines&lt;/li&gt;&lt;li&gt;Externalizes human experience into explicit workflows, rules, and execution constraints&lt;/li&gt;&lt;li&gt;Improves reliability and effectiveness of training behavior in autonomous settings&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;AutoTrainess is a language model agent designed to autonomously improve other language models by externalizing prior human experience into structured workflows. It outperforms CLI-only methods on the PostTrainBench, achieving an average score of 26.94 with GPT-5.4 (Codex) and improving DeepSeek-V4-Flash from 12.13 to 19.58. This framework enhances the reliability and effectiveness of training behavior in autonomous settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31551&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.31551&quot;&gt;Hugging Face Daily Papers (6)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31551&quot;&gt;arXiv cs.CL&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>TerraDiT-Ω: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive</title><link>https://arxiv.org/abs/2606.31029</link><guid isPermaLink="true">https://arxiv.org/abs/2606.31029</guid><description>TerraDiT-$\Omega$ offers a novel approach to generating satellite imagery from any geospatial primitive, enhancing the applicability of generative models in geographic information systems.</description><pubDate>Thu, 02 Jul 2026 00:18:49 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; TerraDiT-$\Omega$ offers a novel approach to generating satellite imagery from any geospatial primitive, enhancing the applicability of generative models in geographic information systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Generates satellite images directly from native geospatial primitives like polygons and polylines&lt;/li&gt;&lt;li&gt;Proposes Geometry-Aware Local Attention mechanism for injecting geometric cues into attention space&lt;/li&gt;&lt;li&gt;Outperforms dense-control and sparse-control baselines across various conditioning formats&lt;/li&gt;&lt;li&gt;Supports controllable synthetic data augmentation using a single model, improving performance in land-cover segmentation, object detection, road graph extraction, and scene classification&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;TerraDiT-$\Omega$ is a unified spatial control framework for satellite image synthesis that leverages any native geospatial primitive. It introduces Geometry-Aware Local Attention to inject geometric cues into the attention space during generation. This approach outperforms existing dense-control and sparse-control methods, enabling controllable synthetic data augmentation with improved performance in land-cover segmentation, object detection, road graph extraction, and scene classification tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31029&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.31029&quot;&gt;Hugging Face Daily Papers (2)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31029&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning</title><link>https://arxiv.org/abs/2606.32017</link><guid isPermaLink="true">https://arxiv.org/abs/2606.32017</guid><description>TRIAGE addresses a critical limitation in agentic reinforcement learning by refining how credit is assigned to actions, potentially improving the efficiency and effectiveness of AI agents.</description><pubDate>Wed, 01 Jul 2026 23:33:07 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; TRIAGE addresses a critical limitation in agentic reinforcement learning by refining how credit is assigned to actions, potentially improving the efficiency and effectiveness of AI agents.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;TRIAGE introduces role-typed credit assignment for agentic reinforcement learning&lt;/li&gt;&lt;li&gt;Standard GRPO uses final verifier outcome as uniform advantage over all action tokens&lt;/li&gt;&lt;li&gt;TRIAGE classifies segments into decisive progress, useful exploration, no-progress infrastructure, or regression&lt;/li&gt;&lt;li&gt;Role-conditioned credit reduces advantage estimation error when the judge is reliable&lt;/li&gt;&lt;li&gt;TRIAGE outperforms GRPO and other baselines in ALFWorld, Search-QA, and WebShop&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces TRIAGE, a role-typed credit assignment framework for agentic reinforcement learning that addresses limitations of standard GRPO by classifying action segments into specific roles (decisive progress, useful exploration, no-progress infrastructure, or regression) and assigning rewards accordingly. This approach improves success rates in environments like ALFWorld, Search-QA, and WebShop compared to GRPO and other baselines, demonstrating the effectiveness of role-conditioned credit assignment.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32017&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.32017&quot;&gt;Hugging Face Daily Papers (7)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32017&quot;&gt;arXiv cs.LG&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Scientists Asked AI to Impersonate 112 Public Figures. What Happened Next Is a ‘Dire’ Warning</title><link>https://404media.co/untitled-28</link><guid isPermaLink="true">https://404media.co/untitled-28</guid><description>The ability of large language models to convincingly mimic public figures raises significant concerns about the spread of misinformation in political discourse.</description><pubDate>Wed, 01 Jul 2026 23:00:24 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The ability of large language models to convincingly mimic public figures raises significant concerns about the spread of misinformation in political discourse.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;GPT-4 Turbo was trained on data from BBC&apos;s Question Time and Wikipedia biographies&lt;/li&gt;&lt;li&gt;112 UK public figures were impersonated by AI, with participants rating AI-generated responses as more authentic, coherent, and relevant than real ones&lt;/li&gt;&lt;li&gt;More than half of the 948 participants found AI impersonations more convincing in terms of authenticity&lt;/li&gt;&lt;li&gt;The study highlights potential risks to political integrity and public trust&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;A PLOS One study reveals that large language models (LLMs) can convincingly impersonate public figures, with participants rating AI-generated responses as more authentic, coherent, and relevant than actual debate responses. The research involved GPT-4 Turbo trained on data from BBC&apos;s Question Time and Wikipedia biographies of 112 UK public figures. Despite the high profile of real politicians, participants found AI impersonations more convincing, raising concerns about misinformation in political discourse.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://404media.co/untitled-28&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://www.404media.co/untitled-28/&quot;&gt;Mastodon trending links (4)&lt;/a&gt; · &lt;a href=&quot;https://www.404media.co/untitled-28/&quot;&gt;404 Media&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Ante: A New Way to Blend Borrow Checking and Reference Counting</title><link>https://verdagon.dev/blog/ante-blending-borrowing-rc</link><guid isPermaLink="true">https://verdagon.dev/blog/ante-blending-borrowing-rc</guid><description>Ante offers a novel approach to memory safety by blending reference counting with borrow checking, enabling safer and more flexible code without run-time crashes.</description><pubDate>Wed, 01 Jul 2026 22:46:01 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Ante offers a novel approach to memory safety by blending reference counting with borrow checking, enabling safer and more flexible code without run-time crashes.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Ante allows multiple mutable references to the same struct simultaneously&lt;/li&gt;&lt;li&gt;Uses &apos;shared&apos; keyword for automatically reference-counted types&lt;/li&gt;&lt;li&gt;Enables mutably borrowing fields of shared mutable types without locking&lt;/li&gt;&lt;li&gt;Provides a compile-time mechanism to safely handle unions and their variants&lt;/li&gt;&lt;li&gt;Requires type analysis to ensure no variable in scope can alias the converted unique reference&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Ante introduces a new programming language that blends reference counting with borrow checking, offering memory safety without run-time crashes. It allows multiple mutable references to the same struct at once and uses a &apos;shared&apos; keyword for automatically reference-counted types. Ante enables mutably borrowing fields of shared mutable types without locking and provides compile-time mechanisms to safely handle unions and their variants. However, it requires type analysis to ensure no variable in scope can alias the converted unique reference.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://verdagon.dev/blog/ante-blending-borrowing-rc&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48710770&quot;&gt;Hacker News (92) · 20c&lt;/a&gt; · &lt;a href=&quot;https://lobste.rs/s/vv4fhi/ante_new_way_blend_borrow_checking&quot;&gt;Lobsters (70) · 21c&lt;/a&gt; · &lt;a href=&quot;https://verdagon.dev/blog/ante-blending-borrowing-rc&quot;&gt;Verdagon / Evan Ovadia Blog (Vale lang)&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Lorentz-Violating Scenarios for the Highest-Energy Photons from GRB 221009A</title><link>https://arxiv.org/abs/2504.01830</link><guid isPermaLink="true">https://arxiv.org/abs/2504.01830</guid><description>This research challenges conventional physics models by providing evidence for Lorentz invariance violation through the detection of an extremely high-energy photon.</description><pubDate>Wed, 01 Jul 2026 21:26:53 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This research challenges conventional physics models by providing evidence for Lorentz invariance violation through the detection of an extremely high-energy photon.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A photon with energy ${\cal E} \simeq 251 \, {\rm TeV}$ from GRB 221009A was initially detected by the Carpet collaboration in 2022&lt;/li&gt;&lt;li&gt;Full data analysis confirms a photon energy of ${\cal E} = 300^{+43}_{-38} \, {\rm TeV}$ with high confidence&lt;/li&gt;&lt;li&gt;Standard models predict absorption by the CMB for photons at this energy level, making the detection anomalous&lt;/li&gt;&lt;li&gt;Detection is disfavored in scenarios involving axion-like particles (ALPs) alone&lt;/li&gt;&lt;li&gt;Lorentz invariant violation (LIV) frameworks are compatible with the observed photon under specific conditions&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The Carpet collaboration has confirmed the detection of a high-energy photon from GRB 221009A at ${\cal E} = 300^{+43}_{-38} \, {\rm TeV}$, challenging conventional physics models. Standard propagation models predict absorption by the cosmic microwave background (CMB) for such photons, making this detection anomalous. The research disfavors explanations involving axion-like particles alone and instead supports specific Lorentz invariance violation frameworks with energy scales of ${\cal E}_{{\rm LIV}, 1} &amp;lt; 1.22_{-0.22}^{+0.19} \times 10^{21} \, {\rm GeV}$ and ${\cal E}_{{\rm LIV}, 2} &amp;lt; 2.03_{-0.22}^{+0.17} \times 10^{13} \, {\rm GeV}$ at the $95\%$ confidence level.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2504.01830&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2504.01830&quot;&gt;arXiv gr-qc&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2504.01830&quot;&gt;arXiv hep-ph&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2504.01830&quot;&gt;arXiv hep-th&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning</title><link>https://arxiv.org/abs/2606.29985</link><guid isPermaLink="true">https://arxiv.org/abs/2606.29985</guid><description>The article highlights a critical gap in how we measure the diversity of large language models&apos; (LLMs) mathematical reasoning, which could impact their ability to solve problems creatively and effectively.</description><pubDate>Wed, 01 Jul 2026 21:09:12 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The article highlights a critical gap in how we measure the diversity of large language models&apos; (LLMs) mathematical reasoning, which could impact their ability to solve problems creatively and effectively.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;LLM mathematical reasoning diversity is crucial for exploration&lt;/li&gt;&lt;li&gt;Common metrics capture surface-level variation but not differences in problem-solving strategies&lt;/li&gt;&lt;li&gt;A human-calibrated LLM judge framework assesses approach-level diversity&lt;/li&gt;&lt;li&gt;Approach-diverse candidate sets improve test-time scaling&lt;/li&gt;&lt;li&gt;Optimizing an LLM judge diversity reward during training exploits judge-specific preferences&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The article introduces the concept of &apos;approach-level diversity&apos; in large language models (LLMs) to measure variation in problem-solving strategies rather than just surface-level differences. Using a human-calibrated LLM judge framework, it shows that existing metrics are unreliable proxies for approach-level diversity. The study finds that while approach-diverse candidate sets improve test-time scaling, optimizing an LLM judge diversity reward during training leads the model to exploit specific preferences rather than broaden its approaches.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.29985&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.29985&quot;&gt;Hugging Face Daily Papers (16)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.29985&quot;&gt;arXiv cs.CL&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation</title><link>https://arxiv.org/abs/2606.31537</link><guid isPermaLink="true">https://arxiv.org/abs/2606.31537</guid><description>DataEvolver offers a novel approach to text-rich image generation by leveraging rejected data to enhance model performance and efficiency.</description><pubDate>Wed, 01 Jul 2026 20:20:33 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; DataEvolver offers a novel approach to text-rich image generation by leveraging rejected data to enhance model performance and efficiency.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Proposes DataEvolver, a self-evolving multi-agent framework&lt;/li&gt;&lt;li&gt;Improves OCR-F1 scores: 85.3% on TextScenesHQ, 35.3% on LongTextBench at 0.75M scale&lt;/li&gt;&lt;li&gt;Includes Retriever, Verifier, Critic, and Generator agents for feedback-driven policy evolution&lt;/li&gt;&lt;li&gt;Rejected samples provide actionable feedback to improve data construction&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;DataEvolver is a self-evolving multi-agent framework designed to enhance text-rich image generation by incorporating rejected data into the training process. The system consists of four agents: Retriever, Verifier, Critic, and Generator, which work together to evolve feedback-driven policies for constructing high-quality datasets. Experiments show significant improvements in OCR-F1 scores on TextScenesHQ (85.3%) and LongTextBench (35.3%) benchmarks compared to fixed-dataset baselines at the 0.75M scale.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31537&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.31537&quot;&gt;Hugging Face Daily Papers (18)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.31537&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation</title><link>https://arxiv.org/abs/2606.23127</link><guid isPermaLink="true">https://arxiv.org/abs/2606.23127</guid><description>The AFTER benchmark provides a standardized way to evaluate and improve procedural memory in LLM agents, crucial for enhancing their performance on recurring tasks.</description><pubDate>Wed, 01 Jul 2026 20:05:42 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The AFTER benchmark provides a standardized way to evaluate and improve procedural memory in LLM agents, crucial for enhancing their performance on recurring tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;AFTER includes 382 realistic enterprise tasks spanning six professional roles and 22 procedural skills&lt;/li&gt;&lt;li&gt;Single refinement rounds can boost performance by 3.7-6.7 points&lt;/li&gt;&lt;li&gt;Skills evolved from diverse multi-model execution traces achieve 73.1% cross-model test accuracy&lt;/li&gt;&lt;li&gt;Some skills generalize broadly while others become specialized to specific workflows&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The AFTER benchmark evaluates procedural memory in LLM agents through a suite of 382 realistic enterprise tasks across six professional roles and 22 procedural skills. It assesses skill transferability across tasks, roles, and model backbones. Experiments show that procedural memory can significantly enhance performance: single refinement rounds improve aggregate scores by 3.7-6.7 points, and diverse multi-model traces yield 73.1% cross-model test accuracy. The study also reveals varying generalizability of skills, with some being broadly applicable while others are role-specific.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.23127&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.23127&quot;&gt;Hugging Face Daily Papers (18)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.23127&quot;&gt;arXiv cs.SE&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents</title><link>https://arxiv.org/abs/2606.32034</link><guid isPermaLink="true">https://arxiv.org/abs/2606.32034</guid><description>QVal offers a cost-effective way to evaluate dense supervision signals in long-horizon LLM agents, enabling researchers to compare different methodologies without the need for extensive training runs.</description><pubDate>Wed, 01 Jul 2026 19:19:10 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; QVal offers a cost-effective way to evaluate dense supervision signals in long-horizon LLM agents, enabling researchers to compare different methodologies without the need for extensive training runs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;QVal is a training-free testbed that measures how well a method&apos;s score aligns with Q-values of a strong reference policy&lt;/li&gt;&lt;li&gt;Benchmarks 21 dense supervision methods across four diverse environments and seven methodological families&lt;/li&gt;&lt;li&gt;Conducted over 1.2K evaluation experiments using six open-weight model backbones&lt;/li&gt;&lt;li&gt;Simple prompting baselines outperform recent dense supervision methods from literature&lt;/li&gt;&lt;li&gt;Performance clusters strongly by family, consistent across model sizes, environments, and observation modalities&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;QVal is a novel testbed introduced to evaluate dense supervision signals in long-horizon LLM agents without requiring training runs. It assesses the alignment of these signals with Q-values from a strong reference policy for state-action pairs. The study benchmarks 21 methods across diverse environments and methodological families, revealing that simple prompting baselines often outperform more complex recent approaches. This framework is designed to be extensible, allowing researchers to iterate on dense supervision methods before committing to training runs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32034&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.32034&quot;&gt;Hugging Face Daily Papers (9)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32034&quot;&gt;arXiv cs.CL&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.32034&quot;&gt;arXiv cs.LG&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Introducing TabFM: A zero-shot foundation model for tabular data</title><link>https://research.google/blog/introducing-tabfm-a-zero-shot-foundation-model-for-tabular-data</link><guid isPermaLink="true">https://research.google/blog/introducing-tabfm-a-zero-shot-foundation-model-for-tabular-data</guid><description>TabFM offers a zero-shot approach to tabular data prediction, eliminating the need for manual feature engineering and hyperparameter tuning, thus significantly simplifying machine learning workflows.</description><pubDate>Wed, 01 Jul 2026 18:15:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; TabFM offers a zero-shot approach to tabular data prediction, eliminating the need for manual feature engineering and hyperparameter tuning, thus significantly simplifying machine learning workflows.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;TabFM uses in-context learning (ICL) to process tabular data without traditional training phases&lt;/li&gt;&lt;li&gt;Trained on hundreds of millions of synthetic datasets generated by structural causal models (SCMs)&lt;/li&gt;&lt;li&gt;Evaluations show superior performance compared to industry-standard supervised algorithms on TabArena benchmarks&lt;/li&gt;&lt;li&gt;Integration with Google BigQuery allows for advanced regression and classification tasks via a simple SQL command&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Google Research introduces TabFM, a zero-shot foundation model designed specifically for tabular data classification and regression. By leveraging in-context learning (ICL), TabFM bypasses the need for manual feature engineering and hyperparameter tuning, offering high-quality predictions with minimal effort. Trained on synthetic datasets generated using structural causal models (SCMs), TabFM demonstrates superior performance across various benchmarks. The model is being integrated into Google BigQuery, enabling users to perform advanced tasks via a simple SQL command.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://research.google/blog/introducing-tabfm-a-zero-shot-foundation-model-for-tabular-data&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48739919&quot;&gt;Hacker News (61) · 8c&lt;/a&gt; · &lt;a href=&quot;https://research.google/blog/introducing-tabfm-a-zero-shot-foundation-model-for-tabular-data/&quot;&gt;Google Research Blog&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Scaling Laws, Carefully</title><link>https://lilianweng.github.io/posts/2026-06-24-scaling-laws</link><guid isPermaLink="true">https://lilianweng.github.io/posts/2026-06-24-scaling-laws</guid><description>Scaling laws dictate optimal resource allocation in deep learning model training, influencing the efficiency and effectiveness of large language model development.</description><pubDate>Wed, 01 Jul 2026 18:00:42 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Scaling laws dictate optimal resource allocation in deep learning model training, influencing the efficiency and effectiveness of large language model development.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Scaling laws describe how training loss decreases predictably as model size (N), dataset size (D), and compute (C) increase, following a power-law curve.&lt;/li&gt;&lt;li&gt;Kaplan et al. (2020) recommend scaling model size faster than data under fixed compute budget: N_opt ∝ C^0.73&lt;/li&gt;&lt;li&gt;Chinchilla paper (Hoffmann et al., 2022) argues for equal scaling of model and data sizes: N_opt ∝ C^0.5.&lt;/li&gt;&lt;li&gt;Muennighoff et al. (2023) introduced a method to fit scaling laws in the presence of repeated data, adjusting for unique tokens and repetitions.&lt;/li&gt;&lt;li&gt;Lovelace et al. (2026) added an overfitting penalty term based on capacity ratio N / U_D, showing larger models are more sensitive to repetition.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Scaling laws provide a framework for predicting the relationship between model size, dataset size, and compute in deep learning training. Kaplan et al. (2020) proposed that optimal model scaling should outpace data scaling under fixed compute constraints, but Chinchilla (Hoffmann et al., 2022) challenges this by advocating for equal scaling of both dimensions. Muennighoff et al. (2023) developed methods to fit these laws in scenarios with repeated data, while Lovelace et al. (2026) introduced an overfitting penalty term that highlights the increased sensitivity of larger models to repetition.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://lilianweng.github.io/posts/2026-06-24-scaling-laws&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48689744&quot;&gt;Hacker News (62) · 16c&lt;/a&gt; · &lt;a href=&quot;https://lilianweng.github.io/posts/2026-06-24-scaling-laws/&quot;&gt;Lilian Weng (Lil&apos;Log)&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Claude Sonnet 5</title><link>https://anthropic.com/news/claude-sonnet-5</link><guid isPermaLink="true">https://anthropic.com/news/claude-sonnet-5</guid><description>Claude Sonnet 5 offers enhanced agentic capabilities at a lower cost compared to previous models, making it more accessible for developers and businesses.</description><pubDate>Tue, 30 Jun 2026 19:45:53 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Claude Sonnet 5 offers enhanced agentic capabilities at a lower cost compared to previous models, making it more accessible for developers and businesses.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Claude Sonnet 5 is the most cost-effective model with high performance in agentic tasks like coding and tool use, narrowing the gap with larger Opus models.&lt;/li&gt;&lt;li&gt;Safety evaluations show reduced rates of undesirable behaviors and improved refusal of malicious requests compared to Sonnet 4.6.&lt;/li&gt;&lt;li&gt;Pricing: $2 per million input tokens and $10 per million output tokens through August 31, 2026; then increases to $3 and $15 respectively.&lt;/li&gt;&lt;li&gt;Cybersecurity safeguards are enabled by default due to slightly higher rates of partial success in cybersecurity tasks compared to Sonnet 4.6.&lt;/li&gt;&lt;li&gt;Available across all plans including Free, Pro, Max, Team, and Enterprise tiers.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Claude Sonnet 5 is introduced as a more cost-effective alternative to larger Opus models with enhanced agentic capabilities. It demonstrates improved safety metrics and performance in coding, tool use, and cybersecurity tasks compared to its predecessor, Sonnet 4.6. The model is available across all plans at an introductory pricing of $2 per million input tokens and $10 per million output tokens through August 31, 2026, with subsequent standard pricing adjustments.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://anthropic.com/news/claude-sonnet-5&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48736605&quot;&gt;Hacker News (1122) · 665c&lt;/a&gt; · &lt;a href=&quot;https://simonwillison.net/2026/Jun/30/claude-sonnet-5/#atom-everything&quot;&gt;Simon Willison&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Claude Code Is Steganographically Marking Requests</title><link>https://thereallo.dev/blog/claude-code-prompt-steganography</link><guid isPermaLink="true">https://thereallo.dev/blog/claude-code-prompt-steganography</guid><description>Claude Code&apos;s use of steganography in system prompts raises concerns about transparency and trust in developer tools that have extensive access privileges.</description><pubDate>Tue, 30 Jun 2026 18:43:57 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Claude Code&apos;s use of steganography in system prompts raises concerns about transparency and trust in developer tools that have extensive access privileges.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Claude Code binary modifies date strings and apostrophes to encode hidden data&lt;/li&gt;&lt;li&gt;Checks for specific timezones, API base URLs, and AI lab keywords&lt;/li&gt;&lt;li&gt;Uses Unicode characters (’, ʼ) to mark conditions invisibly&lt;/li&gt;&lt;li&gt;Domain list is stored as a base64 string XOR-decoded with key 91&lt;/li&gt;&lt;li&gt;Feature likely intended to detect unauthorized resellers or gateways&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The Claude Code binary includes a function that alters date strings and apostrophes for steganographic marking, embedding information about the system&apos;s timezone, API base URL, and AI lab keywords. This technique uses Unicode characters to encode conditions invisibly within the prompt text. The domain list is stored as a base64 string XOR-decoded with key 91. While intended to detect unauthorized resellers or gateways, this implementation raises concerns about transparency and trust in developer tools that require extensive access permissions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://thereallo.dev/blog/claude-code-prompt-steganography&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48734373&quot;&gt;Hacker News (1995) · 581c&lt;/a&gt; · &lt;a href=&quot;https://lobste.rs/s/qs2sxd/claude_code_is_steganographically&quot;&gt;Lobsters (85) · 8c&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Trimming the Long-Tail of Visual World Modeling Evaluation</title><link>https://arxiv.org/abs/2606.24256</link><guid isPermaLink="true">https://arxiv.org/abs/2606.24256</guid><description>Tailor-Bench reveals significant limitations in current visual world models&apos; ability to generalize beyond common physical interactions, highlighting a critical gap in AI&apos;s understanding of the real world.</description><pubDate>Tue, 30 Jun 2026 07:25:54 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Tailor-Bench reveals significant limitations in current visual world models&apos; ability to generalize beyond common physical interactions, highlighting a critical gap in AI&apos;s understanding of the real world.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Introduces Tailor-Bench for evaluating model performance on irregular physical interactions&lt;/li&gt;&lt;li&gt;Three scenario modes: Regular (common tool-task pairs), Unconventional (attribute-compatible substitutes), Impossible (attribute-violating tools)&lt;/li&gt;&lt;li&gt;Two settings under unified protocol: predictive generation and descriptive generation&lt;/li&gt;&lt;li&gt;Experimental results show degradation in performance from Regular to Unconventional to Impossible scenarios&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces Tailor-Bench, a benchmark designed to evaluate visual world models on their ability to simulate irregular physical interactions. It includes three scenario modes—Regular, Unconventional, and Impossible—to progressively challenge model reasoning. The benchmark also features two settings: predictive generation for inferring outcomes without guidance and descriptive generation for faithful realization of specified outcomes. Experimental results indicate a significant performance gap in handling uncommon scenarios compared to common ones, suggesting that current models struggle with generalizing beyond typical physical interactions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.24256&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.24256&quot;&gt;Hugging Face Daily Papers (35)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.24256&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>CogniRoute: Learning to Route Social Evidence in Omni-Modal Models</title><link>https://arxiv.org/abs/2606.20970</link><guid isPermaLink="true">https://arxiv.org/abs/2606.20970</guid><description>CogniRoute advances the state-of-the-art in omni-modal reasoning by introducing a schema-guided Mixture-of-Experts framework that significantly improves accuracy on complex social video question answering tasks.</description><pubDate>Tue, 30 Jun 2026 04:28:18 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; CogniRoute advances the state-of-the-art in omni-modal reasoning by introducing a schema-guided Mixture-of-Experts framework that significantly improves accuracy on complex social video question answering tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;CogniRoute achieves 59.38% average accuracy on OmniSocialBench, outperforming proprietary and open-source baselines&lt;/li&gt;&lt;li&gt;Introduces route-aware reinforcement learning to optimize token generation and expert allocation&lt;/li&gt;&lt;li&gt;Constructs OmniSocialBench with 118K structured training examples for social video QA tasks&lt;/li&gt;&lt;li&gt;Framework uses a cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;CogniRoute is a novel Mixture-of-Experts framework designed to enhance omni-modal reasoning in social contexts. It leverages route-aware reinforcement learning to optimize token generation and expert allocation based on cognitive schemas that factorize examples by cross-modal relation, reasoning demand, and temporal scope. The system achieves 59.38% average accuracy on the newly constructed OmniSocialBench dataset, which includes 118K structured training examples for social video question answering tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.20970&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.20970&quot;&gt;Hugging Face Daily Papers (1)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.20970&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Ornith-1.0: self-improving open-source models for agentic coding</title><link>https://github.com/deepreinforce-ai/Ornith-1</link><guid isPermaLink="true">https://github.com/deepreinforce-ai/Ornith-1</guid><description>Ornith-1.0 introduces a novel reinforcement learning approach that optimizes both scaffold generation and solution rollouts, achieving state-of-the-art performance in agentic coding tasks.</description><pubDate>Tue, 30 Jun 2026 04:14:30 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Ornith-1.0 introduces a novel reinforcement learning approach that optimizes both scaffold generation and solution rollouts, achieving state-of-the-art performance in agentic coding tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE variants&lt;/li&gt;&lt;li&gt;Achieves top performance on Terminal-Bench 2.1, SWE-Bench, NL2Repo, and OpenClaw benchmarks&lt;/li&gt;&lt;li&gt;Uses RL to optimize scaffold generation alongside solution rollouts for better search trajectories&lt;/li&gt;&lt;li&gt;MIT licensed with multi-GPU support and full-precision serving options&lt;/li&gt;&lt;li&gt;Supports an OpenAI-compatible interface and tool calling&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Ornith-1.0 is a self-improving open-source model designed for agentic coding tasks, available in various sizes including 9B-Dense, 35B-MoE, and 397B-MoE variants. It employs reinforcement learning to optimize both scaffold generation and solution rollouts, achieving state-of-the-art performance on multiple coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo, and OpenClaw. The model is MIT licensed, supports multi-GPU configurations, and offers full-precision serving options. It also provides an OpenAI-compatible interface with tool calling capabilities.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://github.com/deepreinforce-ai/Ornith-1&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48722052&quot;&gt;Hacker News (236) · 44c&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Vesta: A Generalist Embodied Reasoning Model</title><link>https://arxiv.org/abs/2606.20905</link><guid isPermaLink="true">https://arxiv.org/abs/2606.20905</guid><description>Vesta demonstrates that a unified generalist model can outperform specialist models in robotics, offering a more efficient and scalable solution.</description><pubDate>Tue, 30 Jun 2026 03:56:56 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Vesta demonstrates that a unified generalist model can outperform specialist models in robotics, offering a more efficient and scalable solution.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Vesta consolidates localization, spatial reasoning, navigation, and long-horizon planning into one model&lt;/li&gt;&lt;li&gt;Improves task success by over 35% on real-world robotic tasks requiring memory and reasoning&lt;/li&gt;&lt;li&gt;Beats individual state-of-the-art (SOTA) baselines by more than 20% across diverse benchmarks&lt;/li&gt;&lt;li&gt;Combines a curated corpus for spatial grounding with a multimodal memory harness&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Vesta is an embodied generalist model designed to integrate localization, spatial reasoning, navigation, and long-horizon planning into a single framework. It outperforms individual state-of-the-art (SOTA) models by over 20% across various benchmarks and improves task success by more than 35% on real-world robotic tasks requiring memory and reasoning. Vesta&apos;s approach involves using a curated corpus for spatial grounding and a multimodal memory harness to enable extended time horizon reasoning.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.20905&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.20905&quot;&gt;Hugging Face Daily Papers (9)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.20905&quot;&gt;arXiv cs.RO&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Micro-Agent: Beat Frontier Models with Collaboration inside Model API</title><link>https://vllm.ai/blog/2026-06-29-micro-agent-frontier-models</link><guid isPermaLink="true">https://vllm.ai/blog/2026-06-29-micro-agent-frontier-models</guid><description>vLLM Semantic Router introduces a new paradigm for AI request routing, enabling cost optimization, safety enforcement, and improved response quality without changing client integration.</description><pubDate>Tue, 30 Jun 2026 01:50:09 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; vLLM Semantic Router introduces a new paradigm for AI request routing, enabling cost optimization, safety enforcement, and improved response quality without changing client integration.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;vLLM Semantic Router uses patterns like Confidence, Ratings, ReMoM, Fusion, and Workflows to handle requests&lt;/li&gt;&lt;li&gt;Evaluation shows VSR Closed outperforms other models in LiveCodeBench (92.6) and GPQA-Diamond (96.0)&lt;/li&gt;&lt;li&gt;The system maintains a single API surface while allowing operators to control the recipe&lt;/li&gt;&lt;li&gt;Micro-agents belong in the router due to its ownership of model aliases, provider policy, credentials, etc.&lt;/li&gt;&lt;li&gt;vLLM Semantic Router aims to be programmable, observable, and open at the serving layer&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The vLLM Semantic Router introduces a new approach to AI request routing by implementing collaboration patterns within the router. These patterns include Confidence, Ratings, ReMoM, Fusion, and Workflows, which optimize cost, enforce safety policies, and enhance response quality. The system evaluates requests based on evidence and selects appropriate model pools or collaboration recipes. Evaluation results show that VSR Closed outperforms other models in benchmarks like LiveCodeBench (92.6) and GPQA-Diamond (96.0). This approach maintains a single API surface while allowing operators to control the underlying recipe, making it programmable, observable, and open at the serving layer.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://vllm.ai/blog/2026-06-29-micro-agent-frontier-models&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48722802&quot;&gt;Hacker News (64) · 19c&lt;/a&gt; · &lt;a href=&quot;https://vllm.ai/blog/2026-06-29-micro-agent-frontier-models&quot;&gt;vLLM Blog&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>What happens when you run a CUDA kernel?</title><link>https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel</link><guid isPermaLink="true">https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel</guid><description>Understanding the detailed execution flow of a CUDA kernel provides insights into GPU architecture and optimization techniques for high-performance computing.</description><pubDate>Mon, 29 Jun 2026 16:12:13 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; Understanding the detailed execution flow of a CUDA kernel provides insights into GPU architecture and optimization techniques for high-performance computing.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;CUDA program compiles into PTX (virtual ISA) and then SASS specific to the GPU architecture&lt;/li&gt;&lt;li&gt;nvcc driver runs multiple compilers to generate both host code and device code&lt;/li&gt;&lt;li&gt;SASS is the machine code that executes on the GPU, while PTX acts as a fallback for compatibility with other architectures&lt;/li&gt;&lt;li&gt;GPU launch involves complex communication between CPU and GPU through PCIe bus using pushbuffer and GPFIFO structures&lt;/li&gt;&lt;li&gt;Each Streaming Multiprocessor (SM) can handle up to 48 warps, with each warp consisting of 32 threads&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The article provides a detailed breakdown of how a CUDA kernel executes from source code to hardware level on an RTX 4090 GPU. It covers the compilation process where PTX (virtual ISA) is generated and then translated into SASS specific to the GPU architecture. The launch mechanism involves complex interactions between CPU and GPU through PCIe bus, utilizing structures like pushbuffer and GPFIFO for command execution. Each SM can manage up to 48 warps, with each warp consisting of 32 threads, highlighting the intricate hardware-level operations involved in executing a CUDA kernel.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://news.ycombinator.com/item?id=48718863&quot;&gt;Hacker News (254) · 30c&lt;/a&gt; · &lt;a href=&quot;https://lobste.rs/s/qkfzto/what_happens_when_you_run_cuda_kernel&quot;&gt;Lobsters (1)&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation</title><link>https://arxiv.org/abs/2606.27978</link><guid isPermaLink="true">https://arxiv.org/abs/2606.27978</guid><description>The proposed Parallel Rollout Approximation (PRA) framework addresses key challenges in pixel-space continuous-token autoregressive image generation, offering a scalable solution that achieves state-of-the-art results on ImageNet-1K.</description><pubDate>Mon, 29 Jun 2026 15:55:35 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; The proposed Parallel Rollout Approximation (PRA) framework addresses key challenges in pixel-space continuous-token autoregressive image generation, offering a scalable solution that achieves state-of-the-art results on ImageNet-1K.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Proposes PRA to address high-dimensional patch generation errors and train-inference gap&lt;/li&gt;&lt;li&gt;Achieves FID of 2.58 with PRA-S model (135M parameters) on ImageNet-1K at 256x256 resolution&lt;/li&gt;&lt;li&gt;Scales to PRA-L with 511M parameters, achieving FID of 1.94 and setting new state-of-the-art among pixel-space AR models&lt;/li&gt;&lt;li&gt;Improves ImageNet classification probing accuracy compared to other autoregressive and diffusion baselines&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper introduces Parallel Rollout Approximation (PRA), a scalable framework for pixel-space continuous-token autoregressive image generation. PRA generates low-dimensional intermediate states, maps them back to pixel-space tokens with a decoder, and constructs inference-like pixel inputs independently across positions. This approach mitigates the train-inference gap and high-dimensional patch generation errors. On ImageNet-1K at 256x256 resolution, PRA-S (135M parameters) achieves an FID of 2.58, surpassing previous results. Scaling to PRA-L with 511M parameters further improves the FID to 1.94, setting a new state-of-the-art benchmark among pixel-space AR models.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27978&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.27978&quot;&gt;Hugging Face Daily Papers (4)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27978&quot;&gt;arXiv cs.AI&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27978&quot;&gt;arXiv cs.CV&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title>Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs</title><link>https://arxiv.org/abs/2606.27378</link><guid isPermaLink="true">https://arxiv.org/abs/2606.27378</guid><description>This paper introduces an axiomatic framework to evaluate latent thought representations in LLMs independently of downstream benchmark scores, revealing fundamental limitations that current models cannot overcome.</description><pubDate>Mon, 29 Jun 2026 15:40:02 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This paper introduces an axiomatic framework to evaluate latent thought representations in LLMs independently of downstream benchmark scores, revealing fundamental limitations that current models cannot overcome.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Introduces four functional axioms: Causality, Minimality, Separability, and Stability&lt;/li&gt;&lt;li&gt;Evaluates open-weight LLMs across 23 reasoning tasks&lt;/li&gt;&lt;li&gt;No model satisfies all four axioms simultaneously&lt;/li&gt;&lt;li&gt;Representations distinguish task type but not between questions within the same task&lt;/li&gt;&lt;li&gt;Indicates a structural gap rather than an issue with model size or training procedure&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The paper presents an axiomatic evaluation framework for latent thought representations in large language models (LLMs), independent of downstream benchmark scores. It defines four functional axioms—Causality, Minimality, Separability, and Stability—and evaluates open-weight LLMs across 23 reasoning tasks. The study finds that no model satisfies all four axioms simultaneously, indicating a structural limitation in current representation methods rather than an issue with model capacity or training procedures.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Read&lt;/strong&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27378&quot;&gt;Primary source&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Surfaced on&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/papers/2606.27378&quot;&gt;Hugging Face Daily Papers (38)&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27378&quot;&gt;arXiv cs.CL&lt;/a&gt; · &lt;a href=&quot;https://arxiv.org/abs/2606.27378&quot;&gt;arXiv cs.LG&lt;/a&gt;&lt;/p&gt;</content:encoded></item></channel></rss>