New training method fixes AI agents' struggle with external tools
Original: Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Source: arxiv.org ↗
Who: Authored by Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, and Byung-Kwan Lee — a team spanning NVIDIA, KAIST, and collaborating institutions — and posted as a preprint on arXiv in May 2026.
What's new: The paper identifies a concrete training failure they call the Thinking-Acting Gap and introduces to fix it. The problem is that when you train an AI model to both reason on its own and reach out to external tools — like a calculator or a web search — the model strongly prefers reasoning silently and rarely practices using tools, so it never gets good at them.
How it works: Standard training uses a method called which compares a batch of the model's own answers and rewards the better ones. The trouble is that when every answer in a batch involving a tool call is wrong, no answer looks better than any other, so the model learns nothing from that batch. AXPO detects these all-wrong tool-using batches, freezes the reasoning text that led up to the tool call, and then reruns just the tool call and what follows — giving the model fresh attempts specifically at the hard part it keeps failing. An uncertainty filter picks which frozen reasoning prefixes are worth resampling, so the extra compute goes where it is most needed.
The numbers: Across nine and three sizes of , adding AXPO on top of gains an average of 1.8 percentage points on and 1.8 percentage points on at the 8-billion-parameter scale. The 8-billion-parameter AXPO model beats the 32-billion-parameter baseline on Pass@4, using four times fewer parameters.
Why it matters: Making a smaller model outperform a model four times its size on tool-use reasoning tasks is a meaningful efficiency win — smaller models are cheaper and faster to run. More broadly, the paper shows that a targeted fix to a specific training blind spot can matter more than simply scaling up model size.