AI agents learn when to ask for help using confidence scores

Who: Posted to arXiv by Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju, Aston Zhang, Gongshen Liu, and Zhuosheng Zhang — researchers from Shanghai Jiao Tong University and Amazon. The paper introduces Mobile-Aptus, a framework for making AI agents that control phones smarter about when to act alone versus when to ask for help.

What's new: Most AI agents that operate smartphones err in one of two directions: they either barrel ahead on tasks they cannot complete, or they pester the user with constant check-ins for tasks they could handle alone. Mobile-Aptus is the first framework to address both failure modes at once by teaching an agent to track its own confidence level and use that number to decide, step by step, whether to proceed or pause.

How it works: The system trains a in two stages. First, teaches the agent to attach a confidence score to every action it takes on screen. Second, a technique called is combined with to sharpen those confidence scores so they reflect reality rather than the model's initial biases.

The numbers: Tested across four benchmarks — OS-Kairos, , , and — Mobile-Aptus beat all competing systems by an average of 17 percent on task completion. In live phone tests it completed 26 percent more tasks than the next-best system while asking for human help only 0.64 times per instruction on average. Code is available at github.com/Wuzheng02/Mobile-Aptus.

Why it matters: As AI agents are increasingly trusted to book appointments, fill forms, and navigate apps on our behalf, knowing when to act and when to ask is a basic safety property. A system that is wrong but never asks is dangerous; one that asks constantly is useless. Mobile-Aptus shows a concrete, measurable path to hitting the middle ground, and because the confidence framework is designed to plug into existing models without rebuilding them from scratch, it could apply broadly to other agent settings beyond phones.