New benchmark tests how AI agents search company knowledge bases

Who: Posted on the Sierra blog, authored by Ben Shi, Ola Zytek, Pedram Razavi, and Victor Barres — researchers at Sierra, a company that builds customer-service AI agents for enterprise clients.

What's new: Sierra has released tau-knowledge, a new for AI agents that must do two hard things at once — dig through a large pile of documents to find relevant policies, and then take correct multi-step actions based on what they find, all while handling a live customer conversation. Most existing tests only measure one of those skills in isolation.

How it works: The benchmark introduces a simulated banking customer-support environment called tau-Banking, built around 698 realistic documents covering products like savings accounts, credit cards, and buy-now-pay-later plans. Each task forces the agent to consult an average of 18.6 documents and execute an average of 9.5 to complete it. Some tasks require up to 33 such calls. The agent also has access to three search methods — , , and a freeform shell — and chooses its own strategy.

The numbers: When the benchmark launched in March 2026, the best model passed only 25.5% of tasks on a single attempt, a metric called . The current leader, GPT-5.5 running at maximum reasoning effort, reaches 37.4% Pass^1 — a meaningful gain, but still failing roughly 60% of tasks. Handing the relevant documents directly to the agent — removing the retrieval challenge entirely — only pushed scores to around 40%, showing that reasoning over the information is just as hard as finding it. By contrast, existing Sierra benchmarks covering airline, retail, and telecom scenarios routinely see scores above 80% Pass^1.

Why it matters: The behavioral patterns Sierra observed point to what actually separates good agents from poor ones: the best models treat searching as an ongoing activity rather than a one-time step at the start of a conversation, issue precise targeted queries rather than scattering many vague ones, and stop acting once the task is done rather than adding unrequested extras. These are exactly the failure modes that would matter in a real deployment, and no existing public benchmark was measuring them. The leaderboard, code, and tasks are open for any model provider to evaluate against.