Technique makes AI agents faster by compacting memory while thinking
Original: Long-horizon LLM agents accumulate conversation histories that blow past the context window. The usual fix is LLM-based summarization, which is lossy AND blocks the agent for tens of seconds while the
Source: x.com ↗
Who: Posted by @sheriyuo (Xiuyu Li, researcher at PSU — Penn State University), sharing her own paper on parallel context compaction co-authored with PSU collaborators.
What's new: When an is running a long task, it accumulates a growing conversation history that eventually overflows its . The standard fix is to have another AI summarize the history, but that summarizer runs one step at a time, blocks progress for tens of seconds, produces unpredictably sized summaries, and retains different amounts of information on different runs. This paper introduces a parallel compaction method that eliminates the blocking delay and gives operators reliable, tunable control over what gets kept.
How it works: Instead of pausing the agent to wait for a summary, the new approach schedules summarization to run at the same time as the agent's next reasoning step, so the two overlap like cooking side dishes while the main dish is already in the oven. The method also restructures how the summarizer is instructed, so the operator can directly dial in how much information is retained rather than hoping the summarizer obeys a vague prompt. This fixes the non-determinism problem: the same history compresses to a consistent result across runs.
Why it matters: Production running long tasks — think automated research assistants, coding helpers, or customer-service bots handling complex cases — pay this summarization tax constantly. The latency saving is effectively free because the work happens in parallel, and the consistency fix means the agent behaves the same way each time it runs, which matters enormously for debugging and reliability in real deployments.