Most AI agents today suffer from the fundamental problem of amnesia. Deploy someone to browse the web, solve GitHub issues, or navigate shopping platforms, and it performs every single task like it’s never seen anything like it before. No matter how many times he has faced the same problem, he repeats the same mistakes. Valuable lessons are lost once the work is done.
Google Cloud AI, a team of researchers from the University of Illinois at Urbana-Champaign and Yale University introduced reasoningbankA memory framework that doesn’t just record what an agent did – it distills Why Reusable, generalizable reasoning strategies are what worked or failed.
Problem with existing agent memory
To understand why ReasoningBank is important, you need to understand what existing agent memory actually does. Two popular approaches are trajectory memory (used in a system called Synapse) and workflow memory (used in agent workflow memory, or AWM). Trajectory memory stores the raw action log – every click, scroll, and typed query an agent executes. Workflow memory goes one step further and extracts reusable step-by-step processes Success Only walks.
Both have serious blind spots. Raw trajectories are noisy and too long to be directly useful for new tasks. Workflow memory only mines successful attempts, meaning that the rich learning signals hidden in each failure – and agents fail a lot – are completely discarded.

How does ReasoningBank work?
ReasoningBank acts as a closed-loop memory process the three Stooges Which runs around each completed task: memory retrieval, memory extraction, And memory consolidation.

Before an agent starts a new task, it queries the ReasoningBank using embedding-based similarity search to retrieve the vertex.Of Most relevant memory items. Those items are injected directly into the agent’s system prompt as additional context. Importantly, the default is k=1, which is a single retrieved memory item per task. Ablation experiments show that retrieving more memories actually impacts performance: the success rate drops from 49.7% at k=1 to 44.4% at k=4. The quality and relevance of the retrieved memory matters more than quantity.
Once the task is finished, a memory extractor – Powered by the same backbone LLM as the agent – analyzes the trajectory and delivers it in a structured memory item. Every object has three components: a Topic (a short strategy name), A Description (one sentence summary), and Material (1-3 sentences of distilled reasoning steps or operational insights). Importantly, the Extractor treats successful and unsuccessful trajectories differently: successes contribute valid strategies, while failures provide counterfactual losses and preventive lessons.
To decide whether the trajectory was successful or not – without access to the ground truth labels at the time of testing – the system uses a LLM as judgeWhich outputs a binary “success” or “failure” decision given the user query, trajectory, and final page state. The judge does not need to be perfect; Ablation experiments show that ReasoningBank remains robust even when the judge’s accuracy drops to about 70%.
New memory items are added directly to the ReasoningBank store, completing the loop and maintained as JSON with pre-computed embeddings for fast cosine similarity search.
MaTTS: combining memory with test-time scaling
The research team goes ahead and introduces Memory-Aware Test-Time Scaling (MaTTS)Which combines ReasoningBank with test-time compute scaling – a technique that has already proven powerful in math reasoning and coding tasks.
The insight is simple but important: scaling up test time generates multiple trajectories for the same task. Instead of choosing only the best answer and discarding the rest, MaTTS uses the entire set of trajectories as rich contrast signals for memory extraction.
MaTTS arrives two ways. parallel scaling produces Of independent trajectories for the same query, then uses self-contrary – Comparing what went right and what went wrong across trajectories – to extract higher-quality, more reliable memory items. sequential scaling Iteratively refines a single trajectory using self-refinementCapturing intermediate improvements and insights as memory signals.
The result is a positive feedback loop: better memory guides the agent toward more promising rollouts, and richer rollouts create even stronger memories. The paper notes that at k=5, parallel scaling (55.1% SR) outperforms sequential scaling (54.5% SR) on WebArena-shopping – the sequential advantage quickly saturates once the model reaches decisive success or failure, while parallel scaling continues to provide diverse rollouts that the agent can contrast and learn from.

Results in three benchmarks
Tested on WebArena (a web navigation benchmark spanning Shopping, Admin, GitLab, and Reddit tasks), Mind2Web (which tests generalization across cross-task, cross-website, and cross-domain settings), and SWE-Bench-Verified (a repository-level software engineering benchmark with 500 verified instances), ReasoningBank consistently outperformed all baselines across all three datasets and all tested backbone models. Performs.
On WebArena with gemini-2.5-flash, ReasoningBank improves overall success rate +8.3 percentage points over the memory-free baseline (40.5% → 48.8%), while reducing the average interaction steps by 1.4 compared to no-memory and by 1.6 compared to other memory baselines. Efficiency gains are fastest Success On the Trajectory – Shopping subset, for example, ReasoningBank reduced 2.1 steps from successful task completion (26.9% relative reduction). The agent reaches solutions faster because it knows the right path, not just because it abandons unsuccessful attempts quickly.
On Mind2Web, ReasoningBank provides consistent advantages in cross-task, cross-website, and cross-domain assessment splits, with the most pronounced improvements in the cross-domain setting – where the highest level of strategy transfer is required and where competing methods like AWM actually outperform down Relative to a no-memory baseline.
On SWE-bench-verified, results vary significantly by backbone model. With Gemini-2.5-Pro, ReasoningBank achieves a milestone 57.4% resolution rate A savings of 1.3 steps per task, compared to 54.0% for the no-memory baseline. With gemini-2.5-flash, the step savings is more dramatic – 2.8 less steps per task (30.3 → 27.5) Also the resolution rate improved from 34.2% to 38.8%.
Adding MaTTS (parallel scaling, k=5) further improves the results. ReasoningBank reaches out with MaTTS Overall 56.3% SR on WebArena With gemini-2.5-pro the average steps per task is also reduced from 8.8 to 7.1 – compared to 46.7% for the no-memory baseline.
emerging strategy development
One of the most surprising findings is that ReasoningBank’s memory does not remain static – it evolves. In one documented case study, the agent’s initial memory items for the “user-specific information navigation” strategy resembled simple procedural checklists: “Actively search for and click ‘Next Page’, ‘Page X’ or ‘Load More’ links.” As the agent accumulates experience, the same memory items mature into adaptive self-reflections, then into systematic pre-task checking, and ultimately into creative strategies such as “regularly cross-reference the current scene with task requirements; if current data does not align with expectations, reevaluate available options such as search filters and alternative sections.” The research team described this as emergent behavior resembling the learning dynamics of reinforcement learning – occurring entirely at test time, without any model weight updates.
key takeaways
- Failure is ultimately a learning signal: Unlike existing agent memory systems (Synapse, AWM) that learn only from successful trajectories, ReasoningBank develops generalizable reasoning strategies from both successes and failures – turning mistakes into preventive guardrails for future actions.
- Memory items are structured, not raw: ReasoningBank does not store dirty action logs. It compresses the experience into clean three-part memory items (title, description, content) that are human-interpretable and can be injected directly into the agent’s system prompt via embedding-based similarity search.
- Quality trumps quantity in recovery: Optimal retrieval is k=1, only one memory item per task. Retrieving more memories gradually affects performance (49.7% SR at k=1 drops to 44.4% at k=4), making the relevance of the retrieved memory more important than the quantity.
- Memory and test-time scaling create a virtuous cycle. MaTTS (Memory-Aware Test-Time Scaling) uses diverse exploration trajectories as contrasting signals to create stronger memories, which in turn guide better exploration – a feedback loop that increases WebArena’s success rate to 56.3% with Gemini-2.5-Pro, up from 46.7% without memory.
check it out paper, repo And technical details. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us