AI EngineerMay 10, 202616m

Hierarchical Memory: Context Management in Agents — Sally-Ann Delucia

TL;DR

Arize learned the hard way that context, not prompts, is what breaks agents — Sally-Ann Delucia says Alex repeatedly failed because it was analyzing huge trace-and-span logs from Arize’s observability platform and kept running into context limits, creating a “vicious loop” where more debugging data just caused more failure.
Naive fixes both failed: truncation made Alex forget, and summarization was too unreliable — keeping only the first 100 characters worked for simple cases but made follow-up questions feel like brand-new chats, while LLM summarization had no dependable way to preserve what was actually important.
Their working solution is a hierarchical memory pattern: keep the head and tail, store the middle — Alex now keeps the first 100 characters, the last 100, preserves the system prompt, stores truncated middle content in memory, and lets the agent retrieve prior tool calls or messages when needed.
Long conversations exposed bugs late, so Arize built long-session evals — instead of waiting for a user complaint, they now load 10 turns and test the 11th to see whether context handling is degrading over time as chats stretch past 20 turns.
Sub-agents were the real unlock for heavy tasks — rather than stuffing chat history, search queries, intermediate reasoning, and large data payloads into one main agent, Alex now delegates data-heavy search work to sub-agents and only passes back results.
The big unsolved problems are long-term memory and principled context selection — Delucia says Alex still relies on a heuristic like “first 100, last 100,” still hits provider limits on huge prompts, and is actively working on memory that persists across chats instead of only within a single session.

Summary

Building Alex to Debug Alex

Sally-Ann Delucia, Arize’s head of product and a core contributor to Alex, frames the talk around a very builder-specific pain: they built Alex, an AI harness with 40-plus skills, while using Alex on their own product. That created a brutal recursive problem — the agent was analyzing trace and span data generated by the same kinds of agent workflows that kept overflowing its context window.

Why Context Engineering Replaced Prompt Engineering

She points to Andrej Karpathy’s “+1 for context engineering over prompt engineering” as the shift the field finally made last year. Her core framing is crisp: context engineering is not about cramming under a token limit, it’s about strategically choosing what the model sees, because the wrong context means bad answers and bad UX.

The Vicious Loop: More Data, More Failure

Alex sat on top of Arize’s observability stack, so even a single trace included user input, prompts, metadata, and interaction history — and then users wanted to analyze patterns across many traces. The result was a loop where bigger spans caused context overflow, Alex failed, retried with even more data, and failed again; the system meant to understand the data was trapped by the data.

First Try: Truncate It and Hope

Their first move was almost comically simple: keep the first 100 characters and drop the rest. It worked just enough to be tempting, then fell apart because Alex couldn’t track follow-up questions — ask about “input B” one turn later, and it no longer knew what “B” referred to.

Second Try: Let the Model Summarize

Summarization felt like the obvious LLM-native solution, but Delucia says it was too inconsistent to trust. The problem wasn’t whether the model could compress text; it was that Arize had no control over what got preserved versus what got thrown away, so important details disappeared unpredictably.

What Actually Worked: Smart Truncation Plus Memory

The strategy Alex uses today is more surgical: keep the head, keep the tail, preserve the system prompt, dedupe long tool calls by keeping the latest result, and store the middle in memory for retrieval. Delucia’s clean distinction is the memorable one: context decides what the model sees; memory decides what survives.

Long Chats Broke Things Quietly

A big surprise was that users didn’t restart chats — they kept one thread going while moving across the Arize app, which pushed conversations from under 10 turns to 20-plus. That meant failures didn’t show up immediately; they surfaced late, so the team started running long-session evals by loading 10 turns and testing the 11th to make those bugs measurable before a customer reported them.

Sub-Agents Became the Escape Hatch

The next realization was that not all context belongs in one agent, especially for search over traces with hundreds of spans, multiple queries, and lots of intermediate reasoning. Arize split the architecture so the main agent keeps only light chat context while sub-agents handle heavy data operations, then return results — a move Delucia calls a game changer and one they’ve now expanded broadly.

What’s Still Unsolved

She’s candid that huge prompts still hit provider limits, long-term memory across chats doesn’t really exist in Alex yet, and context selection is still a heuristic rather than a principled budget. Even after reading the leaked Claude Code context logic and seeing a similar truncation/compression approach, her closing point is blunt: agents don’t fail because of prompts; they fail because of context.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Hierarchical Memory: Context Management in Agents — Sally-Ann Delucia

Summary

Building Alex to Debug Alex

Why Context Engineering Replaced Prompt Engineering

The Vicious Loop: More Data, More Failure

First Try: Truncate It and Hope

Second Try: Let the Model Summarize

What Actually Worked: Smart Truncation Plus Memory

Long Chats Broke Things Quietly

Sub-Agents Became the Escape Hatch

What’s Still Unsolved

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

Building Alex to Debug Alex

Why Context Engineering Replaced Prompt Engineering

The Vicious Loop: More Data, More Failure

First Try: Truncate It and Hope

Second Try: Let the Model Summarize

What Actually Worked: Smart Truncation Plus Memory

Long Chats Broke Things Quietly

Sub-Agents Became the Escape Hatch

What’s Still Unsolved

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks