
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Arize learned the hard way that context, not prompts, is what breaks agents — Sally-Ann Delucia says Alex repeatedly failed because it was analyzing huge trace-and-span logs from Arize’s observability platform and kept running into context limits, creating a “vicious loop” where more debugging data just caused more failure.
Naive fixes both failed: truncation made Alex forget, and summarization was too unreliable — keeping only the first 100 characters worked for simple cases but made follow-up questions feel like brand-new chats, while LLM summarization had no dependable way to preserve what was actually important.
Their working solution is a hierarchical memory pattern: keep the head and tail, store the middle — Alex now keeps the first 100 characters, the last 100, preserves the system prompt, stores truncated middle content in memory, and lets the agent retrieve prior tool calls or messages when needed.
Long conversations exposed bugs late, so Arize built long-session evals — instead of waiting for a user complaint, they now load 10 turns and test the 11th to see whether context handling is degrading over time as chats stretch past 20 turns.
Sub-agents were the real unlock for heavy tasks — rather than stuffing chat history, search queries, intermediate reasoning, and large data payloads into one main agent, Alex now delegates data-heavy search work to sub-agents and only passes back results.
The big unsolved problems are long-term memory and principled context selection — Delucia says Alex still relies on a heuristic like “first 100, last 100,” still hits provider limits on huge prompts, and is actively working on memory that persists across chats instead of only within a single session.
Sally-Ann Delucia, Arize’s head of product and a core contributor to Alex, frames the talk around a very builder-specific pain: they built Alex, an AI harness with 40-plus skills, while using Alex on their own product. That created a brutal recursive problem — the agent was analyzing trace and span data generated by the same kinds of agent workflows that kept overflowing its context window.
She points to Andrej Karpathy’s “+1 for context engineering over prompt engineering” as the shift the field finally made last year. Her core framing is crisp: context engineering is not about cramming under a token limit, it’s about strategically choosing what the model sees, because the wrong context means bad answers and bad UX.
Alex sat on top of Arize’s observability stack, so even a single trace included user input, prompts, metadata, and interaction history — and then users wanted to analyze patterns across many traces. The result was a loop where bigger spans caused context overflow, Alex failed, retried with even more data, and failed again; the system meant to understand the data was trapped by the data.
Their first move was almost comically simple: keep the first 100 characters and drop the rest. It worked just enough to be tempting, then fell apart because Alex couldn’t track follow-up questions — ask about “input B” one turn later, and it no longer knew what “B” referred to.
Summarization felt like the obvious LLM-native solution, but Delucia says it was too inconsistent to trust. The problem wasn’t whether the model could compress text; it was that Arize had no control over what got preserved versus what got thrown away, so important details disappeared unpredictably.
The strategy Alex uses today is more surgical: keep the head, keep the tail, preserve the system prompt, dedupe long tool calls by keeping the latest result, and store the middle in memory for retrieval. Delucia’s clean distinction is the memorable one: context decides what the model sees; memory decides what survives.
A big surprise was that users didn’t restart chats — they kept one thread going while moving across the Arize app, which pushed conversations from under 10 turns to 20-plus. That meant failures didn’t show up immediately; they surfaced late, so the team started running long-session evals by loading 10 turns and testing the 11th to make those bugs measurable before a customer reported them.
The next realization was that not all context belongs in one agent, especially for search over traces with hundreds of spans, multiple queries, and lots of intermediate reasoning. Arize split the architecture so the main agent keeps only light chat context while sub-agents handle heavy data operations, then return results — a move Delucia calls a game changer and one they’ve now expanded broadly.
She’s candid that huge prompts still hit provider limits, long-term memory across chats doesn’t really exist in Alex yet, and context selection is still a heuristic rather than a principled budget. Even after reading the leaked Claude Code context logic and seeing a similar truncation/compression approach, her closing point is blunt: agents don’t fail because of prompts; they fail because of context.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.