AI EngineerJune 7, 202616m

LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

TL;DR

Observability is the audit trail for agents: Dat argues code no longer tells you what an agent did, telemetry does, which is why Arize is built around OpenTelemetry traces, spans, sessions, and distribution views of agent paths.
A good fix can create two or three new failures: In non-deterministic systems, changing a prompt, model, or orchestration may solve one issue while causing regressions elsewhere, so you need evals tied to actual system behavior.
There are five flavors of signal, not just LLM-as-a-judge: Dat highlights human feedback, golden datasets, deterministic checks like JSON schema validation, model-based judges, and business metrics such as saving money, making money, or saving time.
Evals work at multiple scopes: He distinguishes span evals for one input-output step, multi-span evals across components, trajectory evals for the full path an agent took, and session evals for the whole conversation state.
Arize thinks the entire optimization loop should be automated: Dat says users do not want to live in dashboards, so Arize exposes everything through CLI tools and an AI assistant called Alex that can inspect traces, spot latency or errors, and propose what to evaluate next.
Arize splits its products by audience: Phoenix is the open source, single-container option for engineering teams, while Arize AX is aimed at large enterprises like Uber, Booking, and Reddit.

The Breakdown

Dat Ngo says the real problem in enterprise AI is not building agents, it is seeing what they did, deciding what counts as good, and catching the regressions your "fix" quietly introduced elsewhere. He lays out Arize's stack for observability, evals, and experimentation, then makes the bigger claim that the whole loop should eventually run itself.