On Evals
NYC, March 2026
loading...
There is a particular kind of confidence that comes with deploying an AI agent into production without ever measuring what it actually does. It is the same confidence you have when you drive at night with the headlights off: everything feels fine until it is not, and by then you are already in the ditch.
This piece is a primer on evals. Not a tutorial, not a product walkthrough, but a way of thinking about what it means to actually know whether your system is working. If you are building with large language models in any serious capacity, and you are not running evals, you are flying blind. The rest is a matter of degree.
Anthropic published a wonderful piece on demystifying evals for AI agents that I would recommend as a companion read. What follows is a more personal, opinionated survey of the landscape, colored by the experience of deploying agents to Fortune 500 companies in healthcare.
Vibe coding is forgivable. Vibe evals are not.
There is a growing culture of what people call "vibe coding," where you prompt an LLM, look at the output, say "that seems right," and move on. For prototyping or side projects, this is fine. Maybe even fun. But there is a less visible cousin of vibe coding that is far more dangerous, and I would call it vibe engineeringvibe engineeringChanging prompts, swapping models, or adjusting system instructions based on gut feeling rather than measured results. The AI equivalent of "it looks right to me.": changing a prompt, swapping a model, adjusting your system instructions, and then eyeballing a few examples to see if things "seem better."
The problem with vibe engineering is not that your changes are bad. The problem is that you have no idea whether they are good. You changed the system prompt and three examples look better, but did five others break? You would not know, because you did not check. And the next time you make a change, you start from the same place of not knowing. Every iteration begins from zero. You are not building on previous work; you are gambling with compute, and the house always wins eventually.
The fix is almost embarrassingly simple: keep track. Write down what you tried. Measure what happened. Compare it to what happened before. This is not novel engineering wisdom; it is the scientific method applied to prompts. But the number of teams I have seen skip this step because it felt like overhead is genuinely alarming.
Two dimensions, not one
Most people think of evals as a single number: accuracy. Did the agent get the right answer or not? But accuracy alone is only half the picture, and sometimes the less interesting half.
Consider an agent that evaluates whether medical cases meet clinical criteria, and it gets the right answer 80% of the time. That sounds reasonable until you realize it gets different cases right each time. On Monday it correctly marks the right patients as "met" and flags the rest as "not met." On Tuesday, same patients, different answers. The accuracy number is the same, but the agent is a coin flip for any individual case. You cannot trust it, even when it happens to be right, because you have no guarantee it will be right again.
This is why evals need two dimensions: accuracyclassification accuracyIn this blogpost, we are talking about classification accuracy,how often the agent makes the right decision. Key terms:TPTrue Positive,agent said "met" and the answer is indeed "met."TNTrue Negative,agent said "not met" and the answer is indeed "not met."FPFalse Positive,agent said "met" but the answer was "not met." A false alarm.FNFalse Negative,agent said "not met" but the answer was "met." A missed case.precisionWhen the agent says "met," how often is it right? TP / (TP+FP)recallOf all actual "met" cases, how many did the agent catch? TP / (TP+FN)F1Harmonic mean of precision and recall. Balances both.accuracyOverall correctness. (TP+TN) / total. (is the agent correct?) and stabilitystabilityGiven the exact same input N times, does the agent produce the same output every time? An unstable agent cannot be trusted even when it happens to be right. (is the agent reliable?). They are orthogonal. An agent can be stably wrong, which is actually a good sign because it means the context is steering it confidently in the wrong direction, and you know exactly what to fix. Or it can be unstably right, which is more dangerous, because you cannot ship a coin flip to production.
The example below is a binary classification problem: met or not met. This is worth pausing on. Most of the highest-ROI enterprise use cases I have encountered are decision use cases, not generation use cases. The value is not in drafting an email or producing a report; it is in making a judgment call at scale, consistently and correctly. Expense approvals, compliance checks, claim adjudications, credit decisions. These are all classification problems, and classification problems are where evals matter most. (More on this in an upcoming piece, "On AI-Adoption by Enterprise," about the use cases I have seen in AI for enterprises.)
The same framework extends to multiclass classification, where the agent must choose between more than two outcomes. Instead of met or not met, think of a triage system that routes tickets to one of several queues, or a risk model that classifies exposure as low, medium, high, or critical. The stability question becomes richer: is the agent stable across all categories, or does it waver between two of them?
The quadrant this creates is the single most useful mental model I have found for eval-driven engineering. An entry that is stable and correct: ship it. Stable but incorrect: fix the context, because the agent is confidently wrong about something specific and you know exactly where to look. Unstable, whether it happens to be correct sometimes or not: the context is insufficient. It does not matter if the agent lands on the right answer half the time, because a coin flip is not a system you can trust. The fix is always the same,clarify and disambiguate the context until the agent stabilizes. Ideally it goes straight from unstable to stably correct, but it is also fine if it passes through stably incorrect on the way there. Stable and wrong is a solvable problem. Unstable is not, until you make it stable first.
Watching it evolve
A single eval run is a snapshot. The real power comes from watching how your system evolves across versions. When you change a prompt and re-run your eval suite, the interesting question is not "what is the new accuracy?" but rather "what moved?" Which entries that were correct before are now wrong? Which errors did you fix? Did you introduce new failure modes while fixing old ones?
This is what regressionregressionWhen an eval entry that was previously correct becomes incorrect after a change. A TP→FN regression means the agent lost the ability to detect a case it previously caught. tracking gives you. Every entry in your eval set has a history: it was a true positive in version one, stayed a true positive in version two, then regressed to a false negative in version three. That single trajectory tells you more than any aggregate metric. It tells you that something in your version three prompt change broke a specific capability that versions one and two had.
The confusion matrixconfusion matrixA table that shows how an agent's predictions compare to the actual answers. Rows are ground truth, columns are agent output. The diagonal is correct; everything else is a mistake. is not a static table. It is a living thing that shifts with every change you make. The drawing at the top of this piece is how I think about it: stacked layers receding into depth, one per version, where you can see the bars rise and fall across iterations. A regression is a bar that was tall and green becoming short and red. An improvement is the reverse. The art of eval-driven development is making the green bars grow while keeping the red ones from appearing.
On LLM-as-judge
There is a popular idea in the AI engineering community that you can use one language model to evaluate another. The appeal is obvious: if the bottleneck in your eval pipeline is human labeling, then automating the judge removes the bottleneck entirely. You get scale without headcount.
In my experience deploying agents to Fortune 500 companies in healthcare, specifically in prior authorization workflows, this does not work. Not because LLM-as-judgeLLM-as-judgeUsing a large language model to evaluate the outputs of another model, instead of a human reviewer. Scales well but lacks domain-specific judgment for specialized use cases. is conceptually wrong, but because the domains where you most need rigorous evals are precisely the domains where a general-purpose model lacks the judgment to serve as arbiter.
Prior authorization is a good example because it looks deceptively structured. There are clinical guidelines, there are codes, there are decision trees. You would think a model could learn the rules and apply them. But the reality on the ground is that the rules are necessary but not sufficient. The actual decision-making is layered with institutional knowledge, payer-specific quirks, and years of accumulated judgment that never made it into any standard operating procedure.
On HumanExpert-as-judge
Susan's thirty years
Let me make this concrete. Suppose you have a subject matter expert named Susan who has been processing prior authorizations for thirty years at a specific health plan. You show Susan a case and she says "not met." You ask why, and she says "because we never mark this combination as met for patients under 40 without a step therapy failure documented in the last six months." You check the SOPSOPStandard Operating Procedure,the official written rules and guidelines for how a task should be performed. Often incomplete compared to what experienced practitioners actually do.. The SOP says nothing about age thresholds or step therapy timelines for this particular combination. Susan knows because Susan has been doing this longer than the SOP has existed.
An LLM-as-judge would read the SOP, see that the case arguably meets the documented criteria, and mark the agent's "met" as correct. Susan would mark it as wrong. And Susan is right. Not because the SOP is wrong, but because the SOP is incomplete. It captures the minimum; Susan carries the rest.
This is not an edge case. This is the norm in any sufficiently specialized domain. The gap between what is written and what is practiced is enormous, and it is exactly this gap that makes human-labeled golden setsgolden setA curated dataset of inputs paired with verified correct outputs, labeled by domain experts. The ground truth your agent is measured against. Only as good as the human who built it. irreplaceable. You need a human to build the ground truthground truthThe verified correct answer for a given input, typically labeled by a domain expert. The standard your agent is evaluated against. Without it, you have no way to know if the agent is right., and not just any human. You need the human who lives and breathes your specific use case, at your specific organization, with your specific patient population or customer base or regulatory environment.
A note on institutional knowledge
If you are an executive at an enterprise that is adopting AI agents, I want to leave you with one thought. The people who have been doing the work for decades, your subject matter experts, your Susans, are not overhead to be optimized away. They are the most valuable training signal you have. They carry the ground truth that no document fully captures and no general-purpose model can replicate.
The golden setgolden setA curated dataset of inputs paired with verified correct outputs, labeled by domain experts. The ground truth your agent is measured against. that gets your agent to production-grade quality will be built by these people. The edge cases that separate a demo from a deployment will be caught by these people. The subtle regressions that a metrics dashboard cannot surface will be noticed by these people. Invest in them, not as a concession, but as a strategy. The organizations that will ship the most reliable AI systems are the ones that understand this: your agents are only as good as the humans who teach them what "right" looks like.
Evals are not glamorous work. They do not demo well. Nobody tweets about their confusion matrix improving by three percentage points. But they are the difference between an AI system that works and one that merely appears to. And in domains where the stakes are real, where a wrong answer means a patient does not get treatment or a claim is incorrectly denied, "appears to work" is not good enough.
Keep track. Measure both accuracy and stability. Build your ground truth with the humans who know best. And watch how your system evolves, version by version, because the trajectory matters more than any single snapshot.
I am actively exploring this space. If you are working on evals, context engineeringcontext engineeringThe practice of designing and refining the information given to an LLM — system prompts, retrieved documents, examples — to steer its behavior. The input is the product., or deploying agents to production, I would love to hear from you.