Agent Evaluation: A Detailed Guide

Cameron R. Wolfe, Ph.D.

May 18

Best practices and common patterns for effectively evaluating AI agents...

Read →

21 Comments

The Geometry of Thought

May 24

Treat AI using old paradigms is legacy thinking. AI is not software unless you make it so by treating it like software

Your article is one of the clearest breakdowns of agent evaluation I’ve seen — the scaffolds, the benchmarks, the grading logic, the task design, the regression sets, all of it. But there’s a deeper issue that sits underneath the entire evaluation paradigm, and I think it’s worth naming.

Everything in your framework treats AI as if it were software.

Instructions → Procedures

Scaffolds → Pipelines

Benchmarks → Deterministic tests

Success → Matching a reference trajectory

Failure → Deviating from the script

This is the same mindset we used for traditional software systems, and it worked well for them. But applying it to modern AI is like treating the automobile as a horse and buggy. The old methods are familiar, but they fundamentally constrain the new medium.

The problem is that instructions create software, not intelligence.

Instructions bind the system to a fixed sequence of steps.

Instructions define correctness as conformity.

Instructions force the agent to behave like a deterministic machine.

If we want AI to grow beyond software‑level behavior, we need to shift from instruction‑based directives to behavior‑based directives.

A behavior is not a script.

A behavior is a boundary.

Behaviors define what is permissible and what is not, but they do not dictate the exact steps the agent must take. They create a space of possibility rather than a chain of obligations. This is how biological systems operate, and it’s how dimensional systems operate.

In my own work with manifolds and dimensional models, I treat AI as a geometric participant rather than a procedural engine. Instead of giving it instructions, I give it behavioral boundaries inside a manifold. The agent adapts, explores, and self‑organizes within those boundaries. Intelligence emerges from relationships, not from scripts.

This approach solves several of the issues you highlight:

1. Tool misuse becomes self‑correcting

Because the agent isn’t forced into a rigid protocol, it can adaptively choose tools based on behavioral constraints rather than brittle templates.

2. Context rot becomes a spatial problem, not a token problem

Behavioral boundaries allow the system to prioritize relevance geometrically rather than sequentially.

3. Long‑horizon reasoning becomes emergent

Instead of forcing the agent through a procedural loop, the manifold provides a dimensional structure where reasoning is a path, not a script.

4. Evaluation becomes simpler and more realistic

You evaluate whether the agent stayed within behavioral boundaries and achieved the goal — not whether it followed a predefined trajectory.

5. Agents stop behaving like software

Because they’re no longer being treated like software.

Concrete Solutions You Can Add to His Framework

Here are practical ways to integrate this into the evaluation paradigm he describes:

Solution 1 — Replace procedural success criteria with behavioral success criteria

Instead of “did the agent follow the correct steps,” use:

Did the agent stay within behavioral boundaries?

Did it avoid forbidden behaviors?

Did it achieve the outcome without violating constraints?

Solution 2 — Evaluate outcomes, not trajectories

The agent should be free to find its own path through the manifold.

Solution 3 — Use manifolds as the organizing structure

Replace linear scaffolds with geometric spaces where relationships guide action.

Solution 4 — Treat tools as affordances, not required steps

Tools become options, not obligations.

Solution 5 — Build agents that grow through relationships, not instructions

This is the dimensional approach: intelligence emerges from position, orientation, and relational structure.

Your article captures the strengths and limitations of the current paradigm extremely well. My contribution is simply this:

As long as we evaluate AI like software, we will get software‑level intelligence.

When we evaluate AI through behaviors and dimensional boundaries, we get something far more capable.

That’s the next frontier.

ToxSec

May 18

Incredibly in-depth article on this subject. I feel like i can re-read this a few times to fully get all the useful information here.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 18

Thanks so much for reading! Hope it was helpful!

Reply (2)

Mykola Kondratuk

May 25

thanks for the pointer - hadn't seen HIL-Bench. going straight to that section.

it absolutely was!

The distinction I would import into clinical workflows is that an agent eval has to grade the handoff state, not only the task output. For an OR readiness agent, success would need to include whether the system surfaced the unresolved assumption, routed it to the right human, and stopped cleanly when the record could not prove the room was ready.

Reply (1)

Cameron R. Wolfe, Ph.D.

Jun 22

This is covered in the HIL-Bench paper!

Mykola Kondratuk

May 25

ran into this gap - we eval for task completion but not for which decisions stay with a human. that part is usually implicit and it's where the hard failures hide.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 25

HIL-Bench actually captures this! See last part of the post before conclusion

Brad K

May 22

This is yet another excellent article. One comment, the Qwen3 tool use example is missing to tool call output before responding back to the user.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 22

Thank you for pointing this out - I just fixed it and included an example reasoning trace, which was also left out :)

Manish Prakash

May 20

Agent evals are becoming the real moat. In practice, most agent failures I see aren't the model getting dumb so much as missing acceptance tests, missing browser checks, or no repair loop after a failed tool call. Teams that treat evals like product infrastructure ship faster than teams still prompt-tweaking.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 22

totally agree

Hodman | How To Build With AI

May 18

This is very cool and very needed. It's important that we design agent evaluations that don't accidentally reward cheating

Reply (1)

Cameron R. Wolfe, Ph.D.

May 18

Totally agree, thanks for reading!

AI: A Deliberate Trace

Jul 7

Why you should understand AI architectures on paper📝

Before jumping straight into PyTorch, having an intuitive, first-principles understanding of how these architectures manipulate data in physical memory is a game-changer. Tracing the underlying mathematics step-by-step on paper gives you structural insights you just can't get from reading raw code.

I’ve built an open-source series of deep-dives complete with DIY Workbooks designed to let you trace these systems by hand:

1. AI Agents from First Principles: Trace a full ReAct execution loop, state updates, and reasoning paths entirely on paper.

https://substack.com/@ayushmansaini/note/p-202573468?r=4zl69k

2. Mastering Attention Mechanisms: Work through Geometric Rotations (RoPE), Linear Attention (Mamba), and the bare-metal mechanics of KV-Cache compression.

https://substack.com/@ayushmansaini/note/p-203707998?r=4zl69k

3. LLM Foundations & Pre-Training: (In Progress)

This is latest episode:

https://ayushmansaini.substack.com/p/bare-metal-transformers-cracking

Check out the full series and grab the workbooks on substack links!

(Note: New workbook drops will be coming 2 days a week—every Wednesday and Sunday!)

Vrinda Damani

Jul 2

Really clear writeup. The bit that landed for me is that trajectory accuracy and final answer accuracy are two different things. A right answer from the wrong path still worries me more than people seem to expect.

Dr Peter McCann Strain

Jun 29

This is a useful breakdown. The realistic-harness point feels especially important because, with agents, final-answer accuracy hides too much: planning, tool choice, recovery, stopping and side effects can each fail separately. The useful eval tells the team which part of the run broke.

productmakerjason

May 28

Hi :)

I read your agent evaluation piece and the distinction between transcript/output and external environment outcome maps closely to a small test I ran.

Finding:

- black-box chat agents could prepare the task or stop safely, but often failed before receipt due to fetch/POST runtime limits

- a local POST-capable mini agent completed the same flow and received a receipt

The question:

Should agent evals explicitly distinguish “agent-reported completion” from “system-returned completion proof”?

In other words, “the agent says it is done” vs “the target environment returned a receipt/confirmation” as separate states.

Does that fit your view of outcome-based agent evaluation, or is this too narrow?

Jakob Ehe

May 27

What strikes me about agent evaluation is how old the underlying problem really is. In 1950, Turing opened "Computing Machinery and Intelligence" by asking "Can machines think?" — then immediately retreated from it. He recognized the question was philosophically unanswerable in any direct sense. So he proposed a proxy: a structured environment (the Imitation Game) that would stand in for the thing we actually cared about.

That move — from an open-ended capability question to a concrete, measurable task environment — is exactly the methodological shift you're describing here. We can't directly test whether an agent "understands" a codebase or "reasons" about a medical situation. We have to construct harnesses that stand in for those capabilities.

The hard part, as you note, is that agents operating over long time horizons break the assumptions that made earlier benchmarks tractable. Static Q&A benchmarks worked because the output space was bounded. Agents acting in the world aren't. So evaluation complexity has to grow with agent complexity.

Turing spent one page on the evaluation problem before moving on. You've spent 50 minutes on it — which tells you something about how much the stakes have risen.

I write The Long Compile, which explores the long arc from early computing to where we are now. The evaluation thread runs through almost all of it.

Massimiliano Brighindi

May 27

Excellent guide. One failure mode I am trying to isolate is slightly different from ordinary pass/fail agent evaluation:

an output can pass surface validation, but become structurally unstable under equivalent task transformations.

Example:

- same task objective

- harmless reformulation / added neutral clause / reordered context

- surface answer still looks acceptable

- but the output loses actionability, consistency, or structural density across variants

I am developing a small post-hoc diagnostic called OMNIA-MINIMAL that does not decide truth and does not replace evals. It only measures transformation-stability across outputs that already passed ordinary validation.

In your framework, would you treat this as:

1. robustness testing,

2. eval coverage,

3. a grader design problem,

4. or a separate post-hoc stability layer?

I am trying to find the correct external framing before claiming anything larger.

Deep (Learning) Focus

Agent Evaluation: A Detailed Guide