MidnightTokensdeveloper portal
Sign In
Unit Study Document

Agent Evaluation: Trajectory Testing & Benchmarking

8 min readβ€’Visual explainer included

Testing the Unpredictable: Agent Trajectories

Agents do not follow a fixed execution path. When given the same prompt twice, they may call different tools in a different order. Standard unit tests looking for exact matches fail. Instead, we perform Trajectory Evaluation, analyzing the sequence of steps, tool selections, and final outputs.

LLM-as-a-Judge: We evaluate intermediate agent steps using a highly capable evaluator model that compares the agent's actual trajectory against a reference gold-standard set.

Trajectory Score Metric

Evaluators check for three key criteria: correctness of final output, efficiency (number of loops run), and accuracy of tool calls (avoiding redundant tool activations).

Fast Drill

Active Recalls

Card 1 of 1
Question

What is Trajectory Evaluation?

Tap card to flip
Answer

Evaluating the entire step-by-step history of thoughts, tool selections, and intermediate observations of an agent run.

Mastery: 0%
Knowledge Check

Quiz Practice

Question 1 of 1
Why do standard unit tests fail to evaluate complex agent loops?

Chapter Scratchpad

Auto-saves immediately

Active Recall Cards

Review core concepts before doing the quiz

Fast Drill

Active Recalls

Card 1 of 1
Question

What is Trajectory Evaluation?

Tap card to flip
Answer

Evaluating the entire step-by-step history of thoughts, tool selections, and intermediate observations of an agent run.

Mastery: 0%

AI Study Buddy

Always online

Hi! I'm Spooky, your study buddy! Let's learn together.