Moving AI agents from prototypes to production surfaces a challenge that traditional testing is unable to address. Agents are flexible, adaptive, and context-aware by design, but the same qualities that make them powerful also make them difficult to evaluate systematically.
Traditional software testing relies on deterministic outputs: same input, same expected output, every time. AI agents break this assumption. They generate natural language, make context-dependent decisions, and produce varied outputs even from identical inputs. How do you systematically evaluate something that is not deterministic?
In this post, we show how to evaluate AI agents systematically using Strands Evals. We walk through the core concepts, built-in evaluators, multi-turn simulation capabilities and practical approaches and patterns for integration. Strands Evals provides a structured framework for evaluating AI agents built with the Strands Agents SDK, offering evaluators, simulation tools, and reporting capabilities. Whether you need to verify that your agent uses the right tools, produces helpful responses, or guides users toward their goals, the framework provides infrastructure to measure and track these qualities systematically.
When you ask an agent “What is the weather like in Tokyo?”, many valid responses exist, and no single answer is definitively correct. The agent might report temperature in Celsius or Fahrenheit, include humidity and wind, or only focus on temperature. These variations could be correct and helpful, which is exactly why traditional assertion-based testing falls short. Beyond text generation, agents also take action. A well-designed agent calls tools, retrieves information, and makes decisions throughout a conversation. Evaluating the final response alone misses whether the agent took appropriate steps to reach that response.
Even correct responses can fall short. A response might be factually accurate but unhelpful, or helpful but unfaithful to source materials. No single metric captures these different quality dimensions. Conversations add another layer of complexity because they unfold over time. In multi-turn interactions, earlier responses affect later ones. An agent might handle individual queries well but fail to maintain a coherent context across a conversation. Testing single turns in isolation misses these interaction patterns.
These characteristics demand evaluation that requires judgment rather than keyword comparison. Large language model (LLM)-based evaluation addresses this need. By using language models as evaluators, we can assess qualities like helpfulness, coherence, and faithfulness that resist mechanical checking. Strands Evals embraces this flexibility while still offering rigorous, repeatable quality assessments.
Strands Evals follows a pattern that should feel familiar to anyone who has written unit tests but adapts it for the judgment-based evaluation that AI agents require. The framework introduces three foundational concepts that work together: Cases, Experiments, and Evaluators.
A Case represents a single test scenario. It contains the input that you want to test, perhaps a user’s query like “What is the weather in Paris?”, along with optional expected outputs, expected tool sequences known as trajectories, and metadata. Cases are the atomic unit of evaluation. Each one defines one scenario that you want your agent to handle correctly.
from strands_evals import Case
case = Case(
name="Weather Query",
input="What is the weather like in Tokyo?",
expected_output="Should include temperature and conditions",
expected_trajectory=["weather_api"]
)
An Experiment bundles multiple Cases together with one or more evaluators. Think of it as a test suite in traditional testing. The Experiment orchestrates the evaluation process. It takes each Case, runs your agent on it, and applies the configured evaluators to score the results.
Evaluators are the judges. They examine what your agent produced (the actual output and trajectory) and compare it against what was expected. Unlike simple assertion checks, evaluators in Strands Evals are primarily LLM-based. They use language models to make nuanced judgments about quality, relevance, helpfulness, and other qualities that cannot be reduced to string comparison.
Separating these concerns helps keep the framework flexible. You can define what to test with Cases, how to test it with evaluators, and the framework handles orchestration and reporting through Experiments. Each piece can be configured independently so that you can build evaluation suites that are tailored to your specific needs.
Cases define your scenarios, and evaluators provide judgment. But how does your agent actually connect to this evaluation system? That is where the Task Function comes in.
A Task Function is a callable that you provide to the Experiment. It receives a Case and returns the results of running that case through your system. This interface enables two fundamentally different evaluation patterns.
Online evaluation involves invoking your agent live during the evaluation run. Your Task Function creates an agent, sends it the case input, captures the response and execution trace, and returns them for evaluation. This pattern is recommended during development when you want to test changes immediately, or in continuous integration and delivery (CI/CD) pipelines where you need to verify agent behavior before deployment.
from strands import Agent
def online_task(case):
agent = Agent(tools=[search_tool, calculator_tool])
result = agent(case.input)
return {
"output": str(result),
"trajectory": agent.session
}
Offline evaluation works with historical data. Instead of invoking an agent, your Task Function retrieves previously recorded traces from logs, databases, or observability systems. It parses these traces into the format that evaluators expect and returns them for judgment. This pattern works well when you need to evaluate production traffic, perform historical analysis, or compare agent versions against the same set of real user interactions.
def offline_task(case):
trace = load_trace_from_database(case.session_id)
session = session_mapper.map_to_session(trace)
return {
"output": extract_final_response(trace),
"trajectory": session
}
Whether you are testing a new agent implementation or analyzing months of production data, the same evaluators and reporting infrastructure apply. The Task Function adapts your data source to the evaluation system.
With your Task Function connecting agent output to the evaluation system, you can now decide which aspects of quality to measure. Strands Evals ships with ten built-in evaluators, each designed to assess a different dimension of agent quality.
The most flexible evaluators let you define custom criteria through natural language rubrics.
from strands_evals.evaluators import OutputEvaluator
output_evaluator = OutputEvaluator(
rubric="Score 1.0 if the response correctly answers the question and is well-structured. "
"Score 0.5 if partially correct. Score 0.0 if incorrect or irrelevant."
)
Some quality dimensions are common enough that Strands Evals provides pre-built evaluators with carefully designed prompts and scoring scales.
When your agent uses tools, you often need to evaluate not only the final outcome, but the quality of individual tool invocations.
The choice depends on what matters most for your application. A customer service agent might prioritize helpfulness and goal success. A research assistant might emphasize faithfulness. Start with a small set of evaluators that cover your core quality dimensions, then add more as you learn how your agent fails.
The previously mentioned evaluators work well for single-turn interactions where you provide an input, get an output, and evaluate it. Multi-turn conversations present a harder challenge. Real users don’t follow scripts. They ask follow-up questions, change direction, and express confusion. How do you test this? Strands Evals includes an ActorSimulator that creates AI-powered simulated users to drive multi-turn conversations with your agent.
ActorSimulator starts with a test case that defines what the user wants to achieve. From this, it generates a realistic user profile using an LLM, including personality traits, expertise level, communication style, and a specific goal. This profile shapes how the simulated user behaves throughout the conversation.
from strands_evals import Case, ActorSimulator
from strands import Agent
case = Case(
input="I need help setting up a new bank account",
metadata={"task_description": "Successfully open a checking account"}
)
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=10
)
During the interaction, the simulated user sends messages to your agent, receives responses, and decides what to say next. This loop continues until either the goal is achieved, indicated by emitting a special stop token, or the maximum turn count is reached.
agent = Agent(system_prompt="You are a helpful banking assistant.")
user_message = case.input
while user_sim.has_next():
agent_response = agent(user_message)
user_result = user_sim.act(str(agent_response))
user_message = str(user_result.structured_output.message)
You can then pass the resulting conversation transcript to session-level evaluators like GoalSuccessRateEvaluator to assess whether your agent successfully helped the simulated user achieve their goal. Instead of manually writing multi-turn scripts, you define goals and let the simulator create realistic interaction patterns. It might ask unexpected follow-up questions, express confusion, or take the conversation in directions that you didn’t anticipate, catching edge cases that scripted tests can miss.
Whether using simulated or real conversations, different evaluators operate at different granularities. Strands Evals uses a TraceExtractor to parse session data into the format that each evaluator needs.
Session level evaluation looks at the complete conversation from beginning to end. The evaluator receives the full history, the tool executions, and understands the entire context. GoalSuccessRateEvaluator works at this level because determining goal achievement requires understanding the whole interaction.
Trace level evaluation focuses on individual turns, each user prompt, and agent response pair. Evaluators at this level receive the conversation history up to that point and judge the specific response. Helpfulness, Faithfulness, and Harmfulness evaluators work here because these qualities can be assessed turn by turn.
Tool level evaluation drills down to individual tool invocations. Each tool call is evaluated in context, with access to the available tools, the conversation so far, and the specific arguments passed. Tool Selection and Tool Parameter evaluators operate at this granularity.
You can use the hierarchical design to compose evaluation suites that check quality at multiple levels simultaneously. Within a single evaluation run, you can verify that individual tool calls are sensible, responses are helpful, and overall goals are achieved.
At many different evaluation levels, evaluators can benefit from having reference points for comparison. Strands Evals provides first-class support for ground truth through three expected fields on Cases.
The expected_output field specifies what the agent should say. This is useful when there are correct answers or standard response formats. The expected_trajectory field defines the sequence of tools or actions that the agent should take. You might require that a customer service agent checks account status before making changes, or that a research agent queries multiple sources before synthesizing. Not every Case needs every field. You define expectations based on what matters for your evaluation goals. When expected values are provided, evaluators receive both expected and actual results, enabling comparison-based scoring alongside standalone quality assessment.
Let’s walk through a typical evaluation workflow to see how these concepts come together.
First, you define your test cases, meaning the scenarios you want your agent to handle well. They might come from real user queries, synthetic generation, or edge cases that you have identified.
from strands_evals import Experiment, Case
from strands_evals.evaluators import OutputEvaluator, TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
cases = [
Case(
name="Weather Query",
input="What is the weather like in Tokyo?",
expected_output="Should include temperature and conditions",
expected_trajectory=["weather_api"]
),
Case(
name="Calculator Usage",
input="What is 15% of 847?",
expected_output="127.05",
expected_trajectory=["calculator"]
)
]
Next, you configure evaluators with appropriate rubrics or settings.
output_evaluator = OutputEvaluator(
rubric="Score 1.0 if the response is accurate and directly answers the question. "
"Score 0.5 if partially correct. Score 0.0 if incorrect or irrelevant."
)
trajectory_evaluator = TrajectoryEvaluator(
rubric="Verify the agent used appropriate tools for the task."
)
Then, you create an experiment bundling cases and evaluators.
experiment = Experiment(
cases=cases,
evaluators=[output_evaluator, trajectory_evaluator]
)
Finally, you run the evaluation with your Task Function and examine the results.
def my_task(case):
agent = Agent(tools=[weather_tool, calculator_tool])
result = agent(case.input)
return {
"output": str(result),
"trajectory": tools_use_extractor.extract_agent_tools_used(agent.messages)
}
reports = experiment.run_evaluations(my_task)
for report in reports:
report.display()
The EvaluationReport provides overall scores, per-case breakdowns, pass/fail status, and detailed reasoning from each evaluator. You can display results interactively in the console, export to JSON for further analysis, or integrate into CI/CD pipelines. For larger test suites, Strands Evals supports asynchronous evaluation with configurable parallelism:
reports = await experiment.run_evaluations_async(my_task, max_workers=10)
The previous workflow assumes that you have test cases ready. Creating comprehensive test suites by hand is tedious as your agent’s capabilities grow. Strands Evals includes an ExperimentGenerator that uses LLMs to create test cases and evaluation rubrics from high-level descriptions.
from strands_evals.generators import ExperimentGenerator
from strands_evals.evaluators import OutputEvaluator
generator = ExperimentGenerator(
input_type=str,
output_type=str,
include_expected_output=True
)
experiment = await generator.from_context_async(
context="A customer service agent for an e-commerce platform",
task_description="Handle customer inquiries about orders, returns, and products",
num_cases=20,
evaluator=OutputEvaluator
)
The generator creates diverse test cases covering different aspects of the specified context, with appropriate difficulty levels. It can also generate evaluation rubrics that are tailored to the task. Generated cases are particularly valuable during early development when you want broad coverage but have not yet identified specific failure patterns. As your evaluation practice matures, supplement generated Cases with hand-crafted scenarios targeting known edge cases.
Evaluation helps deliver the most value as part of your regular development workflow. During development, run evaluations frequently as you make changes. Fast feedback helps you catch regressions early and understand how changes affect different quality dimensions.
In CI/CD pipelines, include evaluation as a quality gate before deployment. Set score thresholds that must be met for a build to pass. This helps prevent quality regressions from reaching production. For production monitoring, use offline evaluation to assess real user interactions periodically. This reveals patterns that development testing might miss: unusual queries, edge cases you did not anticipate, or gradual drift in agent behavior. Track evaluation results over time. Trending metrics help you understand whether quality is improving or degrading.
HelpfulnessEvaluator and GoalSuccessRateEvaluator, while a research assistant might weigh FaithfulnessEvaluator more heavily. Avoid the temptation to add every available evaluator, as this increases cost and can dilute focus.ExperimentGenerator for broad coverage, then supplement with hand-crafted Cases targeting known weaknesses.Building reliable AI agents requires more than intuition and spot checks. It requires systematic evaluation that tracks quality across multiple dimensions over time. Strands Evals helps provide this foundation through a framework designed specifically for the unique challenges of agent evaluation.
Task Functions separate agent invocation from evaluation logic, enabling both online testing during development and offline analysis of production traces. LLM-based evaluators provide the judgment that quality assessment requires. Hierarchical evaluation levels allow assessment at multiple granularities, from individual tool calls to complete conversation sessions. And the user simulator transforms multi-turn testing from a scripting exercise into realistic user behavior simulation.
These capabilities help you build confidence in your AI agents through evidence rather than assumptions. You can measure whether changes improve or degrade quality, catch regressions before they reach production, and demonstrate to stakeholders that your agents meet defined quality standards.
We encourage you to explore Strands Evals for your agent evaluation needs. The samples repository contains practical examples that you can adapt to your own use cases. Start with a few test cases representing your most important user scenarios, add evaluators that match your quality criteria, and run evaluations as part of your development workflow. Over time, expand your test suite to cover more scenarios. Systematic evaluation is the foundation that helps you ship AI agents with confidence.
Manuel Rioux est fièrement propulsé par WordPress