Organizations building and deploying AI applications, particularly those using large language models (LLMs) with Retrieval Augmented Generation (RAG) systems, face a significant challenge: how to evaluate AI outputs effectively throughout the application lifecycle. As these AI technologies become more sophisticated and widely adopted, maintaining consistent quality and performance becomes increasingly complex.
Traditional AI evaluation approaches have significant limitations. Human evaluation, although thorough, is time-consuming and expensive at scale. Although automated metrics are fast and cost-effective, they can only evaluate the correctness of an AI response, without capturing other evaluation dimensions or providing explanations of why an answer is problematic. Furthermore, traditional automated evaluation metrics typically require ground truth data, which for many AI applications is difficult to obtain. Especially for those involving open-ended generation or retrieval augmented systems, defining a single “correct” answer is practically impossible. Finally, metrics such as ROUGE and F1 can be fooled by shallow linguistic similarities (word overlap) between the ground truth and the LLM response, even when the actual meaning is very different. These challenges make it difficult for organizations to maintain consistent quality standards across their AI applications, particularly for generative AI outputs.
Amazon Bedrock has recently launched two new capabilities to address these evaluation challenges: LLM-as-a-judge (LLMaaJ) under Amazon Bedrock Evaluations and a brand new RAG evaluation tool for Amazon Bedrock Knowledge Bases. Both features rely on the same LLM-as-a-judge technology under the hood, with slight differences depending on if a model or a RAG application built with Amazon Bedrock Knowledge Bases is being evaluated. These evaluation features combine the speed of automated methods with human-like nuanced understanding, enabling organizations to:
These capabilities integrate seamlessly into the AI development lifecycle, empowering organizations to improve model and application quality, promote responsible AI practices, and make data-driven decisions about model selection and application deployment.
This post focuses on RAG evaluation with Amazon Bedrock Knowledge Bases, provides a guide to set up the feature, discusses nuances to consider as you evaluate your prompts and responses, and finally discusses best practices. By the end of this post, you will understand how the latest Amazon Bedrock evaluation features can streamline your approach to AI quality assurance, enabling more efficient and confident development of RAG applications.
Before diving into the implementation details, we examine the key features that make the capabilities of RAG evaluation on Amazon Bedrock Knowledge Bases particularly powerful. The key features are:
These features enable organizations to comprehensively assess AI performance, promote responsible AI development, and make informed decisions about model selection and optimization throughout the AI application lifecycle. Now that we’ve explained the key features, we examine how these capabilities come together in a practical implementation.
The Amazon Bedrock Knowledge Bases RAG evaluation feature provides a comprehensive, end-to-end solution for assessing and optimizing RAG applications. This automated process uses the power of LLMs to evaluate both retrieval and generation quality, offering insights that can significantly improve your AI applications.
The workflow is as follows, as shown moving from left to right in the following architecture diagram:
RAG system evaluation requires a balanced approach that considers three key aspects: cost, speed, and quality. Although Amazon Bedrock Evaluations primarily focuses on quality metrics, understanding all three components helps create a comprehensive evaluation strategy. The following diagram shows how these components interact and feed into a comprehensive evaluation strategy, and the next sections examine each component in detail.
The efficiency of RAG systems depends on model selection and usage patterns. Costs are primarily driven by data retrieval and token consumption during retrieval and generation, and speed depends on model size and complexity as well as prompt and context size. For applications requiring high performance content generation with lower latency and costs, model distillation can be an effective solution to use for creating a generator model, for example. As a result, you can create smaller, faster models that maintain quality of larger models for specific use cases.
Amazon Bedrock knowledge base evaluation provides comprehensive insights through various quality dimensions:
Begin your evaluation process by choosing default configurations in your knowledge base (vector or graph database), such as default chunking strategies, embedding models, and prompt templates. These are just some of the possible options. This approach establishes a baseline performance, helping you understand your RAG system’s current effectiveness across available evaluation metrics before optimization. Next, create a diverse evaluation dataset. Make sure this dataset contains a diverse set of queries and knowledge sources that accurately reflect your use case. The diversity of this dataset will provide a comprehensive view of your RAG application performance in production.
Understanding how different components affect these metrics enables informed decisions about:
Implement a systematic approach to ongoing evaluation:
To use the knowledge base evaluation feature, make sure that you have satisfied the following requirements:
To prepare your dataset for a knowledge base evaluation job, you need to follow two important steps:
conversationTurns key in the dataset format).jsonl extension)Special note: On March 20, 2025, the referenceContexts key will change to referenceResponses. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages/chunks retrieved from the Knowledge Base.
{
"conversationTurns": [{
## required for Context Coverage metric
"referenceContexts": [{
"content": [{
"text": "This is reference retrieved context"
}]
}],
## your prompt to the model
"prompt": {
"content": [{
"text": "This is a prompt"
}]
}
}]
}
{
"conversationTurns": [{
##optional
"referenceResponses": [{
"content": [{
"text": "This is a reference response used as groud truth"
}]
}],
## your prompt to the model
"prompt": {
"content": [{
"text": "This is a prompt"
}]
}
}]
}
Amazon Bedrock Evaluations provides you with an option to run an evaluation job through a guided user interface on the console. To start an evaluation job through the console, follow these steps:


The following screenshot shows the Configurations screen.





On the Evaluation details tab, examine score distributions through histograms for each evaluation metric, showing average scores and percentage differences. Hover over the histogram bars to check the number of conversations in each score range, helping identify patterns in performance, as shown in the following screenshots.
To use the Python SDK for creating a knowledge base evaluation job, follow these steps. First, set up the required configurations:
import boto3
from datetime import datetime
# Generate unique name for the job
job_name = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
# Configure your knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "anthropic.claude-3-sonnet-20240229-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
# Specify S3 locations for evaluation data and output
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"
# Configure retrieval settings
num_results = 10
search_type = "HYBRID"
# Create Bedrock client
bedrock_client = boto3.client('bedrock')
For retrieval-only evaluation, create a job that focuses on assessing the quality of retrieved contexts:
retrieval_job = bedrock_client.create_evaluation_job(
jobName=job_name,
jobDescription="Evaluate retrieval performance",
roleArn=role_arn,
applicationType="RagEvaluation",
inferenceConfig={
"ragConfigs": [{
"knowledgeBaseConfig": {
"retrieveConfig": {
"knowledgeBaseId": knowledge_base_id,
"knowledgeBaseRetrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": num_results,
"overrideSearchType": search_type
}
}
}
}
}]
},
outputDataConfig={
"s3Uri": output_path
},
evaluationConfig={
"automated": {
"datasetMetricConfigs": [{
"taskType": "Custom",
"dataset": {
"name": "RagDataset",
"datasetLocation": {
"s3Uri": input_data
}
},
"metricNames": [
"Builtin.ContextRelevance",
"Builtin.ContextCoverage"
]
}],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [{
"modelIdentifier": evaluator_model
}]
}
}
}
)
For a complete evaluation of both retrieval and generation, use this configuration:
retrieve_generate_job=bedrock_client.create_evaluation_job(
jobName=job_name,
jobDescription="Evaluate retrieval and generation",
roleArn=role_arn,
applicationType="RagEvaluation",
inferenceConfig={
"ragConfigs": [{
"knowledgeBaseConfig": {
"retrieveAndGenerateConfig": {
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": knowledge_base_id,
"modelArn": generator_model,
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": num_results,
"overrideSearchType": search_type
}
}
}
}
}
}]
},
outputDataConfig={
"s3Uri": output_path
},
evaluationConfig={
"automated": {
"datasetMetricConfigs": [{
"taskType": "Custom",
"dataset": {
"name": "RagDataset",
"datasetLocation": {
"s3Uri": input_data
}
},
"metricNames": [
"Builtin.Correctness",
"Builtin.Completeness",
"Builtin.Helpfulness",
"Builtin.LogicalCoherence",
"Builtin.Faithfulness"
]
}],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [{
"modelIdentifier": evaluator_model
}]
}
}
}
)
To monitor the progress of your evaluation job, use this configuration:
# depending on job type, we can retrieve the ARN of the job and monitor to to take any downstream actions.
evaluation_job_arn = retrieval_job['jobArn']
evaluation_job_arn = retrieve_generate_job['jobArn']
response = bedrock_client.get_evaluation_job(
jobIdentifier=evaluation_job_arn
)
print(f"Job Status: {response['status']}")
After your evaluation jobs are completed, Amazon Bedrock RAG evaluation provides a detailed comparative dashboard across the evaluation dimensions.
The evaluation dashboard includes comprehensive metrics, but we focus on one example, the completeness histogram shown below. This visualization represents how well responses cover all aspects of the questions asked. In our example, we notice a strong right-skewed distribution with an average score of 0.921. The majority of responses (15) scored above 0.9, while a small number fell in the 0.5-0.8 range. This type of distribution helps quickly identify if your RAG system has consistent performance or if there are specific cases needing attention.
Selecting specific score ranges in the histogram reveals detailed conversation analyses. For each conversation, you can examine the input prompt, generated response, number of retrieved chunks, ground truth comparison, and most importantly, the detailed score explanation from the evaluator model.
Consider this example response that scored 0.75 for the question, “What are some risks associated with Amazon’s expansion?” Although the generated response provided a structured analysis of operational, competitive, and financial risks, the evaluator model identified missing elements around IP infringement and foreign exchange risks compared to the ground truth. This detailed explanation helps in understanding not just what’s missing, but why the response received its specific score.
This granular analysis is crucial for systematic improvement of your RAG pipeline. By understanding patterns in lower-performing responses and specific areas where context retrieval or generation needs improvement, you can make targeted optimizations to your system—whether that’s adjusting retrieval parameters, refining prompts, or modifying knowledge base configurations.
These best practices help build a solid foundation for your RAG evaluation strategy:
To help you dive deeper into the scientific validation of these practices, we’ll be publishing a technical deep-dive post that explores detailed case studies using public datasets and internal AWS validation studies. This upcoming post will examine how our evaluation framework performs across different scenarios and demonstrate its correlation with human judgments across various evaluation dimensions. Stay tuned as we explore the research and validation that powers Amazon Bedrock Evaluations.
Amazon Bedrock knowledge base RAG evaluation enables organizations to confidently deploy and maintain high-quality RAG applications by providing comprehensive, automated assessment of both retrieval and generation components. By combining the benefits of managed evaluation with the nuanced understanding of human assessment, this feature allows organizations to scale their AI quality assurance efficiently while maintaining high standards. Organizations can make data-driven decisions about their RAG implementations, optimize their knowledge bases, and follow responsible AI practices through seamless integration with Amazon Bedrock Guardrails.
Whether you’re building customer service solutions, technical documentation systems, or enterprise knowledge base RAG, Amazon Bedrock Evaluations provides the tools needed to deliver reliable, accurate, and trustworthy AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets. You can find it on our GitHub repository.
We encourage you to explore these capabilities in the Amazon Bedrock console and discover how systematic evaluation can enhance your RAG applications.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Ayan Ray is a Senior Generative AI Partner Solutions Architect at AWS, where he collaborates with ISV partners to develop integrated Generative AI solutions that combine AWS services with AWS partner products. With over a decade of experience in Artificial Intelligence and Machine Learning, Ayan has previously held technology leadership roles at AI startups before joining AWS. Based in the San Francisco Bay Area, he enjoys playing tennis and gardening in his free time.
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Evangelia Spiliopoulou is an Applied Scientist in the AWS Bedrock Evaluation group, where the goal is to develop novel methodologies and tools to assist automatic evaluation of LLMs. Her overall work focuses on Natural Language Processing (NLP) research and developing NLP applications for AWS customers, including LLM Evaluations, RAG, and improving reasoning for LLMs. Prior to Amazon, Evangelia completed her Ph.D. at Language Technologies Institute, Carnegie Mellon University.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Manuel Rioux est fièrement propulsé par WordPress