Evaluating RAG applications with Amazon Bedrock knowledge base evaluation

14 mars 2025

AWS » Machine Learning, Intelligence artificielle

Organizations building and deploying AI applications, particularly those using large language models (LLMs) with Retrieval Augmented Generation (RAG) systems, face a significant challenge: how to evaluate AI outputs effectively throughout the application lifecycle. As these AI technologies become more sophisticated and widely adopted, maintaining consistent quality and performance becomes increasingly complex.

Traditional AI evaluation approaches have significant limitations. Human evaluation, although thorough, is time-consuming and expensive at scale. Although automated metrics are fast and cost-effective, they can only evaluate the correctness of an AI response, without capturing other evaluation dimensions or providing explanations of why an answer is problematic. Furthermore, traditional automated evaluation metrics typically require ground truth data, which for many AI applications is difficult to obtain. Especially for those involving open-ended generation or retrieval augmented systems, defining a single “correct” answer is practically impossible. Finally, metrics such as ROUGE and F1 can be fooled by shallow linguistic similarities (word overlap) between the ground truth and the LLM response, even when the actual meaning is very different. These challenges make it difficult for organizations to maintain consistent quality standards across their AI applications, particularly for generative AI outputs.

Amazon Bedrock has recently launched two new capabilities to address these evaluation challenges: LLM-as-a-judge (LLMaaJ) under Amazon Bedrock Evaluations and a brand new RAG evaluation tool for Amazon Bedrock Knowledge Bases. Both features rely on the same LLM-as-a-judge technology under the hood, with slight differences depending on if a model or a RAG application built with Amazon Bedrock Knowledge Bases is being evaluated. These evaluation features combine the speed of automated methods with human-like nuanced understanding, enabling organizations to:

Assess AI model outputs across various tasks and contexts
Evaluate multiple evaluation dimensions of AI performance simultaneously
Systematically assess both retrieval and generation quality in RAG systems
Scale evaluations across thousands of responses while maintaining quality standards

These capabilities integrate seamlessly into the AI development lifecycle, empowering organizations to improve model and application quality, promote responsible AI practices, and make data-driven decisions about model selection and application deployment.

This post focuses on RAG evaluation with Amazon Bedrock Knowledge Bases, provides a guide to set up the feature, discusses nuances to consider as you evaluate your prompts and responses, and finally discusses best practices. By the end of this post, you will understand how the latest Amazon Bedrock evaluation features can streamline your approach to AI quality assurance, enabling more efficient and confident development of RAG applications.

Key features

Before diving into the implementation details, we examine the key features that make the capabilities of RAG evaluation on Amazon Bedrock Knowledge Bases particularly powerful. The key features are:

Amazon Bedrock Evaluations
- Evaluate Amazon Bedrock Knowledge Bases directly within the service
- Systematically evaluate both retrieval and generation quality in RAG systems to change knowledge base build-time parameters or runtime parameters
Comprehensive, understandable, and actionable evaluation metrics
- Retrieval metrics: Assess context relevance and coverage using an LLM as a judge
- Generation quality metrics: Measure correctness, faithfulness (to detect hallucinations), completeness, and more
- Provide natural language explanations for each score in the output and on the console
- Compare results across multiple evaluation jobs for both retrieval and generation
- Metrics scores are normalized to 0 and 1 range
Scalable and efficient assessment
- Scale evaluation across thousands of responses
- Reduce costs compared to manual evaluation while maintaining high quality standards
Flexible evaluation framework
- Support both ground truth and reference-free evaluations
- Equip users to select from a variety of metrics for evaluation
- Supports evaluating fine-tuned or distilled models on Amazon Bedrock
- Provides a choice of evaluator models
Model selection and comparison
- Compare evaluation jobs across different generating models
- Facilitate data-driven optimization of model performance
Responsible AI integration
- Incorporate built-in responsible AI metrics such as harmfulness, answer refusal, and stereotyping
- Seamlessly integrate with Amazon Bedrock Guardrails

These features enable organizations to comprehensively assess AI performance, promote responsible AI development, and make informed decisions about model selection and optimization throughout the AI application lifecycle. Now that we’ve explained the key features, we examine how these capabilities come together in a practical implementation.

Feature overview

The Amazon Bedrock Knowledge Bases RAG evaluation feature provides a comprehensive, end-to-end solution for assessing and optimizing RAG applications. This automated process uses the power of LLMs to evaluate both retrieval and generation quality, offering insights that can significantly improve your AI applications.

The workflow is as follows, as shown moving from left to right in the following architecture diagram:

Prompt dataset – Prepared set of prompts, optionally including ground truth responses
JSONL file – Prompt dataset converted to JSONL format for the evaluation job
Amazon Simple Storage Service (Amazon S3) bucket – Storage for the prepared JSONL file
Amazon Bedrock Knowledge Bases RAG evaluation job – Core component that processes the data, integrating with Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
Automated report generation – Produces a comprehensive report with detailed metrics and insights at individual prompt or conversation level
Analyze the report to derive actionable insights for RAG system optimization

Designing holistic RAG evaluations: Balancing cost, quality, and speed

RAG system evaluation requires a balanced approach that considers three key aspects: cost, speed, and quality. Although Amazon Bedrock Evaluations primarily focuses on quality metrics, understanding all three components helps create a comprehensive evaluation strategy. The following diagram shows how these components interact and feed into a comprehensive evaluation strategy, and the next sections examine each component in detail.

Cost and speed considerations

The efficiency of RAG systems depends on model selection and usage patterns. Costs are primarily driven by data retrieval and token consumption during retrieval and generation, and speed depends on model size and complexity as well as prompt and context size. For applications requiring high performance content generation with lower latency and costs, model distillation can be an effective solution to use for creating a generator model, for example. As a result, you can create smaller, faster models that maintain quality of larger models for specific use cases.

Quality assessment framework

Amazon Bedrock knowledge base evaluation provides comprehensive insights through various quality dimensions:

Technical quality through metrics such as context relevance and faithfulness
Business alignment through correctness and completeness scores
User experience through helpfulness and logical coherence measurements
Incorporates built-in responsible AI metrics such as harmfulness, stereotyping, and answer refusal.

Establishing baseline understanding

Begin your evaluation process by choosing default configurations in your knowledge base (vector or graph database), such as default chunking strategies, embedding models, and prompt templates. These are just some of the possible options. This approach establishes a baseline performance, helping you understand your RAG system’s current effectiveness across available evaluation metrics before optimization. Next, create a diverse evaluation dataset. Make sure this dataset contains a diverse set of queries and knowledge sources that accurately reflect your use case. The diversity of this dataset will provide a comprehensive view of your RAG application performance in production.

Iterative improvement process

Understanding how different components affect these metrics enables informed decisions about:

Knowledge base configuration (chunking strategy or embedding size or model) and inference parameter refinement
Retrieval strategy modifications (semantic or hybrid search)
Prompt engineering refinements
Model selection and inference parameter configuration
Choice between different vector stores including graph databases

Continuous evaluation and improvement

Implement a systematic approach to ongoing evaluation:

Schedule regular offline evaluation cycles aligned with knowledge base updates
Track metric trends over time to identify areas for improvement
Use insights to guide knowledge base refinements and generator model customization and selection

Prerequisites

To use the knowledge base evaluation feature, make sure that you have satisfied the following requirements:

An active AWS account.
Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
Confirm the AWS Regions where the model is available and quotas.
Complete the knowledge base evaluation prerequisites related to AWS Identity and Access Management (IAM) creation and add permissions for an S3 bucket to access and write output data.
- You also need to set up and enable CORS on your S3 bucket.
Have an Amazon Bedrock knowledge base created and sync your data such that it’s ready to be used by a knowledge base evaluation job.
If yo’re using a custom model instead of an on-demand model for your generator model, make sure you have sufficient quota for running a Provisioned Throughput during inference. Go to the Service Quotas console and check the following quotas:
- Model units no-commitment Provisioned Throughputs across custom models
- Model units per provisioned model for [your custom model name]
- Both fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset

To prepare your dataset for a knowledge base evaluation job, you need to follow two important steps:

Dataset requirements:
1. Maximum 1,000 conversations per evaluation job (1 conversation is contained in the conversationTurns key in the dataset format)
2. Maximum 5 turns (prompts) per conversation
3. File must use JSONL format (.jsonl extension)
4. Each line must be a valid JSON object and complete prompt
5. Stored in an S3 bucket with CORS enabled
Follow the following format:
1. Retrieve only evaluation jobs.

Special note: On March 20, 2025, the referenceContexts key will change to referenceResponses. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages/chunks retrieved from the Knowledge Base.

{
    "conversationTurns": [{
        ## required for Context Coverage metric
        "referenceContexts": [{
            "content": [{
                "text": "This is reference retrieved context"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}

Retrieve and generate evaluation jobs

{
    "conversationTurns": [{
        ##optional
        "referenceResponses": [{
            "content": [{
                "text": "This is a reference response used as groud truth"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}

Start a knowledge base RAG evaluation job using the console

Amazon Bedrock Evaluations provides you with an option to run an evaluation job through a guided user interface on the console. To start an evaluation job through the console, follow these steps:

On the Amazon Bedrock console, under Inference and Assessment in the navigation pane, choose Evaluations and then choose Knowledge Bases.
Choose Create, as shown in the following screenshot.
Give an Evaluation name, a Description, and choose an Evaluator model, as shown in the following screenshot. This model will be used as a judge to evaluate the response of the RAG application.
Choose the knowledge base and the evaluation type, as shown in the following screenshot. Choose Retrieval only if you want to evaluate only the retrieval component and Retrieval and response generation if you want to evaluate the end-to-end retrieval and response generation. Select a model, which will be used for generating responses in this evaluation job.
(Optional) To change inference parameters, choose configurations. You can update or experiment with different values of temperature, top-P, update knowledge base prompt templates, associate guardrails, update search strategy, and configure numbers of chunks retrieved. The following screenshot shows the Configurations screen.
Choose the Metrics you would like to use to evaluate the RAG application, as shown in the following screenshot.
Provide the S3 URI, as shown in step 3 for evaluation data and for evaluation results. You can use the Browse S3
Select a service (IAM) role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, the knowledge base in the job, and the models being used in the job. You can also create a new IAM role in the evaluation setup and the service will automatically give the role the proper permissions for the job.
Choose Create.
You will be able to check the evaluation job In Progress status on the Knowledge Base evaluations screen, as shown in in the following screenshot.
Wait for the job to be complete. This could be 10–15 minutes for a small job or a few hours for a large job with hundreds of long prompts and all metrics selected. When the evaluation job has been completed, the status will show as Completed, as shown in the following screenshot.
When it’s complete, select the job, and you’ll be able to observe the details of the job. The following screenshot is the Metric summary.
You should also observe a directory with the evaluation job name in the Amazon S3 path. You can find the output S3 path from your job results page in the evaluation summary section.
You can compare two evaluation jobs to gain insights about how different configurations or selections are performing. You can view a radar chart comparing performance metrics between two RAG evaluation jobs, making it simple to visualize relative strengths and weaknesses across different dimensions, as shown in the following screenshot.

On the Evaluation details tab, examine score distributions through histograms for each evaluation metric, showing average scores and percentage differences. Hover over the histogram bars to check the number of conversations in each score range, helping identify patterns in performance, as shown in the following screenshots.

Start a knowledge base evaluation job using Python SDK and APIs

To use the Python SDK for creating a knowledge base evaluation job, follow these steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure your knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "anthropic.claude-3-sonnet-20240229-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"

# Specify S3 locations for evaluation data and output
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

For retrieval-only evaluation, create a job that focuses on assessing the quality of retrieved contexts:

retrieval_job = bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval performance",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveConfig": {
                    "knowledgeBaseId": knowledge_base_id,
                    "knowledgeBaseRetrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            "numberOfResults": num_results,
                            "overrideSearchType": search_type
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.ContextRelevance",
                    "Builtin.ContextCoverage"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

For a complete evaluation of both retrieval and generation, use this configuration:

retrieve_generate_job=bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results,
                                "overrideSearchType": search_type
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

To monitor the progress of your evaluation job, use this configuration:

# depending on job type, we can retrieve the ARN of the job and monitor to to take any downstream actions.
evaluation_job_arn = retrieval_job['jobArn']
evaluation_job_arn = retrieve_generate_job['jobArn']

response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

Interpreting results

After your evaluation jobs are completed, Amazon Bedrock RAG evaluation provides a detailed comparative dashboard across the evaluation dimensions.

The evaluation dashboard includes comprehensive metrics, but we focus on one example, the completeness histogram shown below. This visualization represents how well responses cover all aspects of the questions asked. In our example, we notice a strong right-skewed distribution with an average score of 0.921. The majority of responses (15) scored above 0.9, while a small number fell in the 0.5-0.8 range. This type of distribution helps quickly identify if your RAG system has consistent performance or if there are specific cases needing attention.

Selecting specific score ranges in the histogram reveals detailed conversation analyses. For each conversation, you can examine the input prompt, generated response, number of retrieved chunks, ground truth comparison, and most importantly, the detailed score explanation from the evaluator model.

Consider this example response that scored 0.75 for the question, “What are some risks associated with Amazon’s expansion?” Although the generated response provided a structured analysis of operational, competitive, and financial risks, the evaluator model identified missing elements around IP infringement and foreign exchange risks compared to the ground truth. This detailed explanation helps in understanding not just what’s missing, but why the response received its specific score.

This granular analysis is crucial for systematic improvement of your RAG pipeline. By understanding patterns in lower-performing responses and specific areas where context retrieval or generation needs improvement, you can make targeted optimizations to your system—whether that’s adjusting retrieval parameters, refining prompts, or modifying knowledge base configurations.

Best practices for implementation

These best practices help build a solid foundation for your RAG evaluation strategy:

Design your evaluation strategy carefully, using representative test datasets that reflect your production scenarios and user patterns. If you have large workloads greater than 1,000 prompts per batch, optimize your workload by employing techniques such as stratified sampling to promote diversity and representativeness within your constraints such as time to completion and costs associated with evaluation.
Schedule periodic batch evaluations aligned with your knowledge base updates and content refreshes because this feature supports batch analysis rather than real-time monitoring.
Balance metrics with business objectives by selecting evaluation dimensions that directly impact your application’s success criteria.
Use evaluation insights to systematically improve your knowledge base content and retrieval settings through iterative refinement.
Maintain clear documentation of evaluation jobs, including the metrics selected and improvements implemented based on results. The job creation configuration settings in your results pages can help keep a historical record here.
Optimize your evaluation batch size and frequency based on application needs and resource constraints to promote cost-effective quality assurance.
Structure your evaluation framework to accommodate growing knowledge bases, incorporating both technical metrics and business KPIs in your assessment criteria.

To help you dive deeper into the scientific validation of these practices, we’ll be publishing a technical deep-dive post that explores detailed case studies using public datasets and internal AWS validation studies. This upcoming post will examine how our evaluation framework performs across different scenarios and demonstrate its correlation with human judgments across various evaluation dimensions. Stay tuned as we explore the research and validation that powers Amazon Bedrock Evaluations.

Conclusion

Amazon Bedrock knowledge base RAG evaluation enables organizations to confidently deploy and maintain high-quality RAG applications by providing comprehensive, automated assessment of both retrieval and generation components. By combining the benefits of managed evaluation with the nuanced understanding of human assessment, this feature allows organizations to scale their AI quality assurance efficiently while maintaining high standards. Organizations can make data-driven decisions about their RAG implementations, optimize their knowledge bases, and follow responsible AI practices through seamless integration with Amazon Bedrock Guardrails.

Whether you’re building customer service solutions, technical documentation systems, or enterprise knowledge base RAG, Amazon Bedrock Evaluations provides the tools needed to deliver reliable, accurate, and trustworthy AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets. You can find it on our GitHub repository.

We encourage you to explore these capabilities in the Amazon Bedrock console and discover how systematic evaluation can enhance your RAG applications.

About the Authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Ayan Ray is a Senior Generative AI Partner Solutions Architect at AWS, where he collaborates with ISV partners to develop integrated Generative AI solutions that combine AWS services with AWS partner products. With over a decade of experience in Artificial Intelligence and Machine Learning, Ayan has previously held technology leadership roles at AI startups before joining AWS. Based in the San Francisco Bay Area, he enjoys playing tennis and gardening in his free time.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Evangelia Spiliopoulou is an Applied Scientist in the AWS Bedrock Evaluation group, where the goal is to develop novel methodologies and tools to assist automatic evaluation of LLMs. Her overall work focuses on Natural Language Processing (NLP) research and developing NLP applications for AWS customers, including LLM Evaluations, RAG, and improving reasoning for LLMs. Prior to Amazon, Evangelia completed her Ph.D. at Language Technologies Institute, Carnegie Mellon University.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.