Picture this: Your team just received 10,000 customer feedback responses. The traditional approach? Weeks of manual analysis. But what if AI could not only analyze this feedback but also validate its own work? Welcome to the world of large language model (LLM) jury systems deployed using Amazon Bedrock.
As more organizations embrace generative AI, particularly LLMs for various applications, a new challenge has emerged: ensuring that the output from these AI models aligns with human perspectives and is accurate and relevant to the business context. Manual analysis of large datasets can be time consuming, resource intensive, and thus impractical. For example, manually reviewing 2,000 comments can take over 80 hours, depending on comment length, complexity, and researcher analyses. LLMs offer a scalable approach to serve as qualitative text annotators, summarizers, and even judges evaluating text outputs from other AI systems.
This prompts the question, “But how can we deploy such LLM-as-a-judge systems effectively and then use other LLMs to evaluate performance?”
In this post, we highlight how you can deploy multiple generative AI models in Amazon Bedrock to instruct an LLM model to create thematic summaries of text responses (such as from open-ended survey questions to your customers) and then use multiple LLM models as a jury to review these LLM generated summaries and assign a rating to judge the content alignment between the summary title and summary description. This setup is often referred to as an LLM jury system. Think of the LLM jury as a panel of AI judges, each bringing their own perspective to evaluate content. Instead of relying on a single model’s potentially biased view, multiple models work together to provide a more balanced assessment.
Your organization receives thousands of customer feedback responses. Traditional manual analysis of responses might painstakingly and resource-intensively take days or weeks, depending on the volume of free text comments you receive. Alternative natural language processing techniques, though likely faster, also require extensive data cleanup and coding know-how to analyze the data effectively. Pre-trained LLMs offer a promising, relatively low-code solution for quickly generating thematic summaries from text-based data because these models have been shown to scale data analysis and reduce manual review time. However, when relying on a single pre-trained LLM for both analysis and evaluation, concerns arise regarding biases, such as model hallucinations (that is, producing inaccurate information) or confirmation bias (that is, favoring expected outcomes). Without cross-validation mechanisms, such as comparing outputs from multiple models or benchmarking against human-reviewed data, the risk of unchecked errors increases. Using multiple pre-trained LLMs can address this concern by providing robust and comprehensive analyses, even allowing for enabling human-in-the-loop oversight, and enhancing reliability over a single-model evaluation. The concept of using LLMs as a jury means deploying multiple generative AI models to independently evaluate or validate each other’s outputs.
You can use Amazon Bedrock to compare the various frontier foundation models (FMs) such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. The unified Amazon Web Services (AWS) environment and standardized API calls simplify deploying multiple models for thematic analysis and judging model outputs. Amazon Bedrock also solves for operational needs through a unified security and compliance controlled system and a consistent model deployment environment across all models.
Our proposed workflow, illustrated in the following diagram, includes these steps:

To complete the steps, you need to have the following prerequisites:
In this section, we walk you through the step-by-step implementation.
Try this out for yourself by downloading the Jupyter notebook from GitHub.
import boto3
import json
# Initialize our connection to AWS services
bedrock = boto3.client('bedrock')
s3_client = boto3.client('s3')
# Configure where we'll store our evidence (data)
bucket = 'my-example-name'
raw_input = 'feedback_dummy_data.txt'
output_themes = 'feedback_analyzed.txt'
def analyze_comment(comment):
prompt = f"""You must respond ONLY with a valid JSON object.
Analyze this customer review: "{comment}"
Respond with this exact JSON structure:
{{
"main_theme": "theme here",
"sub_theme": "sub-theme here",
"rationale": "rationale here"
}}
"""
# Call pre-trained model through Bedrock
response = bedrock_runtime.invoke_model(
modelId=#model of choice goes here
body=json.dumps({
"prompt": prompt,
"max_tokens": 1000,
"temperature": 0.1
})
)
return parse_response(response)
def evaluate_alignment_nova(comment, theme, subtheme, rationale):
judge_prompt = f"""Rate theme alignment (1-3):
Comment: "{comment}"
Main Theme: {theme}
Sub-theme: {subtheme}
Rationale: {rationale}
"""
# Complete code in attached notebook
def calculate_agreement_metrics(ratings_df):
return {
'Percentage Agreement': calculate_percentage_agreement(ratings_df),
'Cohens Kappa': calculate_pairwise_cohens_kappa(ratings_df),
'Krippendorffs Alpha': calculate_krippendorffs_alpha(ratings_df),
'Spearmans Rho': calculate_spearmans_rho(ratings_df)
}
We used the following popular agreement metrics to compare alignment and therefore performance across and among models:
Success! If you followed along, you have now successfully deployed multiple LLMs to judge thematic analysis output from an LLM.
To help manage costs when running this solution, consider the following options:
For sensitive data, consider the following options:
In this post, we demonstrated how you can use Amazon Bedrock to seamlessly use multiple LLMs to generate and judge thematic summaries of qualitative data, such as from customer feedback. We also showed how we can compare human evaluator ratings of text-based summaries from survey response data against ratings from multiple LLMs such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. In recently published research, Amazon scientists showed LLMs showed inter-model agreement up to 91% compared with human-to-model agreement up to 79%. Our findings suggest that although LLMs can provide reliable thematic evaluations at scale, human oversight continues to remain important for identifying subtle contextual nuances that LLMs might miss.
The best part? Through Amazon Bedrock model hosting, you can compare the various models using the same preprocessed data across all models, so you can choose the one that works best for your context and need.
With organizations turning to generative AI for analyzing unstructured data, this post provides insight into the value of using multiple LLMs to validate LLM-generated analyses. The strong performance of LLM-as-a-judge models opens opportunities to scale text data analyses at scale, and Amazon Bedrock can help organizations interact with and use multiple models to use an LLM-as-a-judge framework.
Dr. Sreyoshi Bhaduri is a Senior Research Scientist at Amazon. Currently, she spearheads innovative research in applying generative AI at scale to solve complex supply chain logistics and operations challenges. Her expertise spans applied statistics and natural language processing, with a PhD from Virginia Tech and specialized training in responsible AI from MILA. Sreyoshi is committed to demystifying and democratizing generative AI solutions and bridging the gap between theoretical research and practical applications using AWS technologies.
Dr. Natalie Perez specializes in transformative approaches to customer insights and innovative solutions using generative AI. Previously at AWS, Natalie pioneered large-scale voice of employee research, driving product and programmatic improvements. Natalie is dedicated to revolutionizing how organizations scale, understand, and act on customer needs through the strategic integration of generative AI and human-in-the-loop strategies, driving innovation that puts customers at the heart of product, program, and service development.
John Kitaoka is a Solutions Architect at Amazon Web Services (AWS) and works with government entities, universities, nonprofits, and other public sector organizations to design and scale AI solutions. His work covers a broad range of machine learning (ML) use cases, with a primary interest in inference, responsible AI, and security. In his spare time, he loves woodworking and snowboarding.
Dr. Elizabeth (Liz) Conjar is a Principal Research Scientist at Amazon, where she pioneers at the intersection of HR research, organizational transformation, and AI/ML. Specializing in people analytics, she helps reimagine employees’ work experiences, drive high-velocity organizational change, and develop the next generation of Amazon leaders. Throughout her career, Elizabeth has established herself as a thought leader in translating complex people analytics into actionable strategies. Her work focuses on optimizing employee experiences and accelerating organizational success through data-driven insights and innovative technological solutions.
Manuel Rioux est fièrement propulsé par WordPress