Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) relevant to the query. This back-and-forth communication process often takes from hours to days, primarily because analysts, especially the new hires, don’t have immediate access to the necessary information. They spend hours consulting SMEs and reviewing extensive policy documents.
To address this challenge, Amazon Finance Automation developed a large language model (LLM)-based question-answer chat assistant on Amazon Bedrock. This solution empowers analysts to rapidly retrieve answers to customer queries, generating prompt responses within the same communication thread. As a result, it drastically reduces the time required to address customer queries.
In this post, we share how Amazon Finance Automation built this generative AI Q&A chat assistant using Amazon Bedrock.
The solution is based on a Retrieval Augmented Generation (RAG) pipeline running on Amazon Bedrock, as shown in the following diagram. When a user submits a query, RAG works by first retrieving relevant documents from a knowledge base, then generating a response with the LLM from the retrieved documents.
The solution consists of the following key components:
The accuracy of the chat assistant is the most critical performance metric to Amazon Finance Operations. After we built the first version of the chat assistant, we measured the bot response accuracy by submitting questions to the chat assistant. The SMEs manually evaluated the RAG responses one by one, and found only 49% of the responses were correct. This was far below the expectation, and the solution needed improvement.
However, manually evaluating the RAG isn’t sustainable—it requires hours of effort from finance operations and engineering teams. Therefore, we adopted the following automated performance evaluation approach:
question – This consists of 100 questions from policy documents where answers reside in a variety of sources, such as policy documents and engineering SOPs, covering complex text formats such as embedded tables and images.expected_answer – These are manually labeled answers by Amazon Finance Operations SMEs.generated_answer – This is the answer generated by the bot.Based on the aforementioned evaluation techniques, we focused on the following areas in the RAG pipeline to improve the overall accuracy.
Upon diagnosing incorrect responses in the RAG pipeline, we identified 14% of the inaccuracy was due to incomplete contexts sent to the LLM. These incomplete contexts were originally generated by the segmentation algorithm based on a fixed chunk size (for example, 512 tokens or 384 words), which doesn’t consider document boundaries such as sections and paragraphs.
To address this problem, we designed a new document segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service, using the following steps:
The following diagram illustrates the document retriever splitting workflow.
When processing the document, we follow specific rules:
With this approach, we can improve RAG accuracy from 49% to 64%.
Prompt engineering is a crucial technique to improve the performance of LLMs. We learned from our project that there is no one-size-fits-all prompt engineering approach; it’s a best practice to design task-specific prompts. We adopted the following approach to enhance the effectiveness of the prompt-to-RAG generator:
We then used meta-prompting using an FM offered by Amazon Bedrock to craft a prompt that caters to the aforementioned requirements.
The following example is a prompt for generating a quick summary and a detailed answer:
You are an AI assistant that helps answer questions based on provided text context. I will give you some passages from a document, followed by a question. Your task is to provide the best possible answer to the question using only the information from the given context. Here is the context:
<context>
{}
</context>
And here is the question:
<question>
{}
</question>
Think carefully about how the context can be used to answer the question.
<thinkingprocess>
- Carefully read the provided context and analyze what information it contains
- Identify the key pieces of information in the context that are relevant to answering the question
- Determine if the context provides enough information to answer the question satisfactorily
- If not, simply state "I don't know, I don't have the complete context needed to answer this
question"
- If so, synthesize the relevant information into a concise summary answer
- Expand the summary into a more detailed answer, utilizing Markdown formatting to make it clear and
readable
</thinkingprocess>
If you don't have enough context to answer the question, provide your response in the following
format:
I don't know, I don't have the complete context needed to answer this question.
If you do have enough context to answer the question, provide your response in the following format:
#### Quick Summary:
Your concise 1-2 sentence summary goes here.
#### Detailed Answer:
Your expanded answer goes here, using Markdown formatting like **bold**, *italics*, and Bullet points to improve readability.
Remember, the ultimate goal is to provide an informative, clear and readable answer to the question
using only the context provided. Let's begin!
The following example is a prompt for generating citations based on the generated answers and retrieved contexts:
You are an AI assistant that specializes in attributing generated answers to specific sections within provided documents. Your task is to determine which sections from the given documents were most likely used to generate the provided answer. If you cannot find exact matches, suggest sections that are closely related to the content of the answer.
Here is the generated answer to analyze:
<generated_answer>
{}
</generated_answer>
And here are the sections from various documents to consider:
<sections>
{}
</sections>
Please carefully read through the generated answer and the provided sections. In the scratchpad space below, brainstorm and reason about which sections are most relevant to the answer:
<scratchpad>
</scratchpad>
After identifying the relevant sections, provide your output in the following format:
**Document Name:** <document name> n
**Document Link:** <document link> n
**Relevant Sections:** n
- <section name 1>
- <section name 2>
- <section name 3>
Do not include any additional explanations or reasoning in your final output. Simply list the document name, link, and relevant section names in the specified format above.
Assistant:
By implementing the prompt engineering approaches, we improved RAG accuracy from 64% to 76%.
After implementing the document segmentation approach, we still saw lower relevance scores for retrieved contexts (55–65%), and the incorrect contexts were in the top ranks for more than 50% of cases. This indicated that there was still room for improvement.
We experimented with multiple embedding models, including first-party and third-party models. For example, the contextual embedding models such as bge-base-en-v1.5 performed better for context retrieval, comparing to other top embedding models such as all-mpnet-base-v2. We found that using the Amazon Titan Embeddings G1 model increased the possibility of retrieved contexts from approximately 55–65% to 75–80%, and 80% of the retrieved contexts have higher ranks than before.
Finally, by adopting the Amazon Titan Text Embeddings G1 model, we improved the overall accuracy from 76% to 86%.
We achieved remarkable progress in developing a generative AI Q&A chat assistant for Amazon Finance Automation by using a RAG pipeline and LLMs on Amazon Bedrock. Through continual evaluation and iterative improvement, we have addressed challenges of hallucinations, document ingestion issues, and context retrieval inaccuracies. Our results have shown a significant improvement in RAG accuracy from 49% to 86%.
You can follow our journey and adopt a similar solution to address challenges in your RAG application and improve overall performance.
Soheb Moin is a Software Development Engineer at Amazon, who led the development of the Generative AI chatbot. He specializes in leveraging generative AI and Big Data analytics to design, develop, and implement secure, scalable, innovative solutions that empowers Finance Operations with better productivity, automation. Outside of work, Soheb enjoys traveling, playing badminton, and engaging in chess tournaments.
Nitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 19 years of experience building business critical, scalable, high-performance software. Nitin leads data services, communication, work management and several Generative AI initiatives within Finance. In his spare time, he enjoys listening to music and read.
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Kumar Satyen Gaurav is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.
Mohak Chugh is a Software Development Engineer at Amazon, with over 3 years of experience in developing products leveraging Generative AI and Big Data on AWS. His work encompasses a range of areas, including RAG based GenAI chatbots and high performance data reconciliation. Beyond work, he finds joy in playing the piano and performing with his music band.
Parth Bavishi is a Senior Product Manager at Amazon with over 10 years of experience in building impactful products. He currently leads the development of generative AI capabilities for Amazon’s Finance Automation, driving innovation and efficiency within the organization. A dedicated mentor, Parth enjoys sharing his product management knowledge and finds satisfaction in activities like volleyball and reading.
Manuel Rioux est fièrement propulsé par WordPress