This post is co-written with David Gildea and Tom Nijs from Druva.
Druva enables cyber, data, and operational resilience for thousands of enterprises, and is trusted by 60 of the Fortune 500. Customers use Druva Data Resiliency Cloud to simplify data protection, streamline data governance, and gain data visibility and insights. Independent software vendors (ISVs) like Druva are integrating AI assistants into their user applications to make software more accessible.
Dru, the Druva backup AI copilot, enables real-time interaction and personalized responses, with users engaging in a natural conversation with the software. From finding inconsistencies and errors across the environment to scheduling backup jobs and setting retention policies, users need only ask and Dru responds. Dru can also recommend actions to improve the environment, remedy backup failures, and identify opportunities to enhance security.
In this post, we show how Druva approached natural language querying (NLQ)—asking questions in English and getting tabular data as answers—using Amazon Bedrock, the challenges they faced, sample prompts, and key learnings.
The following screenshot illustrates the Dru conversation interface.
In a single conversation interface, Dru provides the following:
In this section, we discuss the challenges and key learnings of Druva’s journey.
Originally, we adopted an AI agent approach and relied on the foundation model (FM) to make plans and invoke tools using the reasoning and acting (ReAct) method to answer user questions. However, we found the objective too broad and complicated for the AI agent. The AI agent would take more than 60 seconds to plan and respond to a user question. Sometimes it would even get stuck in a thought-loop, and the overall success rate wasn’t satisfactory.
We decided to move to the prompt chaining approach using a directed acyclic graph (DAG). This approach allowed us to break the problem down into multiple steps:
Each step became an independent stream, so our engineers could iteratively develop and evaluate the performance and speed until they worked well in isolation. The workflow also became more controllable by defining proper error paths.
Out of the hundreds of APIs that power Druva products, we needed to match the exact API the application needs to call to answer the user question. For example, “Show me my backup failures for the past 72 hours, grouped by server.” Having similar names and synonyms in API routes make this retrieval problem more complex.
Originally, we formulated this task as a retrieval problem. We tried different methods, including k-nearest neighbor (k-NN) search of vector embeddings, BM25 with synonyms, and a hybrid of both across fields including API routes, descriptions, and hypothetical questions. We found that the simplest and most accurate way was to formulate it as a classification task to the FM. We curated a small list of examples in question-API route pairs, which helped improve the accuracy and make the output format more consistent.
Next, we API call with the correct parameters and invoke it. FM hallucination of parameters, particularly those with free-form JSON object, is one of the major challenges in the whole workflow. For example, the unsupported key server can appear in the generated parameter:
"filter": {
"and": [
{
"gte": {
"key": "dt",
"value": 1704067200
}
},
{
"eq": {
"key": "server",
"value": "xyz"
}
}
]
}
We tried different prompting techniques, such as few-shot prompting and chain of thought (CoT), but the success rate was still unsatisfactory. To make API call generation and invocation more robust, we separated this task into two steps:
This method provided a successful API invocation, at the expense of getting more data than required for downstream processing.
Next, we took the response from the API call and transformed it to answer the user question. For example, “Create a pandas dataframe and group it by server column.” Similar to stream 2, FM hallucination is again an obstacle. Generated code can contain syntax errors, such as confusing PySpark functions with Pandas functions.
After trying many different prompting techniques without success, we looked at the reflection pattern, asking the FM to self-correct code in a loop. This improved the success rate at the expense of more FM invocations, which were slower and more expensive. We found that although smaller models are faster and more cost-effective, at times they had inconsistent results. Anthropic’s Claude 2.1 on Amazon Bedrock gave more accurate results on the second try.
Druva selected Amazon Bedrock for several compelling reasons, with security and latency being the most important. A key factor in this decision was the seamless integration with Druva’s services. Using Amazon Bedrock aligned naturally with Druva’s existing environment on AWS, maintaining a secure and efficient extension of their capabilities.
Additionally, one of our primary challenges in developing Dru involved selecting the optimal FMs for specific tasks. Amazon Bedrock effectively addresses this challenge with its extensive array of available FMs, each offering unique capabilities. This variety enabled Druva to conduct the rapid and comprehensive testing of various FMs and their parameters, facilitating the selection of the most suitable one. The process was streamlined because Druva didn’t need to delve into the complexities of running or managing these diverse FMs, thanks to the robust infrastructure provided by Amazon Bedrock.
Through the experiments, we found that different models performed better in specific tasks. For example, Meta Llama 2 performed better with code generation task; Anthropic Claude Instance was good in efficient and cost-effective conversation; whereas Anthropic Claude 2.1 was good in getting desired responses in retry flows.
These were the latest models from Anthropic and Meta at the time of this writing.
The following diagram shows how the three streams work together as a single workflow to answer user questions with tabular data.
The following are the steps of the workflow:
To maintain user and system security, we make sure in our design that:
For more details, refer to Druva’s Secret Sauce: Meet the Technology Behind Dru’s GenAI Magic.
In this section, we discuss Steps 2a–2e in the solution workflow.
This step uses an FM to perform classification. It takes the user question and a full list of available API routes with meaningful names and descriptions as the input, and responds The following is a sample prompt:
Please read the following API routes carefully as I’ll ask you a question about them:
<api_routes>{api_routes}</api_routes>
Which API route can best answer “{question}”?
This step uses an FM to generate API parameters. It first looks up the corresponding swagger for the API route (from Step 2a). Next, it passes the swagger and the user question to an FM and responds with some key-value pairs to the API route that can retrieve relevant data. The following is a sample prompt:
Please read the following swagger carefully as I’ll ask you a question about it:
<swagger>{swagger}</swagger>
Produce a key-value JSON dict of the available request parameters based on “{question}” with reference to the swagger.
In the previous step, even with an attempt to ground responses with swagger, the FM can still hallucinate wrong or nonexistent API parameters. This step uses a programmatic way to verify, format, and invoke the API call to get data. The following is the pseudo code:
for each input parameter (key/value)
if parameter key not in swagger then
drop parameter
else if parameter value data type not match swagger then
drop parameter
else
URL encode parameter
end if
end for
This step uses an FM to generate Python code. It first samples a few records of input data to reduce input tokens. Then it passes the sample data and the user question to an FM and responds with a Python script that transforms data to answer the question. The following is a sample prompt:
Please read the following sample data carefully as I’ll ask you a question about them:
<sample_data>{5_rows_of_data_in_json}</sample_data>
Write a Python script using pandas to transform the data to answer the question “{question}”.
This step involves a Python script, which imports the generated Python package, runs the transformation, and returns the tabular data as the final response. If an error occurs, it will invoke the FM to try to correct the code. When everything fails, it returns the input data. The following is the pseudo code:
for maximum number of retries
run data transformation function
if error then
invoke foundation model to correct code
end if
end for
if success then
return transformed data
else
return input data
end if
Using Amazon Bedrock for the solution foundation led to remarkable achievements in accuracy, as evidenced by the following metrics in our evaluations using an internal dataset:
These results are not just numbers; they are a testament to the robustness and efficiency of the Amazon Bedrock based solution. With such high levels of accuracy, Druva is now poised to confidently broaden their horizons. Our next goal is to extend this solution to encompass a wider range of APIs across Druva products. The next expansion will be scaling up usage and substantially enrich the experience of Druva customers. By integrating more APIs, Druva will offer a more seamless, responsive, and contextual interaction with Druva products, further enhancing the value delivered to Druva users.
To learn more about Druva’s AI solutions, visit the Dru solution page, where you can see some of these capabilities in action through recorded demos. Visit the AWS Machine Learning blog to see how other customers are using Amazon Bedrock to solve their business problems.
David Gildea is the VP of Product for Generative AI at Druva. With over 20 years of experience in cloud automation and emerging technologies, David has led transformative projects in data management and cloud infrastructure. As the founder and former CEO of CloudRanger, he pioneered innovative solutions to optimize cloud operations, later leading to its acquisition by Druva. Currently, David leads the Labs team in the Office of the CTO, spearheading R&D into generative AI initiatives across the organization, including projects like Dru Copilot, Dru Investigate, and Amazon Q. His expertise spans technical research, commercial planning, and product development, making him a prominent figure in the field of cloud technology and generative AI.
Tom Nijs is an experienced backend and AI engineer at Druva, passionate about both learning and sharing knowledge. With a focus on optimizing systems and using AI, he’s dedicated to helping teams and developers bring innovative solutions to life.
Corvus Lee is a Senior GenAI Labs Solutions Architect at AWS. He is passionate about designing and developing prototypes that use generative AI to solve customer problems. He also keeps up with the latest developments in generative AI and retrieval techniques by applying them to real-world scenarios.
Fahad Ahmed is a Senior Solutions Architect at AWS and assists financial services customers. He has over 17 years of experience building and designing software applications. He recently found a new passion of making AI services accessible to the masses.
Manuel Rioux est fièrement propulsé par WordPress