PDI Technologies is a global leader in the convenience retail and petroleum wholesale industries. They help businesses around the globe increase efficiency and profitability by securely connecting their data and operations. With 40 years of experience, PDI Technologies assists customers in all aspects of their business, from understanding consumer behavior to simplifying technology ecosystems across the supply chain.
Enterprises face a significant challenge of making their knowledge bases accessible, searchable, and usable by AI systems. Internal teams at PDI Technologies were struggling with information scattered across disparate systems including websites, Confluence pages, SharePoint sites, and various other data sources. To address this, PDI Technologies built PDI Intelligence Query (PDIQ), an AI assistant that gives employees access to company knowledge through an easy-to-use chat interface. This solution is powered by a custom Retrieval Augmented Generation (RAG) system, built on Amazon Web Services (AWS) using serverless technologies. Building PDIQ required addressing the following key challenges:
In this post, we walk through the PDIQ process flow and architecture, focusing on the implementation details and the business outcomes it has helped PDI achieve.
In this section, we explore PDIQ’s comprehensive end-to-end design. We examine the data ingestion pipeline from initial processing through storage to user search capabilities, as well as the zero-trust security framework that protects key user personas throughout their platform interactions. The architecture consists of these elements:
The following diagram is the solution architecture.

Next, we review how PDIQ implements a zero-trust security model with role-based access control for two key personas:
In this section, we review the end-to-end process flow. We break it down by sections to dive deeper into each step and explain the functionality.

Crawlers are configured by Administrator to collect data from a variety of sources that PDI relies on. Crawlers hydrate the data into the knowledge base so that this information can be retrieved by end users. PDIQ currently supports the following crawler configurations:
By building separate crawler configurations, PDIQ offers easy extensibility into the platform to configure additional crawlers on demand. It also offers the flexibility to administrator users to configure the settings for their respective crawlers (such as frequency, depth, or rate limits).
The following figure shows the PDIQ UI to configure the knowledge base.

The following figure shows the PDI UI to configure your crawler (such as Confluence).

The following figure shows the PDIQ UI to schedule crawlers.

Data crawled is stored in Amazon S3 with proper metadata tags. If the source is in HTML format, the task converts the content into markdown (.md) files. For these markdown files, there is an additional optimization step performed to replace the images in the document with the Amazon S3 reference location. Key benefits of this approach include:
The following is a sample markdown file where images are replaced with the S3 file location:

This is the most critical step of the process. The key objective of this step is to generate vector embeddings so that they can be used for similarity matching and effective retrieval based on user inquiry. The process follows several steps, starting with image captioning, then document chunking, summary generation, and embedding generation. To caption the images, PDIQ scans the markdown files to locate image tags <image>. For each of these images, PDIQ scans and generates an image caption that explains the content of the image. This caption gets injected back into the markdown file, next to the <image> tag, thereby enriching the document content. This approach offers improved contextual searchability. PDIQ enhances content discovery by embedding insights extracted from images directly into the original markdown files. This approach ensures that image content becomes part of the searchable text, enabling richer and more accurate context retrieval during search and analysis. The approach also saves costs. To avoid unnecessary LLM inference calls for exact same images, PDIQ stores image metadata (file location and generated captions) in Amazon DynamoDB. This step enables efficient reuse of previously generated captions, eliminating the need for repeated caption generation calls to LLM.
The following is an example of an image caption prompt:
You are a professional image captioning assistant. Your task is to provide clear, factual, and objective descriptions of images. Focus on describing visible elements, objects, and scenes in a neutral and appropriate manner.
The following is a snippet of markdown file that contains the image tag, LLM-generated caption, and the corresponding S3 file location:
![image-20230818-114454: The image displays a security tip notification on a computer screen. The notification is titled "Security tip" and advises the user to use generated passwords to keep their accounts safe. The suggested password, "2m5oFX#g&tLRMhN3," is shown in a green box. Below the suggested password, there is a section labeled "Very Strong," indicating the strength of the password. The password length is set to 16 characters, and it includes lowercase letters, uppercase letters, numbers, and symbols. There is also a "Dismiss" button to close the notification. Below the password section, there is a link to "See password history." The bottom of the image shows navigation icons for "Vault," "Generator," "Alerts," and "Account." The "Generator" icon is highlighted in red.]
(https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/ABC/file/attachments/12133171243_image-20230818-114454.png)
Now that markdown files are injected with image captions, the next step is to break the original document into chunks that fit into the context window of the embeddings model. PDIQ uses Amazon Titan Text Embeddings V2 model to generate vectors and stores them in Aurora PostgreSQL-Compatible Serverless. Based on internal accuracy testing and chunking best practices from AWS, PDIQ performs chunking as follows:
Using the document chunking logic from the previous step, the document is converted into vector embeddings. The process includes:
By designing a customized approach to generate a summary section atop of all chunks, PDIQ ensures that when a particular chunk is matched based on similarity search, the LLM has access to the entire summary of the document and not only the chunk that matched. This approach enriches end user experience resulting in an increase of approval rate for accuracy from 60% to 79%.
The following is an example of a summarization prompt:
You are a specialized document summarization assistant with expertise in business and technical content.
Your task is to create concise, information-rich summaries that:
Preserve all quantifiable data (numbers, percentages, metrics, dates, financial figures)
Highlight key business terminology and domain-specific concepts
Extract important entities (people, organizations, products, locations)
Identify critical relationships between concepts
Maintain factual accuracy without adding interpretations
Focus on extracting information that would be most valuable for:
Answering specific business questions
Supporting data-driven decision making
Enabling precise information retrieval in a RAG system
The summary should be comprehensive yet concise, prioritizing specific facts over general descriptions.
Include any tables, lists, or structured data in a format that preserves their relationships.
Ensure all technical terms, acronyms, and specialized vocabulary are preserved exactly as written.
The following is an example of summary text, available on each chunk:
### Summary: PLC User Creation Process and Password Reset
**Document Overview:**
This document provides instructions for creating new users and resetting passwords
**Key Instructions:**
{Shortened for Blog illustration}
This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.
---
Chunk 1 has a summary at the top followed by details from the source:
{Summary Text from above}
This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.
title: 2. PLC User Creation Process and Password Reset

Chunk 2 has a summary at the top, followed by continuation of details from the source:
{Summary Text from above}
This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.
---
Maintains a menu with options such as

PDIQ scans each document chunk and generates vector embeddings. This data is stored in Aurora PostgreSQL database with key attributes, including a unique knowledge base ID, corresponding embeddings attribute, original text (summary plus chunk plus image caption), and a JSON binary object that includes metadata fields for extensibility. To keep the knowledge base in sync, PDI implements the following steps:
s3:ObjectRemoved:*), which results in a cleanup job, deleting the records corresponding to the key value in the Aurora table.PDI uses Amazon Nova Pro to retrieve the most relevant document and generates a response by following these key steps:
The following is a snippet of system prompt:
Support assistant specializing in PDI's Logistics(PLC) platform, helping staff research and resolve support cases in Salesforce. You will assist with finding solutions, summarizing case information, and recommending appropriate next steps for resolution.
Professional, clear, technical when needed while maintaining accessible language.
Resolution Process:
Response Format template:
Handle Confidential Information:
By building this customized RAG solution on AWS, PDI realized the following benefits:
PDIQ provides key business outcomes that include:
PDI continues to enhance the application with several planned improvements in the pipeline, including:
PDIQ service has transformed how users access and use enterprise knowledge at PDI Technologies. By using Amazon serverless services, PDIQ can automatically scale with demand, reduce operational overhead, and optimize costs. The solution’s unique approach to document processing, including the dynamic token management and the custom image captioning system, represents significant technical innovation in enterprise RAG systems. The architecture successfully balances performance, cost, and scalability while maintaining security and authentication requirements. As PDI Technologies continue to expand PDIQ’s capabilities, they’re excited to see how this architecture can adapt to new sources, formats, and use cases.
Samit Kumbhani is an Amazon Web Services (AWS) Senior Solutions Architect in the New York City area with over 18 years of experience. He currently partners with independent software vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.
Jhorlin De Armas is an Architect II at PDI Technologies, where he leads the design of AI-driven platforms on Amazon Web Services (AWS). Since joining PDI in 2024, he has architected a compositional AI service that enables configurable assistants, agents, knowledge bases, and guardrails using Amazon Bedrock, Aurora Serverless, AWS Lambda, and DynamoDB. With over 18 years of experience building enterprise software, Jhorlin specializes in cloud-centered architectures, serverless platforms, and AI/ML solutions.
David Mbonu is a Sr. Solutions Architect at Amazon Web Services (AWS), helping horizontal business application ISV customers build and deploy transformational solutions on AWS. David has over 27 years of experience in enterprise solutions architecture and system engineering across software, FinTech, and public cloud companies. His recent interests include AI/ML, data strategy, observability, resiliency, and security. David and his family reside in Sugar Hill, GA.
Manuel Rioux est fièrement propulsé par WordPress