Organizations are adopting large language models (LLMs), such as DeepSeek R1, to transform business processes, enhance customer experiences, and drive innovation at unprecedented speed. However, standalone LLMs have key limitations such as hallucinations, outdated knowledge, and no access to proprietary data. Retrieval Augmented Generation (RAG) addresses these gaps by combining semantic search with generative AI, enabling models to retrieve relevant information from enterprise knowledge bases before responding. This approach grounds outputs in accurate, up-to-date context, making applications more reliable, transparent, and capable of using domain expertise without retraining. Yet, as the usage of RAG solutions grows, so do the operational and technical hurdles associated with scaling these solutions in production. Such hurdles include the costs and infrastructure complexities that come with vector databases that enterprises need to seamlessly store, search, and manage high-dimensional embeddings at scale.
Although RAG unlocks remarkable capabilities, organizations building production-grade applications often encounter four major obstacles with existing vector databases:
With the launch of Amazon Simple Storage Service (Amazon S3) Vectors, the first cloud object storage service with native support to store and query vectors, we have a new approach to cost-effectively manage vector data at scale. In this post, we explore how combining S3 Vectors and Amazon SageMaker AI redefines the developer experience for RAG, making it easier than ever to experiment, build, and scale AI-powered applications without the traditional trade-offs.
Enterprise-scale RAG applications involve high data volumes (often multimillion document knowledge bases, including unstructured data), high query throughput, mission-critical reliability, complex integration, and continuous evaluation and improvement. To realize these enterprise-scale RAG applications, you need more than powerful LLM model deployment. These applications also demand rigorous experimentation, governance, and performance tracking. Amazon SageMaker AI, with its native integration with managed MLflow, offers a unified system for deploying, monitoring, and optimizing LLMs at scale.
Amazon SageMaker JumpStart accelerates embedding and text generation deployment, helping teams rapidly prototype and deliver value. SageMaker JumpStart accelerates the development of RAG solutions with:
Developing effective RAG systems requires evaluating prompts, chunking strategies, retrieval methods, and model settings. SageMaker managed MLflow supports this in several ways. You can track experiments by logging and comparing chunking, retrieval, and generation configurations. Use built-in LLM-based generative AI metrics such as correctness and relevance to assess output quality. To monitor performance, you can track latency and throughput alongside output quality to verify production-readiness. Improve reproducibility of experiments by capturing parameters for reliable auditing and experiment replication. And by using the governance dashboards, you can visualize results, manage model versions, and control approvals through an intuitive interface.
As enterprises scale their RAG implementations, they face significant challenges for managing their vector stores. Traditional vector databases have unpredictable and rising costs because they incur charges based on compute, storage, and API usage, which scale with data volume. Teams spend significant time on the operational complexity that results from having to manage separate vector database infrastructure instead of focusing on application development. Capacity planning becomes increasingly complex as vector collections grow and diversify, making scaling challenging. Connecting vector stores with existing data infrastructure and security frameworks adds additional complexity.
Amazon S3 Vectors delivers purpose-built vector storage so you can harness the semantic power of your organization’s unstructured data at scale. Designed for the cost-optimized and durable storage of large vector datasets with sub-second query performance, S3 Vectors is ideal for infrequent query workloads and can help you reduce the overall cost of uploading, storing, and querying vectors by up to 90% compared to alternative solutions. With S3 Vectors, you only pay for what you use without the need for infrastructure provisioning and management. Whether you’re developing semantic search engines, RAG systems, or recommendation services, you can focus on innovation rather than cost constraints and data management complexities.
Amazon S3 Vectors brings the proven economics and simplicity of Amazon S3 to vectors. S3 Vectors is ideal for RAG applications where slightly higher latency than millisecond-level vector databases is acceptable, in exchange for substantial cost savings and simplified operations. You also have elasticity because you can scale vector search applications seamlessly from gigabytes to petabytes, and scale down to zero when these resources are not in use. Moreover, data management becomes more flexible. You can store vectors alongside metadata, minimizing the need for separate databases while enhancing retrieval performance through consolidated data access. S3 Vectors supports up to 40 KB of (filterable and nonfilterable) metadata per vector with schema-less filtering capabilities, using separate vector indexes for streamlined organization.
In this post, we show how to build a cost-effective, enterprise-scale RAG application using Amazon S3 Vectors, SageMaker AI for scalable inference, and SageMaker managed MLflow for experiment tracking and evaluation, making sure the responses meet enterprise standards. We demonstrate this by building a RAG system that answers questions about Amazon financials using annual reports, shareholder letters, and 10-K filings as the knowledge base. Our solution consists of the following components:
The following diagram is the solution architecture.

You can follow and execute the full example code from the repository. We use code snippets from this GitHub repository to illustrate RAG solution using S3 Vectors and tracking approaches in the rest of this post.
To perform the solution, you need to have these prerequisites:
The following steps walk you through this solution:
We use SageMaker JumpStart to deploy state-of-the-art models in minutes using a few lines of code. A RAG application requires:
Use the following code:
# Deploy embedding model with optimized configurations
EMBEDDING_ENDPOINT_NAME = deploy_jumpstart_model(
model_id="huggingface-textembedding-gte-qwen2-7b-instruct",
instance_type="ml.g5.xlarge",
endpoint_name_base="s3-vectors-demo-embedding-model",
model_version="1.0.8"
)
# Deploy text generation model
TEXT_GENERATION_ENDPOINT_NAME = deploy_jumpstart_model(
model_id="deepseek-llm-r1-distill-qwen-7b",
instance_type="ml.g6.12xlarge",
endpoint_name_base="s3-vectors-demo-generation-model",
model_version="2.0.2"
)
In this post, we use Qwen2 7B instruct as the embedding model and DeepSeek R1 because they’re highly capable open source models that are among the top models in embedding leaderboards and LLM leaderboards. You can choose an embedding and text generation model from the more than 300 models available on SageMaker JumpStart.
Amazon S3 Vectors provide a seamless yet powerful way to store and search vector embeddings. Vectors are stored in vector indexes, which are used for logically grouping. Write and read operations are directed to a single vector index. To start using Amazon S3 Vectors, you need to:
import boto3
s3vectors_client = boto3.client("s3vectors")
s3vectors_client.create_vector_bucket(vectorBucketName=vectors_bucket_name)
cosine or euclidean as their distance metrics. Define your dimensions and distance metric:for index in index_names:
s3vectors_client.create_index(
vectorBucketName=vectors_bucket_name,
indexName= index,
dimension=num_dimensions,
distanceMetric="cosine",
dataType="float32",
metadataConfiguration={"nonFilterableMetadataKeys": ["chunk"]}
)
You can calculate the embedding vectors for each text chunk in the source documents and store them in your S3 vector bucket using the put_vectors API, which supports up to 500 vectors per call. When putting vectors in the vector index, you can include up to 10 fields in the vector metadata to facilitate the retrieval and generation stages of RAG. In the following example, we add the domain (such as a financial document or shareholder letter) and the year for targeted semantic queries, and the text chunk so we can use its content when generating an answer:
import boto3
s3vectors_client = boto3.client("s3vectors")
for chunk in chunks:
unique_id = str(uuid.uuid4())
key = f"{unique_id}"
embedding_response = embeddings_model_endpoint.predict({'inputs': chunk["chunk"]})
s3vectors_client.put_vectors(
vectorBucketName=vector_bucket_name,
indexName=index_name,
vectors=[
{
"key": key,
"vector": embedding_response[0],
"metadata": {
"domain": domain_name,
"year": year,
"chunk": chunk["chunk"]
}
}
]
)
LangGraph is a framework for building stateful, multi-step applications with LLMs using a graph-based architecture. The first step in the definition of a RAG application with LangGraph is to create Python functions for the retrieval and generation steps.
$eq for exact match or $gt for greater than comparison) and combinations through logical operations (such as $and or $or).You can use the query_vectors API in S3 Vectors to run a semantic search on a vector index. In this query, you have to define the query vector (that is, the vector that represents the query or question), search parameters (for example, topK for the number of similar vectors to retrieve), and possibly filter conditions for metadata filtering:
s3vectors_client = boto3.client("s3vectors")
query_args = {
"vectorBucketName": vector_bucket_name,
"indexName": index_name,
"vector": query_vector,
"topK": 5,
"returnDistance": True,
"returnMetadata": True
}
if metadata_filter is not None:
query_args["filter"] = metadata_filter
vectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])
Metadata filters can be used to focus the search on a subset of documents. For example, if you’re asking about business and industry risks from an annual report, you can call the query_vectors function with metadata_filter being equal to “Amazon Annual Report” to only consider those documents: ({"domain": {"$eq": "Amazon Annual Report"}}). For greater specificity, we could add numerical operators to indicate the years of the annual reports to consult ({"year": {"$gt": 2023}}). The choice of operators and filters depends on the use case and on the logic that will allow that only the relevant documents are consulted.
You can automate the retrieval and generation steps of a RAG application using a LangGraph StateGraph. To define a LangGraph graph for RAG, we take the following steps:
# Load RAG prompt from LangChain Hub
prompt = hub.pull("rlm/rag-prompt")
# Define RAG application state structure
class State(TypedDict):
# Keeping question, context, and answer in the state
question: str
context: List[Document]
answer: str
def retrieve(state: State, index_name: str):
# Compute query vector
query_vector = EMBEDDINGS_MODEL_ENDPOINT.predict({'inputs': state["question"]})[0]
# Retrieve chunks with semantic search
s3vectors_client = boto3.client("s3vectors")
query_args = {
"vectorBucketName": VECTOR_BUCKET_NAME,
"indexName": index_name,
"vector": query_vector,
"topK": 5,
"returnDistance": True,
"returnMetadata": True
}
vectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])
chunks = [r.get("metadata", {}).get("chunk") for v in vectors]
# Assemble retrieved chunks into a list of LangChain Documents
document_chunks = [Document(page_content=chunk) for chunk in chunks]
return {"context": document_chunks}
def generate(state: State):
# Join text chunks into a context string
docs_content = "nn".join(doc.page_content for doc in state["context"])
# Generate LangChain messages and convert to OpenAI format
lc_messages = prompt.invoke({
"question": state["question"],
"context": docs_content
}).to_messages()
openai_messages = langchain_to_openai_messages(lc_messages)
# Invoke the text generation model and return "answer"
request = {
"messages": openai_messages,
"temperature": 0.2,
"max_new_tokens": 512,
}
response = text_generation_predictor.predict(request)
return {"answer": response["choices"][0]["message"]["content"]}
response = recursive_graph.invoke({
"question": "What are the names of the people in Amazon's board of directors?"
})
The response from the LangGraph graph is as follows:
The names of the people in Amazon’s board of directors are Jeffrey P. Bezos, Andrew R. Jassy, Keith B. Alexander, Edith W. Cooper, Jamie S. Gorelick, Daniel P. Huttenlocher, and Judith A. McGrath. There are seven members on the board, including Amazon’s CEO.
To validate our RAG system performs well, we use MLflow to track experiments and evaluate performance using a ground truth dataset of questions and answers. You can set up an MLflow evaluation and run it using the following script. Notice that we can complement metrics from the evaluation (for example, latency and answer correctness using LLM as a judge) with parameters from the chunking and embedding stages to provide full visibility of the RAG application:
with mlflow.start_run():
results = mlflow.evaluate(
model=recursive_model,
data=eval_df,
targets="answer",
extra_metrics=[
mlflow.metrics.genai.answer_correctness(
model="bedrock:/anthropic.claude-3-sonnet-20240229-v1:0",
parameters={
"temperature": 0.2,
"max_tokens": 256,
"anthropic_version": "bedrock-2023-05-31",
},
),
mlflow.metrics.latency(),
],
evaluator_config={
"col_mapping": {
"inputs": "question",
}
}
)
# Store metrics from the evaluation
mlflow.log_metrics(results.metrics)
# Store other parameters
mlflow.log_param("Chunking strategy", "recursive")
mlflow.log_param("Chunk size", CHUNK_SIZE)
mlflow.log_param("Chunk overlap", CHUNK_OVERLAP)
mlflow.log_param("Number of chunks", NUM_CHUNKS_RECURSIVE)
mlflow.log_param("Embedding model ID", EMBEDDING_MODEL_ID)
mlflow.log_param("Embedding model version", EMBEDDING_MODEL_VERSION)
mlflow.log_param("Text generation model ID", TEXT_GENERATION_MODEL_ID)
mlflow.log_param("Text generation model version", TEXT_GENERATION_MODEL_VERSION)
mlflow.log_param("Chunking time seconds", CHUNKING_TIME_RECURSIVE)
This evaluation tracks:
The following figure shows the metrics and parameters from one experiment. Using this view, a machine learning (ML) engineer can compare the performance of different RAG applications and pick the combination of retrieval and generation parameters that provide the best user experience. This could be, for example, finding the best combination between high answer correctness and low latency. The ML engineer can then learn the models and chunking parameters that resulted in that performance and implement them in the final application.

By running multiple experiments with different chunking strategies, embedding models, or retrieval configurations, you can identify the optimal setup for your use case.
Amazon S3 Vectors bring scalable, cost-effective vector search to your RAG applications with:
Amazon S3 vector store is ideal for use cases where ultra-low latency isn’t required, such as batch processing, periodic reporting, and agent-based workflows. It offers the durability, scalability, and operational simplicity of Amazon S3. A few industry-specific use cases that can benefit from S3 vector store are:
When building RAG applications with Amazon S3 Vectors, consider these performance optimization strategies:
To summarize, the key advantages demonstrated in using S3 Vectors with SageMaker AI are:
In this post, we’ve shown how to replace traditional vector database complexity with a streamlined AWS based approach that scales with your data and evolves with your generative AI strategy. The SageMaker managed MLflow integration means that every architectural decision is guided by quantifiable metrics, from answer correctness to latency profiles, turning experimentation into actionable insights. As you implement RAG solutions, use these tools to validate retrieval strategies against domain-specific datasets, benchmark embedding models for accuracy and storage tradeoffs, and enforce governance through version-controlled deployments.
To avoid unnecessary costs, delete resources such as the SageMaker-managed MLflow tracking server, S3 Vectors indexes and buckets, and SageMaker endpoints when your RAG experimentation is complete.
In this post, we’ve demonstrated how to build a complete RAG solution using Amazon S3 Vectors and SageMaker AI. The combination of S3 vector buckets, SageMaker AI LLM models, and SageMaker managed MLflow provides a transformative solution for organizations building enterprise-scale RAG applications. In this approach, we illustrate the use of S3 Vectors as a new approach to effectively manage vector data at scale, without the cost and scalability challenges that come with conventional vector databases.
We encourage you to explore Amazon S3 Vectors documentation and experiment with the SageMaker AI LLM models and SageMaker managed MLflow evaluation templates shown in this post. Now it’s your turn to build enterprise-scale AI solutions with serverless, observable, and relentlessly optimized generative AI strategies.
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the Generative AI space. You can find Sandeep on LinkedIn.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.
Biswanath Hore is a Sr. Solutions Architect at Amazon Web Services. He works with customers early in their AWS journey, helping them adopt cloud solutions to address their business needs. He is passionate about Machine Learning and, outside of work, loves spending time with his family.
Manuel Rioux est fièrement propulsé par WordPress