Enterprise organizations are rapidly moving beyond generative AI experiments to production deployments and complex agentic AI solutions, facing new challenges in scaling, security, governance, and operational efficiency. This blog post series introduces generative AI operations (GenAIOps), the application of DevOps principles to generative AI solutions, and demonstrates how to implement it for applications powered by Amazon Bedrock, a fully managed service that offers a choice of industry leading foundation models (FMs) along with a broad set of capabilities that you need to build generative AI applications.
In this first part of our two-part series, you’ll learn how to evolve your existing DevOps architecture for generative AI workloads and implement GenAIOps practices. We’ll showcase practical implementation strategies for different generative AI adoption levels, focusing on consuming foundation models. For information on foundation model training, model fine-tuning, and model distillation, refer to our separate resources. Part two covers AgentOps and advanced patterns for scaling agentic AI applications in production.
For years, enterprises have successfully embedded DevOps practices into their application lifecycle, streamlining the continuous integration, delivery, and deployment of traditional software solutions. As they progress through the generative AI adoption levels, they quickly discover that traditional DevOps practices aren’t sufficient for managing generative AI workloads at scale. Whereas conventional DevOps emphasizes seamless collaboration between development and operations teams and handles deterministic systems with predictable outputs, the nondeterministic, probabilistic nature of AI outputs requires a shift in how organizations approach lifecycle management of their generative AI–powered solutions. GenAIOps helps you with:
At a high level, the GenAIOps lifecycle is similar to that of DevOps, but there are additional considerations for each lifecycle step when it comes to generative AI applications. The following table describes the DevOps practices per stage and the GenAIOps extensions.
| Stage | DevOps practices | GenAIOps extensions |
|---|---|---|
| Plan |
|
|
| Develop |
|
|
| Build |
|
|
| Test |
|
|
| Release |
|
|
| Deploy |
|
|
| Maintain |
|
|
| Monitor |
|
|
The following graphic illustrates the GenAIOps key activities per stage.
Figure 1: DevOps Stages with GenAIOps key activities
Before exploring the GenAIOps implementation patterns, we now examine how GenAIOps expands roles and processes to address unique challenges associated with generative AI. The following are the key roles and pillars of the generative AI application lifecycle:
The following graphic illustrates these roles.
Figure 2: People and Processes in GenAIOps
GenAIOps implementations can vary depending on how generative AI has permeated a business. The following are the three main stages of generative AI adoption:
As they progress in their adoption journey, enterprises expand their existing DevOps workflows for GenAIOps. The following sections describe implementation patterns for GenAIOps per stage using Amazon Bedrock capabilities. With on-demand access to models, managed infrastructure pre-trained and built-in security features, Amazon Bedrock enables rapid Generative AI deployment while helping to maintain enterprise compliance.
In the exploration stage, organizations often rely on small, cross-functional tiger teams comprising early AI adopters who wear multiple hats. Data scientists might double as prompt engineers, developers handle their own model evaluations, and compliance reviews are conducted through extemporaneous meetings with legal teams. The governance processes remain largely manual and informal, with product owners directly collaborating with technical teams to establish success metrics while platform engineers focus on basic environment setup rather than sophisticated CI/CD automation.
Before you integrate generative AI capabilities into your applications and workflows, we need a baseline DevOps architecture to support your solution. As shown in the following diagram, you have a shared account that manages your CI/CD pipelines and controls deployments across development, pre-production, and production accounts. You also have separate AWS accounts for development, pre-production, data governance, and data producers for isolation between environments, security control, and cost tracking by environment. Every resource in this setup is defined as code, meaning you can version, test, and deploy your entire infrastructure as seamlessly as you deploy application code.
Figure 3: DevOps baseline architecture
Now that you understand the DevOps foundation, we’ll show how you can enhance it with Amazon Bedrock capabilities and start building your GenAIOps foundation in four key steps.
Data serves three critical functions in generative AI: powering RAG systems for enhanced contextual responses, providing ground truth for model evaluation and validation, and enabling both initial training and subsequent fine-tuning of AI models for specific use cases by providing training data. In most cases, access controls are required to help prevent unauthorized access. In RAG, data is used to improve LLM responses and ground them in truth by providing relevant contextual information from data sources. In the standard RAG workflow, you:
When using Amazon Bedrock, you can query a vector database such as Amazon OpenSearch Service or get data from a data store using an API query to augment the user’s query before it’s sent to an FM. If you have live data sources, it’s necessary to implement connectors to enable data synchronization and integration with various data sources to help maintain data integrity and freshness. You also need to configure guardrails so that data that shouldn’t be sent to the model or be part of the output, such as personally identifiable information (PII), is correctly blocked or blanked.
You can also use Amazon Bedrock Knowledge Bases, a fully managed capability that helps you implement the entire RAG workflow without having to build custom integrations to data sources and manage data flows.
Your data provides a source of truth in evaluation. Before application development begins, generative AI developers should establish a comprehensive golden dataset derived from real-world interactions or domain expert input. A robust evaluation dataset should be composed of, depending on your evaluation strategy, prompt or prompt-response pairs that accurately reflect real-world usage scenarios and provide comprehensive coverage of expected production queries. Data engineers make this dataset available in the development environment, applying necessary modifications for sensitive data. Prompt outputs, or prompt outputs together with the expected answers, can then be used by human evaluators or LLM-as-a-judge evaluators (for instance, LLM-as-a-judge on Amazon Bedrock Model Evaluation) to assess the quality of the application responses.
Model providers use extensive datasets to develop foundation AI models, whereas end users use domain-specific data to fine-tune these models for specialized applications and targeted use cases.
In most cases, you need to implement data governance policies so that users only access authorized data throughout the entire system pipeline. You also need to control versions of evaluation datasets, as well as track changes to documents and generated embeddings in RAG knowledge bases for evaluation and auditing purposes.
Overall having a strong data foundation is important for generative AI applications.
Start by integrating FMs and other generative AI capabilities into your application during prototyping in the development environment. In Amazon Bedrock, you can access the models directly using the Amazon Bedrock virtual private cloud (VPC) endpoint powered by AWS PrivateLink and establish a private connection between the VPC in your account and the Amazon Bedrock service account.
You can use Amazon Bedrock Prompt Management to create, test, manage, and optimize prompts for FMs and use Amazon Bedrock Flows for multistep workflows such as document analysis pipelines that require sequential LLM calls. You can also configure and apply guardrails and incorporate safety controls for the FM interactions with Amazon Bedrock Guardrails. In many use cases, you want to give these models contextual information from your company’s data sources using RAG. You can implement a self-managed approach or use Amazon Bedrock Knowledge Bases, a fully managed capability with built-in session context management and source attribution.
The following image shows the key Amazon Bedrock capabilities for model consumers starting with FMs, Knowledge Bases, Agents, and Intelligent Prompt Routing. Following these are Guardrails, Flows, Prompt Engineering, and finally, Prompt Caching.
Figure 4: Amazon Bedrock key components for model consumers
After you’ve integrated the FMs and generative AI components into your application, you need to evaluate their performance. At this point, you create test cases, write test configurations to test different prompts, models, vector stores, and chunking strategies, which they save in their application code or other tool of choice, and calculate evaluation metrics. Amazon Bedrock provides evaluation tools for you to accelerate adoption of generative AI applications. With Amazon Bedrock Evaluations, you can evaluate, compare, and select the best FM for your use case using automatic evaluations (programmatic or with LLM-as-a-Judge) and set up human-in-the-loop evaluation workflows. You can also bring your own (BYO) inference responses and evaluate models, RAG implementations, and fully built applications.
The following diagram summarizes the approach where you’d use an AWS Lambda function to read the inference prompt-response pairs, route them to Amazon Bedrock Evaluations, and store the results in an Amazon Simple Storage Service (Amazon S3) bucket.
Figure 5: Evaluation during development
In case of issues, due to the probabilistic nature of the generative AI components, you need to categorize errors systematically to identify patterns before you act, rather than fixing problems in an isolated way. We recommend the following tests in addition to your standard application testing:
When optimizing an application, builders must keep in mind that optimizing on one dimension can force trade-offs in other dimensions. Quality, cost, and latency are closely related, and optimizing for one can effect the other.
When running tests at scale, you need a way to track your experiments. A way to do that is to use Amazon SageMaker AI with MLflow, a fully managed capability of Amazon SageMaker AI that lets you create, manage, analyze, and compare your machine learning (ML) experiments.
After you’ve identified the optimal model, prompts, inference parameters, and other configurations for your use case, commit these artifacts to your application repository to trigger your CI/CD pipeline. This pipeline should execute your predefined evaluation tests, creating an essential quality gate for your generative AI applications. When tests pass the accuracy, safety, and performance thresholds, your pipeline deploys to the pre-production environment for final validation. If the tests fail, the system adds detailed comments to the pull request with specific test failures. You should also pair your evaluation pipeline with comprehensive experiment tracking to capture every test variable, from model configurations to performance metrics.
With generative AI observability, you can understand, optimize, and evolve your applications throughout their lifecycle and deliver capabilities across the following dimensions:
Your observability implementation needs to monitor:
In Amazon Bedrock, you can enable and capture model invocation logs in Amazon S3, Amazon CloudWatch, or both. In CloudWatch, you can create purpose-built dashboards that visualize key metrics including timestamp, input and output token counts, and invoked model ID. The following screenshot shows an example of Amazon Bedrock invocation metrics in CloudWatch.
Figure 6: Example of Amazon Bedrock invocation metrics in Amazon CloudWatch
Amazon Bedrock is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in Amazon Bedrock. AWS CloudTrail captures the API calls for Amazon Bedrock as events. Amazon Bedrock also publishes guardrail metrics to CloudWatch such as guardrail invocations, latency, and guardrail interventions.
You can also refer to the Bedrock-ICYM (I See You Monitoring) repository for an open source observability solution for Amazon Bedrock or use open source solutions such as Arize AI and Langfuse.
Implementing the four steps outlined in this post can transform your baseline DevOps solution into a robust GenAIOps solution that follows best practices. This architecture maintains isolation between environments while integrating key generative AI components, including Amazon Bedrock capabilities, development-time evaluation tools, automated evaluation pipelines, and comprehensive monitoring. The recommended application lifecycle is as follows:
The following diagram shows the complete architecture with numbers that correspond to these steps.
Figure 7: GenAIOps architecture – Exploration
With these essential GenAIOps components in place, organizations can progress to the next stage.
As organizations enter the production stage, they formalize dedicated roles within newly established generative AI centers of excellence, where specialized prompt engineers, AI evaluation specialists, and GenAIOps platform engineers emerge as distinct functions with clear responsibilities and standardized processes. Training programs become systematic across the organization, compliance checks transition from manual reviews to automated policy enforcement, and cross-functional collaboration shifts from informal coordination to structured workflows with defined handoffs between development, evaluation, and deployment teams.
Organizations in the production stage aim at standardizing code repositories, developing reusable components, and implementing automated evaluation and feedback loops within their established GenAIOps pipeline to help continuously enhance application quality and operational efficiency. In some cases, they can also benefit from a centralized generative AI gateway to help streamline LLM interactions, help with load balancing and error handling, and fall back to other models when required. You can further enhance your GenAIOps foundation in three additional key steps.
In addition to having version control through repository solutions such as GitHub or GitLab to maintain, track, and version your application code for the application, you should strive to create reusable, versioned blueprints for your prompts, model configuration, and guardrails.
A prompt template is a prompt that includes replaceable variables or parameters. You can adjust these to customize and tailor the prompt template to a particular use case. The purpose is to accelerate prompt engineering processes and facilitate quick sharing of prompts across generative AI applications. Examples of prompt templates for Amazon Bedrock can be found at Prompt templates and examples for Amazon Bedrock text models.
You should centrally store, track, and version both prompt templates and application-specific prompts in a prompt catalog. You can use Git, but as you scale, we recommend using dedicated tools instead, for example, Amazon Bedrock Prompt Management. Use prompt management to store your prompts alongside models, inference parameter values, guardrails, and variables when creating prompts. Additionally, you can use automated prompt optimization tools that can help refine prompts such as Prompt optimization in Amazon Bedrock.
Many use cases require the creation of workflows that are end-to-end solutions linking FMs, prompts, and components such as knowledge bases. Store workflows as code in your repository, enabling version control, tracking, and seamless deployment across test and production environments. Amazon Bedrock Flows provides a visual builder interface for creating and deploying generative AI workflows.
Although guardrails might already be part of your prompt and model configuration, it’s also possible to have guardrails defined for your workloads and to store the configuration of these guardrails in a common store or repository in the shared account.
To further scale, store the evaluation pipelines as shared and deployable assets that can be maintained, tracked, and versioned in a repository environment. This allows the integration of automatic testing into the CI/CD pipeline.
For evaluation pipelines that include human evaluation to assess and evaluate the outputs of AI models through human feedback and review, you can use tools to automate this process. For instance, you can use human-based model evaluation in Amazon Bedrock and to automate the creation of the evaluation job. Use this tool to seamlessly orchestrate human evaluation flows, including pre-implemented workflows, UI components, and the management of labeling workforce.
Organizations at this stage can benefit from implementing a centralized multi-provider generative AI gateway on AWS to optimize their LLM operations. Using a generative AI gateway can help customers that want to implement one or more of the following:
An example of a mature generative AI gateway is the Guidance for Multi-Provider Generative AI Gateway on AWS, which provides standardized access through a single API interface that supports multiple AI providers, such as Amazon Bedrock, and streamlines LLM interactions. The solution features comprehensive enterprise controls, including usage tracking, robust access control, cost management capabilities, and security monitoring. It also enables deployment efficiency through automated deployment using HashiCorp Terraform, with support for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS) implementations in a production-ready infrastructure.
The updated diagram reflects these steps. It introduces experiment tracking in the shared account and a generative AI gateway for each one of your Dev, Pre-Prod and Prod environments to provide standardized access to FMs from multiple providers through unified API management and security controls. Changes to the generative AI gateway configuration should trigger the CI/CD pipeline for deployment and evaluation.
Your application repository contains your application specific code and tests. In addition, we have added the following shared modules to the repository:
Figure 8: GenAIOps with reusable components
Many organizations progress from initial LLM applications and RAG implementations to sophisticated agent-based architectures. This evolution marks a fundamental shift from using AI to augment human decisions to deploying autonomous agentic AI solutions.
An AI agent is an autonomous system that combines LLMs with external tools and data sources to perceive its environment, reason and plan, execute complex multistep tasks, and achieve specific objectives with minimal human intervention. It usually consists of the the following components: models, data sources, models, deterministic components, agentic/probabilistic components, and protocols.
Figure 9: Components of Agentic AI solutions
The probabilistic nature of some of these components brings new challenges when managing the end-to-end lifecycle of agentic AI solutions. AgentOps is a set of practices that extend GenAIOps to address these challenges and risks. In the second part of this blog post series, we revise our generative AI solution for AgentOps.
In this post, you’ve learned how to implement GenAIOps practices that align with your organization’s generative AI adoption level, accelerate development through systematic evaluation and reusable assets, and establish robust monitoring for your generative AI solutions. Throughout, we’ve shown how to effectively mitigate risks and maximize business value using the managed capabilities of Amazon Bedrock.
We encourage you to start applying these practices in your projects and share your experiences. Ready to take your GenAIOps to the next level? Stay tuned for part 2, where we’ll dive deeper into AgentOps and explore solution designs for multi-agent systems with Amazon Bedrock AgentCore.
Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. Her experience spans the entire AI lifecycle, from collaborating with organizations training cutting-edge Large Language Models (LLMs) to guiding enterprises in deploying and scaling these models for real-world applications. In her spare time, she explores new worlds through fiction.
Anna Grüebler Clark is a Specialist Solutions Architect at AWS focusing on in Artificial Intelligence. She has more than 16 years experience helping customers develop and deploy machine learning applications. Her passion is taking new technologies and putting them in the hands of everyone, and solving difficult problems leveraging the advantages of using traditional and generative AI in the cloud.
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.
Aris Tsakpinis is a Specialist Solutions Architect for Generative AI focusing on open source models on Amazon Bedrock and the broader generative AI open source ecosystem. Alongside his professional role, he is pursuing a PhD in Machine Learning Engineering at the University of Regensburg, where his research focuses on applied natural language processing in scientific domains.
Manuel Rioux est fièrement propulsé par WordPress