In production generative AI applications, we encounter a series of errors from time to time, and the most common ones are requests failing with 429 ThrottlingException and 503 ServiceUnavailableException errors. As a business application, these errors can happen due to multiple layers in the application architecture.
Most of the cases in these errors are retriable but this impacts user experience as the calls to the application get delayed. Delays in responding can disrupt a conversation’s natural flow, reduce user interest, and ultimately hinder the widespread adoption of AI-powered solutions in interactive AI applications.
One of the most common challenges is multiple users flowing on a single model for widespread applications at the same time. Mastering these errors means the difference between a resilient application and frustrated users.
This post shows you how to implement robust error handling strategies that can help improve application reliability and user experience when using Amazon Bedrock. We’ll dive deep into strategies for optimizing performances for the application with these errors. Whether this is for a fairly new application or matured AI application, in this post you will be able to find the practical guidelines to operate with on these errors.
bedrock:InvokeModel or bedrock:InvokeModelWithResponseStream for your specific modelscloudwatch:PutMetricData, cloudwatch:PutMetricAlarm for monitoringsns:Publish if using SNS notificationsExample IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:us-east-1:123456789012:model/anthropic.claude-*"
}
]
}
Note: This walkthrough uses AWS services that may incur charges, including Amazon CloudWatch for monitoring and Amazon SNS for notifications. See AWS pricing pages for details.
The following table compares these two error types:
| Aspect | 503 ServiceUnavailable | 429 ThrottlingException |
|---|---|---|
| Primary Cause | Temporary service capacity issues, server failures | Exceeded account quotas (RPM/TPM) |
| Quota Related | Not Quota Related | Directly quota-related |
| Resolution Time | Transient, refreshes faster | Requires waiting for quota refresh |
| Retry Strategy | Immediate retry with exponential backoff | Must sync with 60-second quota cycle |
| User Action | Wait and retry, consider alternatives | Optimize request patterns, increase quotas |
A 429 ThrottlingException means Amazon Bedrock is deliberately rejecting some of your requests to keep overall usage within the quotas you have configured or that are assigned by default. In practice, you will most often see three flavors of throttling: rate-based, token-based, and model-specific.
Error Message:
ThrottlingException: Too many requests, please wait before trying again.
Or:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many requests, please wait before trying again
What this actually indicates
Rate-based throttling is triggered when the total number of Bedrock requests per minute to a given model and Region crosses the RPM quota for your account. The key detail is that this limit is enforced across the callers, not just per individual application or microservice.
Imagine a shared queue at a coffee shop: it does not matter which team is standing in line; the barista can only serve a fixed number of drinks per minute. As soon as more people join the queue than the barista can handle, some customers are told to wait or come back later. That “come back later” message is your 429.
Multi-application spike scenario
Suppose you have three production applications, all calling the same Bedrock model in the same Region:
Ops has requested a quota of 150 RPM for this model, which seems reasonable since 50 + 50 + 50 = 150 and historical dashboards show that each app stays around its expected peak.
However, in reality your traffic is not perfectly flat. Maybe during a flash sale or a marketing campaign, App A briefly spikes to 60 rpm while B and C stay at 50. The combined total for that minute becomes 160 rpm, which is above your 150 rpm quota, and some requests start failing with ThrottlingException.
You can also get into trouble when the three apps shift upward at the same time over longer periods. Imagine a new pattern where peak traffic looks like this:
Your new true peak is 175 rpm even though the original quota was sized for 150. In this situation, you will see 429 errors regularly during those peak windows, even if average daily traffic still looks “fine.”
Mitigation strategies
For rate-based throttling, the mitigation has two sides: client behavior and quota management.
On the client side:
On the quota side:
In the previous example, if App A peaks at 75 rpm and B and C peak at 50 rpm, you should plan for at least 175 rpm and realistically target something like 200 rpm to provide room for growth and unexpected bursts.
Error message:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many tokens, please wait before trying again.
Even if your request count is modest, a single large prompt or a model that produces long outputs can consume thousands of tokens at once. Token-based throttling occurs when the sum of input and output tokens processed per minute exceeds your account’s TPM quota for that model.
For example, an application that sends 10 requests per minute with 15,000 input tokens and 5,000 output tokens each is consuming roughly 200,000 tokens per minute, which may cross TPM thresholds far sooner than an application that sends 200 tiny prompts per minute.
You may notice that your application runs smoothly under normal workloads, but suddenly starts failing when users paste large documents, upload long transcripts, or run bulk summarization jobs. These are symptoms that token throughput, not request frequency, is the bottleneck.
To mitigate token-based throttling:
For consistently high-volume, token-intensive workloads, you should also evaluate requesting higher TPM quotas or using models with larger context windows and better throughput characteristics.
Error message:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Model anthropic.claude-haiku-4-5-20251001-v1:0 is currently overloaded. Please try again later.
Model-specific throttling indicates that a particular model endpoint is experiencing heavy demand and is temporarily limiting additional traffic to keep latency and stability under control. In this case, your own quotas might not be the limiting factor; instead, the shared infrastructure for that model is temporarily saturated.
One of the most effective approaches here is to design for graceful degradation rather than treating this as a hard failure.
Once you understand the types of throttling, the next step is to encode that knowledge into reusable client-side components.
Here’s a robust retry implementation that uses exponential backoff with jitter. This pattern is essential for handling throttling gracefully:
import time
import random
from botocore.exceptions import ClientError
def bedrock_request_with_retry(bedrock_client, operation, **kwargs):
"""Secure retry implementation with sanitized logging."""
max_retries = 5
base_delay = 1
max_delay = 60
for attempt in range(max_retries):
try:
if operation == 'invoke_model':
return bedrock_client.invoke_model(**kwargs)
elif operation == 'converse':
return bedrock_client.converse(**kwargs)
except ClientError as e:
# Security: Log error codes but not request/response bodies
# which may contain sensitive customer data
if e.response['Error']['Code'] == 'ThrottlingException':
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
continue
else:
raise
This pattern avoids hammering the service immediately after a throttling event and helps prevent many instances from retrying at the same exact moment.
For token-based throttling, the following class maintains a sliding window of token usage and gives your caller a simple yes/no answer on whether it is safe to issue another request:
import time
from collections import deque
class TokenAwareRateLimiter:
def __init__(self, tpm_limit):
self.tpm_limit = tpm_limit
self.token_usage = deque()
def can_make_request(self, estimated_tokens):
now = time.time()
# Remove tokens older than 1 minute
while self.token_usage and self.token_usage[0][0] < now - 60:
self.token_usage.popleft()
current_usage = sum(tokens for _, tokens in self.token_usage)
return current_usage + estimated_tokens <= self.tpm_limit
def record_usage(self, tokens_used):
self.token_usage.append((time.time(), tokens_used))
In practice, you would estimate tokens before sending the request, call can_make_request, and only proceed when it returns True, then call record_usage after receiving the response.
A 503 ServiceUnavailableException tells you that Amazon Bedrock is temporarily unable to process your request, often due to capacity pressure, networking issues, or exhausted connection pools. Unlike 429, this is not about your quota; it is about the health or availability of the underlying service at that moment.
What it looks like:
botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the ConverseStream operation (reached max retries: 4): Too many connections, please wait before trying again.
In many real-world scenarios this error is caused not by Bedrock itself, but by how your client is configured:
boto3 HTTP connection pool size is relatively small (for example, 10 connections), which can be quickly exhausted by highly concurrent workloads.To help fix this, share a single Bedrock client instance and increase the connection pool size:
import boto3
from botocore.config import Config
# Security Best Practice: Never hardcode credentials
# boto3 automatically uses credentials from:
# 1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
# 2. IAM role (recommended for EC2, Lambda, ECS)
# 3. AWS credentials file (~/.aws/credentials)
# 4. IAM roles for service accounts (recommended for EKS)
# Configure larger connection pool for parallel execution
config = Config(
max_pool_connections=50, # Increase from default 10
retries={'max_attempts': 3}
)
bedrock_client = boto3.client('bedrock-runtime', config=config)
This configuration allows more parallel requests through a single, well-tuned client instead of hitting client-side limits.
What it looks like:
botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the InvokeModel operation: Service temporarily unavailable, please try again.
In this case, the Bedrock service is signaling a transient capacity or infrastructure issue, often affecting on-demand models during demand spikes. Here you should treat the error as a temporary outage and focus on retrying smartly and failing over gracefully:
When you operate mission-critical systems, simple retries are not enough; you also want to avoid making a bad situation worse.
The circuit breaker pattern helps prevent your application from continuously calling a service that is already failing. Instead, it quickly flips into an “open” state after repeated failures, blocking new requests for a cooling-off period.
When Bedrock returns 503 errors due to capacity issues, continuing to hammer the service with requests only makes things worse. The circuit breaker pattern helps:
The following code implements this:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing if service recovered
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage
circuit_breaker = CircuitBreaker()
def make_bedrock_request():
return circuit_breaker.call(bedrock_client.invoke_model, **request_params)
Amazon Bedrock cross-Region inference (CRIS) helps add another layer of resilience by giving you a managed way to route traffic across Regions.
For applications without data residency requirements, global CRIS offers enhanced performance, reliability, and cost efficiency.
From an architecture standpoint:
Bedrock automatically encrypts data in transit using TLS and does not store customer prompts or outputs by default; combine this with CloudTrail logging for compliance posture.
You cannot manage what you cannot see, so robust monitoring is essential when working with quota-driven errors and service availability. Setting up comprehensive Amazon CloudWatch monitoring is essential for proactive error management and maintaining application reliability.
Note: CloudWatch custom metrics, alarms, and dashboards incur charges based on usage. Review CloudWatch pricing for details.
Monitor these CloudWatch metrics:
For better insight, create dashboards that:
Do not wait until users notice failures before you act. Configure CloudWatch alarms with Amazon SNS notifications based on thresholds such as:
For 429 Errors:
For 503 Errors:
CloudWatch Logs Insights queries help you move from “we see errors” to “we understand patterns.” Examples include:
Find 429 error patterns:
fields @timestamp, @message
| filter @message like /ThrottlingException/
| stats count() by bin(5m)
| sort @timestamp desc
Analyze 503 error correlation with request volume:
fields @timestamp, @message
| filter @message like /ServiceUnavailableException/
| stats count() as error_count by bin(1m)
| sort @timestamp desc
We’ve covered a lot of ground in this post, so let’s bring it all together. Successfully handling Bedrock errors requires:
Handling 429 ThrottlingException and 503 ServiceUnavailableException errors effectively is a crucial part of running production-grade generative AI workloads on Amazon Bedrock. By combining quota-aware design, intelligent retries, client-side resilience patterns, cross-Region strategies, and strong observability, you can keep your applications responsive even under unpredictable load.
As a next step, identify your most critical Bedrock workloads, enable the retry and rate-limiting patterns described here, and build dashboards and alarms that expose your real peaks rather than just averages. Over time, use real traffic data to refine quotas, fallback models, and regional deployments so your AI systems can remain both powerful and dependable as they scale.
For teams looking to accelerate incident resolution, consider enabling AWS DevOps Agent—an AI-powered agent that investigates Bedrock errors by correlating CloudWatch metrics, logs, and alarms just like an experienced DevOps engineer would. It learns your resource relationships, works with your observability tools and runbooks, and can significantly reduce mean time to resolution (MTTR) for 429 and 503 errors by automatically identifying root causes and suggesting remediation steps.
Learn More
About the Authors
Manuel Rioux est fièrement propulsé par WordPress