This post was co-authored with Neevash Ramdial, Technical Marketing leader at Stream
Building production-grade voice agents that feel natural and responsive is a complex engineering challenge. You must orchestrate speech-to-speech models, manage low-latency audio streaming, and handle connection lifecycle. You also need to deliver consistent experiences across web, mobile, and desktop applications.
In this post, you learn how to combine Stream’s Vision Agents open-source framework with Amazon Bedrock and Amazon Nova 2 Sonic to build real-time voice agents that can be production-ready in minutes. You’ll learn how the integration works under the hood, walk through code examples, and explore advanced capabilities like function calling, automatic reconnection, and multilingual voice support.
Building voice-enabled AI applications requires orchestrating multiple complex systems that must work together reliably. You face the challenge of managing real-time audio streaming infrastructure while simultaneously integrating speech recognition, language models, and text-to-speech services. Each of these has its own latency characteristics and failure modes. A typical voice interaction involves capturing audio from the user’s microphone, streaming it to a speech-to-text service, processing the transcript through a language model, generating a response, converting that response back to speech, and delivering it to the user. All of this must happen within a window of a few hundred milliseconds to feel natural. Delays in this pipeline can break the conversational flow and frustrate users.Beyond the core AI pipeline, production voice applications must handle the messy realities of real-world deployment: unreliable network connections, browser compatibility issues, session timeouts, and graceful degradation when services become unavailable. You often spend more time building reconnection logic, managing WebRTC connections, and handling edge cases than on the actual AI capabilities. This infrastructure burden means teams either invest months building custom solutions or settle for limited off-the-shelf products that don’t meet their specific needs. Vision Agents abstracts the infrastructure complexity while providing the flexibility to customize the AI experience.
The solution brings together three key components:
Together, these components create a complete stack: Stream handles the real-time media transport and client-side experience, Amazon Nova 2 Sonic provides the AI intelligence, and Vision Agents provides the glue code that ties them together.
The system is designed around a clean separation of concerns: Stream’s infrastructure handles the real-time media transport and client connectivity, while Amazon Nova Sonic runs in the customer’s own AWS account and provides the AI intelligence. This separation helps keep sensitive data and business logic remain within the customer’s control, while Stream’s globally distributed edge network delivers the low-latency media experience users expect.Stream’s edge network acts as the media broker between end-user devices and the Vision Agent worker processes. When a user speaks, audio is captured, encrypted, and transmitted as RTP over UDP to the nearest Stream SFU (Selective Forwarding Unit). The SFU terminates the WebRTC connection, handles NAT traversal and bandwidth estimation, and forwards audio tracks to the Vision Agent worker as if it were another call participant. This means the agent integrates naturally into the call model. The agent is another peer, receiving and sending audio through the same infrastructure used by human participants.
Audio data flows bidirectionally through the system: incoming speech from the user is decoded to raw PCM by the Vision Agent worker, streamed to Amazon Nova Sonic via the Bedrock real-time API, and response audio frames from Nova Sonic are re-encoded, packetized as RTP, and delivered back through the SFU to the client device. End-to-end latency is typically under 500 milliseconds. Voice activity detection (VAD) runs in the worker to detect speech boundaries and barge-in events, while echo cancellation in the browser helps prevent the agent’s own output from re-triggering the VAD loop.

This architecture might seem complex, but Vision Agents abstracts much of this complexity. Let’s see what the actual code looks like:
Before getting started, make sure you have the following:
pip install uv).uv add vision-agents)mkdir voice-agent
cd voice-agentuv inituv add "vision-agents[getstream,aws]"
python-dotenv
The vision-agents[aws] extra installs the Amazon Bedrock plugin along with its dependencies, including boto3, aws-sdk-bedrock-runtime, and Silero VAD for voice activity detection.
Create a “.env” file in your project root to manage your configuration. For AWS credentials, we recommend pointing to your AWS_PROFILE in this file so the application can access your credentials when interacting with AWS resources. We do not recommend storing your AWS access keys directly in this file.
For Stream API credentials, you can use a third-party library like HashiCorp Vault or AWS Secrets Manager, but security considerations are not in the scope of this post.
# Stream API credentials
STREAM_API_KEY=test/geststream/api_key
STREAM_API_SECRET=test/getstream/api_secret
# AWS credentials
AWS_PROFILE=your_aws_profile_name
AWS_REGION=us-east-1
Vision Agents automatically discovers these environment variables at startup, so you don’t need to pass them explicitly to each client.
Create a main.py file with the following code:
import asyncio
from dotenv import load_dotenv
from vision_agents.core import Agent, User, Runner
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import aws, getstream
load_dotenv()
async def create_agent(**kwargs) -> Agent:
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Helpful Assistant", id="agent"),
instructions="You are a helpful voice assistant. Be concise and friendly.",
llm=aws.Realtime(
model="amazon.nova-2-sonic-v1:0",
region_name="us-east-1",
voice_id="matthew",
),
)
return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
await asyncio.sleep(2)
await agent.llm.simple_response(
text="Greet the user warmly and ask how you can help."
)
await agent.finish() # Run until the call ends
if __name__ == "__main__":
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Run the agent:
uv run main.py run
In fewer than 30 lines of code, you have a fully functional, real-time voice agent powered by Amazon Nova Sonic, accessible from a Stream client SDK.
Let’s take a closer look at how the aws.Realtime plugin works under the hood.
Amazon Nova 2 Sonic uses an event-driven bidirectional streaming API. Instead of using a request-response pattern, this approach allows nearly continuous audio to flow in both directions simultaneously. The Vision Agents AWS plugin manages this complexity through a structured event sequence:
sessionStart event is sent with inference configuration (temperature, max tokens, top-p).promptStart event configures the audio output format (24kHz PCM), voice selection, and tool definitions.SYSTEM role.audioInput events.audioOutput events with the generated speech.promptEnd and sessionEnd events cleanly close the connection.Each content block follows a three-part pattern: contentStart → content payload → contentEnd. This hierarchical structure allows the model to maintain proper context throughout the interaction.
Here is what the session start event looks like in the plugin:
def _create_session_start_event(self) -> Dict[str, Any]:
return {
"event": {
"sessionStart": {
"inferenceConfiguration": {
"maxTokens": 1024,
"topP": 0.9,
"temperature": 0.7,
}
},
"turnDetectionConfiguration": {
"endpointingSensitivity": "MEDIUM"
},
}
}
One of the key capabilities of Amazon Nova 2 Sonic is native function calling during real-time conversations. This allows your voice agent to perform actions like querying databases, calling APIs, and triggering workflows while maintaining a natural spoken conversation.Use the @llm.register_function decorator to define functions the model can call:
import asyncio
from dotenv import load_dotenv
from typing import Dict, Any
from vision_agents.core import Agent, User, Runner
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import aws, getstream
load_dotenv()
async def create_agent(**kwargs) -> Agent:
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Weather Assistant", id="agent"),
instructions="""You are a helpful weather assistant. When users ask
about weather, use the get_weather function to fetch current conditions.
You can also help with simple calculations.""",
llm=aws.Realtime(
model="amazon.nova-2-sonic-v1:0",
region_name="us-east-1",
),
)
@agent.llm.register_function(
name="get_weather",
description="Get the current weather for a given city"
)
async def get_weather(location: str) -> Dict[str, Any]:
# In production, call a real weather API
return {
"city": location,
"temperature": 72,
"condition": "Sunny",
"humidity": "45%"
}
@agent.llm.register_function(
name="calculate",
description="Perform a mathematical calculation"
)
def calculate(operation: str, a: float, b: float) -> dict:
operations = {
"add": lambda x, y: x + y,
"subtract": lambda x, y: x - y,
"multiply": lambda x, y: x * y,
"divide": lambda x, y: x / y if y != 0 else None,
}
result = operations.get(operation, lambda x, y: None)(a, b)
return {"operation": operation, "a": a, "b": b, "result": result}
return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
await agent.create_user()
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
await asyncio.sleep(2)
await agent.llm.simple_response(
text="Greet the user and let them know you can check the weather."
)
await agent.finish()
if __name__ == "__main__":
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
When the model decides to invoke a function, the following sequence occurs:
toolUse event containing the function name and arguments.toolResult event, wrapped in the standard contentStart → toolResult → contentEnd pattern.You can build complex, multi-step workflows with this approach. For example, a voice agent could look up a customer record, check inventory, and place an order, all within a single natural conversation.
Beyond real-time speech-to-speech, the AWS plugin also provides a standard LLM integration via aws.LLM. This is useful for custom pipeline architectures where you want to pair an Amazon Bedrock model with separate STT and TTS providers:
from vision_agents.core import Agent, User
from vision_agents.plugins import aws, getstream, cartesia, deepgram, smart_turn
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Custom Pipeline Agent"),
instructions="Be helpful and concise.",
llm=aws.LLM(
model="anthropic.claude-3-haiku-20240307-v1:0",
region_name="us-east-1"
),
tts=cartesia.TTS(),
stt=deepgram.STT(),
turn_detection=smart_turn.TurnDetection(
buffer_duration=2.0,
confidence_threshold=0.5
),
)
The standard LLM supports streaming responses via converse_stream(), full conversation history management, vision inputs for models like Claude, and multi-round tool calling with up to 3 rounds of function execution per request.
For custom pipeline architectures, the AWS plugin also includes an Amazon Polly TTS integration. This is useful when you’re using a non-realtime LLM (like Claude on Amazon Bedrock or another provider) and need high-quality voice synthesis:
from vision_agents.plugins import aws
tts = aws.TTS(
region_name="us-east-1",
voice_id="Joanna",
engine="neural", # 'standard' or 'neural'
language_code="en-US"
)
Amazon Polly TTS supports both standard and neural engines, SSML input for fine-grained speech control, and multiple languages and voices. The neural engine produces more natural-sounding speech, making it a strong choice when you’re building a custom STT → LLM → TTS pipeline on AWS infrastructure.
To delete the Stream call and terminate running Vision Agent processes:
uv run main.py stop
Important: Amazon Bedrock charges apply for all API calls to Amazon Nova 2 Sonic. You can run the cleanup command to terminate sessions and avoid ongoing charges. Active sessions may continue to incur costs until explicitly terminated.
With the technical foundation in place, it’s worth exploring where these capabilities translate into meaningful real-world impact. The combination of low-latency voice, conversation management, and tool integration opens up a wide range of applications across industries where natural spoken interaction can replace or augment traditional interfaces.
Vision Agents combined with Amazon Nova 2 Sonic is well suited for environments where users cannot reliably interact with a screen, such as driving, field service, logistics, healthcare, or on-site operations.In these contexts, voice becomes the primary interface, not a convenience feature.
Because the agent maintains context throughout the interaction, users can issue follow-up commands or clarifications without repeating information.For example, a delivery driver can ask, “What’s my next stop?” receive spoken directions, say “Mark the last delivery as complete,” and then follow up with “Call dispatch,” all without touching a screen, while the agent updates backend systems in real time.
Vision Agents combined with Amazon Nova 2 Sonic is designed for handling large volumes of inbound support calls where human agents become a bottleneck. This use case is fundamentally about scale: reducing queue times, deflecting repetitive requests, and reserving human agents for cases that require their involvement.
When an issue exceeds predefined confidence thresholds or requires manual intervention, the agent escalates to a human with structured context attached, alleviating the need for callers to repeat themselves.During peak hours, hundreds of customers might call asking about delivery delays. Instead of waiting in a queue, callers are immediately answered by a voice agent that checks order status, explains the delay, offers next steps, and only routes to a live agent if an exception is detected.This turns the phone system from a queue-based cost center into a continuous, first-line resolution layer.
This post walked through how to build real-time voice agents using Stream’s Vision Agents framework and Amazon Bedrock with Amazon Nova 2 Sonic. We covered the architecture, the bidirectional streaming protocol, automatic reconnection handling, function calling, multilingual support, and production deployment.The combination of Stream’s low-latency edge network and Amazon Nova Sonic’s native speech-to-speech capabilities provides a solid foundation for building voice AI applications. The Vision Agents framework abstracts the complex orchestration of connection lifecycle management, audio encoding, VAD-aware reconnection, and tool execution, so you can focus on your agent’s logic and user experience.
If you’re ready to explore further, we encourage you to try extending your agent with custom functions for your specific use case or explore the multilingual capabilities for global applications. The Vision Agents repository at https://github.com/GetStream/Vision-Agents is a good place to start. You’ll find additional examples, plugin documentation, and community discussions. For deeper integration details, the AWS plugin documentation is available at https://visionagents.ai/integrations/aws-bedrock, and the Amazon Nova 2 Sonic documentation in the AWS Nova User Guide provides a comprehensive reference for the bidirectional streaming API. You can sign up for a Stream developer account at https://getstream.io/ and start building for today at no additional cost.
Manuel Rioux est fièrement propulsé par WordPress