Programmatic tool calling (PTC) is a paradigm shift in how large language models (LLMs) interact with external tools. In a traditional tool-calling workflow, each tool invocation requires a full round trip back to the model. The model calls a tool, receives the result, reasons about it, calls the next tool, and so on. For workflows that involve multiple tool calls, this creates compounding latency and token consumption because every intermediate result must pass through the model’s context window.

PTC takes a different approach. Instead of orchestrating tool calls one at a time, the model writes code, typically Python, that invokes multiple tools programmatically within a sandboxed execution environment. The code can include loops, conditionals, filtering, and aggregation logic. The model is only sampled once to produce the code. The execution environment then handles tool invocations, and only the final processed result is returned to the model’s context. This dramatically reduces both latency and token usage for multi-tool workflows. PTC is particularly effective for large data processing, precise numerical calculations, multi-step process orchestration, and privacy-sensitive scenarios where raw data shouldn’t enter the model’s context.

PTC originated as a provider-specific feature, but the underlying pattern—model generates code, sandbox executes it, only final output returns to context—is model-agnostic. In this post, we show three ways to implement PTC on Amazon Bedrock: a self-hosted Docker sandbox on ECS for maximum control, a managed solution using Amazon Bedrock AgentCore Code Interpreter, and an Anthropic SDK-compatible path through a proxy for teams that prefer that developer experience.
Consider this example: “Which engineering team members exceeded their Q3 travel budget?”With traditional tool calling (assuming no parallel function calling), the model must:
Each of those tool calls requires a full round trip through the model. The model generates a tool call, pauses, receives the result, reasons about it, generates the next tool call, and so on. This creates three compounding problems:
PTC flips the pattern. The model writes a single Python code block that orchestrates the tool calls, processes the results, and returns only the final output.

Using the same expense audit example, here’s what the model generates when PTC is enabled:
import asyncio
import json
# Step 1: Get team members
team_json = await get_team_members(department="engineering")
team = json.loads(team_json)
# Step 2: Fetch all expense records in parallel
expense_tasks = [
get_expenses(employee_id=m["id"], quarter="Q3")
for m in team
]
expenses_results = await asyncio.gather(*expense_tasks)
# Step 3: Filter and check budgets
exceeded = []
for member, exp_json in zip(team, expenses_results):
expenses = json.loads(exp_json)
total_travel = sum(
e["amount"] for e in expenses
if e["category"] == "travel" and e["status"] == "approved"
)
if total_travel > 5000:
budget_json = await get_custom_budget(user_id=member["id"])
budget = json.loads(budget_json)
limit = budget["budget_limit"]
if total_travel > limit:
exceeded.append({
"name": member["name"],
"spent": total_travel,
"limit": limit,
"exceeded_by": total_travel - limit
})
# Step 4: Only the summary enters the model's context
print(f"{len(exceeded)} members exceeded budget:")
print(json.dumps(exceeded, indent=2))
There are two things to notice here. First, asyncio.gather() issues all 20 expense lookups in parallel rather than sequentially, the tool calls happen almost simultaneously. Second, the filtering, aggregation, and budget comparison happens in Python, not in natural language. Only the final print() output is returned to the model’s context window. The over 2,000 raw expense records don’t touch it.The model is sampled only twice: once to generate the code, and once to interpret the final output. Everything in between (the tool calls, the data processing, the filtering) happens inside the container without additional model inference.
The managed PTC implementations rely on a provider-managed sandbox environment. But there are good reasons to self-host:
The self-hosted solution has two components:

The core idea is straightforward: take the tool definitions that normally go in tool_config, inject them into the system prompt instead, and instruct the model to write Python code that orchestrates those tools. The generated code runs in the Docker sandbox. The orchestrator acts as a control plane, intercepting tool calls through IPC, executing them externally, and injecting results back into the sandbox.
The system prompt is the critical piece that makes a model behave like it supports PTC natively. It describes the execution environment, the available tools, and the rules for generating code.A streamlined version is provided:
# Code Execution Environment Description
## Core Function
You can use the `execute_code` tool to run Python code. The code can call
asynchronous tool functions.
{tools_doc}
## Key Rules
### 1. Stateless Environment
- Each `execute_code` call is a fresh environment.
- Variables are not retained between calls.
- All operations must be completed in a single code block.
### 2. Basic Syntax
- Tool calls must use `await`.
- Use `print()` to output results.
- Data processing, filtering, and aggregation are allowed.
## Best Practices
### Correct: One code block completes all tasks
import json
import asyncio
data = await get_orders(days=7)
orders = json.loads(data)
tasks = [get_detail(id=o['id']) for o in orders]
details = await asyncio.gather(*tasks)
for order, detail in zip(orders, details):
print(f"{order['name']}: {detail}")
### Incorrect: Multiple code blocks
# First execution
data = await get_orders()
# Second execution - NameError: data does not exist
for item in data:
pass
This prompt guides the model to produce well-structured Python code that follows the same patterns as the native PTC implementation, single code blocks, async tool calls, and print() for output.
SandboxExecutor is the central component. It manages the lifecycle of isolated Docker containers, executes model-generated code safely, and handles the IPC protocol for tool calls.The system uses a dual-process architecture. The orchestrator (running in your ECS task) launches a Docker container for each code execution request. Communication happens through standard I/O streams, the container writes tool call requests to stderr, and the orchestrator injects tool results through stdin.
The runner script is dynamically generated by the orchestrator and injected into each Docker container at startup. It handles:
__PTC_TOOL_CALL__, __PTC_END_CALL__, __PTC_OUTPUT__) to separate tool call requests, results, and final output in the text stream.The runner script supports two execution modes:
To reliably separate different message types in a text stream, the system defines boundary markers:
__PTC_TOOL_CALL__ / __PTC_END_CALL__ – Wraps a tool call request (tool name + arguments as JSON).__PTC_OUTPUT__ – Marks the final output of the code execution.When the runner script encounters a tool call in the executing code, it serializes the call as JSON, writes it to stderr between the marker boundaries, and blocks on stdin waiting for the result. The orchestrator reads stderr, parses the tool call, executes the tool, and writes the result back to stdin. The runner script unblocks and continues execution.
Enabling PTC on Amazon Bedrock requires three elements:
The orchestrator ties together Amazon Bedrock and the Docker sandbox. Here is the core loop:import boto3import json
import subprocess
import tempfile
import os
# ── Configuration ──
MODEL_ID = "us.anthropic.claude-sonnet-4-5-20250929-v1:0"
REGION = "us-west-2"
SANDBOX_IMAGE = "ptc-sandbox"
SYSTEM_PROMPT = "..." # Full system prompt as shown above
TOOLS = [
{
"name": "execute_code",
"description": "Execute Python code in a sandboxed environment.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute."}
},
"required": ["code"]
}
}
]
# ── Bedrock call ──
def call_bedrock(client, messages):
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"system": [{"type": "text", "text": SYSTEM_PROMPT}],
"tools": TOOLS,
"messages": messages,
})
response = client.invoke_model(
modelId=MODEL_ID,
contentType="application/json",
accept="application/json",
body=body,
)
return json.loads(response["body"].read())
# ── Sandbox execution ──
def execute_in_sandbox(code):
"""Run code in a hardened Docker container. Returns stdout."""
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write("import jsonn" + code)
tmp_path = f.name
try:
result = subprocess.run(
["docker", "run", "--rm",
"--network", "none", "--read-only",
"--tmpfs", "/tmp:size=64m",
"--user", "sandbox", "--cap-drop", "ALL",
"--memory", "256m", "--cpus", "0.5",
"-v", f"{tmp_path}:/sandbox/user_code.py:ro",
SANDBOX_IMAGE],
capture_output=True, text=True, timeout=30,
)
return result.stdout.strip() if result.returncode == 0 else result.stderr.strip()
finally:
os.unlink(tmp_path)
# ── PTC orchestration loop ──
client = boto3.client("bedrock-runtime", region_name=REGION)
query = "Which engineering team members exceeded their Q3 travel budget?"
# Step 1: Send user query — model generates Python code
messages = [{"role": "user", "content": query}]
response = call_bedrock(client, messages)
# Step 2: Extract code from tool_use block
for block in response["content"]:
if block["type"] == "tool_use":
code = block["input"]["code"]
tool_id = block["id"]
# Step 3: Execute in Docker sandbox
output = execute_in_sandbox(code)
# Step 4: Send sandbox output back as tool_result
messages.append({"role": "assistant", "content": response["content"]})
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": tool_id, "content": output}]
})
# Step 5: Model interprets the result and produces final answer
final = call_bedrock(client, messages)
for block in final["content"]:
if block["type"] == "text":
print(block["text"])
The orchestrator sends the user query to Amazon Bedrock, extracts the model-generated code from the tool_use response, runs it in the Docker sandbox, and feeds the output back as a tool_result. The model then produces its final human-readable answer, sampled only twice total.
The sandbox container runs with strict isolation. Here is an example docker run command that enforces the security layers:
docker run --rm
--network none
--read-only
--tmpfs /tmp:size=64m
--user sandbox
--cap-drop ALL
--memory 256m
--cpus 0.5
-v /path/to/code.py:/sandbox/user_code.py:ro
ptc-sandbox
This facilitates: no network access, a read-only filesystem (with a small tmpfs for scratch space), a non-root user, Linux capabilities dropped, and hard memory/CPU limits. Model-generated code can’t escape the sandbox, persist data, or consume excessive resources.
For teams that don’t want to manage Docker containers and ECS infrastructure, Amazon Bedrock AgentCore provides a managed Code Interpreter that implements the same PTC pattern. The model writes code, a managed sandbox executes it, and only the final output returns to the model context. Here is the same architecture modified with the use of AgentCore Code Interpreter for code execution:

The key difference from the self-hosted approach is that tools are pre-loaded into the sandbox session rather than dispatched back to the client through IPC. You start a Code Interpreter session, inject your tool function definitions as Python code, and then let the model generate code that calls those pre-loaded functions directly.
AgentCore uses the bedrock-agentcore boto3 client:
import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
agentcore = boto3.client("bedrock-agentcore", region_name="us-west-2")
# Start a Code Interpreter session
session = agentcore.start_code_interpreter_session(
codeInterpreterIdentifier="aws.codeinterpreter.v1",
name="ptc-tools",
sessionTimeoutSeconds=900,
)
session_id = session["sessionId"]
# Pre-load tool functions into the sandbox.
# Replace this string with your actual tool function definitions.
tool_functions_code = """
def get_team_members(department):
# Your implementation here — return JSON string
pass
def get_expenses(employee_id, quarter="Q3"):
# Your implementation here — return JSON string
pass
def get_custom_budget(user_id):
# Your implementation here — return JSON string
pass
print("Tools loaded.")
"""
agentcore.invoke_code_interpreter(
codeInterpreterIdentifier="aws.codeinterpreter.v1",
sessionId=session_id,
name="executeCode",
arguments={"language": "python", "code": tool_functions_code}
)
| Aspect | Self-hosted (Part 1) | AgentCore (Part 2) |
| Infrastructure | You manage ECS + Docker | Fully managed |
| Customization | Full control over sandbox | Standard runtime |
| Tool execution | Client-side (IPC) | Inside sandbox |
| Network access | Configurable | Default off, PUBLIC mode available |
The managed approach is recommended for teams that want the token savings and accuracy benefits of PTC without the operational overhead of running Docker containers. The self-hosted approach is better when you need custom Python packages, specific security configurations, or full control over the execution environment.
If your team prefers the Anthropic SDK developer experience and wants to use it with Amazon Bedrock as the backend, you can build a lightweight API translation proxy that sits between the Anthropic SDK and Amazon Bedrock.

The proxy deploys on Amazon ECS and translates Anthropic API calls to Amazon Bedrock InvokeModel calls. It also manages the Docker sandbox lifecycle and the full PTC protocol transparently. To migrate, change base_url to point at the proxy:
import anthropic
# Point the Anthropic SDK at the proxy deployed on ECS.
# The proxy translates these calls to Bedrock InvokeModel under the hood.
client = anthropic.Anthropic(
api_key="your-proxy-api-key", # API key configured in the proxy
base_url="http://your-proxy-url.com" # Your proxy's ECS endpoint
)
# Define PTC tools — same format as Anthropic's native PTC API
ptc_tools = [
{"type": "code_execution_20250825", "name": "code_execution"},
{
"name": "get_team_members",
"description": "Get department team member list",
"input_schema": {
"type": "object",
"properties": {"department": {"type": "string"}},
"required": ["department"]
},
"allowed_callers": ["code_execution_20250825"]
}
# Add get_expenses, get_custom_budget similarly
]
response = client.beta.messages.create(
model="claude-sonnet-4-5-20250929", # Proxy routes to Bedrock model
betas=["advanced-tool-use-2025-11-20"],
tools=ptc_tools,
messages=[{"role": "user", "content": "Which team members exceeded Q3 travel budget?"}]
)
# The proxy handles sandbox execution and tool call interception transparently.
This approach is recommended for teams that prefer the Anthropic SDK interface while using Amazon Bedrock for model inference and the benefits of running within their AWS account. The proxy handles model translation, sandbox management, and the full PTC protocol transparently.
To validate the self-hosted PTC solution, we ran the same expense audit task across multiple models available on Amazon Bedrock.
Business setup:
Task prompt: “Which engineering team members exceeded their Q3 travel budget? Standard quarterly travel budget is $5,000. However, some employees have custom budget limits. For anyone who exceeded the $5,000 standard budget, check if they have a custom budget exception.”
Expected correct answer:
| Name | Budget | Actual | Over by |
| Alice Chen | $5,000.00 | $9,876.54 | +$4,876.54 |
| Emma Johnson | $5,000.00 | $5,266.02 | +$266.02 |
| Grace Taylor | $5,000.00 | $6,474.46 | +$1,474.46 |
| Model | PTC tokens | Non-PTC tokens | Token reduction | PTC accurate | Non-PTC accurate |
| Claude Sonnet 4.6 (adaptive thinking) | 12,739 | 128,043 | 90.1% | Yes | Yes |
| Claude Opus 4.6 (adaptive thinking) | 13,043 | 126,152 | 89.7% | Yes | Yes |
| Qwen3-Coder-480B | 34,159 | 305,114 | 88.8% | Yes | No |
| Qwen3-Next-80B | 28,878 | 233,332 | 87.6% | Yes | No |
| deepseek.v3.2 (thinking) | 19,543 | 245,967 | 92.1% | Yes | No |
| MiniMax M2.1 (thinking) | 11,787 | 101,990 | 88.4% | Yes | No |
| Kimi 2.5 (thinking) | 10,875 | 148,085 | 92.7% | Yes | No |
| GLM 4.7(thinking) | 11,550 | 115,829 | 90.0% | Yes | No |
Note: Models marked with thinking or adaptive thinking used their respective reasoning modes during code generation.
The key takeaway: PTC as a paradigm isn’t tied to any single model. Through the self-hosted sandbox approach, a model that supports tool use can benefit from code-orchestrated tool calling.
Taking Claude Sonnet 4.6 as an example, the expense audit task showed approximately 90% reduction in token consumption between PTC and non-PTC modes. The reason is straightforward: in non-PTC mode, every intermediate tool result enters the context window. In PTC mode, only the code and the final summary do.
Cost projection (based on Claude Sonnet pricing of $3/$15 per 1M input/output tokens):
If this task is executed 1,000 times per day in a production environment:
| Metric | Non-PTC mode | PTC mode |
| Estimated daily cost | ~$520 | ~$52 |
| Estimated monthly cost | ~$15,600 | ~$1,560 |
| Monthly savings | ~$14,040 (90%) |
These numbers will vary by task complexity and data volume, but the pattern is consistent: PTC reduces cost roughly in proportion to how much intermediate data it keeps out of the context window.
Programmatic tool calling represents a shift in how AI agents interact with tools, from conversational, one-at-a-time invocations to code-orchestrated, parallel, filtered execution. The results from our testing confirm the core value proposition:
We presented three ways to implement PTC on Amazon Bedrock:
All three approaches are model-agnostic, privately deployed within your AWS account, and extensible to new models as they become available on Amazon Bedrock. Amazon Bedrock provides the model inference backend with pay-as-you-go pricing, data sovereignty within your AWS account, and access to a diverse set of models through a single API.
Manuel Rioux est fièrement propulsé par WordPress