When you analyze documents that span millions of characters, you hit the context window barrier and even the largest context windows fall short. Your model either rejects the input or produces answers based on incomplete information. How do you reason over documents that don’t fit?
In this post, you will learn how to implement Recursive Language Models (RLM) using Amazon Bedrock AgentCore Code Interpreter and the Strands Agents SDK. By the end, you will know how to:
Consider a typical financial analysis task of comparing metrics across two years of annual reports from a single company. Each report runs 300–500 pages. Add analyst reports, SEC filings, and supplementary materials, and the total reaches millions of characters.
When you send these documents directly to a model, either the input exceeds the model’s context window limit and the request fails, or the input fits but the model has difficulty attending to information in the middle of long inputs, often referred to as the “lost in the middle” problem.
Both failure modes exist because context window size is a hard limit that prompt engineering alone can’t solve. You need an approach that decouples document size from the model’s context window.
RLMs, introduced by Zhang et al. in arXiv:2512.24601, reframe the problem. Instead of feeding an entire document into the model’s context window, an RLM treats the input as an external environment that the model interacts with programmatically.

Figure 1. Recursive language models operate as an iterative loop: the root LLM generates code to explore the document environment, delegates semantic analysis to sub-LLMs on selected chunks, and accumulates results in working memory before refining the next step.
The model receives only the query and a description of the available environment. It then writes code to search, slice, and analyze the document iteratively. When the model needs semantic understanding of a specific section, it delegates that analysis to a sub-LLM call, keeping the results in working memory as Python variables rather than consuming context window space.
This creates a recursive structure: the root LLM orchestrates the analysis through code, calling sub-LLMs as needed for semantic tasks, while the full document never enters the model’s context window.
Here, we show how to implement RLM using Amazon Bedrock AgentCore Code Interpreter as the execution environment. Amazon Bedrock AgentCore Code Interpreter provides a sandboxed Python runtime with persistent state across executions. The architecture has three components working together.
A root LLM agent, built with the Strands Agents SDK, receives the user’s query and decides what code to execute. An Amazon Bedrock AgentCore Code Interpreter session runs in PUBLIC network mode, with the full document loaded as a Python variable. A llm_query() function injected into the sandbox calls Amazon Bedrock directly from within the Code Interpreter, so sub-LLM results stay in Python variables and don’t flow back into the root LLM’s context window.

Figure 2. RLM architecture using Amazon Bedrock AgentCore Code Interpreter. The root LLM agent iteratively writes and executes Python code in a sandboxed environment where the full input data is pre-loaded. From within the sandbox, the agent can call sub-LLMs via Amazon Bedrock for semantic analysis of specific sections. Intermediate results remain as Python variables in the sandbox, keeping the root LLM’s context window focused on orchestration.
Amazon Bedrock AgentCore Code Interpreter’s PUBLIC network mode supports this by allowing the sandbox to make outbound API calls to Amazon Bedrock. The persistent session state means variables, intermediate results, and extracted data accumulate across multiple code executions, giving the model working memory that persists throughout the analysis.
Follow these steps to set up and run RLM with Amazon Bedrock AgentCore Code Interpreter.
To follow along with this post, you need:
bedrock:InvokeModel, bedrock-agentcore:StartCodeInterpreterSession, bedrock-agentcore:InvokeCodeInterpreter, and bedrock-agentcore:StopCodeInterpreterSession.1: Start a Code Interpreter session and load the document
Create an Amazon Bedrock AgentCore Code Interpreter session and write the document into the sandbox:
import boto3
import json
# Start a Bedrock AgentCore Code Interpreter session
client = boto3.client('bedrock-agentcore', region_name='us-east-1')
response = client.start_code_interpreter_session(
codeInterpreterIdentifier=code_interpreter_id,
name="rlm-session",
sessionTimeoutSeconds=3600
)
session_id = response["sessionId"]
# Write the document to the sandbox
client.invoke_code_interpreter(
codeInterpreterIdentifier=code_interpreter_id,
sessionId=session_id,
name="writeFiles",
arguments={"content": [{"path": "_context.txt", "text": document}]}
)
2: Initialize the document and define the llm_query() helper inside the sandbox
Inside the sandbox, load the document and define the llm_query() function that sub-LLM calls will use:
# Runs inside the Bedrock AgentCore Code Interpreter sandbox
with open('_context.txt', 'r') as f:
context = f.read()
def llm_query(prompt: str) -> str:
"""Query a sub-LLM from within the sandbox."""
response = bedrock_client.invoke_model(
modelId=sub_model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"messages": [{"role": "user", "content": prompt}]
})
)
result = json.loads(response['body'].read())
return result['content'][0]['text']
3: Create the Strands Agent and run your query
Create a Strands Agent with a single execute_python tool that runs code in the session, then submit your question:
from strands import Agent
agent = Agent(
model="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
system_prompt=rlm_system_prompt,
tools=[execute_python],
)
answer = agent("What are the key revenue trends across these reports?")
The agent iteratively writes and executes Python code to explore the document, extract relevant sections, and call llm_query() when it needs semantic analysis of specific chunks.
In our evaluation, we compare RLM against two baselines, namely Base and Long Context. In the Base approach, the full document is sent directly to the model in a single API call with 200K token context window. This is the most straightforward strategy but fails when documents exceed the model’s context window. In the Long Context approach, we use Claude’s extended 1 million token context window, which handles larger inputs but still has an upper bound and can suffer from problems like “lost in the middle”.
We evaluated this approach on the Financial Multi-Document QA subset of LongBench v2, a benchmark designed to test LLM performance on tasks requiring reasoning across long contexts. This subset contains 15 multiple-choice questions, each requiring analysis across multiple financial reports with context lengths up to approximately 2 million characters.
We report two metrics: success rate, the percentage of questions that the model can process without exceeding input limits or encountering errors, and accuracy, the percentage of correct answers out of the total questions asked (unanswered questions count as incorrect).
We compared three approaches as described earlier: Base, Long Context, and RLM. We evaluated RLM across four Claude models serving as the root LLM, where the sub-LLM was configured as either the same model or Haiku 4.5 to balance performance and efficiency. We use Claude Haiku 4.5 as the sub-LLM because it offers significantly lower latency and cost for localized chunk-level analysis, while the root model retains responsibility for global reasoning and orchestration.
Table 1. LongBench v2 Financial Multi-Document QA (15 questions). Human expert accuracy from the LongBench v2 paper. Base results for Claude Sonnet 4.6 and Opus 4.6 are omitted because these models have a default 1 million token context window, making the Base and Long Context approaches equivalent.
| Model | Approach | Success rate | Accuracy |
| Claude Haiku 4.5 | Base | 46.7% | 33.3% |
| Claude Haiku 4.5 + Haiku 4.5 | RLM | 100.0% | 66.7% |
| Claude Sonnet 4.5 | Base | 46.7% | 26.7% |
| Claude Sonnet 4.5 | Long Context | 93.3% | 66.7% |
| Claude Sonnet 4.5 + Haiku 4.5 | RLM | 100.0% | 66.7% |
| Claude Sonnet 4.6 | Long Context | 93.3% | 60.0% |
| Claude Sonnet 4.6 + Haiku 4.5 | RLM | 100.0% | 73.3% |
| Claude Opus 4.6 | Long Context | 93.3% | 66.7% |
| Claude Opus 4.6 + Haiku 4.5 | RLM | 100.0% | 80.0% |
| Human Expert | – | – | 40% |
The results reveal three key findings:
The Financial QA evaluation focuses on long-form document reasoning. We next examine generalization to a different domain: code repository understanding, which requires navigating large codebases, resolving function dependencies, and tracing logic across files. This setting is particularly well suited to programmatic exploration through code execution.
To test this, we evaluated on the Code Repository Understanding subset of LongBench v2, which contains 50 multiple-choice questions. Each question provides an entire code repository as context (ranging from ~ around 100K to over 16M characters) and asks about implementation details, API behavior, or architectural decisions that require navigating and understanding the codebase.
The architecture is the same as for Financial QA where the full repository is loaded into the Code Interpreter sandbox as a single context variable. The model writes Python code to search for relevant files, extract function definitions, trace call chains, and use llm_query() to analyze specific code sections.
We evaluated all 50 questions using four Claude models with the same approaches. Based on the Financial QA finding that sub-LLM choice has limited impact for stronger models, we fix the sub-LLM to Claude Haiku 4.5 across RLM runs.
Table 2. LongBench v2 Code Repository Understanding (50 questions).
| Model | Approach | Success Rate | Accuracy |
| Claude Haiku 4.5 | Base | 30.0% | 20.0% |
| Claude Haiku 4.5 + Haiku 4.5 | RLM | 100.0% | 64.0% |
| Claude Sonnet 4.5 | Base | 30.0% | 20.0% |
| Claude Sonnet 4.5 | Long Context | 60.0% | 46.0% |
| Claude Sonnet 4.5 + Haiku 4.5 | RLM | 100.0% | 76.0% |
| Claude Sonnet 4.6 | Long Context | 60.0% | 42.0% |
| Claude Sonnet 4.6 + Haiku 4.5 | RLM | 100.0% | 66.0% |
| Claude Opus 4.6 | Long Context | 60.0% | 44.0% |
| Claude Opus 4.6 + Haiku 4.5 | RLM | 100.0% | 74.0% |
The results mirror the Financial QA findings: RLM achieves 100 percent success rate across all models, compared to 30–60 percent for Base and Long Context. Accuracy improves substantially across models under RLM, with every model achieving between 64 percent and 76 percent—up from 20–46 percent under Base and Long Context.
To illustrate how RLM operates in practice, the following is a representative sequence from one of the evaluation questions. The model is asked to compare financial metrics across two annual reports totaling approximately 1.5 million characters.
First, the model searches the context for structural markers to understand the document layout:
matches = re.findall(r'Table of Contents|ANNUAL REPORT', context)
Next, it slices into specific sections to find revenue tables:
revenue_section = context[450000:500000]
print(revenue_section)
For semantic analysis, it delegates to the sub-LLM:
analysis = llm_query(f"Compare these revenue figures: {chunk}")
Finally, it aggregates findings across multiple sections and arrives at a final answer.
When adopting RLM for your document analysis workloads, keep the following practical tradeoffs in mind.
To avoid ongoing charges, stop the Amazon Bedrock AgentCore Code Interpreter session when the analysis is complete:
client.stop_code_interpreter_session(
codeInterpreterIdentifier=code_interpreter_id,
sessionId=session_id
)
If you created a dedicated Code Interpreter resource for this walkthrough and no longer need it, you can delete it through the Amazon Bedrock AgentCore console or the AWS CLI.
Recursive language models offer a practical path to processing documents that exceed model context windows. By combining Amazon Bedrock AgentCore Code Interpreter with the Strands Agents SDK, you can implement RLM to reason over arbitrarily long input data through iterative code execution and sub-LLM calls.
Across our evaluations, the results are significant: Claude Opus 4.6 with RLM achieves 80.0 percent accuracy on LongBench v2 Financial QA (compared to 66.7 percent for Long Context with 1 million token context window and 40 percent for human experts), and Claude Sonnet 4.5 with RLM achieves 76.0 percent on LongBench v2 Code Repository QA (compared to 20.0 percent for Base prompting with 200K token context window, 46.0 percent for Long Context).
Tasks that require reasoning over long contexts or large reference libraries can benefit from this pattern, whether it’s financial analysis, code repository understanding, healthcare and life sciences research, legal review, or compliance auditing. If you try this approach on your own document analysis workloads, we want to hear what you build. Share your experience in the comments.
To get started with the approach described in this post, explore the following resources:
Manuel Rioux est fièrement propulsé par WordPress