The rapid advancement of artificial intelligence (AI) has created unprecedented demand for specialized models capable of complex reasoning tasks, particularly in competitive programming where models must generate functional code through algorithmic reasoning rather than pattern memorization. Reinforcement learning (RL) enables models to learn through trial and error by receiving rewards based on actual code execution, making it particularly well-suited for developing genuine problem-solving capabilities in algorithmic domains.
However, implementing distributed RL training for code generation presents significant infrastructure challenges such as orchestrating multiple heterogeneous components, coordinating parallel code compilation across nodes, and maintaining fault tolerance for long-running processes. Ray is one of the frameworks for distributed workloads that address these challenges, due to its unified system that handles the entire AI pipeline, GPU-first architecture, and seamless integration with tools like Hugging Face Transformers and PyTorch.
Workloads can be run with Ray framework on SageMaker training jobs by using the Ray on Amazon SageMaker Training jobs solution, which combines Ray’s distributed computing framework with SageMaker’s fully managed infrastructure. This solution automatically handles Ray cluster initialization, multi-node coordination, and distributed resource management, enabling developers to focus on model development while benefiting from SageMaker’s enterprise-grade features.
In this post, we demonstrate how to train CodeFu-7B, a specialized 7-billion parameter model for competitive programming, using Group Relative Policy Optimization (GRPO) with veRL, a flexible and efficient training library for large language models (LLMs) that enables straightforward extension of diverse RL algorithms and seamless integration with existing LLM infrastructure, within a distributed Ray cluster managed by SageMaker training jobs. We walk through the complete implementation, covering data preparation, distributed training setup, and comprehensive observability, showcasing how this unified approach delivers both computational scale and developer experience for sophisticated RL training workloads.
CodeFu-7B-v0.1 is a 7B parameter language model specifically trained for solving Competitive Programming (CP) problems. Built upon the DeepSeek-R1-Distill-Qwen-7B base model, CodeFu demonstrates how reinforcement learning can develop capabilities in algorithmic reasoning and efficient C++ code generation beyond traditional supervised fine-tuning approaches.
The model is trained using problem statements from the DeepMind CodeContest dataset without access to ground-truth solutions during training, forcing it to learn through trial and error based on code execution feedback. This approach enables the development of genuine problem-solving capabilities rather than pattern memorization
CodeFu is publicly available on HuggingFace and released under the MIT license, making it accessible for researchers and practitioners interested in code generation and algorithmic reasoning. The model’s training methodology demonstrates the potential for applying reinforcement learning techniques to complex reasoning tasks beyond competitive programming.
Ray on Amazon SageMaker Training jobs is a solution that enables distributed data processing and model training using Ray within SageMaker’s managed training environment. The solution provides key capabilities including universal launcher architecture for automatic Ray cluster setup, multi-node cluster management with intelligent coordination, heterogeneous cluster support for mixed instance types, and integrated observability through Ray Dashboard, Prometheus, Grafana, and Amazon CloudWatch integration.
The solution seamlessly integrates with the SageMaker Python SDK using the modern ModelTrainer API. This publicly available solution on GitHub enables developers to use Ray’s distributed computing capabilities while benefiting from SageMaker’s managed infrastructure, making it ideal for complex workloads like reinforcement learning training that require sophisticated distributed coordination and resource management.
The workflow for training CodeFu 7B with veRL and Ray on SageMaker training jobs, as illustrated in the accompanying diagram, consists of the following steps:

This streamlined architecture delivers a fully managed reinforcement learning training experience, enabling developers to focus on model development while SageMaker and Ray handle the complex distributed infrastructure orchestration—within a pay-as-you-go pricing model that bills only for actual compute time.
The following prerequisites must be complete before the notebook can be run:
p4de.24xlarge instances (with 8 x NVIDIA A100 GPUs) and scale to more p4de.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). P5 instances (with 8 x NVIDIA H100 GPUs) are also supported. On the Service Quotas console, request the following SageMaker AI quotas:
p4de.24xlarge) for training job usage: 2AmazonSageMakerFullAccess, AmazonS3FullAccess, AmazonSSMFullAccess to give required access to SageMaker AI to run the examples.{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"",
"Effect":"Allow",
"Principal":{
"Service":
"sagemaker.amazonaws.com"
]
},
"Action":"sts:AssumeRole"
}
]
}
Note: These permissions grant broad access and are not recommended for use in production environments. See the SageMaker Developer Guide for guidance on defining more fine-grained permissions
The code example can be found at this GitHub repository.
The data preparation pipeline transforms the raw DeepMind CodeContest dataset into a format suitable for reinforcement learning training. We apply systematic filters to identify suitable problems, removing those with Codeforces ratings below 800 and implementing quality validation checks for missing test cases, malformed descriptions, and invalid constraints.
We categorize problems into three difficulty tiers: Easy (800-1000 points), Hard (1100-2200 points), and Expert (2300-3500 points). This post uses only the Easy dataset for training. Each problem is formatted with two components: a user prompt containing the problem statement, and a reward_model specification with test cases, time limits, and memory constraints. Crucially, the ground_truth field contains no solution code — only test cases, forcing the model to learn through reward signals rather than memorizing solutions.
{
"data_source": "code_contests",
"prompt": [
{
"role": "user",
"content": "Write a C++ solution for this problem: ..."
}
],
"ability": "coding-cp",
"reward_model": {
"style": "rule",
"ground_truth": {
"name": "problem 1",
"public_tests": {
"input": ["test input 1", "test input 2"],
"output": ["expected output 1", "expected output 2"]
},
"private_tests": {
"input": ["private input 1", "private input 2"],
"output": ["private output 1", "private output 2"]
},
"time_limit": 2.0,
"memory_limit_bytes": 268435456,
"cf_rating": 1200
}
}
}
For this post, we provide a pre-processed subset of the Easy difficulty dataset in the code sample to streamline the training example, accessible from the GitHub repository.
The training process uses Ray to orchestrate the distributed execution and synchronization of vLLM rollout, reward evaluation (code compilation and execution), FSDP model parallelism, and Ulysses sequence parallelism. We set the degree of sequence parallelism to 4 for long-form reasoning and code generations.
The veRL framework implements a sophisticated multi-component architecture through its main_ppo.py orchestrator, which coordinates three primary distributed worker types: ActorRolloutRefWorker for policy inference and rollouts, CriticWorker for value function estimation, and RewardModelWorker for scoring generated solutions.
The GRPO algorithm enhances traditional proximal policy optimization (PPO) by computing advantages using group-relative baselines, which helps stabilize training by reducing variance in policy gradient estimates.
We extended the TinyZero code repository by using Ray to manage and distribute reward function calculation. This enables parallel C++ code compilation and evaluation across the same cluster to address the compute-intensive and latency-bound nature of code execution. The entire pipeline is executed as a SageMaker training job running on ml.p4de.24xlarge instances. The training pipeline consists of the following steps as shown in the following architecture:

The training process orchestration involves several key components implemented across multiple modules. The core veRL training loop is implemented in main_ppo.py, which initializes Ray workers and manages the distributed training process:
@ray.remote
def main_task(config):
# Initialize tokenizer and download model
local_path = copy_local_path_from_hdfs(config.actor_rollout_ref.model.path)
tokenizer = hf_tokenizer(local_path)
# Define distributed worker roles
role_worker_mapping = {
Role.ActorRollout: ray.remote(ActorRolloutRefWorker),
Role.Critic: ray.remote(CriticWorker),
Role.RefPolicy: ray.remote(ActorRolloutRefWorker),
}
# Initialize reward manager for code execution
reward_fn = RewardManager(tokenizer=tokenizer, num_examine=0)
# Create and start trainer
trainer = RayPPOTrainer(
config=config,
tokenizer=tokenizer,
role_worker_mapping=role_worker_mapping,
resource_pool_manager=resource_pool_manager,
reward_fn=reward_fn,
)
trainer.init_workers()
trainer.fit()
The reward evaluation system implements parallel code execution through Ray remote functions, handling C++ compilation and test case execution:
@ray.remote
def process_reward_item(idx, valid_response_length, sequences_str, data_source, reward_model_data):
# Extract and compile C++ code from model response
ground_truth = json.loads(reward_model_data)["ground_truth"]
# Select appropriate scoring function based on data source
if data_source == "code_contests":
compute_score = code_contests.compute_score
# Execute code against test cases and calculate pass ratio
score = compute_score(solution_str=sequences_str, ground_truth=ground_truth)
return idx, score, valid_response_length, sequences_str, data_source
The parallel test case execution system optimizes evaluation efficiency by sampling test cases and using process pools:
def run_test_cases_parallel(
bin_file: str, test_inputs: List[str],
test_outputs: List[str],
prob_name: str, execution_timeout: float,
max_test_cases: int = 100,
max_workers: int = 100) -> Tuple[int, int]:
# Sample test cases if too many available
if len(test_inputs) > max_test_cases:
random_indices = np.random.choice(len(test_inputs), size=max_test_cases, replace=False)
test_inputs = test_inputs[random_indices]
test_outputs = test_outputs[random_indices]
# Execute test cases in parallel using ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=min(max_workers, len(test_inputs))) as executor:
results = list(executor.map(_process_test_case, args_list))
total_matches = sum(results)
return total_matches, len(test_inputs)
This implementation enables efficient distributed training by separating concerns: the main_ppo.py orchestrator manages Ray worker coordination, while the reward system provides scalable code evaluation through parallel compilation and execution across the SageMaker cluster.
Below is the pseudocode for the reward calculation used in this post to train a competitive programming coding model. The reward function is the most important part of reinforcement learning as it defines what the model is encouraged to achieve and what it should avoid. This implementation uses a hierarchical penalty system that first checks for fundamental code execution issues, assigning severe penalties for non-executable code (-1) and moderate penalties for compilation failures (-0.5). Extracted code solutions are executed with strict time limit enforcement – code exceeding the problem’s specified time limit is given zero reward, facilitating realistic competitive programming conditions. For a successfully executed C++ solution, its reward is calculated as a linear function based on the fraction of private test cases passed, encouraging the model to solve as many private test cases as possible while avoiding overfitting to publicly visible tests. This design prioritizes code correctness and execution validity, with the private test performance serving as the sole signal for learning optimal coding solutions.
def compute_reward(code_output, ground_truth):
# Handle execution failures (same for both stages)
if not is_executable(code_output):
return -1
if compilation_failed(code_output):
return -0.5
if exceeds_time_limit(code_output):
return 0
# Primary reward signal: correctness on hidden test cases
# Run code against private test cases
passed_private, total_private = run_private_tests(code_output, ground_truth, max_test_cases=1000)
return passed_private / total_private
Refer to scripts/verl/utils/reward_score/code_contests.py for the complete Python code. Executing generated code in production environments requires appropriate sandboxing. In this controlled demonstration setting, we execute the code as a quick example to evaluate its correctness to assign rewards.
To train CodeFu-7B using veRL and Ray on SageMaker training jobs, we use the ModelTrainer class from the SageMaker Python SDK. Start by setting up the distributed training workload with the following steps:
instance_type = "ml.p4de.24xlarge"
instance_count = 2
account_id = sts.get_caller_identity()["Account"]
region = sagemaker_session.boto_session.region_name
repo_name = "codefu-pytorch"
tag = "latest"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:{tag}"
The training uses a custom Docker container that includes veRL, Ray, and the necessary dependencies for distributed RL training. Refer to the GitHub repository for the complete container definition and build instructions.
The ModelTrainer class provides flexible execution options through its SourceCode configuration, allowing users to customize their training workflows with different frameworks and launchers. Specify either an entry_script for direct Python script execution or use the command parameter for custom execution commands, enabling integration with specialized frameworks such as Ray, Hugging Face Accelerate, or custom distributed training solutions.
...
args = [
"--entrypoint", "train.py",
"--config", "/opt/ml/input/data/config/args.yaml",
]
# Define the script to be run with Ray launcher
source_code = SourceCode(
source_dir="./scripts",
requirements="requirements.txt",
command=f"python launcher.py {' '.join(args)}",
)
# Define the compute configuration
compute_configs = Compute(
instance_type=instance_type,
instance_count=instance_count,
keep_alive_period_in_seconds=1800,
)
job_name = "train-codefu-verl-ray"
output_path = f"s3://{bucket_name}/{job_name}"
model_trainer = ModelTrainer(
training_image=image_uri,
source_code=source_code,
base_job_name=job_name,
compute=compute_configs,
stopping_condition=StoppingCondition(max_runtime_in_seconds=3600 * 24 * 5),
output_data_config=OutputDataConfig(s3_output_path=output_path),
checkpoint_config=CheckpointConfig(
s3_uri=output_path + "/checkpoint",
local_path="/opt/ml/checkpoints"
),
environment={
"RAY_PROMETHEUS_HOST": "<PROMETHEUS_HOST>",
"RAY_GRAFANA_HOST": "<GRAFANA_HOST>",
"RAY_PROMETHEUS_NAME": "prometheus",
"BASE_MODEL": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"RUN_NAME": "sagemaker-training-run",
...
},
role=get_execution_role(),
).with_remote_debug_config(RemoteDebugConfig(enable_remote_debug=True))
The launcher.py script serves as the universal entry point that detects the SageMaker environment (single-node or multi-node, homogeneous or heterogeneous cluster), initializes the Ray cluster with proper head/worker node coordination, and executes your custom training script. Key launcher.py functionalities are:
--entrypoint script (train.py) within the Ray cluster context.RAY_PROMETHEUS_HOST and RAY_GRAFANA_HOST for comprehensive cluster monitoring. For additional information, refer to Ray on SageMaker training jobs – Observability with Prometheus and Grafana.For the complete implementation of the Ray cluster setup with SageMaker training jobs, refer to launcher.py.
The train.py script serves as the actual training orchestrator that:
For the complete implementation of the entry point script, refer to train.py.
InputData objects from the S3 bucket paths:...
train_input = InputData(
channel_name="train",
data_source=S3DataSource(
s3_data_type="S3Prefix",
s3_uri=train_dataset_s3_path,
s3_data_distribution_type="FullyReplicated",
),
)
config_input = InputData(
channel_name="config",
data_source=S3DataSource(
s3_data_type="S3Prefix",
s3_uri=train_config_s3_path,
s3_data_distribution_type="FullyReplicated",
),
)
model_trainer.train(
input_data_config=[train_input, val_input, config_input],
wait=False
)
The job can be monitored directly from the notebook output or through the SageMaker console, which shows the job status and corresponding CloudWatch logs.

SageMaker training jobs console

SageMaker training jobs system metrics
The launcher.py script orchestrates the Ray cluster initialization through the following automated steps, which can be monitored in real-time through CloudWatch logs:
__main__ - INFO - Entrypoint argument provided: train.py
__main__ - INFO - Set source_dir=, entry_script=train.py
...
__main__ - INFO - Found SageMaker environment with hosts: ...
__main__ - INFO - Current host: algo-1
__main__ - INFO - Configured Prometheus host: <PROMETHEUS_HOST>
__main__ - INFO - Configured Grafana host: <GRAFANA_HOST>
__main__ - INFO - Ray runtime environment contains 137 total environment variables
__main__ - INFO - Ray runtime environment: ...
__main__ - INFO - Homogeneous cluster configuration: 2 total hosts
__main__ - INFO - All hosts: ['algo-1', 'algo-2']
__main__ - INFO - Found multiple hosts, initializing Ray as a multi-node cluster
__main__ - INFO - Head node: algo-1, Current host: algo-1
__main__ - INFO - CPUs for the head node: 192
__main__ - INFO - GPUs for the head node: 8
#011INFO worker.py:1723 -- Connecting to existing Ray cluster at address: ...
#011INFO worker.py:1908 -- Connected to Ray cluster. View the dashboard at
__main__ - INFO - All nodes connected to the Ray cluster!
Script path: /opt/ml/input/data/code/train.py
...
__main__ - INFO - Loading and executing Python script using importlib...
After the job completes, the trained model weights and checkpoints will be available in the specified S3 output path, ready for deployment or further evaluation.
The CodeFu training pipeline integrates seamlessly with Managed MLflow on Amazon SageMaker AI as well as third party solutions, for comprehensive experiment tracking and visualization of reinforcement learning metrics.
The following image shows the metrics that are particularly useful to monitor during CodeFu training.

The metrics plot shows a promising GRPO/PPO learning progression for the competitive programming model. The reward signals demonstrate clear improvement, with critic/reward/mean rising from -0.8 to 0.6 and critic/reward/min recovering from initial failures -1.0 to moderate performance -0.5, while critic/reward/max maintains perfect scores 1.0 throughout training, indicating the model can achieve optimal solutions.
The Actor metrics reveal healthy training dynamics: actor/ppo_kl remains low ~0.0002 after an initial spike, confirming stable policy updates, while actor/pg_clipfrac stays in a reasonable range ~0.002-0.004, suggesting appropriately sized learning steps.
The increasing actor/kl_loss trend indicates growing divergence from the reference model as expected during RL fine-tuning. Most importantly, val/test_score/code_contests shows consistent improvement from -0.6 to ~0.5, and the train-validation comparison reveals good generalization with both curves tracking closely, indicating the model is learning to solve coding problems effectively without overfitting.
The table below explains key GRPO training metrics and why monitoring each one matters for diagnosing training health and performance:
| Metric | Description | Purpose |
| critic/reward/min | Minimum reward achieved on the training set | Detect catastrophic failures: Extremely negative rewards indicate the model is producing poor outputs that need attention |
| critic/reward/mean | Average reward across the training set | Primary progress indicator: Shows overall model performance improvement; should generally trend upward during successful training |
| critic/reward/max | Maximum reward achieved on the training set | Track best-case performance: Shows the model’s peak capability; helps identify if the model can achieve excellent results even if average is low |
| actor/ppo_kl | KL divergence between current and previous policy iteration | Training stability monitoring: High values indicate rapid policy changes that may destabilize training; should stay moderate |
| actor/pg_clipfrac | Fraction of policy updates hitting the clipping boundary | Update aggressiveness gauge: Moderate values indicate healthy learning; too high suggests overly aggressive updates that may destabilize training, too low (e.g. zero) suggests inefficient learning. This is valid only during off-policy PPO updates. |
| actor/kl_loss | KL divergence between current policy and fixed reference model | Reference drift prevention: Helps prevent the model from deviating too far from original behavior; important for maintaining coding capabilities |
| val/test_score/code_contests | Reward/performance on held-out validation set | Generalization check: Most important metric for real performance; detects overfitting and measures true model improvement |
To access the Ray Dashboard and enable Grafana visualization during training, establish port forwarding using AWS Systems Manager (SSM). To learn more about the setup of AWS SSM, please refer to AWS Systems Manager Quick Setup.
__main__ - INFO - Found multiple hosts, initializing Ray as a multi-node cluster
__main__ - INFO - Head node: algo-1, Current host: algo-2
aws ssm start-session —target sagemaker-training-job:train-codefu-verl-ray-20250821185206_algo-1
--region us-east-1
--document-name AWS-StartPortForwardingSession
--parameters '{"portNumber":["8265"],"localPortNumber":["8265"]}'
aws ssm start-session —target sagemaker-training-job:train-codefu-verl-ray-20250821185206_algo-1
--region us-east-1
--document-name AWS-StartPortForwardingSession
--parameters '{"portNumber":["8080"],"localPortNumber":["<YOUR_LOCAL_PORT>"]}'
Once port forwarding is established, the Ray Dashboard can be accessed at localhost:8265 in your browser, providing detailed insights into:

The integrated Grafana dashboards provide comprehensive visualization of the training metrics, system performance, and cluster health in real-time:

This observability setup is crucial for debugging distributed RL training issues, optimizing resource allocation, and making sure the training process progresses efficiently across the multi-node SageMaker cluster.
To clean up your resources and avoid ongoing charges, follow these steps:
This post demonstrates how to train specialized reasoning models for competitive programming using the Ray on Amazon SageMaker Training jobs solution combined with veRL’s reinforcement learning framework.
The Ray on SageMaker training jobs solution simplifies the complexity of orchestrating distributed RL workloads by automatically handling Ray cluster initialization, multi-node coordination, and resource management across heterogeneous compute environments. This integration enables organizations to use Ray’s advanced distributed computing capabilities—including support for complex multi-component architectures, dynamic resource allocation, and fault-tolerant execution—while benefiting from SageMaker’s fully managed infrastructure, enterprise-grade security, and pay-as-you-go pricing model.
The detailed metrics analysis demonstrated how to monitor training health through reward progression, policy stability indicators, and generalization performance, enabling practitioners to identify optimal training configurations and troubleshoot distributed training issues effectively.
To begin implementing distributed RL training with Ray on SageMaker, visit the Ray on Amazon SageMaker Training jobs GitHub repository for the foundational solution framework. The complete CodeFu-7B training implementation, including veRL integration and configuration examples, is available at this GitHub repository.
Manuel Rioux est fièrement propulsé par WordPress