This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.
As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.
In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.
The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.
At its core, NeMo Framework provides model builders with:
By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.
The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:
In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.
The key steps to deploying this solution include:

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.
First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.
Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.
The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.
Use the following steps to deploy the prerequisite resources.
It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.
After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.
The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.
$ aws --version
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{
"instance_group_name": "worker-group-1",
"partition_name": ${instance_type}
}
], "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
$ aws sagemaker create-cluster
--cli-input-json file://cluster-config.json --region $AWS_REGION
$ aws sagemaker list-clusters --output table
The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster
After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.
$ sinfo
$ squeue
# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster
#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)
#Exit to the head node
$ exit
#Exit again to cancel the srun job above
$ exit
$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm
Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.
The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.
To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.
Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.
$ docker login nvcr.io
You can use the following command to build the Docker file and create a SquashFS image.
$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12
Before continuing:
$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh
$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges
Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.
The following function defines the model configuration.
def small_llama_cfg() -> llm.GPTConfig:
return run.Config(
llm.Llama3Config8B,
rotary_base=500_000,
seq_length=1024,
num_layers=12,
hidden_size=768,
ffn_hidden_size=2688,
num_attention_heads=16,
init_method_std=0.023,
)
The following function defines the Slurm executor.
def slurm_executor(
account: str,
partition: str,
nodes: int,
user: str = "local",
host: str = "local",
remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",
time: str = "01:00:00",
custom_mounts: Optional[list[str]] = None,
custom_env_vars: Optional[dict[str, str]] = None,
container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",
retries: int = 0,) -> run.SlurmExecutor:
The following function runs the experiment.
with run.Experiment(exp_name, log_level="INFO") as exp:
exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
# Run the experiment
exp.run(detach=True)
Use the following command to run the training job.
$ python run.py --nodes 2 --max_steps 1000
The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.
The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .
$ tensorboard --logdir .
Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.


Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.
$ aws sagemaker delete-cluster --cluster-name ml-cluster
$ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod
Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.
Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.
Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.
Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.
Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.
Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.
Manuel Rioux est fièrement propulsé par WordPress