Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

18 mars 2025

AWS » Machine Learning, Intelligence artificielle

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
Enterprise-grade features with built-in capabilities including:
- Advanced parallelism techniques
- Memory optimization strategies
- Distributed checkpointing
- Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
Template parameters:
- Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
- All other parameters can be left as default or changed as needed for your use case.
Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:

$ aws --version

Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.

$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars

Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.

$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
1. Cluster name
2. It defines three instance groups
  1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
  2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
  3. Worker-group: The group of nodes that executes the actual model training workload
3. VPC configuration

$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json

Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.

$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Create the SageMaker HyperPod cluster

$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION

Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding

$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

Install the AWS SSM Session Manager Plugin.
Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.

$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

View the existing partition and nodes per partition

$ sinfo

List the jobs that are in the queue or running.

$ squeue

SSH to the compute nodes.

# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit

Clone the code sample GitHub repository onto the cluster controller node (head node).

$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.

$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh

To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:

$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor:

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed

$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.

$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

Delete the SageMaker HyperPod cluster.

$ aws sagemaker delete-cluster --cluster-name ml-cluster

Delete the CloudFormation stack created in the prerequisites.

$ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References

About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.