Scaling machine learning (ML) workflows from initial prototypes to large-scale production deployment can be daunting task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod offers a streamlined solution to this challenge. As teams progress from proof of concept to production-ready models, they often struggle with efficiently managing growing infrastructure and storage needs. This integration addresses these hurdles by providing data scientists and ML engineers with a comprehensive environment that supports the entire ML lifecycle, from development to deployment at scale.
In this post, we walk you through the process of scaling your ML workloads using SageMaker Studio and SageMaker HyperPod.
Implementing the solution consists of the following high-level steps:
This integrated approach not only streamlines the transition from prototype to large-scale training but also enhances overall productivity by maintaining a familiar development experience even as you scale up to production-level workloads.
Complete the following prerequisite steps:
hyperpod-cluster-filesystem. This is the ID for the FSx for Lustre file system associated with the SageMaker HyperPod cluster. This is needed for Amazon SageMaker Studio to mount FSx for Lustre onto Jupyter Lab and Code Editor spaces. Use the following code snippet to add a tag to an existing SageMaker HyperPod cluster:
aws sagemaker add-tags --resource-arn <cluster_ARN>
--tags Key=hyperpod-cluster-filesystem,Value=<fsx_id>
In the following sections, we outline the steps to create an Amazon SageMaker domain, create a user, set up a SageMaker Studio space, and connect to the SageMaker HyperPod cluster. By the end of these steps, you should be able to connect to a SageMaker HyperPod Slurm cluster and run a sample training workload. To follow the setup instructions, you need to have admin privileges. Complete the following steps:
sagemaker.amazonaws.com service to assume this role.{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:StartSession",
"ssm:TerminateSession"
],
"Resource": "*"
}
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateCluster",
"sagemaker:ListClusters"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"sagemaker:DescribeCluster",
"sagemaker:DescribeClusterNode",
"sagemaker:ListClusterNodes",
"sagemaker:UpdateCluster",
"sagemaker:UpdateClusterSoftware"
],
"Resource": "arn:aws:sagemaker:region:account-id:cluster/*"
}
]
}
Tag Key = “SSMSessionRunAs” and Tag Value = “<posix user>”. The POSIX user is the user that is set up on the Slurm head node. Systems Manager uses this user to exec into the head node.ssm-user account on a managed node. To enable Run As in Session Manager, complete the following steps:
SSMSessionRunAs that you created earlier.
VPC_ID and Subnet_ID are the same as the SageMaker HyperPod cluster’s VPC and subnet. The EXECUTION_ROLE_ARN is the role you created earlier.export DOMAIN_NAME=<domain name>
export VPC_ID=vpc_id-for_hp_cluster
export SUBNET_ID=private_subnet_id
export EXECUTION_ROLE_ARN=execution_role_arn
export FILE_SYSTEM_ID=fsx id
export USER_UID=10000
export USER_GID=1001
export REGION=us-east-2
cat > user_settings.json << EOL
{
"ExecutionRole": "$EXECUTION_ROLE_ARN",
"CustomPosixUserConfig":
{
"Uid": $USER_UID,
"Gid": $USER_GID
},
"CustomFileSystemConfigs":
[
{
"FSxLustreFileSystemConfig":
{
"FileSystemId": "$FILE_SYSTEM_ID",
"FileSystemPath": "$FILE_SYSTEM_PATH"
}
}
]
}
EOL
aws sagemaker create-domain
--domain-name $DOMAIN_NAME
--vpc-id $VPC_ID
--subnet-ids $SUBNET_ID
--auth-mode IAM
--default-user-settings file://user_settings.json
--region $REGION
The UID and GID in the preceding configuration are set to 10000 and 1001 as default; this can be overridden according to the user created in Slurm, and this UID/GID is used to give permissions to the FSx for Lustre file system. Also, setting this at the domain level gives each user the same UID. In order to have a separate UID for each user, consider setting CustomPosixUserConfig while creating the user profile.
SecurityGroupIdForInboundNfs created as part of domain creation to all ENIs of the FSx Lustre volume:
inbound-nfs-<domain-id> and can be found on the Network tab.
fsx:describeFileSystemsSecurityGroupIdForInboundNfs of the domain to it.Alternately, you can use the following script to automatically find and attach security groups to the ENIs associated with the FSx for Lustre volume. Replace the REGION, DOMAIN_ID, and FSX_ID attributes accordingly.
#!/bin/bash
export REGION=us-east-2
export DOMAIN_ID=d-xxxxx
export FSX_ID=fs-xxx
export EFS_ID=$(aws sagemaker describe-domain --domain-id $DOMAIN_ID --region $REGION --query 'HomeEfsFileSystemId' --output text)
export MOUNT_TARGET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID --region $REGION --query 'MountTargets[0].MountTargetId' --output text)
export EFS_SG=$(aws efs describe-mount-target-security-groups --mount-target-id $MOUNT_TARGET_ID --query 'SecurityGroups[0]' --output text)
echo "security group associated with the Domain $EFS_SG"
echo "Adding security group to FSxL file system ENI's"
# Get the network interface IDs associated with the FSx file system
NETWORK_INTERFACE_IDS=$(aws fsx describe-file-systems --file-system-ids $FILE_SYSTEM_ID --query "FileSystems[0].NetworkInterfaceIds" --output text)
# Iterate through each network interface and attach the security group
for ENI_ID in $NETWORK_INTERFACE_IDS; do
aws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --groups $EFS_SG
echo "Attached security group $EFS_SG to network interface $ENI_ID"
done
Without this step, application creation will fail with an error.
export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export REGION=us-east-2
aws sagemaker create-user-profile
--domain-id $DOMAIN_ID
--user-profile-name$USER_PROFILE_NAME
--region $REGION
Create a space using the FSx for Lustre file system with the following code:
export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export FILE_SYSTEM_ID=fs-xxx
export REGION=us-east-2
aws sagemaker create-space --domain-id $DOMAIN_ID
--space-name $SPACE_NAME
--space-settings "AppType=JupyterLab,CustomFileSystems=[{FSxLustreFileSystem={FileSystemId=$FILE_SYSTEM_ID}}]"
--ownership-settings OwnerUserProfileName=$USER_PROFILE_NAME --space-sharing-settings SharingType=Private
--region $REGION
Create an application using the space with the following code:
export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export APP_NAME=test-app
export INSTANCE_TYPE=ml.t3.medium
export REGION=us-east-2
export IMAGE_ARN=arn:aws:sagemaker:us-east-2:081975978581:image/sagemaker-distribution-cpu
aws sagemaker create-app --space-name $SPACE_NAME
--resource-spec '{"InstanceType":"$INSTANCE_TYPE","SageMakerImageArn":"$IMAGE_ARN"}'
--domain-id $DOMAIN_ID --app-type JupyterLab --app-name $APP_NAME --region $REGION
You should now have everything ready to access the SageMaker HyperPod cluster using SageMaker Studio. Complete the following steps:
Here you can view the SageMaker HyperPod clusters available in the account.

You can also preview the cluster by choosing the arrow icon.


You can also go to the Settings and Details tabs to find more information about the cluster.


You can also launch either JupyterLab or Code Editor, which mounts the cluster FSx for Lustre volume for development and debugging.
The Cluster Filesystem column identifies which space has the cluster file system mounted.

This should launch JupyterLab with the FSx for Lustre volume mounted. By default, you should see the getting started notebook in your home folder, which has step-by-step instructions to run a Meta Llama 2 training job with PyTorch FSDP on the Slurm cluster. This example notebook demonstrates how you can use SageMaker Studio notebooks to transition from prototyping your training script to scaling up your workloads across multiple instances in the cluster environment. Additionally, you should see the FSx for Lustre file system you mounted to your JupyterLab space under /home/sagemaker-user/custom-file-systems/fsx_lustre.

You can go to SageMaker Studio and choose the cluster to view a list of tasks currently in the Slurm queue.

You can choose a task to get additional task details such as the scheduling and job state, resource usage details, and job submission and limits.

You can also perform actions such as release, requeue, suspend, and hold on these Slurm tasks using the UI.

Complete the following steps to clean up your resources:
aws —region <REGION> sagemaker delete-space
--domain-id <DomainId>
--space-name <SpaceName>
aws —region <REGION> sagemaker delete-user-profile
--domain-id <DomainId>
--user-profile-name <UserProfileName>
HomeEfsFileSystem=Retain.aws —region <REGION> sagemaker delete-domain
--domain-id <DomainId>
--retention-policy HomeEfsFileSystem=Delete
In this post, we explored an approach to streamline your ML workflows using SageMaker Studio. We demonstrated how you can seamlessly transition from prototyping your training script within SageMaker Studio to scaling up your workload across multiple instances in a cluster environment. We also explained how to mount the cluster FSx for Lustre volume to your SageMaker Studio spaces to get a consistent reproducible environment.
This approach not only streamlines your development process but also allows you to initiate long-running jobs on the clusters and conveniently monitor their progress directly from SageMaker Studio.
We encourage you to try this out and share your feedback in the comments section.
Special thanks to Durga Sury (Sr. ML SA), Monidipa Chakraborty (Sr. SDE), and Sumedha Swamy (Sr. Manager PMT) for their support to the launch of this post.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Pooja Karadgi is a Senior Technical Product Manager at Amazon Web Services. At AWS, she is a part of the Amazon SageMaker Studio team and helps build products that cater to the needs of administrators and data scientists. She began her career as a software engineer before making the transition to product management. Outside of work, she enjoys crafting travel planners in spreadsheets, in true MBA fashion. Given the time she invests in creating these planners, it’s clear that she has a deep love for traveling, alongside a strong passion for hiking.
Manuel Rioux est fièrement propulsé par WordPress