Fine-tuning of large language models (LLMs) has emerged as a crucial technique for organizations seeking to adapt powerful foundation models (FMs) to their specific needs. Rather than training models from scratch—a process that can cost millions of dollars and require extensive computational resources—companies can customize existing models with domain-specific data at a fraction of the cost. This approach has become particularly valuable as organizations across healthcare, finance, and technology sectors look to use AI for specialized tasks while maintaining cost-efficiency. However, implementing a production-grade fine-tuning solution presents several significant challenges. Organizations must navigate complex infrastructure setup requirements, enforce robust security measures, optimize performance, and establish reliable model hosting solutions.
In this post, we present a complete solution for fine-tuning and deploying the Llama-3.2-11B-Vision-Instruct model for web automation tasks. We demonstrate how to build a secure, scalable, and efficient infrastructure using AWS Deep Learning Containers (DLCs) on Amazon Elastic Kubernetes Service (Amazon EKS). By using AWS DLCs, you can gain access to well-tested environments that come with enhanced security features and pre-installed software packages, significantly simplifying the optimization of your fine-tuning process. This approach not only accelerates development, but also provides robust security and performance in production environments.
In this section, we explore the key components of our architecture for fine-tuning a Meta Llama model and using it for web task automation. We explore the benefits of different components and how they interact with each other, and how we can use them to build a production-grade fine-tuning pipeline.
At the core of our solution are AWS DLCs, which provide optimized environments for machine learning (ML) workloads. These containers come preconfigured with essential dependencies, including NVIDIA drivers, CUDA toolkit, and Elastic Fabric Adapter (EFA) support, along with preinstalled frameworks like PyTorch for model training and hosting. AWS DLCs tackle the complex challenge of packaging various software components to work harmoniously with training scripts, so you can use optimized hardware capabilities out of the box. Additionally, AWS DLCs implement unique patching algorithms and processes that continuously monitor, identify, and address security vulnerabilities, making sure the containers remain secure and up-to-date. Their pre-validated configurations significantly reduce setup time and reduce compatibility issues that often occur in ML infrastructure setup.
We deploy these DLCs on Amazon EKS, creating a robust and scalable infrastructure for model fine-tuning. Organizations can use this combination to build and manage their training infrastructure with unprecedented flexibility. Amazon EKS handles the complex container orchestration, so you can launch training jobs that run within DLCs on your desired Amazon Elastic Compute Cloud (Amazon EC2) instance, producing a production-grade environment that can scale based on training demands while maintaining consistent performance.
AWS DLCs come with pre-configured support for EFA, enabling high-throughput, low-latency communication between EC2 nodes. An EFA is a network device that you can attach to your EC2 instance to accelerate AI, ML, and high performance computing applications. DLCs are pre-installed with EFA software that is tested and compatible with the underlying EC2 instances, so you don’t have to go through the hassle of setting up the underlying components yourself. For this post, we use setup scripts to create EKS clusters and EC2 instances that will support EFA out of the box.
Our solution uses PyTorch’s built-in support for Fully Sharded Data Parallel (FSDP) training, a cutting-edge technique that dramatically reduces memory requirements during training. Unlike traditional distributed training approaches where each GPU must hold a complete model copy, FSDP shards model parameters, optimizer states, and gradients across workers. The optimized implementation of FSDP within AWS DLCs makes it possible to train larger models with limited GPU resources while maintaining training efficiency.
For more information, see Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2.
For model deployment, we use Amazon Bedrock, a fully managed service for FMs. Although we can use AWS DLCs for model hosting, we use Amazon Bedrock for this post to demonstrate diversity in service utilization.
Finally, we implement the SeeAct agent, a sophisticated web automation tool, and demonstrate its integration with our hosted model on Amazon Bedrock. This combination creates a powerful system capable of understanding visual inputs and executing complex web tasks autonomously, showcasing the practical applications of our fine-tuned model.In the following sections, we demonstrate how to:
You must have the following prerequisites:
AmazonEC2FullAccess, AmazonSageMakerFullAccess, AmazonBedrockFullAccess, AmazonS3FullAccess, AWSCloudFormationFullAccess, AmazonEC2ContainerRegistryFullAccess. For more information about other IAM policies needed, see Minimum IAM policies.Run export AWS_REGION=<region_name> in your bash script from where you are running the commands.
In this section, we walk through the steps to create your EKS cluster and install the necessary plugins, operators, and other dependencies.
The simplest way to create an EKS cluster is to use the cluster configuration YAML file. You can use the following sample configuration file as a base and customize it as needed. Provide the EC2 key pair created as a prerequisite. For more configuration options, see Using Config Files.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: MyCluster
region: us-west-2
managedNodeGroups:
- name: p5
instanceType: p5.48xlarge
minSize: 0
maxSize: 2
desiredCapacity: 2
availabilityZones: ["us-west-2a"]
volumeSize: 1024
ssh:
publicKeyName: <your-ec2-key-pair>
efaEnabled: true
privateNetworking: true
## In case you have an On Demand Capacity Reservation (ODCR) and want to use it, uncomment the lines below.
# capacityReservation:
# capacityReservationTarget:
# capacityReservationResourceGroupARN: arn:aws:resource-groups:us-west-2:897880167187:group/eks_blog_post_capacity_reservation_resource_group_p5
Run the following command to create the EKS cluster:
eksctl create cluster --config-file cluster.yamlThe following is an example output:
YYYY-MM-DD HH:mm:SS [ℹ] eksctl version x.yyy.z
YYYY-MM-DD HH:mm:SS [ℹ] using region <region_name>
...
YYYY-MM-DD HH:mm:SS [✔] EKS cluster "<cluster_name>" in "<region_name>" region is ready
Cluster creation might take between 15–30 minutes. After it’s created, your local ~/.kube/config file gets updated with connection information to your cluster.
Run the following command line to verify that the cluster is accessible:
kubectl get nodes
In this step, you install the necessary plugins, operators and other dependencies on your EKS cluster. This is necessary to run the fine-tuning on the correct node and save the model.
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
helm repo add eks https://aws.github.io/eks-charts
git clone -b v0.0.190 https://github.com/aws/eks-charts.git
cd eks-charts/stable
helm install efa ./aws-efa-k8s-device-plugin -n kube-system
cd ../..
aws-efa-k8s-device-plugin-daemonset by running the following command:kubectl delete daemonset aws-efa-k8s-device-plugin-daemonset -n kube-system
git clone https://github.com/aws-samples/aws-do-eks.git
cd aws-do-eks
git checkout f59007ee50117b547305f3b8475c8e1b4db5a1d5
curl -L -o patch-aws-do-eks.tar.gz https://github.com/aws/deep-learning-containers/raw/refs/heads/master/examples/dlc-llama-3-finetuning-and-hosting-with-agent/patch-aws-do-eks.tar.gz
ftar -xzf patch-aws-do-eks.tar.gz
cd patch-aws-do-eks/
git am *.patch
cd ../..
kubectl apply -f aws-do-eks/Container-Root/eks/deployment/etcd/etcd-deployment.yaml
fsx folder:
cd aws-do-eks/Container-Root/eks/deployment/csi/fsx/
fsx.conf file to modify the CLUSTER_NAME, CLUSTER_REGION, and CLUSTER_ZONE values to your cluster specific data:
vi fsx.conf
./deploy.sh
cd aws-do-eks/Container-Root/eks/deployment/kubeflow/training-operator/
./deploy.sh
deploy.sh from the following GitHub repo.cd aws-do-eks/Container-Root/eks/deployment/kubeflow/mpi-operator/
./deploy.sh
This section outlines the process for fine-tuning the Meta Llama 3.2 Vision model using PyTorch FSDP on Amazon EKS. We use the DLCs as the base image to run our training jobs.
Complete the following steps to configure the setup for fine-tuning:
fsdp folder:cd Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp
kubectl apply -f pvc.yaml
Monitor kubectl get pvc fsx-claim and make sure it reached BOUND status. You can then go to the Amazon EKS console to see an unnamed volume created without a name. You can let this happen in the background, but before you run the ./run.sh command to run the fine-tuning job in a later step, make sure the BOUND status is achieved.
HF_TOKEN: Add the Hugging Face token that you generated earlier.S3_LOCATION: Add the Amazon Simple Storage Service (Amazon S3) location where you want to store the fine-tuned model after the training is complete../deploy.sh
This line uses the values in the .env file to generate new YAML files that will eventually be used for model deployment.
./login-dlc.sh
./build.sh
./push.sh
In this step, we use the upstream DLCs and add the training scripts within the image for running the training.
Make sure that you have requested access to the Meta Llama 3.2 Vision model on Hugging Face. Continue to the next step after permission has been granted.
Execute the fine-tuning job:
./run.sh
For our use case, the job took 1.5 hours to complete. The script uses the following PyTorch command that’s defined in the .env file within the fsdp folder:
```
bash
torchrun --nnodes 1 --nproc_per_node 8
recipes/quickstart/finetuning/finetuning.py
--enable_fsdp --lr 1e-5 --num_epochs 5
--batch_size_training 2
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct
--dist_checkpoint_root_folder ./finetuned_model
--dist_checkpoint_folder fine-tuned
--use_fast_kernels
--dataset "custom_dataset" --custom_dataset.test_split "test"
--custom_dataset.file "recipes/quickstart/finetuning/datasets/mind2web_dataset.py"
--run_validation False --batching_strategy padding
```
You can use the ./logs.sh command to see the training logs in both FSDP workers.
After a successful run, logs from fsdp-worker will look as follows:
Sharded state checkpoint saved to /workspace/llama-recipes/finetuned_model_mind2web/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct
Checkpoint Time = 85.3276
Epoch 5: train_perplexity=1.0214, train_epoch_loss=0.0211, epoch time 706.1626197730075s
training params are saved in /workspace/llama-recipes/finetuned_model_mind2web/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct/train_params.yaml
Key: avg_train_prep, Value: 1.0532150745391846
Key: avg_train_loss, Value: 0.05118955448269844
Key: avg_epoch_time, Value: 716.0386156642023
Key: avg_checkpoint_time, Value: 85.34336999000224
fsdp-worker-1:78:5593 [0] NCCL INFO [Service thread] Connection closed by localRank 1
fsdp-worker-1:81:5587 [0] NCCL INFO [Service thread] Connection closed by localRank 4
fsdp-worker-1:85:5590 [0] NCCL INFO [Service thread] Connection closed by localRank 0
I0305 19:37:56.173000 140632318404416 torch/distributed/elastic/agent/server/api.py:844] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0305 19:37:56.173000 140632318404416 torch/distributed/elastic/agent/server/api.py:889] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0305 19:37:56.177000 140632318404416 torch/distributed/elastic/agent/server/api.py:902] Done waiting for other agents. Elapsed: 0.0037238597869873047 seconds
Additionally:
[rank8]:W0305 19:37:46.754000 139970058049344 torch/distributed/distributed_c10d.py:2429] _tensor_to_object size: 2817680 hash value: 9260685783781206407
fsdp-worker-0:84:5591 [0] NCCL INFO [Service thread] Connection closed by localRank 7
I0305 19:37:56.124000 139944709084992 torch/distributed/elastic/agent/server/api.py:844] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0305 19:37:56.124000 139944709084992 torch/distributed/elastic/agent/server/api.py:889] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0305 19:37:56.177000 139944709084992 torch/distributed/elastic/agent/server/api.py:902] Done waiting for other agents. Elapsed: 0.05295562744140625 seconds


After the jobs are complete, the fine-tuned model will exist in the FSx file system. The next step is to convert the model into Hugging Face format and save it in Amazon S3 so you can access and deploy the model in the upcoming steps:kubectl apply -f model-processor.yaml
The preceding command deploys a pod on your instance that will read the model from FSx, convert it to Hugging Face type, and push it to Amazon S3. It takes approximately 8–10 minutes for this pod to run. You can monitor the logs for this using ./logs.sh or kubectl logs -l app=model-processor.
Get the location where your model has been stored in Amazon S3. This is the same Amazon S3 location that was mentioned the .env file in an earlier step. Run the following command (provide the Amazon S3 location):aws s3 cp tokenizer_config.json <S3_LOCATION>://tokenizer_config.json
This is the tokenizer config that is needed by Amazon Bedrock to import Meta Llama models so they work with the Amazon Bedrock Converse API. For more details, see Converse API code samples for custom model import.
For this post, we use the Mind2Web dataset. We have implemented code that has been adapted from the Mind2Web code for fine-tuning. The adapted code is as follows:
git clone https://github.com/meta-llama/llama-cookbook &&
cd llama-cookbook &&
git checkout a346e19df9dd1a9cddde416167732a3edd899d09 &&
curl -L -o patch-llama-cookbook.tar.gz https://raw.githubusercontent.com/aws/deep-learning-containers/master/examples/dlc-llama-3-finetuning-and-hosting-with-agent/patch-llama-cookbook.tar.gz &&
tar -xzf patch-llama-cookbook.tar.gz &&
cd patch-llama-cookbook &&
git config --global user.email "[email protected]" &&
git am *.patch &&
cd .. &&
cat recipes/quickstart/finetuning/datasets/mind2web_dataset.py
After you fine-tune your Meta Llama 3.2 Vision model, you have several options for deployment. This section covers one deployment method using Amazon Bedrock. With Amazon Bedrock, you can import and use your custom trained models seamlessly. Make sure your fine-tuned model is uploaded to an S3 bucket, and it’s converted to Hugging Face format. Complete the following steps to import your fine-tuned Meta Llama 3.2 Vision model:


The process might take 10–15 minutes depending on the model size to complete.
After you import your custom model, you can invoke it using the same Amazon Bedrock API as the default Meta Llama 3.2 Vision model. Just replace the model name with your imported model’s Amazon Resource Name (ARN). For detailed instructions, refer to Amazon Bedrock Custom Model Import.
You can follow the prompt formats mentioned in the following GitHub repo. For example:
What are the steps to build a docker image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Running the agent workload involves using the SeeAct framework and browser automation to start an interactive session with the AI agent and perform the browser operations. We recommend completing the steps in this section on a local machine for browser access.
Clone the customized SeeAct repository, which contains example code that can work with Amazon Bedrock, as well as a couple of test scripts:
git clone https://github.com/OSU-NLP-Group/SeeAct.git
Complete the following steps to set up SeeAct in a local runtime environment:
python3.11 -m venv seacct-python-3-11
source seacct-python-3-11/bin/activate
cd SeeAct
curl -O https://raw.githubusercontent.com/aws/deep-learning-containers/master/examples/dlc-llama-3-finetuning-and-hosting-with-agent/patch-seeact.patch
git checkout 2fdbf373f58a1aa5f626f7c5931fe251afc69c0a
git apply patch-seeact.patch
cd SeeAct/seeact_package
pip install .
pip install -r requirements.txt
pip install -U boto3
playwright install
Make sure you’re using the latest version of Boto3 for these steps.
We added a small Python script to verify the functionality of Playwright, the browser automation tool used by SeeAct:
cd SeeAct/src
python test_playwright.py
You should see a browser launched and closed after a few seconds. You should also see a screenshot being captured in SeeAct/src/example.png showing google.com.

Modify the content of test_bedrock.py. Update the MODEL_ID to be your hosted Amazon Bedrock model ARN and set up the AWS connection.
export AWS_ACCESS_KEY_ID="replace with your aws credential"
export AWS_SECRET_ACCESS_KEY="replace with your aws credential"
export AWS_SESSION_TOKEN="replace with your aws credential"
Run the test:
cd SeeAct
python test_bedrock.py
After a successful invocation, you should see a log similar to the following in your terminal:
The image shows a dog lying down inside a black pet carrier, with a leash attached to the dog's collar.
If the botocore.errorfactory.ModelNotReadyException error occurs, retry the command in a few minutes.
The branch has already added support for BedrockEngine and SGLang for running inference with the fine-tuned Meta Llama 3.2 Vision model. The default option uses Amazon Bedrock inference.
To run the agent workflow, update self.model from src/demo_utils/inference_engine.py at line 229 to your Amazon Bedrock model ARN. Then run the following code:
cd SeeAct/src
python seeact.py -c config/demo_mode.toml
This will launch a terminal prompt like the following code, so you can input the task you want the agent to do:
Please input a task, and press Enter.
Or directly press Enter to use the default task: Find pdf of paper "GPT-4V(ision) is a Generalist Web Agent, if Grounded" from arXiv
Task:
In the following screenshot, we asked the agent to search for the website for DLCs.

Use the following code to clean the resources you created as part of this post:
cd Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp
kubectl delete -f ./fsdp.yaml ## Deletes the training fsdp job
kubectl delete -f ./etcd.yaml ## Deletes etcd
kubectl delete -f ./model-processor.yaml ## Deletes model processing YAML
cd aws-do-eks/Container-Root/eks/deployment/kubeflow/mpi-operator/
./remove.sh
cd aws-do-eks/Container-Root/eks/deployment/kubeflow/training-operator/
./remove.sh
## [VOLUME GETS DELETED] - If you want to delete the FSX volume
kubectl delete -f ./pvc.yaml ## Deletes persistent volume claim, persistent volume and actual volume
To stop the P5 nodes and release them, complete the following steps:
In this post, we presented an end-to-end workflow for fine-tuning and deploying the Meta Llama 3.2 Vision model using the production-grade infrastructure of AWS. By using AWS DLCs on Amazon EKS, you can create a robust, secure, and scalable environment for model fine-tuning. The integration of advanced technologies like EFA support and FSDP training enables efficient handling of LLMs while optimizing resource usage. The deployment through Amazon Bedrock provides a streamlined path to production, and the integration with SeeAct demonstrates practical applications in web automation tasks. This solution serves as a comprehensive reference point for engineers to develop their own specialized AI applications, adapt the demonstrated approaches, and implement similar solutions for web automation, content analysis, or other domain-specific tasks requiring vision-language capabilities.
To get started with your own implementation, refer to our GitHub repo. To learn more about AWS DLCs, see the AWS Deep Learning Containers Developer Guide. For more details about Amazon Bedrock, see Getting started with Amazon Bedrock.
For deeper insights into related topics, refer to the following resources:
Need help or have questions? Join our AWS Machine Learning community on Discord or reach out to AWS Support. You can also stay updated with the latest developments by following the AWS Machine Learning Blog.
Shantanu Tripathi is a Software Development Engineer at AWS with over 4 years of experience in building and optimizing large-scale AI/ML solutions. His experience spans developing distributed AI training libraries, creating and launching DLCs and Deep Learning AMIs, designing scalable infrastructure for high-performance AI workloads, and working on generative AI solutions. He has contributed to AWS services like Amazon SageMaker HyperPod, AWS DLCs, and DLAMIs, along with driving innovations in AI security. Outside of work, he enjoys theater and swimming.
Junpu Fan is a Senior Software Development Engineer at Amazon Web Services, specializing in AI/ML Infrastructure. With over 5 years of experience in the field, Junpu has developed extensive expertise across the full cycle of AI/ML workflows. His work focuses on building robust systems that power ML applications at scale, helping organizations transform their data into actionable insights.
Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He helps customers harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Arindam Paul is a Sr. Product Manager in SageMaker AI team at AWS responsible for Deep Learning workloads on SageMaker, EC2, EKS, and ECS. He is passionate about using AI to solve customer problems. In his spare time, he enjoys working out and gardening.
Manuel Rioux est fièrement propulsé par WordPress