With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.
In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.
The steps to implement the solution are as follows:
We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.
Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.
In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.
If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:
export AWS_REGION=us-east-1
export CLUSTER_NAME=my-cluster
export EKS_VERSION=1.30
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
Then complete the following steps:
eks_cluster.yaml with the following command:cat > eks_cluster.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $AWS_REGION
version: "$EKS_VERSION"
addons:
- name: vpc-cni
version: latest
cloudWatch:
clusterLogging:
enableTypes: ["*"]
iam:
withOIDC: true
EOF
This configuration file contains the following parameters:
my-cluster in this example. You can change it to a name of your choice.us-east-2. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented.latest will install the latest available version.eks_cluster.yaml file, you can create the EKS cluster by running the following command:eksctl create cluster --config-file eks_cluster.yaml
This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.
During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.
kubectl, run the following code:aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME
To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:
export ACCELERATED_AMI=$(aws ssm get-parameter
--name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id
--region $AWS_REGION
--query "Parameter.Value"
--output text)
eksctl:cat > eks_nodegroup.yaml <<EOF --- apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: $CLUSTER_NAME region: $AWS_REGION version: "$EKS_VERSION" managedNodeGroups: - name: neuron-group instanceType: inf2.48xlarge desiredCapacity: 1 volumeSize: 512 ami: "$ACCELERATED_AMI" amiFamily: AmazonLinux2 iam: attachPolicyARNs: - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess overrideBootstrapCommand: | #!/bin/bash /etc/eks/bootstrap.sh $CLUSTER_NAME EOF
eksctl create nodegroup --config-file eks_nodegroup.yaml to create the node group.This will take approximately 5 minutes.
To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.
The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.
For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.
To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).
export ECR_REPO_NAME=vllm-neuron
aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION
Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.
cat > Dockerfile <<EOF FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04 # Clone the vllm repository RUN git clone https://github.com/vllm-project/vllm.git # Set the working directory WORKDIR /vllm RUN git checkout v0.6.0 # Set the environment variable ENV VLLM_TARGET_DEVICE=neuron # Install the dependencies RUN python3 -m pip install -U -r requirements-neuron.txt RUN python3 -m pip install . # Modify the arg_utils.py file to support larger block_size option RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/[8, 16, 32]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py # Install ray RUN python3 -m pip install ray RUN pip install -U triton>=3.0.0 # Set the entry point ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] EOF
# Authenticate Docker to your ECR registry
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image
docker build -t ${ECR_REPO_NAME}:latest .
# Tag the image
docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
# Push the image to ECR
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:
cat > neuronx-vllm-deployment.yaml <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: neuronx-vllm-deployment labels: app: neuronx-vllm spec: replicas: 3 selector: matchLabels: app: neuronx-vllm template: metadata: labels: app: neuronx-vllm spec: schedulerName: my-scheduler containers: - name: neuronx-vllm image: <replace with the url to the docker image you pushed to the ECR> resources: limits: cpu: 32 memory: "64G" aws.amazon.com/neuroncore: "8" requests: cpu: 32 memory: "64G" aws.amazon.com/neuroncore: "8" ports: - containerPort: 8000 env: - name: HF_TOKEN value: <your huggingface token> - name: FI_EFA_FORK_SAFE value: "1" args: - "--model" - "meta-llama/Meta-Llama-3.1-8B" - "--tensor-parallel-size" - "8" - "--max-num-seqs" - "64" - "--max-model-len" - "8192" - "--block-size" - "8192" EOF
Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.
This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.
The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.
To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.
The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.
After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:
kubectl get deployments
This will show you the desired, current, and up-to-date number of replicas.
kubectl get pods -l app=neuronx-vllm -w
The -w flag will watch for changes. You’ll see the pods transitioning from "Pending" to "ContainerCreating" to "Running".
kubectl logs <pod-name>
The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.
To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.
spec:
containers:
- name: neuronx-vllm
# ... other container configurations ...
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 15
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
failureThreshold: 30
periodSeconds: 10
The configuration is comprised of three probes:
These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.
With these probes in place, Kubernetes will do the following:
This configuration helps facilitate high availability and proper functioning of your vLLM deployment.
Now you’re ready to access the pods.
neuronx-vllm label:kubectl get pods -l app=neuronx-vllm
This command will output a list of pods, and you’ll need the name of the pod you want to forward.
kubectl port-forward to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:kubectl port-forward <pod-name> 8000:8000
This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.
Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.
In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.
curl -X POST http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{
"model": " meta-llama/Meta-Llama-3.1-8B",
"prompt": "Explain the theory of relativity.",
"max_tokens": 100
}'
This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.
For more information about routing, see Route application and HTTP traffic with Application Load Balancers.
AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.
These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.
Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.
To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.
As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:
You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:
kubectl scale deployment neuronx-vllm-deployment --replicas=<new-number>
Alternatively, you can set up auto scaling:
You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.
Example scaling solutions include:
You should consider the following when scaling:
By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.
Use the following code to delete the cluster created in this solution:
eksctl delete cluster --name $CLUSTER_NAME --region $AWS_REGION
Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.
This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.
The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.
For the complete code and detailed implementation steps, visit the GitHub repository.
Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.
Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.
Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.
Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.
Manuel Rioux est fièrement propulsé par WordPress