Itamar Perez

Posted on Sep 12, 2023 • Edited on Sep 14, 2023

Enabling GPU Nodes for PyTorch Workloads on EKS with Autoscaling

Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service that makes it easier for users to run Kubernetes on AWS without needing to install, operate, and maintain their own Kubernetes control plane or nodes. When running machine learning workloads, especially those that require GPU acceleration like PyTorch, it's essential to set up GPU nodes. This article will guide you through the process of setting up GPU nodes for PyTorch workloads on EKS with autoscaling.

Prerequisites

Basic knowledge of Kubernetes, EKS, and Terraform.
AWS CLI and kubectl installed and configured.
Docker installed.
Helm installed.

1. Setting up the EKS Cluster with GPU Nodes using Terraform

Before applying the Terraform configuration:

Ensure you have the AWS provider configured in your Terraform setup.
Initialize the Terraform directory using terraform init.

Here's a snippet from a Terraform configuration that sets up an EKS cluster and a self-managed GPU node group:

## (https://github.com/aws-ia/terraform-aws-eks-blueprints)
## ... [other Terraform code]

## Cluster Configuration
module "eks" {
  # ... [other configuration]

  self_managed_node_groups = {
    gpu_node_group = {
      node_group_name = "gpu-node-group"
      ami_type        = "AL2_x86_64_GPU"
      capacity_type   = "ON_DEMAND"
      instance_types = [
        "g4dn.xlarge",
        "g4dn.2xlarge",
      ]
      # ... [other configuration]

      taints = {
        dedicated = {
          key    = "nvidia.com/gpu"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
      # ... [other configuration]
    }
  }
}

2. Building the PyTorch Container with GPU Support

Your requirements.txt should contain:

torch
torchvision

To run PyTorch workloads on the GPU nodes, you need a container image with the necessary dependencies. Here's a Dockerfile that sets up a PyTorch environment with GPU support:

Your Dockerfile should contain:

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
WORKDIR /app

# Install Python and pip
RUN apt-get update && \
    apt-get install -y python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Install requirements
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py ./
# Run app
CMD ["python3", "./app.py"]

Your app.py can be as follows:

import torch
import time

def main():
    # Check if CUDA (GPU support) is available
    if not torch.cuda.is_available():
        print("GPU not available. Exiting.")
        return

    # Set the device to GPU
    device = torch.device("cuda:0")

    iteration = 0
    while True:
        iteration += 1

        # Create two random tensors
        a = torch.randn(1000, 1000, device=device)
        b = torch.randn(1000, 1000, device=device)

        # Perform a matrix multiplication on GPU
        c = torch.matmul(a, b)

        print(f"Iteration {iteration}: Matrix multiplication completed on GPU!")
        print(f"Result tensor shape: {c.shape}")

        # Pause for a short duration before the next iteration
        time.sleep(2)

if __name__ == "__main__":
    main()

3. Pushing PyTorch GPU App to AWS ECR

Prerequisites:

Docker installed and running.
AWS CLI installed and configured with the necessary permissions.

Steps:

Set Environment Variables:

Set your AWS account ID and region as environment variables:
```
export AWS_ACCOUNT=<your-account-id>
export AWS_REGION=<your-region>
```
Replace <your-account-id> with your AWS account ID and <your-region> with your AWS region (e.g., us-west-1).
Docker Build:

Navigate to the directory containing your Dockerfile and app.js. Build the Docker image:
```
docker buildx build --platform linux/amd64 -t pytorch-gpu-app .
```

Create an ECR Repository (if you haven't already):

aws ecr create-repository --repository-name pytorch-gpu-app --region $AWS_REGION

Authenticate Docker to the ECR Registry:

aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com

Tag the Docker Image:

docker tag pytorch-gpu-app:latest $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/pytorch-gpu-app:latest

Push the Docker Image to ECR:

docker push $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/pytorch-gpu-app:latest

3.5 Creating an imagePullSecret for AWS ECR

Before deploying your PyTorch workload on EKS, if your Docker image is stored in a private AWS ECR repository, you'll need to create a Kubernetes secret (imagePullSecret) to allow your EKS nodes to pull the image.

Retrieve an authentication token to use to authenticate your Docker client to your registry:
```
TOKEN=$(aws ecr get-login-password --region $AWS_REGION)
```

Create the imagePullSecret:

kubectl create secret docker-registry ecr-secret \
--docker-server=AWS_ACCOUNT.dkr.ecr.$AWS_REDION.amazonaws.com \
--docker-username=AWS \
--docker-password="${TOKEN}"
secret/ecr-secret created

Replace <your-account-id> with your AWS account ID and <your-region> with your AWS region.

4. Deploying the PyTorch Workload on EKS

Before deploying our workloads we need to enable the nvidia k8s-device-plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

To deploy the PyTorch workload on the GPU nodes in EKS, use the following deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch
  template:
    metadata:
      labels:
        app: pytorch
    spec:
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      nodeSelector:
        NodeGroup: gpu-node-group
      imagePullSecrets:
        - name: ecr-secret
      containers:
        - name: pytorch
          image: <your-account-id>.dkr.ecr.<your-region>.amazonaws.com/pytorch-gpu-app:latest
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 2
              memory: 2Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 4
              memory: 4Gi
              nvidia.com/gpu: 1

4. Installing the Cluster Autoscaler

The Cluster Autoscaler automatically adjusts the size of the cluster, adding or removing nodes based on resource requirements and constraints. To install the Cluster Autoscaler on your EKS cluster, you can use the Helm package manager.

Ensure that the necessary RBAC roles and permissions are set up for the Cluster Autoscaler.

First, add the Cluster Autoscaler Helm repository:

$ helm repo add autoscaler https://kubernetes.github.io/autoscaler

Using Autodiscovery

This method allows the Cluster Autoscaler to discover node groups automatically:

$ helm install my-release autoscaler/cluster-autoscaler \
    --set 'autoDiscovery.clusterName'=<CLUSTER NAME>

Replace <CLUSTER NAME> with the name of your EKS cluster.

5. Scaling Up GPU Nodes in EKS

Scaling up the number of GPU nodes in your EKS cluster ensures that you have sufficient resources to handle increased workloads. Here's how you can scale up the GPU nodes:

Update the Terraform Configuration:

Modify your existing Terraform configuration to increase the desired number of nodes in the gpu_node_group. For instance, if you initially set up 2 nodes and want to scale up to 4, you'd update the desired_capacity attribute.

self_managed_node_groups = {
  gpu_node_group = {
    # ... [other configuration]

    desired_capacity = 4
    min_size         = 2
    max_size         = 6

    # ... [other configuration]
  }
}

The min_size and max_size attributes define the minimum and maximum number of nodes in the node group, respectively. Adjust these values based on your requirements.

Apply the Updated Configuration:

Run the following command to apply the updated configuration:

terraform apply

Terraform will show you a plan of the changes it will make. Review the plan to ensure it's making the desired changes, then confirm to apply.

Monitor the Scaling Process:

You can monitor the scaling process using the AWS Management Console or the kubectl command. To check the nodes being added to your cluster using kubectl, run:

kubectl get nodes

This will display a list of all nodes in your cluster. You should see the new nodes being added.

Optimizing Costs:

Remember, adding more GPU nodes will increase your AWS bill. To optimize costs, consider using a mix of On-Demand and Spot Instances, or setting up Auto Scaling policies that scale down the number of nodes during off-peak hours.

In conclusion, by following these steps, you can efficiently set up GPU nodes for PyTorch workloads on EKS with autoscaling. This setup ensures that your machine learning workloads can leverage the power of GPUs for faster processing and better performance, while also benefiting from the elasticity provided by the Cluster Autoscaler. Remember to tear down resources after you're done to avoid incurring unnecessary costs using commands like terraform destroy and deleting the EKS cluster.

DEV Community

Enabling GPU Nodes for PyTorch Workloads on EKS with Autoscaling