What's new at AWS π’
β #Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development
β This new availability enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.
β Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. However managing hardware failures are not automated.
β With this launch, customers can run deep health checks during cluster creation and automated hardware failures during ML trainings and fine-tuning.
β In addition, HyperPod automatically replaces faulty nodes(self-healing performant clusters) and resumes training from the last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators.
β EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability of health status checks and visual dashboards.
β Customer can use HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads.
β What is Amazon EKS:
β° AWS managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers as well.
β° It automatically manages the availability and scalability of the Kubernetes control plane nodes and major tasks.
β° Amazon EKS is integrated with AWS services such as Elastic load balancer, IAM, VPC, and CloudTrails are added advantage.
π Explore more about EKS: https://aws.amazon.com/eks/
π Explore more about SageMaker HyperPod: https://aws.amazon.com/blogs/aws/amazon-sagemaker-hyperpod-introduces-amazon-eks-support/
Top comments (0)