DEV Community

Saket
Saket

Posted on

Setting Up a Spark Cluster on Kubernetes Using Helm

Introduction

Apache Spark is a powerful distributed data processing engine that can handle large-scale data processing tasks efficiently. Spark is a cluster computing framework which offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling.

Kubernetes, on the other hand, is a popular container orchestration platform that simplifies the deployment and management of containerized applications. Combining Spark and Kubernetes allows you to harness the benefits of both technologies for running Spark workloads in a scalable and flexible manner. With Kubernetes, the scalability aspect just becomes so easy for Spark, as you just scale the master or worker nodes by running a simple command. In this article, we'll guide you through the process of setting up a Spark cluster on Kubernetes using Helm, a package manager for Kubernetes.

Prerequisites:

Before proceeding, make sure you have the following prerequisites:

A Kubernetes cluster: Ensure you have a functioning Kubernetes cluster to deploy your Spark applications.

Helm installed: Install Helm, the package manager for Kubernetes, on your local machine.

Setting Up the Spark Cluster on Kubernetes:

Step 1: Install the Spark Helm Chart:
To deploy Spark on Kubernetes, we'll use the official Spark Helm chart. Open your terminal and add the Spark Helm repository:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
Enter fullscreen mode Exit fullscreen mode

Step 2: Customize Configuration (Optional):
You can customize the Spark cluster configuration by creating a values.yaml file. This file allows you to set parameters like the number of master or worker nodes, cpu/memory allocation, Spark version, add additional supporting jars through init containers etc. You can refer to the official Spark Helm chart documentation for available configuration options.
For eg: In values.yaml, you can update the image version to match the spark version according to your need as shown below:

...
...
image:
  registry: docker.io
  repository: bitnami/spark
  tag: 3.2.0
  digest: ""
...
...
Enter fullscreen mode Exit fullscreen mode

Step 3: Deploy the Spark Cluster:
With the customization (if any) done, it's time to deploy the Spark cluster using Helm:

helm install spark bitnami/spark -f path/to/your/values.yaml
Enter fullscreen mode Exit fullscreen mode

This command will deploy the Spark Master and Worker nodes as specified in the configuration, along with all the required kubernetes components like services, secrets etc required for proper functioning of the spark cluster.
This deploys a stateful set each for master and worker which can be scaled independently as per requirements.
It also deploys a headless service to access master and worker nodes.
It also deploys a regular service to access master and worker nodes.
Other components to capture the help way of deployment.

Step 4: Monitor the Spark Cluster:
After the deployment is complete, you can monitor the Spark cluster by accessing the Spark Web UI. Find the service IP for the Spark Master by running:

kubectl get svc spark-master-headless
Enter fullscreen mode Exit fullscreen mode

Then, access the Spark Web UI in your browser using the obtained IP and port 8080 (default Spark UI port).

Step 5: Submit Spark Applications:
With the Spark cluster up and running, you can now submit your Spark applications for processing. Use the kubectl command we can exec into a pod either master or worker or any other pod which is available in the same namespace as the master and worker nodes:


# Running Spark application on Kubernetes cluster
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master http://spark-master-headless:7077 \
  --deploy-mode cluster \
  --executor-memory 5G \
  --executor-cores 8 \
  /spark-home/examples/jars/spark-examples_versionxx.jar 80

Enter fullscreen mode Exit fullscreen mode

The spark-job.yaml file contains the specification of your Spark job, including the application jar, main class, input, and output paths, and any required configurations.

Conclusion:

Setting up a Spark cluster on Kubernetes using Helm brings together the power of Spark's distributed computing and Kubernetes' container orchestration capabilities. This combination allows you to scale your Spark workloads efficiently and take advantage of Kubernetes' resource management and fault tolerance features. With Helm's ease of use, you can quickly deploy and manage Spark clusters, making it a valuable approach for big data processing in cloud-native environments.

Top comments (0)