Goal
Create a cluster with PySpark with python. Based on this repo & on this video
Here is the code repository
Description
We are going to create 4 enviroments:
1- Master: ports: 8080 & 7077
2- Worker-1: ports: 8081
3- Worker-2: ports: 8082
All of them are going to share same volumen on: /opt/workspace
And also they are going to have the same source file
For try to check if everything, it's going to be test this script:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("TestCluster") \
.master("spark://spark-master:7077") \
.getOrCreate()
df = spark \
.read.options(
header='True'
,inferschema='True'
,delimiter=";") \
.csv("src/Travel\ Company\ New\ Clients.csv")
df.count()
df.printSchema()
df.select("FamilyMembers").distinct().show()
def myFunc(s):
return [ ( s["FamilyMembers"],1)]
lines=df.rdd.flatMap(myFunc).reduceByKey(lambda a, b: a + b)
famColumns =["FamilyMembers","Num_reg"]
dataColl=lines.toDF(famColumns)
dataColl.show(truncate=False)
Prequistes
- Docker version 23.0.5
- Openjdk:8-jre-slim
- Python: 3
- Spark version: 3.4.0
- Hadoop version: 3
Steps:
Firstly, we should the bases of the ApacheSpark (with the same instalation, files, ...) . Three of all of these image are going to be based on this first one.
For this reason, we are going to separate our flow in four parts:
1- Create a script of a bash which is going to create the firts image
2- Create a Dockerfile for the master based on (1)
3- Create a Dockerfile for the workers based on (1)
4- Create a docker-compose to activate (2 & 3)
0 - Env variables
In futures steps, it has been used by version purposes or directory index.
# Version
## JDK
JDK_VERSION=8-jre-slim
## Hadoop
HAD_VERSION=3
## Spark
SPA_VERSION=3.4.0
# Directories
WORK_DIR=/opt/workspace
1 - Create Apache Spark
This could be the step a little harder. First, we should to know what tools we could use:
- Openjdk : the basic framework to run pyspark.
- Python3: we could use Scala, but in my case I prefer to use python.
- Curl: we need it to download to Oficial spark repository
- Vim : only in case if we need to edit something
Variables
- jdk_version
- spark_version
- hadoop_version
- shared_wk : where we are going to save our main files
Script Main DockerFile
For this reason our Dockerfile should be:
# Layer - Image
ARG jdk_version
FROM openjdk:${jdk_version}
# Layer - Arguments in dockerfile
ARG spark_version
ARG hadoop_version
ARG shared_wk
ARG py_cmd
# Layer - OS + Directories
RUN apt-get update -y
RUN mkdir -p ${shared_wk}/data
RUN mkdir -p /usr/share/man/man1
RUN ln -s /usr/bin/python3
RUN ln -s /urs/bin/python
# Layer - Prequistes
RUN apt-get -y install curl
RUN apt-get -y install vim
RUN apt-get -y install python3
# Download the info about PySpark
RUN curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz
RUN tar -xvzf spark.tgz
RUN mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/
RUN echo "alias pyspark=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/pyspark" >> ~/.bashrc
RUN echo "alias spark-shell=/usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/bin/spark-shell" >> ~/.bashrc
RUN mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs
RUN rm spark.tgz
RUN echo "JAVA_HOME"
# Layer - Enviorment Out
ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3
# Layer - Move files to execute
## Data of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src
COPY ./src/* /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/src/
## Script of the execution
RUN mkdir -p /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script
COPY ./script/* /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/script/
# Layer - Volume & workdir
VOLUME ${shared_wk}
WORKDIR ${SPARK_HOME}
Script bash
To create this image, we are going to use an script to bash to run it:
#!/bin/bash
# Load variables from .env
set -o allexport;
source .env;
set +o allexport;
# Create Spark Base
docker build \
--build-arg jdk_version=${JDK_VERSION} \
--build-arg hadoop_version=${HAD_VERSION} \
--build-arg spark_version=${SPA_VERSION} \
--build-arg shared_wk=${WORK_DIR} \
-f Dockerfile \
-t spark-base .;
2 - Create Apache Spark - Master
With this first image create, we should create a master's image:
FROM spark-base
CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out;bin/spark-submit --master spark://spark-master:7077 script/main.py
3 - Create Apache Spark - Worker
Then, we need to create another image of workers:
FROM spark-base
# Layer - Enviorment Out
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out
4 - Enviroment
We need to create a docker-compose file to maintain all the structure. I'll use this one:
version: "3.9"
volumes:
shared-workspace:
name: "Pyspark"
driver: local
services:
spark-master:
build:
context: .
dockerfile: Dockerfile.master
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
volumes:
- shared-workspace:${WORK_DIR}
spark-worker-1:
build:
context: .
dockerfile: Dockerfile.worker
container_name: spark-worker-1
ports:
- "8081:8081"
volumes:
- shared-workspace:${WORK_DIR}
depends_on:
- spark-master
spark-worker-2:
build:
context: .
dockerfile: Dockerfile.worker
container_name: spark-worker-2
ports:
- "8082:8082"
volumes:
- shared-workspace:${WORK_DIR}
depends_on:
- spark-master
5 - Test it
Once everything is everthing is up and running. I'll use the same terminal of docker to run this shell.
./bin/spark-submit --master spark://spark-master:7077 script/main.py
And it looks like it works great:
Have a nice day:)!
Top comments (0)