Sometimes its useful to be able to run your own version of Apache Spark/Hudi, on a AWS EMR cluster you provisioned. You get the best of both worlds : all the AWS tooling + latest Spark + latest Hudi
This is a simple post on how you can accomplish this. First, create your EMR cluster, following works for EMR 6.2
Step 1: Build Hudi and copy the spark-bundle over
On your local mac/linux box.
# You can get this from the cluster's status page
export EMR_MASTER=<your_emr_master_public_dns>
# So you can build your own bundles and deploy
export HUDI_REPO=/path/to/hudi/repo
cd ${HUDI_REPO}
mvn clean package -DskipTests -Dspark3
export HUDI_SPARK_BUNDLE=hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar
scp -i /path/to/key.pem ${HUDI_REPO}/packaging/hudi-spark-bundle/target/${HUDI_SPARK_BUNDLE} hadoop@${EMR_MASTER}:~/
Step 2: Install Spark 3 with AWS Jars
ssh to the EMR master node.
ssh -i /path/to/key.pem hadoop@{EMR_MASTER}
Then proceed to download Spark 3.
# For hadoop-aws > 3.2 versions, we need the bundle jar.
export HADOOP_VERSION=3.2.0
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar
# Install spark version on your own; Just Apache Spark. Need to match Hadoop version from EMR cluster created
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
tar -zxvf spark-3.0.1-bin-hadoop3.2.tgz
Step 3: Fire up your spark-shell
Need to set the following
export HADOOP_CONF_DIR=/etc/hadoop/conf
export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<your access key>
Setup a spark shell
export SCALA_VERSION=2.12
export SPARK_VERSION=3.0.1
export HUDI_JAR=~/hudi-spark-bundle_${SCALA_VERSION}-0.8.0-SNAPSHOT.jar
export AWS_JARS="${HOME}/hadoop-aws.jar,${HOME}/aws-java-sdk.jar"
export JARS="${HUDI_JAR},${AWS_JARS}"
bin/spark-shell \
--driver-memory 8g --executor-memory 8g \
--master yarn --deploy-mode client \
--num-executors 2 --executor-cores 4 \
--conf spark.rdd.compress=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.hadoop.yarn.timeline-service.enabled=false \
--conf spark.driver.userClassPathFirst=true \
--conf spark.executor.userClassPathFirst=true \
--conf spark.ui.proxyBase="" \
--jars ${JARS} \
--packages org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION} \
--conf "spark.memory.storageFraction=0.8" \
--conf "spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" \
--conf "spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70"
scala> // Spark UI is otherwise broken.
scala> sys.props.update("spark.ui.proxyBase", "")
Top comments (0)