Warda Liaqat

Posted on Mar 27, 2022

Big Data Processing, EMR with Spark and Hadoop | Python, PySpark

Introduction:

AWS's cool data analysis services can be of significant help when it comes to processing and analyzing large amounts of data.

Use Case:

To demonstrate our data processing job, we will use EMR cluster and S3 (as a storage medium for data) along with Python code and the PySpark library. We will execute python code on a data set of Stack Overflow Annual Developer Survey 2021 and print out some results based on that data. Those results will then be stored in S3.

In case you are just starting with Big Data, I would like to introduce you to some terms we are going to work with, You may skip below few lines if you're already familiar.

EMR (Elastic MapReduce):

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data. Wanna dig more dipper?

Amazon S3:

Amazon S3 is object storage built to store and retrieve any amount of data from anywhere.
S3 is a global service. Simply say it works like Google drive.
Wanna dig more dipper?

Apache Spark:

Apache Spark is an open-source, distributed processing system used for big data workloads. Wanna dig more dipper?

Apache Hadoop:

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.Wanna dig more dipper?

Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently. Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively.

PySpark:

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Wanna dig more dipper?

Outline

Downloading a Data Set from Stack Overflow
Setup an Amazon S3 bucket with different folders
Setting up an EMR cluster
Write a python code to perform some analysis on data and print results
Connecting to Cluster via SSH and using PySpark to load data from Amazon S3
Viewing analysis results
Cleanup
Recap

Guided Procedure

1. Downloading a Data Set from Stack Overflow

Go to Stack Overflow Annual Developer Survey and download the latest data set for 2021
Ultimately, it will download four files, but in this case, we will only be using the "Survey Results Public" file

2. Setup an Amazon S3 bucket with different folders

Login to AWS Management Console
Navigate to S3 and buckets
Create a new bucket named as big-data-demo-bucket with Versioning enabled and encryption true
Click on your bucket once you've created it
Create two folders named as bigdata-emr-logs (for storing EMR logs) and data-source (for storing our source data file) with encryption enabled

Place your source data file in the source data folder

3. Setting up an EMR cluster

Search for EMR
Click on create cluster

An S3 bucket is needed to store EMR logs. When you don't want to do it manually, EMR will automatically create a bucket for you to store logs. Select a bucket and folder that we created in previous step
Select Spark with Hadoop and Zeppelin in the software configurations
In terms of hardware configurations, you can choose EC2 type based on your needs
For testing purposes, 3 instances would be sufficient. You may create as many as you need
Enable auto-termination, which will terminate your cluster if any error occurs during the creation process
You can also set the cluster to automatically terminate when it goes into idle state for long
It is also completely up to you whether or not to scale
Select a key pair under security and access to SSH after launching the cluster
If you do not have a keypair, you can create easily from EC2 dashboard

Once you click on the Create Cluster button, your cluster will be created. It usually takes 10 to 15 minutes for the cluster to become operational

4. Write a python code to perform some analysis on data and print results

Now let's write some code. A spark job will run this code to analyze the data and print out the results.




from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_SOURCE_PATH = 's3://big-data-demo-bucket/data-source/survey_results_public.csv'
S3_DATA_OUTPUT = 's3://big-data-demo-bucket/data-output'


def main():
    spark = SparkSession.builder.appName('BigDataDemoApp').getOrCreate()
    all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
    print('Total number of records in dataset: %s' % all_data.count())
    selected_data = all_data.where((col('Country') == 'United States of America') & (col('Age1stCode') == '11 - 17 years'))
    print('Total number of engineers who work more than 45 hours in the US is : %s' % selected_data.count())
    selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT)
    print('Selected data was successfully saved to: %s' % S3_DATA_OUTPUT)

if __name__ == '__main__':
    main()

What does this code do?

Setting up a Spark session
Data reading from S3
Printing some results based on certain conditions
A S3 folder is created for storing the results

5. Connecting to Cluster via SSH and using PySpark to load data from Amazon S3

Don't forget to enable SSH connections before trying to SSH into your cluster. The port 22 needs to be added to the cluster's master security group

Follow the instructions accordingly to connect to your cluster if you are using Windows or Mac

I’m on windows so, I used putty for the connection

Use the vi main.py command to create a Python file using the Vim editor
Press I on your keyboard and paste your code
Press the ECS key to exit the insert mode
Type :wq to quite the editor with saving your changes
If you type cat main.py, you can view your code
To submit this spark job use spark-submit [filename]
Your job will begin executing

6. Viewing analysis results

Your three print results can be viewed in the logs after completion

You can now see that the latest logs are stored in the logs folder in S3. Additionally, you'll notice a new folder named data-output that contains all output results with success files has been created for you

7. Cleanup

You can then terminate the cluster to save money

When you terminate a cluster, all EC2 associated with it will also be terminated

8. Recap

This article showed how you can use EMR and Amazon S3 to process and analyze a vast amount of data collected from Stack Overflow developer survey to extract some useful insights

Welcome to the end of this article, Happy Clouding!

Let me know what do you think about it??

DEV Community