DEV Community

Cover image for Big Data Processing, EMR with Spark and Hadoop | Python, PySpark
Warda Liaqat
Warda Liaqat

Posted on

Big Data Processing, EMR with Spark and Hadoop | Python, PySpark

Introduction:

AWS's cool data analysis services can be of significant help when it comes to processing and analyzing large amounts of data.

Use Case:

To demonstrate our data processing job, we will use EMR cluster and S3 (as a storage medium for data) along with Python code and the PySpark library. We will execute python code on a data set of Stack Overflow Annual Developer Survey 2021 and print out some results based on that data. Those results will then be stored in S3.

Image description

In case you are just starting with Big Data, I would like to introduce you to some terms we are going to work with, You may skip below few lines if you're already familiar.

EMR (Elastic MapReduce):

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data. Wanna dig more dipper?

Amazon S3:

Amazon S3 is object storage built to store and retrieve any amount of data from anywhere.
S3 is a global service. Simply say it works like Google drive.
Wanna dig more dipper?

Apache Spark:

Apache Spark is an open-source, distributed processing system used for big data workloads. Wanna dig more dipper?

Apache Hadoop:

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.Wanna dig more dipper?

Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently. Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively.

PySpark:

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Wanna dig more dipper?

Outline

  1. Downloading a Data Set from Stack Overflow
  2. Setup an Amazon S3 bucket with different folders
  3. Setting up an EMR cluster
  4. Write a python code to perform some analysis on data and print results
  5. Connecting to Cluster via SSH and using PySpark to load data from Amazon S3
  6. Viewing analysis results
  7. Cleanup
  8. Recap

Guided Procedure

1. Downloading a Data Set from Stack Overflow

Go to Stack Overflow Annual Developer Survey and download the latest data set for 2021
Ultimately, it will download four files, but in this case, we will only be using the "Survey Results Public" file

Image description

2. Setup an Amazon S3 bucket with different folders

  • Login to AWS Management Console
  • Navigate to S3 and buckets
  • Create a new bucket named as big-data-demo-bucket with Versioning enabled and encryption true
  • Click on your bucket once you've created it
  • Create two folders named as bigdata-emr-logs (for storing EMR logs) and data-source (for storing our source data file) with encryption enabled

Image description

  • Place your source data file in the source data folder

Image description

3. Setting up an EMR cluster

  • Search for EMR
  • Click on create cluster

Image description

  • An S3 bucket is needed to store EMR logs. When you don't want to do it manually, EMR will automatically create a bucket for you to store logs. Select a bucket and folder that we created in previous step
  • Select Spark with Hadoop and Zeppelin in the software configurations
  • In terms of hardware configurations, you can choose EC2 type based on your needs
  • For testing purposes, 3 instances would be sufficient. You may create as many as you need
  • Enable auto-termination, which will terminate your cluster if any error occurs during the creation process
  • You can also set the cluster to automatically terminate when it goes into idle state for long
  • It is also completely up to you whether or not to scale
  • Select a key pair under security and access to SSH after launching the cluster
  • If you do not have a keypair, you can create easily from EC2 dashboard

Image description

  • Once you click on the Create Cluster button, your cluster will be created. It usually takes 10 to 15 minutes for the cluster to become operational

Image description

4. Write a python code to perform some analysis on data and print results

Now let's write some code. A spark job will run this code to analyze the data and print out the results.




from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_SOURCE_PATH = 's3://big-data-demo-bucket/data-source/survey_results_public.csv'
S3_DATA_OUTPUT = 's3://big-data-demo-bucket/data-output'


def main():
    spark = SparkSession.builder.appName('BigDataDemoApp').getOrCreate()
    all_data = spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
    print('Total number of records in dataset: %s' % all_data.count())
    selected_data = all_data.where((col('Country') == 'United States of America') & (col('Age1stCode') == '11 - 17 years'))
    print('Total number of engineers who work more than 45 hours in the US is : %s' % selected_data.count())
    selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT)
    print('Selected data was successfully saved to: %s' % S3_DATA_OUTPUT)

if __name__ == '__main__':
    main()




Enter fullscreen mode Exit fullscreen mode

Image description
What does this code do?

  • Setting up a Spark session
  • Data reading from S3
  • Printing some results based on certain conditions
  • A S3 folder is created for storing the results

5. Connecting to Cluster via SSH and using PySpark to load data from Amazon S3

Don't forget to enable SSH connections before trying to SSH into your cluster. The port 22 needs to be added to the cluster's master security group

Image description

  • Follow the instructions accordingly to connect to your cluster if you are using Windows or Mac

Image description

  • Iโ€™m on windows so, I used putty for the connection

Image description

  • Use the vi main.py command to create a Python file using the Vim editor
  • Press I on your keyboard and paste your code
  • Press the ECS key to exit the insert mode
  • Type :wq to quite the editor with saving your changes
  • If you type cat main.py, you can view your code
  • To submit this spark job use spark-submit [filename]
  • Your job will begin executing

Image description

6. Viewing analysis results

  • Your three print results can be viewed in the logs after completion

Image description

Image description

Image description

Image description

  • You can now see that the latest logs are stored in the logs folder in S3. Additionally, you'll notice a new folder named data-output that contains all output results with success files has been created for you

Image description

Image description

Image description

7. Cleanup

  • You can then terminate the cluster to save money

Image description

  • When you terminate a cluster, all EC2 associated with it will also be terminated

Image description

8. Recap

This article showed how you can use EMR and Amazon S3 to process and analyze a vast amount of data collected from Stack Overflow developer survey to extract some useful insights

Welcome to the end of this article, Happy Clouding!

Let me know what do you think about it??

Top comments (0)