DEV Community

Kasamba Lumwagi
Kasamba Lumwagi

Posted on

Data Engineering 101 : Introduction to Data Engineering

Before i thought data engineering and data science were just similar fields, well this changed this week data engineering entails the making of quality data available from various resources, maintain databases, build data pipelines, query data, data preprocessing, Feature Engineering, Apache hadoop and spark, Develop data workflows using Airflow etc while data science is about building ML algorithms, building data and ML models and deploy them, have statistical and mathematical knowledge and measure, optimize and improve results.

Week one down i've got the basic layout of the topics that we will be tackling finding it kinda easy for i now have a clear path on what it will take for me to become a data engineering. The following tools serve greatly in data engineering:
1.Cloud .
(AWS, Azure, GCP) master one but get a good grasp of all. AWS most preferably for its widely used. this will be used as a setup Development Environment to learn building Data Engineering Applications on GCP, AWS, Microsoft Azure.

2.Programming Language
I would recommend you to use python its faster and contains libraries and frameworks that are best suited for data engineering tasks .

3.SQL(Structured Query Language )
Makes it easy to manipulate databases.

4.A TEXT EDITOR
Visual Studio Code

5.Anaconda

Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment

6.Hadoop
It is an open-source framework that provides a distributed file system for big data sets. This allows users to process and transform big data sets into useful information using the MapReduce Programming Model of data processing.

7.Pyspark
PySpark is a data analytics tool created by Apache Spark Community for using Python along with Spark. It allows us to work with RDD (Resilient Distributed Dataset) and DataFrames in Python. PySpark has numerous features that make it such an amazing framework and when it comes to deal with the huge amount of data PySpark provides us fast and Real-time processing, flexibility, in-memory computation, and various other features. It is a Python library to use Spark which combines the simplicity of Python language with the efficiency of Spark.

THE TOPICS TO BE COVERED ARE-
1). Data Engineering
-What’s Data Engineering
-Why Data Engineering
-Data Engineers — ML Engineers — Data Scientists

2). Python for Data Engineering
-Basic Python with Project
-Advanced Python with Project
-Techniques and Optimization

  1. Scripting and Automation -Shell Scripting -CRON -ETL

4). Relational Databases and SQL
-RDBMS
-Data Modeling
-Basic SQL
-Advanced SQL
-Big Query

5). NoSQL Data bases and Map Reduce
-Unstructured Data
-Advanced ETL
-Map-Reduce
-Data Warehouses
-Data API

6). Data Analysis
-Pandas
-Numpy
-Web Scraping
-Data Visualization

7). Data Processing Techniques
-Batch Processing : Apache Spark
-Stream Processing — Spart Streaming
-Build Data Pipelines
-Target Databases
-Machine learning Algorithms

8). Big Data
-Big data basics
-HDFS in detail
-Hadoop Yarn
-Sqoop Hadoop
-Hadoop Yarn
-Hive
-Pig
-Hbase

9). WorkFlows
-Introduction to Airflow
-Airflow hands on project

10). Infrastructure
-Docker
-Kubernetes
-Business Intelligence

11). Cloud Computing
-AWS
-Google Cloud Platform
-Microsoft Azure

THE FIRE KEEPS ON BURNING.

Top comments (0)