"Open Source (OSS) frameworks have improved the quality of Big Data processing with its diverse set of tools addressing numerous use cases
In fact, if you are a part of a team working on building a modern data architecture, chances are high you are using an open-source stack.
Similarly, Cloud Computing has been enabling Big Data Solutions in yielding scalable and cost-effective solutions in analytics space.
Open Source and Cloud : The Correlation
In the cloud ecosystem, many of the commercially available cloud services are either
Similar to an OSS β‘ Similar in Features (Eg: AWS Step Functions and Apache Airflow )
Modeled after an OSS β‘ Follows/ Inherits the design principles of an existing Open Source framework. (Eg: AWS Kinesis and Apache Kafka)
Managed service of an OSS β‘ Takes care of deployment & maintenance of the OSS framework and making it ready to use. (Eg: AWS RDS Postgres and PostgresDB)
To understand more, Let's touch upon the basics...
Getting to know the cloud
The first step that many of us go through while getting to know about cloud services is to start wondering where to start from the plethora of services available out there.
So, For the ease of understanding, Irrespective of the cloud provider (AWS, Azure, GCP, etc). let's group the big data related cloud services into these stages.
Now, Let's try to understand the cloud ecosystem by comparing AWS cloud services with its equivalent open source frameworks. (Similar comparison can be drawn with Azure and GCP as well)
π Data Ingestion:
AWS Service | What it does | Relation with OSS | OSS Alternative |
---|---|---|---|
Kinesis | Stream Processing | Modelled After | Apache Kafka |
SQS | Message Queue | Similar to | RabbitMQ |
Managed Streaming for Kafka (MSK) | Stream Processing | Managed Service of | Apache Kafka |
π Data Storage:
AWS Service | What it does | Relation with OSS | OSS Alternative |
---|---|---|---|
S3 | Object store | Similar to | Minio, Swift, Ceph, ... |
RDS | Relational database | Managed Service of | MariaDB, MySQL, Postgres |
DynamoDB | NoSQL database | Similar to | Apache Cassandra |
ElastiCache | In-memory cache | Managed Service of | Memcached, Redis |
Neptune | Graph database | Similar to | Neo4j |
Amazon QLDB | Ledger database | Modelled After | Hyperledger |
Amazon DocumentDB | Document database | Similar to | MongoDB |
AWS Lake Formation | Data lake | Similar to | HDFS |
EC2 EBS | Block storage for EC2 | Similar to | OpenEBS, Portworx |
π Data Processing:
AWS Service | What it does | Relation with OSS | OSS Alternative |
---|---|---|---|
Elastic Map Reduce | Hadoop | Managed Service of | Hadoop, |
Step Functions | Worflow Orchestrator | Similar to | Apache Airflow , Flyte |
AWS Glue | ETL | Managed Service of | Apache Spark |
Lambda | Serverless | Similar to | Knative, OpenFaaS, Fn |
Batch | Batch Job Computing | Similar to | Apache Airflow on Kubernetes |
π Data Analysis & Visualization:
AWS Service | What it does | Relation with OSS | OSS Alternative |
---|---|---|---|
Amazon Redshift | Data warehousing | Similar to | Spark SQL, Apache Hive, Presto |
Athena | Data warehousing | Similar to | Spark SQL, Apache Hive, Presto |
CloudSearch | Search | Similar to | Elasticsearch |
Elasticsearch Service | Search | Managed Service of | Elasticsearch |
QuickSight | Business analytics | Similar to | PowerBI |
π Deployment:
AWS Service | What it does | Relation with OSS | OSS Alternative |
---|---|---|---|
Elastic Container Registry (ECR) | Container registry | Managed Service of | Docker Registry, Quay |
Elastic Container Service (ECS) | Container orchestration | Managed Service of | Kubernetes, Marathon |
Elastic Kubernetes Services (EKS) | Container orchestration | Managed Service of | Kubernetes |
Cloud Formation | Infrastructure as a code | Similar to | Terraform |
Some of the notable cloud adoptions with respect to Big Data.
- Till now, AWS users have launched more than 15 million Hadoop clusters. (EMR / Containerized versions)
- "container-as-a-service" (EKS, ECS) and "Database-as-a-service" (RDS, DynamoDB) are the most commonly used managed services in 2020.
- Database services usage up 127% year over year.
Next Steps...
- You can understand how these services are put to use in real-world use cases in this article
- This Whitepaper from AWS on Big Data will be a good place to understand its Services.
- And start getting hands-on following this repo
Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.
Follow for updates.
Top comments (4)
Great job on your Data Engineer series so far!
I see lots of amazing talent coming out of Chennai
Good one for starters. Keep going...
That is some good analysis right there! π―
Good, write up. Keep it going. π I am just starting with Python already loving it.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.