Data Lake on Google Cloud Platform
First of all, thank you for reading my post.
Second, it's a simple overview about Data Lake on GCP.
Intro
Google, as you know, was the last of the big players to enter in the cloud 'war'. But they did it in a very different way, they try to do this focusing on managed services and also focus on Big Data (and Machine Learning) things.
For Big Data, they are good (they ARE really good). With so many years improving their internal workloads for such a huge amount of data, the services provided are high in class.
Words are cheap, show me the "pic" :)
A complete Data Lake solution diagram:
The truth is: the main component on the Data Lake is Google Cloud Storage. The GCS, for short, is the place where you can store all your data. Even the other products in some way also use the GCS to store things. GCS is a powerful service in GCP, with many configs and ways to use it. In a Data Lake, we use it for unstructured data. For structured data, we commonly use CloudSQL(up to 10Tb), Spanner(Global Relational Database), BigTable(Low-latency-NoSQL Database) and BigQuery(Datawarehouse).
For each type of data, we can use one service/product available in GCP.
But to be honest with you, for structured data, BigQuery is the king. You can store and process all sizes of data you have in a manner of seconds.
In BigQuery you can import data in batch or use the Streaming API. Streaming API allows you to send up to 100,000 requests per second.
Of course, Data Lake is a huge topic to discuss, but here I'm trying to show you in a pic and in a few words, how it is possible on GCP. You may notice, but 2 of the most important service for data lake, provided by Google Cloud, is Cloud Storage and BigQuery. But they are not alone in the field, below I wrote down the other "friends" which you can use to create a complete environment for your big data on GCP.
Ingestion
- gsutil
- Cloud Transfer Service
- BigQuery API
- Cloud Pub/Sub
Store
- Cloud Storage
- Bigtable
- BigQuery
- Spanner
Process
- DataPrep
- Dataflow
- Dataproc
Analyse
- BigQuery
Reporting
- Data Studio
Conclusion
As I told you before, it is a simple and direct-to-the-point post :)
Data Lake is only one part of the huge topic "Big Data", but it is the starting point. As a fundamental part of this, I'll try to bring more information about it, writing one more post about it. I also will bring more about each one of those services, creating here on dev.to a series of posts to giving me a chance to write the things I know, and bring to all the community information about this topic.
As the first post, I'm absolutely pushing hard (as I'm not a writing person, even more, I'm not an English native speaking/writing). Please, give feedback, in all ways possible. I've been working hard to change myself to create the habit to write. But without feedback, I feel like to write to no one.
I really appreciate your time, thank you so much!
Top comments (5)
hi @giulianobr , im waiting for ur series. Im a new guy in big data field and Im gonna build a data lake and still consider choosing aws or gcp for storage.
That's why im looking at such this post. Hopefully you will go deeper into it so that i can undermore about gcp.
@giulianobr continue, please.
Hi Tran! I will soon! Thanks for your comment :)
It was a very interesting article.
However, I could not read the characters in the image, so please tell me if there is an enlarged view.
Hi!
Sorry for that:
take a look here: drive.google.com/file/d/1deNAyAu7v...
Thank you!