++ Cons on ETL pipelines
Just another day at the workplace;
5 minutes post the boot:
You hear everyone complain that the production database is slow. You quickly start to investigate; exploring all possible outcomes on the dashboards...
Could it have been the long-running slow query which you had raised a ticket for the production support to fix? Or Is it one of the queries run based on a non-indexed column?
6th Minute and 15 minutes later:
Next, you hear the fellow data-analysts lament over their failed reports.
You now realize that your CPU had taken a humongous amount of query load and you understand that your relational database system has gone for a toss into an eternal slumber.
And all of this due to a slow running query of your ETL pipeline..!! Ding. Ding... Ding…!! We have a winner!!!
Alright, let’s back it up a little bit.
Probably you did/used one of the following:
SELECT * from production_database.table where updated_at between x and y;
Bulk exports and Dumps once in every few minutes
Long-running and forgotten Zombie Crontabs
Let’s put it this way... DB Size < 500 GB; it’s OK to do selects but if it exceeds >500GB, unless you have cuts in the budget, do not ever do a bulk select and transfer it over the wire to the destination database in the form of a pipeline.
Enter CDC:
CDC A.K.A Change Data Capture is there to assist this data wrangling exercise and if your data is rapidly growing and if your BI/BA needs access to that sweet-sweet DWH (data warehouse); THIS right here, is the way to go.
You meant the replication? — I don’t wanna get stuck with my OLTP DB Engine:
Well, you are not alone. For many reasons, data-wranglers generally don’t prefer a similar database engine as that of the OLTP. The main reasons include query performance, need for triggers and ease of re-running transformation jobs. But, doing so, requires setting up airflow clusters and setting up connectors to transform and load.
Wouldn’t it be great if you could do all this without doing a bulk selection from the production database? Of course, you can...
CD-Stream is a cross-database CDC driven replicator tool that currently supports replication between MySQL and Postgres.
The tool runs queues to process the information occurring in the binary logs of the source database and replicates it across to a destination database of an entirely different engine.
Post the setup, as given in the project page: CD-Stream; there’s a directory called ‘sample’ in the project which contains some of the intensive DDL and Data Insertion scripts, for you to evaluate and exercise.
And that, People; is how you wrangle production data!!!
Originally published at https://www.datawrangler.in on October 30, 2018.
Top comments (0)