I am a Python developer with 15+ years of experience. I was working as automation QA engineer (created custom testing framework, modifying it and doing functionality testing) for 4+ years and as you would have guessed it, working on something for 4 years surely does take out charm / challenge out of it. So i was looking for new things and at the start of 2022 there was email about Python developer role in some other team. I jumped onto it and got it just based on recommendation.
So, this new role was with Quantitative research team, just to give my background, I am not good at math, specially math which only has letters in their equation :). One thing i was sure about my skills and i knew that i would learn anything for the task i am working on and produce result and improve them as i get more experience / knowledge.
I was happy that i will work with Scientific side of Python libraries (numpy, scipy etc.) and then with GCP (Google Cloud Platform). First task i started with was creating Pipeline to load data into bigquery for our ML Pipeline. I faced some design decision issues just because i was not following cloud native approach.
- Get requests to data server timing out after first 1000 requests.
I created fully functional local Airflow pipeline but it started timing out on GCP. Why? cause i was opening new socket for every request to save host! This would have worked fine only any VM or server but not on cloud. Though, i think i should have used Session from the start. Learned why to use sessions
- Re-running airflow job
I must confess that choosing BigQuery (BQ) was not my decision. As BQ has no concept of duplicate / unique row concept, we needed to make sure that if we re-run job for some day it needs to first clean the data from BQ. So I was forced to think in a different way of how to structure the data, partition on column and then in case of re-run just drop the partition. But how to decide which partition to drop? all these decisions impacted my conventional developer thinking and helped me to think like Cloud-Native developer.
- Do the backfill with starting date going back to 6 years back.
This i must confess was the hardest part of the whole pipeline. How would you one-off back-fill for past 6 years and then do it periodically if some data is corrupt for specific IDs.
When i started with this development, I was very sure that this whole thing can be completed in 1 month but in reality, I had to learn about Airflow, Bigquery, GCP limitations and access rights! In the end, i spent close to 2.5 months to get this pipeline running in production. In the end, i was very happy that my first pipeline is working fine for 3+ months without having any issues.
Top comments (0)