This article was originally posted on my self-hosted blog on August 10, 2021.
The Goal
"Automate an ETL processing pipeline for COVID-19 data using Python and cloud services": #CloudGuruChallenge – Event-Driven Python on AWS
I saw this challenge when it was first posted in September 2020, but my Python and AWS skills at the time were not nearly good enough to tackle it. Fast-forward ten months and I was finally ready to give it a shot.
The idea is simple: download some data, transform and merge it, load it into a database, and create some sort of visualization for it. In practice, of course, there were lots of choices to make and plenty of new things I needed to learn to be successful.
First things first: Python
The data sources are .csv files, updated daily, from the New York Times and Johns Hopkins University, and both are published on GitHub. I started by downloading the raw files locally, extracting them into dataframes with Pandas, and creating a separate module that would do the work of transforming and merging the data. For my local script, I created a container class to act as a database, into which I could write each row of the resulting dataframe. This allowed me to figure out the necessary logic to determine if there was data in the 'database' or not, and therefore whether to write in the entire dataset or just load any new data that wasn't already there.
Along the way, I worked through my first major learning objective of this challenge: unit testing. Somewhat surprisingly, the online bootcamp I took during the winter didn't teach code testing at all, and I was intimidated by the idea. After some research, I chose to go with pytest for its simplicity and easy syntax relative to Python's built-in unittest. With a little experimentation, I was able to write some tests for many of the functions I had written, and even dabbled a bit with some test-first development.
Decisions, decisions...
Once my Python function was working locally, I had to decide which step to take next, as there were a couple choices. After some thinking, and discussing my ideas with my mentor, I went with my second learning objective: Terraform. I've worked a little with Infrastructure as Code in the form of AWS CloudFormation and the AWS Serverless Application Model, but I'd been meaning to try the provider-agnostic Terraform for several months.
I started a separate Pycharm project, wrote a quick little Lambda function handler, and dove into the Terraform tutorials. Once I got the hang of the basics, I found a Terraform Lambda module and started plugging my own values into the template. A sticking point here was figuring out how to get Pandas to operate as a Lambda Layer - after failing to correctly build a layer myself (than you, Windows), I found a prebuilt layer that worked perfectly and added it to my Terraform configuration as an S3 upload.
I proved that Terraform worked when deploying locally, and then turned my attention to setting up a GitHub Action for automatic deployment. I combined pytest and Terraform into one workflow, with Terraform being dependent upon all tests passing, so that I had a full CI/CD pipeline from my local computer to GitHub and on to AWS via Terraform Cloud.
Starting to come together
With deployment just a git push
away, it was time to start utilizing other AWS resources. This brought me to my third big learning objective: boto3. I recall being a bit overwhelmed by boto3 and its documentation last fall when I was working on the Resume Challenge. Fortunately, lots of practice reading documentation in the intervening months paid off, as it wasn't nearly as scary as I'd feared once I actually got started. I added SNS functionality first, so that I would get an email any time the database was updated or an error occurred. With that working nicely, it was time for another decision: what database to use?
I used DynamoDB for the Resume Challenge, but that was just one cell being atomically incremented. Much of my database experience since then has been with various RDS instances, so I wanted to gain some more experience with AWS's serverless NoSQL option. Back to the documentation I went, as well as to Google to figure out the best way to overcome the batch-writing limits. Before long, my Lambda function was behaving exactly how I wanted, with everything still being deployed by Terraform.
Finishing touches
At this point, I was cruising along and it was a simple matter to create an Event Bridge scheduled event to trigger my Lambda function once a day. It took a few tries to get the permissions and attachments set up correctly in Terraform, and once that was completed, I had to figure out the data visualization solution. I could have gone with AWS Quicksight, but I explored a bit and settled on using a self-hosted instance of Redash. Since there was already an EC2 AMI with Redash installed, I was able to add that to my Terraform configuration (although I cheated a wee bit and created a security group and IAM role for the instance in the console, in the name of finally finishing this project).
With Redash up and running, and some simple visualizations scheduled to update daily, I reached the end of the project requirements earlier today. Huzzah!
Room for growth
I'm happy with how this project went. I invested nearly 50 hours of time to get it going, due to the number of topics I had to teach myself along the way - a hefty but worthwhile time commitment over the past two weeks. A few things I think could/would get better with more learning and practice:
- I suspect my Terraform configuration is a little rough around the edges, and could probably be refactored a bit.
- Because so many things were new to me, I spent a lot of time in the console, manually coding and testing functionality in an account separate from the one I used for the finished product. It struck me, after almost everything was done, that this would have been an opportunity to learn more about using environment variables to create development and production stages, perhaps. I'm not sure if that would have been useful for this application, or if using two accounts was the most sensible way to go about this, but my workflow felt a bit kludgy to me.
- I spent a solid three hours rewriting my Terraform script because of what turned out to be an IAM permission scoping issue - yikes! I ended up going back to the Terraform configuration I had already been using, albeit with the right IAM permissions, because the module I was using for Lambda was more efficient at packaging code than Terraform's native config.
- My mentor and I worked through a lot of the Python together, and I found myself getting frustrated at my very basic understanding of object-oriented programming. While I didn't end up using any of my own classes in the final product, I can see that's a subject I should spend more time learning.
- It might have been nice to figure out some more complex visualizations, such as daily changes, but I wasn't sure how to go about that. I suspect my choice of querying my DynamoDB table directly from Redash, as opposed to porting the data to S3 for consumption by Athena or some other service, may have played a role in how complex I could get.
Aaaaaand done
Many long nights and many more rabbit holes later, I can finally present my finished product!
Click here for the Github repository, and click here for the dashboard.
Top comments (0)