1. Get Accepted for the MLH Fellowship Program
I got accepted for the MLH Fellowship Program 2024 Spring and had an opportunity to work for Apache Airflow. Apache Airflow is an open-source workflow management platform for data engineering pipelines, which is an established and widely used project in the data engineering community.
If you are interested in the program, I also wrote about how I prepared my application here
As I passed the mid-term of the MLH Fellow Program, I would like to share what I have done so far for the project.
2. My Contributions
Apache Airflow is a large repository. I contributed to the project from various fields.
I can categorize my contributions into 5 fields
- Documentation
- FrontEnd
- Code Formatting
- BackEnd API
- Upgrading Dependency (Connexion V3)
2.1 Documentation
First, I needed to set up a developer environment to start contributing. Fortunately, Apache Airflow makes the process easier with Breeze. I just needed to follow the documentation to set up Breeze. While reading the documentation, I found some hyperlinks are broken. Fixing the broken link was my first contribution to the project.
- Added Supported Database Types #37376
On the Airflow Slack, a user requested an available SQL list. I added the list document using Sphinx.
2.2 FrontEnd
- Momento Warning #37281 (Create an issue)
- Added shutdown color to the STATE_COLORS #37295
- Update searchBlogsPosts.js to avoid errors #956 (apache/airflow-site)
As I'm into web development, I have a habit of keeping the browser terminal open to see any warnings and errors. Once I found them, I fixed them right away.
2.3 Code Formatting
- Applied D401 to airbyte files. #37370
- D105 checks - airflow.ti_deps #37578
- D105 Check on Amazon #37764
As Apache Airflow has so many contributors(2895 contributors when I'm writing this post), we need code formatting tests. The maintainers added a new format check named D401 and D105.
Once introduce the code formatting test, we need to update lots of files. So they split the work. I took modules to update the code to follow the code format.
2.4 Backend API
- Filter Datasets by associated dag_ids (GET /datasets) #37512
Before this implementation, the endpoint can only filter by the database URL. Now we can filter the datasets by associated dag_ids(task id). This update makes it easy to see what datasets are connected to a dag.
Through implementing this feture, I learned many things such as SQLAlchemy, OpenAPI, and the structure of testing units of Airflow.
2.5 Upgrading Dependency Connexion V3
- Migrate to Connexion v3 #37638 (ongoing)
This is still an ongoing issue. I'm sure this will be the highlight of my internship.
Apache Airflow is trying to upgrade the connection version from v2 to v3 to enhance security, but there are so many bugs after the upgrade. Me and my teammate Sudipto are in charge of fixing those errors. We created subtasks to fix bugs one by one.
What I have done so far is
- Fixed Swagger Configuration
- Fixed Unit tests to create HTTP requests as a user
3. End Notes
Thank you for reading this far. Over the past 6 weeks, I've learned a lot, especially through my work on upgrading the dependency, which has helped me grow as a developer. I may write another post when I wrap up this internship.
I would like to express my gratitude to the MLH Fellowship program for providing me with this valuable learning opportunity, and to RBC (the Royal Bank of Canada) for sponsoring this internship position.
Top comments (0)