As a youth football coach and data engineer, I have been dreaming of that Moneyball moment to become true.
But coaching 10-year old girls, we (the coaching team) are less concerned about batting averages and player valuations. Our primary goal is to get everyone to enjoy football and develop as a player and a teammate. By getting these foundational parts right, we hope that many of todays' team members will keep playing football for a long time.
But this doesn't mean that you can't use data to improve. Most clubs these days use standardized software to track attendance and team members over time. This is extremely helpful just to get players to show up to practice and games, but the data can also be used when planning for the season:
- Which days to practice?
- How many teams to register?
- Which level of difficulty (league) to pick?
- If you are bringing in an external coach, which day of the week would be the best one?
There are many more potential questions you might find the answer to, especially these days when teams keep track of scores, shots, running distance, etc.
I have been wanting to try DuckDB for a long time, and this seemed like the perfect excuse. Follow along to see how to scrape a website using Python and transform data using DBT. Best of all, it's all done on your local machine using DuckDB for persistence and querying.
Getting the Data
Within our club, we use Svenskalag.se which has become a very popular system used to manage sports teams in Sweden.
This system offers some basic reporting functionality, but you quickly run out of options if you want to do anything else than just seeing how many training sessions each player has attended.
There is no public API available to extract the data, so the only option left is the dirtiest trick in the book: web scraping!
Using Scrapy I fetched the data needed (activities and attendance). Scrapy handled authentication using a form request in a very simple way:
yield scrapy.FormRequest(
login_url,
formdata={'UserName': username, 'UserPass': password},
callback=self.parse
)
Scrapy relies on XPath to extract data. I admit it, I rarely get those expressions right the first time, so it was a big help to use Chrome Developer Tools to test them.
At the beginning I searched for individual elements to extract the data. However, after a while I noticed that all the data I needed was rendered as JavaScript/JSON within script tags. Example:
<script>
var initData = {
teams: [
{
id: 7335,
name: "Soccer Dads"
},
{
id: 9495,
name: "Soccer Moms"
}
]
}
</script>
This made things I whole lot easier. By getting the text content from the script tag, I could use calmjs.parse to convert JavaScript into a Python data structure. Much easier than finding tags and extracting text using XPath.
Data Modelling
After fetching the data I ended up with JSON-objects that I stored in DuckDB. These needed to be transformed into something that could be analyzed more easily.
I decided to use DBT for this task, together with the DuckDB connector. DuckDB is especially brilliant when you are working locally on a project like this.
I had some issues at the beginning, but it was because DBT is very picky about the naming of the profile file (profiles.yml
and NOT profiles.yaml
) π€¦
In the DBT profile I configured DuckDB by attaching the database with the raw data and loading the extensions needed (ICU for time zone parsing). After that it felt like any other DBT project I have worked on.
As a frequent Snowflake user, I appreciate the simplicity when it comes to handling unstructured data. Turns out DuckDB can do it just as well. I used UNNEST
to pick a part the JSON payload, which was almost hassle free (learnt about 1-based indexing). A positive surprise was the inclusion of QUALIFY
into DuckDB.
Analysis
Over time I have gotten to know my coach colleagues pretty well, and they like Microsoft Excel. I recognize this all too well from my day-time job: You create fancy data models in a data warehouse, just for them to be exported into Microsoft Excel or Google Sheets.
But this time I came prepared and built a wide table containing all the data you could imagine, which can then be exported to a CSV file (using DuckDB) with a simple command. Anyone can then open it in Excel and be happy π
Summary
It is a breeze to get your analytics project up and running using tools such as DBT and DuckDB. And although fragile, web scraping can be a life saver and tool worth having around.
You can find the source code for this project on GitHub. Don't be a stranger, I'd love to hear your feedback.
Now let's get back to enjoying the football played at Euro 2024 π
Top comments (0)