This a translated version of my tutorial originaly published in Brazilian Portuguese. The repository with the code from this tutorial is in my gitlab profile.
Getting data and transforming it into information is the foundation of fields such as Data Science. Sometimes obtaining it is very simple, for example, you can, right now, visit the Brazilian government website data.gov.br and get access to several raw data files from the government and then perform the analysis of a .csv file (a file format that transmits data) in an easy, simple and fast way.
However, in some situations the data is somewhat difficult to obtain, for example, you may need to receive data that is only available on a web page to perform an analysis. In this situation you can use Beautiful Soup, a Python library, to perform web scraping.
Beautiful Soup is the most popular Python library for receiving web data, it is capable of extracting data from HTML and XML files, it has several methods that make the search for specific data on web pages rather simple an fast.
For this tutorial, we will extract data from the website Transfermarkt which is a web plataform that contains news and data about games, transfers, clubs and players from the football/soccer world.
We will receive the name, country of the previous league and the price of the 25 most expensive players in the history of the AFC Ajax club, this information can be found on the Transfermarkt page.
Page which contains the informations about the 25 biggest AFC Ajax signs
Extracting Data
Before obtaining the data itself, we will import the libraries required for the execution of the program, these will be: Beautiful Soup, Pandas and Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
After that, we will download the webpage in our program using the requests
library, which requests the information from the page, and the BeautifulSoup library, which transforms the data received in requests (a Response
object) into aBeautifulSoup
object that will be used in data extraction.
"""
To make the request to the page we have to inform the
website that we are a browser and that is why we
use the headers variable
"""
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
# endereco_da_pagina stands for the data page address
endereco_da_pagina = "https://www.transfermarkt.co.uk/ajax-amsterdam/transferrekorde/verein/610/saison_id//pos//detailpos/0/w_s//altersklasse//plus/1"
# In the objeto_response variable we will the download of the web page
objeto_response = requests.get(endereco_da_pagina, headers=headers)
"""
Now we will create a BeautifulSoup object from our object_response.
The 'html.parser' parameter represents which parser we will use when creating our object,
a parser is a software responsible for converting an entry to a data structure.
"""
pagina_bs = BeautifulSoup(objeto_response.content, 'html.parser')
pagina_bs
is now a variable that contains all the HTML content inside our data page.
Now let's extract the data that is in our variable, note that the information we need is in a table. Each row in this table represents a player, with his name, represented in HTML by an anchor (<a>
) with the class "spielprofil_tooltip", country of origin league, represented as a flag image (<img>
) with a class "flaggenrahmen" in the seventh column (<td>
) of each row, and cost represented by a table cell (<td>
) of the class "rechts hauptlink"
We will then get this data using the BeautifulSoup library.
First we will get the names of the players.
nomes_jogadores = [] # List that will receive all the players names
# The find_all () method is able to return all tags that meet restrictions within parentheses
tags_jogadores = pagina_bs.find_all("a", {"class": "spielprofil_tooltip"})
# In our case, we are finding all anchors with the class "spielprofil_tooltip"
# Now we will get only the names of all players
for tag_jogador in tags_jogadores:
nomes_jogadores.append(tag_jogador.text)
Now we will get the countries of the players’s previous leagues.
pais_jogadores = [] # List that will receive all the names of the countries of the players’s previous leagues.
tags_ligas = pagina_bs.find_all("td",{"class": None})
# Now we will receive all the cells in the table that have no class atribute set
for tag_liga in tags_ligas:
# The find() function will find the first image whose class is "flaggenrahmen" and has a title
imagem_pais = tag_liga.find("img", {"class": "flaggenrahmen"}, {"title":True})
# The country_image variable will be a structure with all the image information,
# one of them is the title that contains the name of the country of the flag image
if(imagem_pais != None): # We will test if we have found any matches than add them
pais_jogadores.append(imagem_pais['title'])
Finally, we will get the players' prices.
custos_jogadores = []
tags_custos = pagina_bs.find_all("td", {"class": "rechts hauptlink"})
for tag_custo in tags_custos:
texto_preco = tag_custo.text
# The price text contains characters that we don’t need like £ (euros) and m (million) so we’ll remove them
texto_preco = texto_preco.replace("£", "").replace("m","")
# We will now convert the value to a numeric variable (float)
preco_numerico = float(texto_preco)
custos_jogadores.append(preco_numerico)
Now that we have got all the data we wanted, let's make it understandable to improve any analysis we want to do. For this, we will use the pandas library and its DataFrame
class, which is a class that represents a tabular data structure, that is, it is similar to a common table.
# Creating a DataFrame with our data
df = pd.DataFrame({"Jogador":nomes_jogadores,"Preço (milhão de euro)":custos_jogadores,"País de Origem":pais_jogadores})
# Printing our gathered data
print(df)
Now we can see all our data obtained with web scraping
organized in the DataFrame!
I hope I have helped in any way and if you have any problems or questions, feel free to leave a comment on this post or send an email to me ;).
Top comments (1)
Hi Lisandra,
Good to read your post. I tried to email you but this did not work. Could you get in touch through koen@blackbox-search.com or send me your email address? I'd like to have a brief chat on your web scraping article for transfermarkt.
Best wishes,
Koen