GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.
Repositories in GitHub can be tagged using topics. For example, the tensorflow repository has the topics python, machine-learning, deep-learning etc.
The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.
Project Outline
- 1. We're going to scrape https://github.com/topics
- 2. We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- 3. For each topic, we'll get the top 25 repositories in the topic from the topic page
- 4. For each repository, we'll grab the repo name, username, stars and repo URL
- 5. By the end of the project, we'll create a CSV and XLX file in the following format:
Instal and import all library needed
Before we begin lets install the library with pip
pip install requests
pip install beautifulsoup4
pip install pandas
then, lets import all the library into code editor
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
Write a function to download the page
def get_topic_link(base_url):
response = req.get(url=base_url)
page_content = response.text
print(response.status_code)
soup = BeautifulSoup(page_content, "html.parser")
tags = soup.find_all("div", class_ = "py-4 border-bottom d-flex flex-justify-between")
topic_links = []
for tag in tags:
url_end = tag.find("a")["href"]
topic_link = f"https://www.github.com{url_end}"
topic_links.append(topic_link)
return topic_links
write a function to extract information
def get_info_topic(topic_link):
response1 = req.get(topic_link)
topic_soup = BeautifulSoup(response1.text, "html.parser")
topic_tag = topic_soup.find("h1").text.strip() #3D
topic_desc = topic_soup.find("p").text
info_topic = {
"title" : topic_tag,
"desc" : topic_desc
}
return info_topic
def get_info_tags(topic_link):
response = req.get(topic_link)
info_soup = BeautifulSoup(response.text, "html.parser")
repo_tags = info_soup.find_all("div", class_ = "d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3")
return repo_tags
def get_info(tag):
repo_username = tag.find_all("a")[0]
repo_name = tag.find_all("a")[1]
url_end = tag.find("a")["href"]
repo_url = f"https://www.github.com{url_end}"
repo_star = tag.find("span",{"id":"repo-stars-counter-star"}).text.strip()
repo_value = int(float(repo_star[:-1]) * 1000)
topics_data = {
"repo_name" : "repo_name",
"repo_username" : repo_username.text.strip(),
"repo_name" : repo_name.text.strip(),
"repo_url" : repo_url,
"repo_star" : repo_value,
}
return topics_data
Create CSV file(s) with the extracted information
def save_CSV(results):
df = pd.DataFrame(results)
df.to_csv("github.csv", index=False)
Create XLX file(s) with the extracted information
def save_XLX(results):
df = pd.DataFrame(results)
df.to_excel("github.xlsx", index=False)
Putting it all together
- we have a function to get the list of topics
- we have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together
def main():
base_url = "https://github.com/topics"
topic_links = get_topic_link(base_url) # list of url ex: https://github.com/topics/3d, https://github.com/topics/AJAX, etc
result2 = []
for topic_link in topic_links: # https://github.com/topics/3d
print(f"getting info {topic_link}")
topic_tags = get_info_topic(topic_link) #title, desc
repo_tags = get_info_tags(topic_link) # some repo tags, so we can use for loop
result1 = []
for tag in repo_tags:
repo_info = get_info(tag)
result1.append(repo_info)
for x in result1:
gabungan = topic_tags | x
result2.append(gabungan)
save_CSV(result2)
save_XLX(result2)
if __name__ == "__main__":
main()
Conclusion
We are done here, i hope this simple project can be valuable to your practice in python web scrapping.
Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com
Top comments (0)