Muchamad Faiz for Zetta

Posted on Dec 18, 2022

Scrapping Top Repositories for GitHub Topics

#python #tutorial #beginners

GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.

Repositories in GitHub can be tagged using topics. For example, the tensorflow repository has the topics python, machine-learning, deep-learning etc.

The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.

Project Outline

1. We're going to scrape https://github.com/topics
2. We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
3. For each topic, we'll get the top 25 repositories in the topic from the topic page
4. For each repository, we'll grab the repo name, username, stars and repo URL
5. By the end of the project, we'll create a CSV and XLX file in the following format:

Instal and import all library needed

Before we begin lets install the library with pip

pip install requests
pip install beautifulsoup4 
pip install pandas

then, lets import all the library into code editor

import requests as req
from bs4 import BeautifulSoup
import pandas as pd

Write a function to download the page

def get_topic_link(base_url):
    response = req.get(url=base_url)
    page_content = response.text
    print(response.status_code)

    soup = BeautifulSoup(page_content, "html.parser")
    tags = soup.find_all("div", class_ = "py-4 border-bottom d-flex flex-justify-between")
    topic_links = []
    for tag in tags:
        url_end = tag.find("a")["href"]
        topic_link = f"https://www.github.com{url_end}"
        topic_links.append(topic_link)
    return topic_links

write a function to extract information

def get_info_topic(topic_link):
    response1 = req.get(topic_link)
    topic_soup = BeautifulSoup(response1.text, "html.parser")
    topic_tag = topic_soup.find("h1").text.strip() #3D
    topic_desc = topic_soup.find("p").text
    info_topic = {
        "title" : topic_tag,
        "desc" : topic_desc
    }
    return info_topic


def get_info_tags(topic_link):
    response = req.get(topic_link)
    info_soup = BeautifulSoup(response.text, "html.parser")
    repo_tags = info_soup.find_all("div", class_ = "d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3")
    return repo_tags


def get_info(tag):
    repo_username = tag.find_all("a")[0]
    repo_name = tag.find_all("a")[1]

    url_end = tag.find("a")["href"]
    repo_url = f"https://www.github.com{url_end}"

    repo_star = tag.find("span",{"id":"repo-stars-counter-star"}).text.strip()
    repo_value = int(float(repo_star[:-1]) * 1000)

    topics_data = {
        "repo_name" : "repo_name",
        "repo_username" : repo_username.text.strip(),
        "repo_name" : repo_name.text.strip(),
        "repo_url" : repo_url,
        "repo_star" : repo_value,
        }
    return topics_data

Create CSV file(s) with the extracted information

def save_CSV(results):
    df = pd.DataFrame(results)
    df.to_csv("github.csv", index=False)

Create XLX file(s) with the extracted information

def save_XLX(results):
    df = pd.DataFrame(results)
    df.to_excel("github.xlsx", index=False)

Putting it all together

we have a function to get the list of topics
we have a function to create a CSV file for scraped repos from a topics page
Let's create a function to put them together

def main():
    base_url = "https://github.com/topics"
    topic_links = get_topic_link(base_url) # list of url ex: https://github.com/topics/3d, https://github.com/topics/AJAX, etc
    result2 = []
    for topic_link in topic_links: # https://github.com/topics/3d
        print(f"getting info {topic_link}")
        topic_tags = get_info_topic(topic_link) #title, desc
        repo_tags = get_info_tags(topic_link) # some repo tags, so we can use for loop
        result1 = []
        for tag in repo_tags:
            repo_info = get_info(tag)
            result1.append(repo_info)
        for x in result1:
            gabungan = topic_tags | x
            result2.append(gabungan)
        save_CSV(result2)
        save_XLX(result2)


if __name__ == "__main__":
    main()

Conclusion

We are done here, i hope this simple project can be valuable to your practice in python web scrapping.

Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com

DEV Community

Scrapping Top Repositories for GitHub Topics

Project Outline

Instal and import all library needed

Write a function to download the page

write a function to extract information

Create CSV file(s) with the extracted information

Create XLX file(s) with the extracted information

Putting it all together

Conclusion

Top comments (0)

Read next

Don't Ask Anyone To "Be Your Mentor"— Do This Instead

Why Rewriting Everything in Rust Won’t Solve All Your Problems

How to undo the most recent local commits in Git?

Automated Session Control with Bluetooth: An Insight into ble-lock-session