When i was learning web scraping , one of the ideas that came to my mind is a Github Scraper.
Here i will try my best to describe each process.
Lets start..
We have to install a couple of packages first.
- Beautifulsoup
- requests
- htmlparser
pip install requests
pip install html5lib
pip install beautifulsoup4
- Then open https://github.com/yourusername
- Open Devtools.
- This is what i see when i open my dashboard and devtools.
While we scrape web , we need the element's id ,classname or xpath to scrape it.
We will be scraping Name, Username , No of Repos, Followers , Following and profile image.
import requests
from bs4 import BeautifulSoup
import html5lib
- Import the modules.
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
- Make a request into the website.
Parse the html recieved as response in
r.content
using beautifulsoup and html5lib.From here we are starting scraping.
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
- Here we are getting all element in the element of class name
vcard-names pl-2 pl-md-0"
- Name and Username are in the span element in the above div.
- We have assigned the content into namediv variable.
- We are finding all span elements and selecting (0:name,1:Username) and getting the text using getText() function.
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
- Here the same thing happens.
Followers,Following,Stargazers are inside element of classname
flex-order-1 flex-md-order-none mt-2 mt-md-0
and inmb-3
which is inside that.Lets get that and store it in elements variable.
-
Getting the span inside inside the elements returns a list.
- Followers is having index=0
- Following is having index=1
- Stargazer is having index=2
elements.find_all('a')[2].find('span').getText().strip(' ')
- Here we are getting the second index item in a element and then
getText()
from the span inside it. We are usingstrip('')
to remove unneccesary blank spaces in the result.
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
- The above code gives the image tag and we are getting the src attribute.
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
Here we are getting the no of repos user haves.
-
That is all you need to scrape user details with python.
Source Code
import requests
from bs4 import BeautifulSoup
import html5lib
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
- The idea is that, we should make the program to navigate to the element we want and select the required element.
Refer some beautifulsoup methods here
I have also made a pypi module to scrape Github.See it here and give a star if you like it.
If you have any doubts or need clarification, comment down below.
Stay tuned for part 2 where we will scrape the user repo details.
Top comments (0)