"Hackers loves to use scraping to harvest data.~Ankit Dobhal"
original blog is here - >blog
Welcome to My Blog
Hello my Computer Geek Friend!!This is a blog about scraping wikipedia content using python & bs4(python module),So what is exactly web scraping & from where this term comes from?Let's Try To Understand!!
Web Scraping - :
Web scraping is data scraping process used for extracting data from websites.Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.It comes when world wide web born.Most of time search engine like google uses crawling process in their search result.
Scraping With Python - :
Web scraping & crawling can be done with the help of some softwares but in Nowadays Python is gaining its popularty in the field of web scraping & crawling ,& as we all know python is one of the most famous & powerful scripting languages generally for hackers & shell coders.Python have some amazing & powerful modules & libraries which makes this scraping process so easy & useful,Their are two important modules in python one is requests & another is BeautifulSoup.
Let's Write Python Script to scrape wikipedia content or wikpedia searcher:
I have a basic understanding how to do get request to websites using python , so first of all I open up my vscode editor and create file name as wikipy.py.Then import sys library(command line argument), requests library(for downloading & get method to wikipedia), & my favorite library BeautifulSoup as bs4 (To extact content from wikipedia page).
Now its time to use get method to requests data from wikipedia server , but wait I want to create a wikipedia searcher which will scrape the data according to my command line argument.So let's create a variable name as res to store get method to wikipedia search url & add it with my command line argument.
note: I uses raise_for_status() method if their is any error code and status code comes so this method will raise that & whole script will terminate.
res download the whole page but it is complicating to extract data from the page bacuase it shows the html format data , so now this is time to use BeautifulSoup to extract data. So I am creating a variable name as wiki to extract data.
note: As you can in wiki variable I uses Beautiful Soup function with two parameters ,So what they are exactly? let's understand. res.text is a text format of the page which is downloaded with the help of res variable & html.parser is a parser which will help me to structure the data into html format.I want to scrape the p tag content according to command line argument because the whole text content of Wikipedia page is inside the p tag you can check this with the help of developer tools of chrome & Firefox.
Now I am using .select() function to select p tag & for loop to looping throgh it ,then finally printing the text elements imside p tag with.getText() function.
Yeah we did it in just 10 line of code bravo!!!
Its time to run the script with command line argument >>
Thankyou all for visting my blog you can also check my gist for wikipy script the link is below!!
wiki.py
follow me on github & linkedlin for more exciting blogs and scripts!
This blog is basically quoted from my blog website visit original blog->
https://ankitdobhal.github.io/posts/2019/10/Scraping%20Wikipedia%20With%20Python/
Top comments (8)
Instead of scraping Wikipedia and consuming the foundations' bandwidth & server capacity, why not take advantage of the offline mirror's available en.wikipedia.org/wiki/Wikipedia:Da...
I was just about to suggest the same. +1
How about a way to download the app icon from Google play. Say I have a file with list of several package names that I want links to the icon. Get a list of links into Excel file for each package name so I can download the images.
Trying to automate Soni don't have to manually search each package and right-click to save app icon.
The million dollar question is how can you save the page locally with the related images?
You can use file handling
Most browsers can save a web page with all resources (images, css, javascript) and can use an embedded browser or a tool like Selenium to automate it.
Very informative
how I select only 2 or 3 "p" element?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.