WHAT IS WEB SCRAPING?
In a nutshell, web scraping is the act of an automated extraction of publicly displayed data from the web which could not be reached nor extracted by the use of API.
Most times, the basic aim for for web scraping are usually for price or news monitoring, information/data gathering, automated research, automated web/platform engagements and similar events.
Web scraping has become very popular and important in recent times due to it’s relevance in the current business world. However, this tutorial is about web scraping with python, so without further ado we’ll dive into what web scraping with python looks like and the libraries needed to code a simple web scraper.
SCRAPING WITH SELENIUM
Python is widely known to be useful in many things in tech, but web scraping happens to be one of the major domains where python programming thrives.
WHAT IS SELENIUM?
Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
It provides extensions to emulate user interaction with browsers, a distribution server for scaling browser allocation, and the infrastructure for implementations of the W3C WebDriver specification that lets you write interchangeable code for all major web browsers.
Meanwhile, selenium is not the only module used for web scraping with python, there are other major modules that are also as popular as selenium. However there are cons and pros for each of them, you just need to know the one you need at every occasion. We’ll discuss a brief comparison of these modules further on.
The 3 most popular python modules used for web scraping are as follows:
SCRAPY:
Scrapy is efficient and portable. However it’s major con is that it’s not user friendly, especially for beginners.
BEAUTIFUL SOUP:
Beautiful soup is easy to learn and understand. However it does have some cons too: Beautiful soup requires dependencies and it’s less efficient than Scrapy.
SELENIUM:
Selenium is versatile and also works well with javascript. However selenium is also not as efficient as scrapy.
In this post we’ll use selenium as our module for web scraping with python, perhaps in my next web scraping post we’ll adopt any of the other modules mentioned above.
TALK IS CHEAP, LET THE CODING BEGIN…
We are about to code a web scraper that will go to the popular wikipedia’s website and enter a query in the search box, get results and possibly links too. Make no mistakes there is a module specifically for wikipedia search called wikipedia.
However, the aim here is to show how one can access a public website, fill a form, submit it, explore the site contents and more. But we are just going to keep things simple on this particular post.
MY ASSUMPTIONS:
You have the basic experience of HTML and CSS
You have at least beginner’s basic python coding experience, for instance you are familiar with loops, functions, importing of modules and similar knowledge.
Meanwhile, if you have not used selenium before, please do yourself a favor, checkout the basic documentation of this module here before you continue with this tutorial.
Firstly, we’ll begin by importing the necessary modules. But before importing these modules you’d need to download your web browser’s web driver. I personally prefer google chrome driver.
Be sure that you downloaded the same version as the version of your browser, to check your browser version, click on the 3 dots on chrome and click on “help” then click on “about google chrome” right there you’ll see the version you are using.
Once you are done with the download, extract the file and keep it somewhere close to your code folder and note the path to the driver.
Now let’s import the necessary modules.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import requests
from pprint import pprint
Now it’s time to create our main function for this code:
def get_wiki():
#get the preferred keyword
keyword = input('Enter a keyword to search:\n')
link_dec = input('Do you need links? Kindly enter yes or no:\n').lower()
#create path instance and create the driver path
d_path = Service('/home/you/Desktop/my_scraper/web_driver/chromedriver')
driver = webdriver.Chrome(service = d_path)
Let’s explain the code above:
We created a function called get_wiki, now the variable keyword will get a keyword to search for from the user. link dec variable is for decision about needing links or not by the user.
Then we created the the driver path and the driver instance. Now we’ll continue with more codes inside the main function:
#get the page and enter a keyword to search
driver.get('https://en.wikipedia.org/wiki/Main_Page')
search_box = driver.find_element(By.NAME, 'search')
search_box.send_keys(keyword)
search_box.send_keys(Keys.ENTER)
time.sleep(3)
#get the main content
main_data = driver.find_element(By.ID, 'content')
pprint(main_data.text)
I presume you have checked out the selenium documentation as I advised earlier, and with your prior knowledge on HTML and CSS, you already know how to find the needed selectors and elements on the wikipedia page.
You can just open the website in a new window and explore the elements and selectors with chrome developers tools while simulating the search. This will enable you to check what is working and what’s not, in case you run into bugs.
So with the code above we’ll get the page and enter a keyword to search and press the search button. We wait for 3 seconds, get the results, and print them out using pprint.
Now let’s create an inner function inside the main function that will get the available links if required by the user:
def show_links():
"""Get the links available in the contents"""
links = driver.find_elements(By.TAG_NAME, 'a')
for link in links:
print(link.get_attribute('href'))
As you can see the function above is self explanatory. Next, we’ll call the function if the link_dec was “yes” and quit the driver, next we call the main function:
if link_dec == 'yes':
show_links()
else:
pass
driver.quit()
get_wiki()
Now let’s see all the codes in one place:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import requests
from pprint import pprint
def get_wiki():
#get the preferred keyword
keyword = input('Enter a keyword to search:\n')
link_dec = input('Do you need links? Kindly enter yes or no:\n').lower()
#create path instance and create the driver path
d_path = Service('/home/you/Desktop/my_scraper/web_driver/chromedriver')
driver = webdriver.Chrome(service = d_path)
#get the page and enter a keyword to search
driver.get('https://en.wikipedia.org/wiki/Main_Page')
search_box = driver.find_element(By.NAME, 'search')
search_box.send_keys(keyword)
search_box.send_keys(Keys.ENTER)
time.sleep(3)
#get the main content
main_data = driver.find_element(By.ID, 'content')
pprint(main_data.text)
def show_links():
"""Get the links available in the contents"""
links = driver.find_elements(By.TAG_NAME, 'a')
for link in links:
print(link.get_attribute('href'))
if link_dec == 'yes':
show_links()
else:
pass
driver.quit()
get_wiki()
CONCLUSION
From here you can do other things with your search results, like sending them to an email address, converting them to pdf file and more.
In my next web scraping with python post, we’ll focus more on other cools stuffs like getting prices and updates on news, trading and more. We’ll also learn about beautiful soup, regex, sending emails with python and more.
You can edit this code and use it on different sites or search engine like google. Now that you have the basic knowledge, you can explore selenium even more, create better scrapers than what I did here.
Becoming better in anything requires curiosity, so get curious and explore the available knowledge on the internet about web scraping, you might want to check the popular programming communities for extra knowledge on the topic.
To automatically get notification when my next post on web scraping with python and subsequent ones gets published, hit the follow button.
Get an affordable and seamless python one on one training today from anywhere in the world, location is never a barrier, we have friendly learning tools to make your python programming training a worthwhile experience.
Top comments (1)
Hey really good post here, what are your thoughts on playwright compared to the tools you mentioned? I've only used playwright for scraping paired with agentql querying