Developers always are looking for ways to reuse their codes as well, for that, I'm gonna share with you a way to use your selenium code as well.
First of all, you should install the libs in your project:
pip install selenium
I really like to use selenium, because there are many javascript on the web which most of the time you need to wait for the website to load completely for starting scrap.
After that run these commands on Linux in the same folder of your project(.py file)!
wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
https://trendoceans.com/how-to-install-and-setup-selenium-with-google-chrome-on-ubuntu/
unzip chromedriver_linux64.zip
It will download the chrome web driver for Linux.
Let's code!
Imports
lib that you allow to chose your browser
from selenium import webdriver
It is very useful to do some time.sleep() and wait any seconds on page.
import time
Very nice to use to get information from html
from bs4 import BeautifulSoup
I use to see the progressive bar in for loops
import tqdm
Pandas I use to create dataframes and export my information scraped on csv
import pandas as pd
You can use this to wait a specific element load
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
manage drop down components on web
from selenium.webdriver.support.ui import Select
os system commands
import os
Download the webdriver in your project automatic (Just for Windows)
import chromedriver_autoinstaller
You will use to check the operational system your bot is running( Windows or Linux )
import platform
Implementation
Check your operational system!
OP_SYSTEM = platform.system()
print(OP_SYSTEM)
if OP_SYSTEM.lower() == 'windows':
chromedriver_autoinstaller.install()
Create a folder to recieve your donwloads
try:
os.mkdir(os.path.dirname(os.path.realpath(__file__)) + '//data')
except:
pass
folder = os.path.dirname(os.path.realpath(__file__)) + '/data'# Set Google Options
options = webdriver.ChromeOptions()
Define donwload settings
Set a specific folder to download files from selenium ( Default is download folder)
prefs = {
"download.default_directory": r"%s" % folder,
"download.prompt_for_download": False,
"download.directory_upgrade": True
}
options.add_experimental_option('prefs', prefs)
This option hide the browser... to see the browser comment this line below
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-extensions")
options.add_argument("--proxy-server='direct://'")
options.add_argument("--proxy-bypass-list=*")
options.add_argument("--start-maximized")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
Remove selenium logs on console ( More clean! )
options.add_argument('--log-level=3')
`
Chose the webdriver according with your system
Windows or Linux
if OP_SYSTEM.lower() == 'windows':
driver = webdriver.Chrome(chrome_options=options)
else:
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=options)
driver.get("https://google.com")
Your code to scrap start here , everything above I like as default in my codes!
search_box = driver.find_element_by_name('q')
search_box.send_keys('What is Python?')
search_click = driver.find_element_by_name('btnK')
search_click.submit()
time.sleep(2)
tiles = driver.find_elements_by_tag_name('h3')
for title in tiles:
print(title.text)
good practice to kill the process, for dont speeding too much resources
driver.close()
driver.quit()
Good Hacking!
Add me on Linkedin!
Top comments (0)