Sadhan Sarker

Posted on Jan 29, 2020 • Edited on Feb 1, 2020

Python bypass anti-bot page and scrape it

#python #scrape #antibot #beginners

🔰 Today we are going to look an awesome python module, scrapping is fun if you try it previously. Scrapping & Crawling are common names, but they have a bit of difference. Web Crawling is basically what Google & Facebook etc do, it's looking for any information. On the other hand, Scrapping is targeted at certain websites, for specific data, e.g. for product information and price, etc.

Check Development Environment Ready or not

Before moving forward we need to check python is available or not. To do so, Open terminal or command line and hit below command,

python --version
Output: Python 2.7.16

Or,

python3 --version
Output: Python 3.8.0

If everything looks good like me, your python version might be different from me. So don't worry about it. If you see not found then install python from here.

Setup Virtual Environment

We need to create Virtual Environment, to avoid python modules, dependency or libraries version conflicting issues. So that we can ensure isolation, each project dependencies or libraries version can be maintained easily.

Open terminal or command line then create a project

📗 macOS Users:-

pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate

📗 Windows Users:-

pip install virtualenv
virtualenv venv
srouce venv\Scripts\activate

We can see, venv folder will be created. Congratulation successfully we are able to create Virtual Environment

Install required libs or modules

Open Terminal or command line then hit bellow commands,

pip install beautifulsoup4
pip install cfscrape

Learn Basic How Scrapping Work

📗 Create app.py file, includes

import cfscrape
from bs4 import BeautifulSoup

def basic():
    # string html code sample
    html_text = '''
        <div>
            <h1 class="product-name">Product Name 1</h1>
            <h1 custom-attr="price" class="product-price">100</h1>
            <p class="product description">This is basic description 1</p>
        </div>
        <div>
            <h1 class="product-name">Product Name 2</h1>
            <h1 custom-attr="price" class="product-price">200</h1>
            <p class="product description">This is basic description 2</p>
        </div>
    '''
    parsed_html = BeautifulSoup(html_text, 'html.parser')                       # String to HTML
    # parsed_html = BeautifulSoup("https://www.google.com", 'html.parser')      # URL to HTML
    # parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser')        # File to HTML

    print(parsed_html.select(".product-name")[0].text)
    print(parsed_html.select(".product-name")[1].text)
    print(parsed_html.select(".product.description")[0].text)
    print(parsed_html.findAll("h1", {"custom-attr": "price"})[0].text)
    print(parsed_html.find("h1", {"custom-attr": "price"}).text)

if __name__ == '__main__':
    basic()

Now, open a terminal and hit below command, python app.py to run that file.

Learn Anti Bot Scraping

📗 Create app.py file, includes


def anti_bot_scraping():

    target_url = "https://www.google.com"   # replace url with anti-bot protected website
    scraper = cfscrape.create_scraper()
    html_text = scraper.get(target_url).text
    parsed_html = BeautifulSoup(html_text, 'html.parser')
    print(parsed_html)

if __name__ == '__main__':
    anti_bot_scraping()

Now, open a terminal and hit below command, python app.py to run that file.

Notes: Please don't misuse this knowledge. I'm sharing it only for learning purposes or fun purposes.

Enjoy, Coding!

👌 Congratulations!. & Thank You!
Feel free to comments, If you have any issues & queries.

References:

Top comments (1)

yashc1998 • May 4 '20

Hey, can you provide me any article or tutorial thoroughly explaining the process and the methods used to bypass these websites?

DEV Community

Python bypass anti-bot page and scrape it

Check Development Environment Ready or not

Setup Virtual Environment

Install required libs or modules

Learn Basic How Scrapping Work

Learn Anti Bot Scraping

References:

Top comments (1)

Read next

☘️ Growing 3D grass on Your GitHub Profile

PDF chat with source highlights

Building a Voice Transcription and Translation App with OpenAI Whisper and Streamlit

EmbodiedRAG: Dynamic Scene Graphs for Efficient Robot Task Planning in Real-World Environments