DEV Community

cuongld2
cuongld2

Posted on • Edited on

Introduction to web scraping and real world task

Recently, I've been assigned a task to check whether the list of page news is too old or not.
That comes the part I need to go over the page source to get the date information.

I've heard about webscraping before so for this task I applied web scraping technique.
Here is what I learnt.

I.What is webscraping

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale.

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.

More than a modern convenience, the true power of web scraping lies in its ability to build and power some of the world’s most revolutionary business applications. ‘Transformative’ doesn’t even begin to describe the way some companies use web scraped data to enhance their operations, informing executive decisions all the way down to individual customer service experiences.

II.What we'll need
We will apply the webscraping in Python so we need to check out information about below libraries.

  • beautiful-soup4
    You can checkout about beautiful-soup documentation in here
    But in short, beautifulsoup support us to get web data in html or json format, so that we can cook to get the exact data we want

  • requests
    Requests support us to get the response from the web.
    You can refer to the documentation of requests in here

III.Real world task:
As I mentioned earlier, the task we would need to do is check whether the page news

Basically, what we need to do is:

Open a page where all the sites are listed
Get all the links for the news site
Get the date from the site news
Assert the published date of the news must be less than 1 month til current date
Enter fullscreen mode Exit fullscreen mode

Here is the link for all the news we need to check : coccoc-newtab
--> Please open in CocCoc browser or google-chrome. This might not work for other browsers like firefox

Below is the step we will need to do in details.
1.Scroll page

Selenium native function support we move to element, but that's not what we're looking for.

If we need to scroll page, we need to do some js script.

Below is the illustration code for how to do that.


def scroll_to_with_scroll_height(self, driver):
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
Enter fullscreen mode Exit fullscreen mode

2.Get all the urls of site news

In the zen news, there is news from site, and there is ads new.

In the task, we only need to get the news from normal sites ( not the ads).

In order to differentiate that, we will do it by css selectors.

Below is the css selector for that:


ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR = 'div[class] > a:not(.qc-link)[href]:not(.context-content)' \
                                                  ':not(.zen-ads__context):not([href*="utm"])' \
                                                  ':not([data-click-url*="click"])'
    ZEN_NEWS_NOT_CONTAINS_ADS_ITEM = (By.CSS_SELECTOR, ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR)
Enter fullscreen mode Exit fullscreen mode

Method for find all the zen news:


def find_all_current_zen_except_ads_elements(self, driver):
    self.wait_for_element(driver).until(ec.presence_of_element_located(NewTabZenLocators
                                                                       .ZEN_NEWS_NOT_CONTAINS_ADS_ITEM))
    return driver.find_elements_by_css_selector(NewTabZenLocators.ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR)

Enter fullscreen mode Exit fullscreen mode

Method for get attribute of the zen news:


def get_attribute_all_zen_except_ads_elements(self, driver, attribute_name):
attribute_value = []
for element in self.new_tab_zen_elem.find_all_current_zen_except_ads_elements(driver):
    attribute_value.append(element.get_attribute(attribute_name))
return attribute_value
Enter fullscreen mode Exit fullscreen mode

Remember as we need to get the url, so the attribute_name is 'href'.
3.Check the response

We need to call get method using requests to get the response, then assert the response.

Follow is how to implement

for url in url_list:
    response = None
    try:
        response = requests.get(url)
    except ConnectionError as e:
        print(e)
    expect(response is not None, f'Assert response not None for site {url}')
    expect(response.status_code == 200, f'Assert response status code for site {url}')
Enter fullscreen mode Exit fullscreen mode

4.Get the date

Followed is how to get the string_datetime using beautifulsoup for html.parser

def get_published_time_of_web_page(self, response_text):
    published_time = None
    soup_instance = BeautifulSoup(response_text, features='html.parser', parse_only=SoupStrainer("head"))
    meta_tags = soup_instance.find_all(name="meta")
    for item in meta_tags:
        property_value = item.get('property')
        if property_value == 'article:published_time':
            published_time = item.get('content')
    if published_time is None:
        soup_instance = BeautifulSoup(response_text, features='html.parser', parse_only=SoupStrainer('script',
                                                                                                     attrs={
                                                                                                         "type": "application/ld+json"}))
        list_json = soup_instance.findAll('script')
        for each_json in list_json:
            if 'datePublished' in each_json.text.strip():
                import json
                from json import JSONDecodeError
                try:
                    json_parse = json.loads(each_json.text.strip(), strict=False)
                    published_time = json_parse['datePublished']
                except JSONDecodeError as e:
                    print(e)
    return published_time
Enter fullscreen mode Exit fullscreen mode

Current implementation is for 2 popular types of html template of the news pages.

Then we need to parse the string_date to date type using dateutil library

import dateutil.parser
import datetime


def parse_string_to_date(string_datetime):
    your_date = dateutil.parser.parse(string_datetime)
    return your_date.date()


def how_many_days_til_now(string_datetime):
    number_of_days = datetime.date.today() - parse_string_to_date(string_datetime)
    return number_of_days.days
Enter fullscreen mode Exit fullscreen mode

5.Soft-assert

Sometimes we need to use soft-assert so that the test doesn't stop right after failed.

As we need to collect all the failed results of the other sites

There is delayed_assert library for this in python.

Below is how to implement.


expect(response is not None, f'Assert response not None for site {url}')
expect(response.status_code == 200, f'Assert response status code for site {url}')
expect(how_many_days_til_now(published_time) <= 30, f'Verify date of page {url}')
        # else:
            # print(f'Url of the site which cannot get published date is : {url}')
assert_expectations()
Enter fullscreen mode Exit fullscreen mode

Remember if we do not have assert_expectations() at the end of test, the test will always pass.

That's it.

Stay tuned for more

Notes: If you feel this blog help you and want to show the appreciation, feel free to drop by :

This will help me to contributing more valued contents.

Top comments (0)