DEV Community

Cover image for Scraping Dynamic Web Pages Using Selenium And C#
himanshuseth004 for LambdaTest

Posted on • Edited on • Originally published at lambdatest.com

Scraping Dynamic Web Pages Using Selenium And C#

Today’s websites are a lot different from yesteryears, where content on a majority of the websites is dynamic in nature. The content in dynamic pages varies from one user request to another based on the website visitor’s actions. Selenium, the popular test automation framework, is useful for testing dynamic web pages, but it can be extensively used for scraping dynamic web pages.

Though there are many tools for scraping static web pages, Selenium is one of the preferred tools for scraping large volumes of data (e.g., images, links, text, etc.) in a relatively short amount of time. We have chosen C# – the popular backend programming language for demonstrating dynamic web page scraping. As per Stack Overflow Survey 2020, C# holds the sixth position in the preferred programming languages category.

In this Selenium C# tutorial, you would be in a comfortable position for scraping dynamic web pages and extracting the meaningful information (from the page) that you intend to save for future use.

What is Web Scraping?

Web Scraping is a common technique primarily used for extracting information (or data) from websites. The HTML of the page from where relevant data has to be scraped is processed using the appropriate tools and stored in the database, excel sheet, etc. so that the data can be used for further analysis.

With larger size (or amounts) of data, scraping could add a significant amount of load on the server that hosts the website. As long as the scraping activity does not disrupt the website’s services, it is perfectly fine to scrap the said website.

Prominent Use Cases of Web Scraping

Why scrap websites when they might add load on the server that is hosting the website? Well, the answer lies in the umpteen number of scenarios where web scraping can be extremely useful. Web scraping can help unleash information related to customers, products, etc., which can be further used to make future decisions.

Owners of e-commerce websites can analyze data from different sources like their website, social media accounts, review websites, etc., for understanding users ’ buying patterns and improving their services. Product review scraping is a prominent use case that online businesses leverage for keeping a close watch on their competition.

_ Example –_ Dynamic web page scraping of the LambdaTest blog can give detailed insights on article views, author’s performance, and more. The data can be used for better content planning and getting the best out of the rockstar writers who contribute to our blog ☺.

Difference between Static & Dynamic Web Scraping

In static web pages, all the data on the page is available at the initial call to the site. You might not even need to maintain a connection to the server since all the information is now available locally. Hence, the HTML document can be downloaded, and data can be scraped using tools that let you scrap data from static pages.

On the other hand, dynamic content means that the data is generated from a request after the initial page load request. On dynamic pages, most of the functionality happens in responses to the actions performed by the user and the JavaScript code that is executed in the web browser.

HTTP agent is not suited for websites (or web applications) where there is a high level of dynamic interaction and interface automation. The Selenium C# library, which is widely used for automation testing of web applications, helps emulate human actions and render dynamic JavaScript code.

Scraping Dynamic Web Pages with Selenium C

Due to Selenium’s capability in handling dynamic content generated using JavaScript, it is the preferred option for scraping dynamic web pages. Selenium is a popular automated testing framework used to validate applications across different browsers and operating systems.

Prerequisites for demonstrating web scraping with Selenium C

We use Visual Studio for the implementation of test scenarios in C#. Here are the basic setup requirements for performing Selenium web scraping in C#

  • Visual Studio IDE – For implementation, we use the Visual Studio (Community Edition), which can be downloaded from VS Community Download Page.
  • Browser Driver – It automates browser interaction from Selenium C# code. You have to download the browser driver for Selenium in accordance with the browser on which Selenium web scraping is performed.

Downloading and installing browser drivers is not required when dynamic web page scraping is done using a cloud-based Selenium Grid like LambdaTest. You can refer to our detailed Selenium WebDriver tutorial for a quick recap on Selenium WebDriver.

  • C# Packages – We showcase Selenium web scraping using the Selenium WebDriver and NUnit framework. The NUnit project requires reference to the following libraries (or packages):
  1. Selenium.WebDriver
  2. NUnit
  3. NUnit3TestAdapter
  4. Microsoft.NET.Test.Sdk

These are the standard set of packages that are used for automated browser testing with NUnit and Selenium. No additional packages are required for scraping dynamic web pages with C# and Selenium. Check out our tutorial on NUnit test automation with Selenium C# for a quick recap on NUnit for automation testing.

Setting up the Selenium C# Project

  1. We create a new project of type ‘NUnit Test Project (.Net Core)’ in Visual Studio.
  2. Name the project as ‘WebScraping’ and press the ‘Create button.
  3. Once you have created the project, install the packages mentioned above using the Package Manager (PM) console, which can be accessed through ‘Tools’ -> ‘NuGet Package Manager’ -> ‘Package Manager Console.’
  4. For installing the packages, run the following commands in the PM console:
Install-Package Selenium.WebDriver
Install-Package NUnit
Install-Package NUnit3TestAdapter
Install-Package Microsoft.NET.Test.Sdk
Install-Package Selenium.WebDriver.ChromeDriver
Enter fullscreen mode Exit fullscreen mode
  1. Run the Get-Package command on the PM console to confirm whether the above packages are installed successfully:
PM> get-package

Id                                            Versions                                       
--                                            --------                                         
Selenium.WebDriver                            {3.141.0}                                  
NUnit                                         {3.12.0}
NUnit3TestAdapter                             {3.16.1}
Microsoft.NET.Test.Sdk                        {16.5.0}
Selenium.WebDriver.ChromeDriver               {88.0.4324}
Enter fullscreen mode Exit fullscreen mode

Now that the necessary packages are installed for the Selenium C# NUnit project, we proceed with the addition of NUnit test scenarios that demonstrate scraping dynamic web pages with Selenium.

Let’s begin writing a scraper for scraping the following websites:

In this demonstration, we scrap the following data from the LambdaTest YouTube channel:

  • Video Title
  • Number of video views
  • Time of upload

When writing this article, the LambdaTest YouTube channel had 79 videos, and we would scrape the requisite information from all the videos on the channel. Here is the Selenium web scraping test scenario that will be executed on Chrome (on Windows 10). The test is run on a cloud-based Selenium Grid provided by LambdaTest.

  1. Go to https://www.youtube.com/c/LambdaTest/videos.
  2. Scroll till the end of the page so that all the videos are available on the page.
  3. Scrap the video title, views, and upload details.
  4. Print the scraped information on the terminal.

Implementation

using NUnit.Framework;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Remote;
using OpenQA.Selenium.Support.UI;
using System;
using System.Collections.ObjectModel;
using System.Collections.Specialized;
using System.IO;
using System.Reflection;
using System.Threading;
using OpenQA.Selenium.Safari;
using System.Collections.Generic;
using System.Web;

namespace WebScraping
{
    public class ScrapingTest
    {
        String test_url_1 = "https://www.youtube.com/c/LambdaTest/videos";
        static Int32 vcount = 1;
        public IWebDriver driver;

        /* LambdaTest Credentials and Grid URL */
        String username = "user-name";
        String accesskey = "access-key";
        String gridURL = "@hub.lambdatest.com/wd/hub";

        [SetUp]
        public void start_Browser()
        {
            /* Local Selenium WebDriver */
            /* driver = new ChromeDriver(); */
            DesiredCapabilities capabilities = new DesiredCapabilities();

            capabilities.SetCapability("user", username);
            capabilities.SetCapability("accessKey", accesskey);
            capabilities.SetCapability("build", "[C#] Demo of Web Scraping in Selenium");
            capabilities.SetCapability("name", "[C#] Demo of Web Scraping in Selenium");
            capabilities.SetCapability("platform", "Windows 10");
            capabilities.SetCapability("browserName", "Chrome");
            capabilities.SetCapability("version", "latest");

            driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities,
                TimeSpan.FromSeconds(600));
            driver.Manage().Window.Maximize();
        }

        [Test(Description = "Web Scraping LambdaTest YouTube Channel"), Order(1)]
        public void YouTubeScraping()
        {
            driver.Url = test_url_1;
            /* Explicit Wait to ensure that the page is loaded completely by reading the DOM state */
            var timeout = 10000; /* Maximum wait time of 10 seconds */
            var wait = new WebDriverWait(driver, TimeSpan.FromMilliseconds(timeout));
            wait.Until(d => ((IJavaScriptExecutor)d).ExecuteScript("return document.readyState").Equals("complete"));

            Thread.Sleep(5000);

            /* Once the page has loaded, scroll to the end of the page to load all the videos */
            /* Scroll to the end of the page to load all the videos in the channel */
            /* Reference - https://stackoverflow.com/a/51702698/126105 */
            /* Get scroll height */
            Int64 last_height = (Int64)(((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.scrollHeight"));
            while (true)
            {
                ((IJavaScriptExecutor)driver).ExecuteScript("window.scrollTo(0, document.documentElement.scrollHeight);");
                /* Wait to load page */
                Thread.Sleep(2000);
                /* Calculate new scroll height and compare with last scroll height */
                Int64 new_height = (Int64)((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.scrollHeight");
                if (new_height == last_height)
                    /* If heights are the same it will exit the function */
                    break;
                last_height = new_height;
            }

            By elem_video_link = By.CssSelector("ytd-grid-video-renderer.style-scope.ytd-grid-renderer");
            ReadOnlyCollection<IWebElement> videos = driver.FindElements(elem_video_link);
            Console.WriteLine("Total number of videos in " + test_url_1 + " are " + videos.Count);

            /* Go through the Videos List and scrap the same to get the attributes of the videos in the channel */
            foreach (IWebElement video in videos)
            {
                string str_title, str_views, str_rel;
                IWebElement elem_video_title = video.FindElement(By.CssSelector("#video-title"));
                str_title = elem_video_title.Text;

                IWebElement elem_video_views = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[1]"));
                str_views = elem_video_views.Text;

                IWebElement elem_video_reldate = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[2]"));
                str_rel = elem_video_reldate.Text;

                Console.WriteLine("******* Video " + vcount + " *******");
                Console.WriteLine("Video Title: " + str_title);
                Console.WriteLine("Video Views: " + str_views);
                Console.WriteLine("Video Release Date: " + str_rel);
                Console.WriteLine("\n");
                vcount++;
            }
            Console.WriteLine("Scraping Data from LambdaTest YouTube channel Passed");
        }

        [TearDown]
        public void close_Browser()
        {
            driver.Quit();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Code Walkthrough

Now let’s decipher the code where we scraped vital information from the LambdaTest YouTube Channel.

Step 1 – Import the packages (or namespaces).

First, we import the namespaces or packages for Selenium Remote WebDriver, NUnit framework, and more.

using NUnit.Framework;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Remote;
using OpenQA.Selenium.Support.UI;
......................................
......................................
Enter fullscreen mode Exit fullscreen mode

Step 2 – Set the desired browser capabilities.

In the [SetUp] annotation, we implement a method that sets the desired browser capabilities that are created using the LambdaTest capabilities generator.

The test is run on Selenium 3 Grid. Hence, we have used desired browser capabilities in the implementation.

String username = "user-name";
String accesskey = "access-key";
String gridURL = "@hub.lambdatest.com/wd/hub";

[SetUp]
public void start_Browser()
{
    DesiredCapabilities capabilities = new DesiredCapabilities();

      capabilities.SetCapability("user", username);
      capabilities.SetCapability("accessKey", accesskey);
      capabilities.SetCapability("build", "[C#] Demo of Web Scraping in Selenium");
      capabilities.SetCapability("name", "[C#] Demo of Web Scraping in Selenium");
    capabilities.SetCapability("platform", "Windows 10");
    capabilities.SetCapability("browserName", "Chrome");
    capabilities.SetCapability("version", "latest");
Enter fullscreen mode Exit fullscreen mode

Step 3 – Create an instance of Selenium RemoteWebDriver.

An instance of Remote WebDriver is created using the browser capabilities (generated in the previous step) and the access-credentials of the LambdaTest platform. You can get the access details (i.e., user-name & access-key) from the LambdaTest Profile Page.

The LambdaTest Grid URL [i.e. @hub.lambdatest.com/wd/hub] is also passed an argument to the RemoteWebDriver interface.

driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities, TimeSpan.FromSeconds(600));
Enter fullscreen mode Exit fullscreen mode

Step 4 – Navigate to the LambdaTest YouTube URL.

Using WebDriver.URL, we navigate to the URL under test.

driver.Url = test_url_1;
Enter fullscreen mode Exit fullscreen mode

Step 5 – Wait for the page load to complete.

We want to start the test only when the loading of the web page is complete. Using Explicit Wait in Selenium, a WebDriverWait of 10 seconds is initiated.

var timeout = 10000;
var wait = new WebDriverWait(driver, TimeSpan.FromMilliseconds(timeout));
Enter fullscreen mode Exit fullscreen mode

Document.readyState property describes the loading state of the document. Document.readyState equates to complete when the current HTML document (or page) and its resources have finished loading. An explicit wait is performed on the Document.readyState till its value equates to ‘complete.’ The ExecuteScript method in the JavascriptExecutor interface is used for executing the JavaScript in the context of the current page.

wait.Until(d => ((IJavaScriptExecutor)d).ExecuteScript("return document.readyState").Equals("complete"));
Enter fullscreen mode Exit fullscreen mode

If the document is not loaded within the maximum wait duration (i.e., 10 seconds), a timeout error occurs, and a further part of the test is not executed.

Step 6 – Load all the YouTube Videos on the page.

When we load the LambdaTest YouTube page, only 30 videos will be available (or loaded) on the page. As we want to scrap details of all the videos on the page, we perform a vertical scroll until the page’s end is reached.

The document.documentElement.scrollHeight method in JavaScript returns the height of the entire document. A while loop is run for scrolling till the end of the document (or page) and the window.scrollTo method in JavaScript scrolls to a specified set of coordinates in the document.

Int64 last_height = (Int64)(((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.scrollHeight"));
while (true)
{
   ((IJavaScriptExecutor)driver).ExecuteScript("window.scrollTo(0, document.documentElement.scrollHeight);");
   Thread.Sleep(2000);
   /* Calculate new scroll height and compare with last scroll height */
   Int64 new_height = (Int64)((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.scrollHeight");
   if (new_height == last_height)
      /* If heights are the same it will exit the function */
      break;
      last_height = new_height;
}
Enter fullscreen mode Exit fullscreen mode

At each step in the while loop, the document’s current height is checked to ensure that we scroll until the page’s end. Once the current height and previous height (of the page) are the same, it means that we have reached the end of the page, and we break from the while loop.

This is the page when the LambdaTest YouTube Channel is loaded in the web browser:

The LambdaTest YouTube Channel page after the ‘end of the page’ scroll is performed using the scrollTo method in JavaScript.

Step 7 – Create a ReadOnlyCollection of the VideoElements on the page.

This is the most important step when it comes to scraping dynamic web pages in Selenium. Whether it is static or dynamic web page scraping, we need to identify WebElements that house (or contain) the items from where the relevant information has to be scraped. In the case of LambdaTest YouTube Channel (or any YouTube channel page), all the videos are enclosed under a div with id: items and class: style-scope ytd-grid-renderer.

Inside the <div> container, every video is enclosed in a class style-scope ytd-grid-renderer.

Here are the details for the first 2 videos obtained using the ‘Inspect Tool’ in Chrome browser:

A variable of By attribute in Selenium is created that uses the CssSelector property ytd-grid-video-renderer.style-scope.ytd-grid-renderer.

A ReadOnlyCollection (or list) of type IWebElement is created that contains the WebElements located using the FindElements method (and CssSelector property obtained in the earlier step).

By elem_video_link = By.CssSelector("ytd-grid-video-renderer.style-scope.ytd-grid-renderer");
ReadOnlyCollection<IWebElement> videos = driver.FindElements(elem_video_link);
Enter fullscreen mode Exit fullscreen mode

Since the page contains 79 videos (at the time of writing this article), the count method on the created list (or ReadOnlyCollection) returns 79.

Console.WriteLine("Total number of videos in " + test_url_1 + " are " + videos.Count);
Enter fullscreen mode Exit fullscreen mode

Step 8 – Parse the list of IWebElements to obtain the MetaData of the videos.

Parse through the list created in the earlier steps to obtain the video title, views, and upload date for each video in the list.

foreach (IWebElement video in videos)
{
    string str_title, str_views, str_rel;

      IWebElement elem_video_views = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[1]"));
    str_views = elem_video_views.Text;

      IWebElement elem_video_reldate = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[2]"));
      str_rel = elem_video_reldate.Text;

      vcount++;
}
Enter fullscreen mode Exit fullscreen mode

8.1) Scrap Video Title for every video (in the Video List/Channel).

The meta <div> in the “style-scope ytd-grid-video-renderer” class contains every video’s metadata on the page.

The Video Title of each video in the list is obtained by reading the CssSelector property [#video-title].


IWebElement elem_video_title = video.FindElement(By.CssSelector("#video-title"));
str_title = elem_video_title.Text;
Enter fullscreen mode Exit fullscreen mode

8.2) Scrap Video Views for every video (in the list/Channel).

The WebElement that contains the video views obtained using the findElement method with the XPath property. A dot (.) is used at the start of the XPath since we want the XPath search to be restricted to the required WebElement (i.e., video).


IWebElement elem_video_views = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[1]"));
Enter fullscreen mode Exit fullscreen mode

Now that we have located the WebElement containing the video views, the Text property of the WebElement is used to obtain the video views.

str_views = elem_video_views.Text;
Enter fullscreen mode Exit fullscreen mode

8.3) Scrap Upload Details for every video (in the list/Channel).

Similar to step (8.2), the XPath of the WebElement that displays the video’s upload details is obtained using the ‘Inspect Tool’ in Chrome.

Once we have the XPath of the element, the findElement method in Selenium is used to locate the element using the XPath property.

IWebElement elem_video_reldate = video.FindElement(By.XPath(".//*[@id='metadata-line']/span[2]"));
Enter fullscreen mode Exit fullscreen mode

The text property of the WebElement gives the upload details of the video.

str_rel = elem_video_reldate.Text;
Enter fullscreen mode Exit fullscreen mode

Steps 8.2 through 8.3 are repeated for all the videos in the channel (or list). In our case, the total video count was 79. Hence the sub-steps in step(8) will be run 79 times. In our case, we print the details of each video on the terminal.

Execution

Here is the truncated execution snapshot from the VS IDE, which indicates that there are a total of 79 videos on the LambdaTest YouTube channel.

We obtained the Standard Output by doing ‘Copy All’ in the Standard Output area. As seen below, we could successfully do dynamic web page scraping of LambdaTest YouTube channel:

Web Scraping LambdaTest Blog Page

In this demonstration, we scrap the following data from the LambdaTest Blog:

  • Blog Title
  • Blog Link
  • Blog Author
  • Blog Views & Read Duration

Though the demonstration is limited to scraping data on the blog’s first page, it can be further extended to scrap relevant information from the blog’s subsequent pages.

Here is the Selenium web scraping test scenario that will be executed on Chrome (on Windows 10). The test is run on a cloud-based Selenium Grid provided by LambdaTest.

  1. Go to https://www.lambdatest.com/blog/.
  2. Scrap the blog title, blog author, blog perm link, blog views, and read duration for each blog article on the homepage of LambdaTest blog.
  3. Print the scraped information on the terminal.

Implementation

using NUnit.Framework;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Remote;
using OpenQA.Selenium.Support.UI;
using System;
using System.Collections.ObjectModel;
using System.Collections.Specialized;
using System.IO;
using System.Reflection;
using System.Threading;
using OpenQA.Selenium.Safari;
using System.Collections.Generic;
using System.Web;

namespace WebScraping
{
    public class ScrapingTest
    {
        String test_url_2 = "https://www.lambdatest.com/blog/";
        static Int32 vcount = 1;
        public IWebDriver driver;

        /* LambdaTest Credentials and Grid URL */
        String username = "user-name";
        String accesskey = "access-key";
        String gridURL = "@hub.lambdatest.com/wd/hub";

        [SetUp]
        public void start_Browser()
        {
            /* Local Selenium WebDriver */
            /* driver = new ChromeDriver(); */
            DesiredCapabilities capabilities = new DesiredCapabilities();

            capabilities.SetCapability("user", username);
            capabilities.SetCapability("accessKey", accesskey);
            capabilities.SetCapability("build", "[C#] Demo of Web Scraping in Selenium");
            capabilities.SetCapability("name", "[C#] Demo of Web Scraping in Selenium");
            capabilities.SetCapability("platform", "Windows 10");
            capabilities.SetCapability("browserName", "Chrome");
            capabilities.SetCapability("version", "latest");

            driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities,
            TimeSpan.FromSeconds(600));
            driver.Manage().Window.Maximize();
        }

        [Test(Description = "Web Scraping LambdaTest Blog Page"), Order(2)]
        public void LTBlogScraping()
        {
            driver.Url = test_url_2;
            /* Explicit Wait to ensure that the page is loaded completely by reading the DOM state */
            var timeout = 10000; /* Maximum wait time of 10 seconds */
            var wait = new WebDriverWait(driver, TimeSpan.FromMilliseconds(timeout));
            wait.Until(d => ((IJavaScriptExecutor)d).ExecuteScript("return document.readyState").Equals("complete"));

            Thread.Sleep(5000);

            /* Find total number of blogs on the page */
            By elem_blog_list = By.CssSelector("div.col-xs-12.col-md-12.blog-list");
            ReadOnlyCollection<IWebElement> blog_list = driver.FindElements(elem_blog_list);
            Console.WriteLine("Total number of videos in " + test_url_2 + " are " + blog_list.Count);

            /* Reset the variable from the previous test */
            vcount = 1;

            /* Go through the Blogs List and scrap the same to get the attributes of the blogs on the page*/
            foreach (IWebElement blog in blog_list)
            {
                string str_blog_title, str_blog_author, str_blog_views, str_blog_link;

                IWebElement elem_blog_title = blog.FindElement(By.ClassName("blog-titel"));
                str_blog_title = elem_blog_title.Text;

                IWebElement elem_blog_link = blog.FindElement(By.ClassName("blog-titel"));
                IWebElement elem_blog_alink = elem_blog_link.FindElement(By.TagName("a"));
                str_blog_link = elem_blog_alink.GetAttribute("href");

                IWebElement elem_blog_author = blog.FindElement(By.ClassName("user-name"));
                str_blog_author = elem_blog_author.Text;

                IWebElement elem_blog_views = blog.FindElement(By.ClassName("comm-count"));
                str_blog_views = elem_blog_views.Text;
                vcount++;
            }
        }

        [TearDown]
        public void close_Browser()
        {
            driver.Quit();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Code Walkthrough

Step (1) – Step (5)

These steps remain the same as the previous example. Please refer to the earlier section for a detailed explanation of those steps.

Step 6 – Create a ReadOnlyCollection of the Blogs on the page.

On the LambdaTest Blog page, we see that each blog article is enclosed under the following <div>.

<div class="col-xs-12 col-md-12 blog-list">
Enter fullscreen mode Exit fullscreen mode

Hence, the findElements method is used with the CssSelector property to locate the blog articles’ total number on the Blog home page.

Here are the details for the first 2 blog articles obtained using the ‘Inspect Tool’ in Chrome browser:

This is an overall view of the DOM, which shows that there are a total of 10 blogs on the blog home page:

A ReadOnlyCollection (or List) is created containing the WebElements located using the FindElements method (and CssSelector property). Since there are 10 blogs on the home page, the ‘count’ property of the list (or collection) will return 10.

/* Find total number of blogs on the page */
By elem_blog_list = By.CssSelector("div.col-xs-12.col-md-12.blog-list");
ReadOnlyCollection<IWebElement> blog_list = driver.FindElements(elem_blog_list);
Enter fullscreen mode Exit fullscreen mode

Step 7 – Parse the list of IWebElements to obtain the MetaData of the blogs.

Parse through the list created in step (6) to scrap every blog’s required information in the list.

foreach (IWebElement blog in blog_list)
{
    string str_blog_title, str_blog_author, str_blog_views, str_blog_link;

      IWebElement elem_blog_title = blog.FindElement(By.ClassName("blog-titel"));
      str_blog_title = elem_blog_title.Text;

      IWebElement elem_blog_link = blog.FindElement(By.ClassName("blog-titel"));
      IWebElement elem_blog_alink = elem_blog_link.FindElement(By.TagName("a"));
      str_blog_link = elem_blog_alink.GetAttribute("href");

      IWebElement elem_blog_author = blog.FindElement(By.ClassName("user-name"));
      str_blog_author = elem_blog_author.Text;

      IWebElement elem_blog_views = blog.FindElement(By.ClassName("comm-count"));
      str_blog_views = elem_blog_views.Text;

      vcount++;
}
Enter fullscreen mode Exit fullscreen mode

7.1) Scrap Blog Title for each blog (in the list).

The class name ‘blog-titel’ inside the parent class ‘col-xs-12 col-md-12 blog-list’ contains the href (or link to the blog post) and the blog title.

The findElement method is used with className (i.e., blog-titel) property to locate the WebElement that gives the blog title. The Text property of the located WebElement gives the title of each blog post in the list.

IWebElement elem_blog_title = blog.FindElement(By.ClassName("blog-titel"));
str_blog_title = elem_blog_title.Text;
Enter fullscreen mode Exit fullscreen mode

7.2) Scrap Blog Post Link from every blog (in the list).

The class name ‘blog-titel’ inside the parent class ‘col-xs-12 col-md-12 blog-list’ also contains the href (or link to the blog post). We first locate that WebElement using the ClassName property.

IWebElement elem_blog_link = blog.FindElement(By.ClassName("blog-titel"));
Enter fullscreen mode Exit fullscreen mode

Once we have located the WebElement [i.e., elem_blog_link], the findElement method is applied on it with the TagName locator set to anchor tag [i.e. ‘a’].

IWebElement elem_blog_alink = elem_blog_link.FindElement(By.TagName("a"));
Enter fullscreen mode Exit fullscreen mode

On the located WebElement [i.e., elem_blog_alink], GetAttribute in Selenium is used to get the value of that element’s ‘href’ attribute.


str_blog_link = elem_blog_alink.GetAttribute("href");
Enter fullscreen mode Exit fullscreen mode

The output is the ‘href’ (or permalink) of each blog post in the list.

7.3) Scrap Author Name for every blog (in the list).

The WebElement that gives the ‘Author Name’ is located using the ClassName property. As seen below, the “user-name” class contains the author’s name.

The FindElement method locates the WebElement using the “user-name” class.

IWebElement elem_blog_author = blog.FindElement(By.ClassName("user-name"));
Enter fullscreen mode Exit fullscreen mode

The Text property of the WebElement gives the author name for the located WebElement [i.e., elem_blog_author].

str_blog_author = elem_blog_author.Text;
Enter fullscreen mode Exit fullscreen mode

7.4) Scrap Blog Views & Read Duration for each blog (in the list).

The WebElement that gives the ‘Blog Views & Read Duration’ is located using the ClassName property. As seen below, the “comm-count” class contains the views and estimated time duration to read that blog article.

The FindElement method locates the WebElement using the “comm-count” class.

IWebElement elem_blog_views = blog.FindElement(By.ClassName("comm-count"));
Enter fullscreen mode Exit fullscreen mode

The Text property of the WebElement gives the blog views & estimated time duration for the located WebElement [i.e., elem_blog_views].

str_blog_views = elem_blog_views.Text;
Enter fullscreen mode Exit fullscreen mode

Execution

Here is the truncated execution snapshot from the VS IDE, indicating that details of the 10 blogs are scrapped successfully.

Shown below is the execution snapshot of both the test scenarios that demonstrated scraping dynamic web pages in Selenium:

We are done Scraping!

In this Selenium C# tutorial, we laid the foundation blocks for web scraping with Selenium C#. There is a lot of difference between scraping static web pages and dynamic web pages. There are a number of tools like VisualScrapper, HTMLAgilityPack, etc., used for scraping static web pages. However, Selenium is the most preferred tool when it comes to dynamic web page scraping. The FindElements method in Selenium helps in locating the list (or collection) of web element(s). The FindElement method is used on the collection (obtained using FindElements) to scrap relevant information from the objects in that list.

Do let us know how you use Selenium for dynamic web page scraping, please leave your feedback in the comments section…

Happy Scraping!

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase

Nice! This blog is all about scraping dynamic web pages using Selenium and C#. It's a neat guide that shows how to handle those tricky dynamic elements on websites. Plus, it's written in a way that's easy to follow, especially if you're diving into Selenium and C#. Don't forget to check out Crawlbase if you're looking to up your web scraping game!