DEV Community

Create a Simple Web Scraper in C#

Rachel Soderberg on July 29, 2019

Web scraping is a skill that can come in handy in a number of situations, mainly when you need to get a particular set of data from a website. I be...
Collapse
 
matthewzar profile image
Matthew F. • Edited

I found this useful, but I admit to getting a bit stuck around connecting each of your steps together. To help other's in the future, here's a Gist that links everything together.

I admit it's output isn't as neat as yours, so I have a mistake somewhere... but it's a start. One quick note: it's WPF rather than WinForms, so take that into consideration for all UI-interactions.

Collapse
 
aaroncarrick profile image
Aaron L Carrick

Follow the link Mathew F. linked to, but edit these lines and everything will work!

Reguarding:
gist.github.com/CodeCommissions/43...

Edit:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower())));

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

TO THIS:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))).Skip(1);

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

Take note of the:

.Skip(1)

The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)

Collapse
 
matthewzar profile image
Matthew F.

Thanks for the fix ^_^
I've updated the Gist to include your suggestion.

Collapse
 
aaroncarrick profile image
Aaron L Carrick

Follow the link Mathew F. linked to, but edit these lines and everything will work!

Reguarding:
gist.github.com/CodeCommissions/43...

Edit:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower())));

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

TO THIS:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))).Skip(1);

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

Take note of the:

.Skip(1)

The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)

Collapse
 
aldoresendiz profile image
aldoresendiz

Thank you for your contribution but this is definitely not a beginner's project. I am in the middle of adding the missing parts of the project -namespaces- and I'm done. I went through it because it looks like a very simple and easy project now I'm out looking for something "easier".

Collapse
 
anjankant profile image
Anjan Kant

This article (dev.to/anjankant/visual-studio-lea...) will also help to scrape whole website

Collapse
 
rachelsoderberg profile image
Rachel Soderberg

Does this follow a similar method as I wrote above? I see it's using the HTML Agility Pack library, and I'm not familiar with that.

Collapse
 
anjankant profile image
Anjan Kant

Yes Rachel, these (HTMLAgilityPack) are advanced libraries followed by xpath extractions uses also LINQ. I have written in vast and depth to scrape web sites, myself scraped a number of websites using HTMLAgilityPack. But you explained beautifully to get start with web scraping.

Thread Thread
 
rachelsoderberg profile image
Rachel Soderberg

Very cool! I'll have to check it out next time I have some free time for a personal project. Thanks for the recommendation, your articles look very good as well.

Thread Thread
 
anjankant profile image
Anjan Kant

Thanks Rachel to taking your time. If any help then please text me.

Collapse
 
alexander248365 profile image
Alexander248365

Hi Rachel,

Thank you for the post - I've just discovered AngleSharp!

How should I modify search, if I want to go to a site, set values to search controls and imitating clicking button Search? For example, site app.toronto.ca/DevelopmentApplicat..., I want to set filter to New Development = 30 days, click Search and read the results below.

Thank you so much,
Alexander
and searc

Collapse
 
yirez profile image
Yiğit İrez

This may be just me but what I look for in a nicely written blog post such as this one, with the title "create-a-simple-web-scraper", is completeness because it should be a fullproof starter for beginners.

The code here doesn't work without adding the missing parts and fixing implied wrong usage suggestions.

Collapse
 
rachelsoderberg profile image
Rachel Soderberg • Edited

I'm sorry you cannot get it working, but I built the application from the ground up while writing the post. It absolutely does work and is in its fullest form, there are no missing parts and I'm unsure what you mean by "implied wrong usage suggestions". Could you be more specific?

I would be glad to help you get the application working, can you provide the error you're getting and perhaps a link to your code?

Collapse
 
hanneslim profile image
hanneslim

Thank you very much for your tutorial! It helped me a lot! I could successfully build my own C# Web scrapper: nerd-corner.com/how-to-program-a-w...

Collapse
 
dunghv36 profile image
Dung Hoang Seothetop • Edited

I found something useful from your post and want to apply it to my blog seothetop.com, I will create an xml sitemap generator to submit to Google search
Thank You so much

Collapse
 
philb313 profile image
PhilB313

How would this work on a website that uses a login session?

Collapse
 
techmax profile image
techmax

I found something useful from your post and want to apply it to my blog mucintechmax.com.vn/, I will create an xml sitemap generator to submit to Google search
Thank You so much

Collapse
 
anomaly2009 profile image
Ibne Nahian

Thank you for this awesome post

Collapse
 
tombohub profile image
tombohub

Can you tag those code samples with C# so they get syntax color?

Collapse
 
tlx_pod profile image
TLX GROUP • Edited

I found something useful from your post and want to apply it to my blog tlx.asia/, I will create an xml sitemap generator to submit to Google search
Thank You so much