Web crawling is the act of having a program or script accessing a website, capturing content and discovering any pages linked to from that content. On the surface it really is only performing HTTP requests and parsing HTML, both things that can be quite easily accomplished in a variety of languages and frameworks.
Web crawling is an extremely important tool for search engines or anyone wanting to perform analysis of a website. The act of crawling a site though can consume a lot of resources for the site operator depending how the site is crawled.
For example, if you crawl an 1000 page site in a few seconds, you've likely caused a not insignificant amount of server load for low-bandwidth hosting. What if you crawled a slow-loading page but your crawler didn't handle it properly, continuously re-querying the same page. What if you are just crawling pages that shouldn't be crawled. These things can lead to very upset website operators.
In a previous article, I wrote about the Robots.txt file and how that can help address these problems from the website operator's perspective. Web crawlers should (but don't have to) abide by the rules governed in that file to prevent getting blocked. In addition to the Robots.txt file, there are some other things crawlers should do to avoid being blocked.
When crawling a website on a large scale, especially for commercial purposes, it is a good idea to provide a custom User Agent, allowing website operators a chance to restrict what pages can be crawled.
Crawl frequency is another aspect you will want to to refine to allow you to crawl a site fast enough without being a performance burden. It is highly likely you will want to limit crawling to a handful of requests a second. It is also a good idea to track how long requests are taking and to start throttling the crawler to compensate for potential site load issues.
Actual footage of a server catching fire because of load, totally not from a TV Show
I spend my days programming in the world of .NET and had a need for a web crawler for a project of mine. There are some popular web crawlers already out there including Abot and DotnetSpider however for different reasons they didn't suit my needs.
I originally did have Abot setup in my project however I have been porting my project to .NET Core and it didn't support it. The library also uses a no longer support version of a library that does parsing of Robots.txt files.
With DotnetSpider, it does support .NET Core but it is designed around an entire different process of using it with message queues, model binding and built-in DB writing. These are cool features but excessive for my own needs.
I wanted a simple crawler, supporting async/await, with .NET Core support thus InfinityCrawler was born!
TurnerSoftware / InfinityCrawler
A simple but powerful web crawler library for .NET
Features
- Obeys robots.txt (crawl delay & allow/disallow)
- Obeys in-page robots rules (
X-Robots-Tag
header and<meta name="robots" />
tag) - Uses sitemap.xml to seed the initial crawl of the site
- Built around a parallel task
async
/await
system - Swappable request and content processors, allowing greater customisation
- Auto-throttling (see below)
Licensing and Support
Infinity Crawler is licensed under the MIT license. It is free to use in personal and commercial projects.
There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.
Polite Crawling
The crawler is built around fast but "polite" crawling of website This is accomplished through a number of settings that allow adjustments…
I'll be honest, I don't know why I called it InfinityCrawler - it sounded cool at the time so I just went with it.
This crawler is in .NET Standard and builds upon both my SitemapTools and RobotsExclusionTools libraries. It uses the Sitemap library to help seed the list of URLs it should start crawling.
It has built in support for crawl frequency including obeying frequency defined in the Robots.txt file. It can detect slow requests and auto throttle itself to avoid thrashing the website as well as detect when performance improves and return back to normal.
using InfinityCrawler;
var crawler = new Crawler();
var results = await crawler.Crawl(siteUri, new CrawlSettings
{
UserAgent = "Your Awesome Crawler User Agent Here"
});
InfinityCrawler, while available for use in any .NET project, it still is in its early stages. I am happy with its core functionality but likely will go through a few stages of restructure as well as expanding on the testing.
I am personally pretty proud of how I implemented the async/await part but would love to talk to anyone that is an expert in this area with .NET to check my implementation and give pointers on how to improve it.
Top comments (5)
Thank you! Great read! InfinityCrawler seems like a neat solution for web crawling in .NET. Kudos on the implementation! If you're into optimizing your crawling experience, check out Crawlbase too!
I've just recently found out how search engines in general and web crawlers in particular work litslink.com/blog/what-is-a-web-cr....
Now I'm studying Python and I'm wondering how much time will it take for a junior Python developer to make a decent web crawler.
Fun project! Giving a star as a mental note, this might be useful one day :)
What about the UserAgent in Crawl settings? Can I Provide Like this "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
Yep, you can supply any user agent in the crawl settings (see example). Providing a user agent like that, while will work perfectly fine, circumvents sites giving direction into what content is accessible or not via the "robots.txt" file.