Infinity Crawler

A simple but powerful web crawler library for .NET

Features

Obeys robots.txt (crawl delay & allow/disallow)
Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
Uses sitemap.xml to seed the initial crawl of the site
Built around a parallel task async/await system
Swappable request and content processors, allowing greater customisation
Auto-throttling (see below)

Licensing and Support

Infinity Crawler is licensed under the MIT license. It is free to use in personal and commercial projects.

There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.

Polite Crawling

The crawler is built around fast but "polite" crawling of website This is accomplished through a number of settings that allow adjustments…

C# Design Pattern: Proxy (Portugues)

Juarez Júnior - Oct 5

Exploring Azure Queue Storage in .NET

Bervianto Leo Pratama - Oct 5

Managing Background Jobs with Hangfire in .NET

Captain Iminza - Oct 4

C# Design Pattern: Flyweight

Juarez Júnior - Oct 4

Top comments (5)

Crawlbase • Apr 11

Thank you! Great read! InfinityCrawler seems like a neat solution for web crawling in .NET. Kudos on the implementation! If you're into optimizing your crawling experience, check out Crawlbase too!

LaDeBug • Oct 30 '19 • Edited

I've just recently found out how search engines in general and web crawlers in particular work litslink.com/blog/what-is-a-web-cr....
Now I'm studying Python and I'm wondering how much time will it take for a junior Python developer to make a decent web crawler.

Kalle Fagerberg • Feb 4 '20

Fun project! Giving a star as a mental note, this might be useful one day :)

sudhersan • Nov 15 '19

What about the UserAgent in Crawl settings? Can I Provide Like this "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"

James Turner • Nov 15 '19

Yep, you can supply any user agent in the crawl settings (see example). Providing a user agent like that, while will work perfectly fine, circumvents sites giving direction into what content is accessible or not via the "robots.txt" file.

DEV Community

Building a Polite Web Crawler

TurnerSoftware / InfinityCrawler