Something I never understood, was why people forget about NodeJS to web scraping to use Python instead.
It seems it's forgotten JavaScript it's the language for the web, and always have great libraries been released for good or bad.
On my experience with Web scraping and Web automation, each project has it own issue, and it seems NodeJS has always the tool for what it needs.
If you have to scrape a simple page, just use libraries to make requests, or maybe even the native Fetch module can fit.
When scraping a page "JS Rendered"(React, Svelte, Vue) use Puppeteer/Selenium or Playwright the new Browser Automation lib written by Google
Sometimes when scraping, even using a Browser Automation Lib, websites may block you, using bot detection. Well, that was the main point that you should consider yourself replacing Python by NodeJS for some cases, simple because NodeJS has a great tooling to avoid bot detections when scraping, such as:
1.Puppeteer-extra that comes with great plugins for Puppeteer, and more recently they are updating to use playwright as well. You can use puppeteer-stealth to avoid detection.
2.To avoid bot detection with simple requests scraping, just try to replace the user -agents and other tiny things, libs like Axios, got can make this work totally fine with just some adjustments.
3.And of course, now we have Crawlee, "A full-featured library that helps you build reliable crawlers with NodeJS", think in Scrapy but for NodeJS, this for sure will facilitate the work for all of us, because this lib can handle with Bot detections, Storage, Proxy Management, and other problems related with web scraping out of the box, which can be great for the majority of use cases.
This post do not intend to claim the NodeJS use for web scraping over python, but let the people know about how NodeJS can be not only suitable, but a great decision for your next scraping project 😁
Top comments (0)