Decided to learn webscraping this month. The first thing I did was to watch courses in Pluralsight:
Scraping Dynamic Web Pages with Python and Selenium
Scraping Your First Web Page with Python
Exploring Web Scraping with Python
Webscraping can be done by using Python libraries like BeautifulSoup and Requests. This assumes that you have all urls predetermined and will just scrape the page source.
But if you will scrape a dynamic page (ex: a div is rendered only if a specific button was clicked) then you will need a library like Selenium to emulate user interactions.
When I was confident with the basics, I took a step further and learned the Scrapy framework. This requires a steeper learning curve than native Python libraries because you have to know the flow of how objects are passed in the framework. The main advantage is you won't have to write boilerplate codes (writing data to files, handling url requests, data modelling) redundantly because those are already integrated with its pipeline.
Here are some Scrapy Pluralsight courses that helped me:
Crawling the Web with Python and Scrapy
Extracting Structured Data from the Web Using Scrapy
Top comments (0)