Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend.
These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.
So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.
There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.
It can be challenging to scrape these SPAs because there are often lots of Ajax calls and Websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.
So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code in order to render the page, that is what the next part is about.
Headless Chrome with Python
PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about the release of the headless mode with Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote βGoogle Chrome is faster and more stable than PhantomJS [...]β It looks like Chrome in headless mode is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.
Prerequisites
You will need to install the selenium package:
pip install selenium
And of course, you need a Chrome browser, and Chromedriver installed on your system.
On macOS, you can simply use brew:
brew install chromedriver
Taking a screenshot
We are going to use Chrome to take a screenshot of the Nintendo's home page which uses lots of Javascript.
> chrome.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver')
driver.get("https://www.nintendo.com/")
driver.save_screenshot('screenshot.png')
driver.quit()
The code is really straightforward, I just added a parameter --window-size because the default size was too small.
You should now have a nice screenshot of the Nintendo's home page:
Waiting for the page load
Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.
A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.
The other solution is to use the WebDriverWait object from the Selenium API:
try:
elem = WebDriverWait(driver, delay)
.until(EC.presence_of_element_located((By.NAME, 'chart')))
print("Page is ready!")
except TimeoutException:
print("Timeout")
`
This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.
Conclusion
As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.
Meaning there will be CPU spikes, memory spikes just like a regular Chrome browser. After all, your Chrome instance will execute un-trusted and un-predictable third-party Javascript code! Then there is also the zombie-processes problem
This is one of the reason I started ScrapingBee, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!
This was my first post on about scraping, I hope you enjoyed it!
If you did please let me know, I'll write more π
If you want to know more about ScrapingBee, you can π here
Top comments (3)
Wow! This blog is like a friend guiding you through the maze of scraping single-page applications, especially those tricky ones packed with JavaScript. Plus, it sheds light on the challenges of managing resources in production. If you want an easier way to handle scraping tasks, check out Crawlbase.
What about sending direct HTTP requests to the API endpoints that SPA uses instead of bringing a headless browser to the game?
This is exactly what I do. I also wrote a chrome extension that will tell you when a page is built as an SPA and let you view the data or export it directly.
It will generate code snippets you can use to pull this data out using just HTTP requests, as well.