Getting Started with Splash in Docker

#webscrapping #docker #scrapy #splash

Splash is a javascript rendering service. I don't have much idea what this service actually is. All I know is the service is one of many tools that could help me scrapping sites that needs javascript to run and enabled. And Splash could work well along with Scrapy, the web scrapping framework that I currently learn about. And as always, If this service can be done installed using Docker then I would give a try the docker way.

Pulling the Image

As instructed from the docker registry page, we can pull the latest splash image using this docker command (The image size is huge enough, prepare your internet) :

docker pull scrapinghub/splash

And when check the image listed using docker image ls, we could see that it has a huge size:

scrapinghub/splash                        latest       9364575df985   12 months ago   1.89GB

Run As Container Service

We can name the service anything you want, but here let's it's splash-test. We forward the port to 8050:8050 so we can access it on our browser. Here is the full command to create and run the container:

docker run --name splash-test -p 8050:8050 -d scrapinghub/splash

Once it created, you can check whether the service is running or stopped using docker container ls:

CONTAINER ID   IMAGE                COMMAND                  CREATED          STATUS          PORTS                                       NAMES
6e49662c03a7   scrapinghub/splash   "python3 /app/bin/sp…"   48 seconds ago   Up 46 seconds   0.0.0.0:8050->8050/tcp, :::8050->8050/tcp   splash-test

You could also check the resource used by the service using docker stats:

CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O   PIDS
6e49662c03a7   splash-test   0.08%     181.8MiB / 6.043GiB   2.94%     1.09MB / 987kB   0B / 0B     37

Render A Javascript-Required Site

You can access the service using your browser at http://localhost:8050/ and here is what it looks like:

If you successfully followed me at this point, then you can start to render any web site that needs javascript enabled to view the pages. For example, you can use https://www.transfermarkt.com/ because I find that this site can't be viewed when I disable the javascript on the browser. So try it by filling the URL form with it and hit the green Render me! button.

As the result, you can see the snapshot image of the site, some statistics, and more importanly the raw html document that ready for you to scrap it.

That's it and have fun scrapping!

DEV Community

Getting Started with Splash in Docker

Pulling the Image

Run As Container Service

Render A Javascript-Required Site

Top comments (0)

Read next

A Simple Guide to Docker Compose & Multi-Container Applications

How to Build an Elm Land Project for Production

Deploy Flask app using docker Compose

Monitoring and Logging in Jenkins: A Complete Guide