Crawling a website with wget

#crawling #wget

Here's an example that I've used to get all the pages from Paul Graham's website:

$ wget --recursive --level=inf --no-remove-listing --wait=6 --random-wait --adjust-extension --no-clobber --continue -e robots=off --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"  --domains=paulgraham.com https://paulgraham.com

Parameter	Description
`--recursive`	Enables recursive downloading (following links)
`--level=inf`	Sets the recursion level to infinite
`--no-remove-listing`	Keep ".listing" files that are created to keep track of directory listings
`--wait=6`	Wait the given number of seconds between requests
`--random-wait`	Multiplies `--wait` randomly between 0.5 and 1.5 for each request
`--adjust-extension`	Make sure that ".html" is added to the files
`--no-clobber`	Do not redownload a file if exists locally
`--continue`	Allows resuming downloading a partially downloaded file
`-e robots=off`	Ignores `robots.txt` instructions.
`--user-agent`	Sends the given "User-Agent" header to the server
`--domains`	Comma-separated list of domains to be followed
`--span-hosts`	Allows navigating to subdomains

Other useful parameters:

Parameter	Description
`--page-requisites`	Downloads things as inlined images, sounds, and referenced stylesheets
`--span-hosts`	Allows downloading files from links that point to different hosts
`--convert-links`	Converts links to local links (allowing local viewing)
`--no-check-certificate`	Bypasses SSL certificate verification.
`--directory-prefix=/my/directory`	Sets up the destination directory.
`--include-directories=posts`	Comma-separated list of allowed directories to be followed when crawling
`--reject "?"`	Rejects URLs that contain query strings

DEV Community

Crawling a website with wget

Top comments (0)

Read next

Important Topics for Frontend Developers to Master in 2025

🚀 Unlock Free AI and Cloud Skills with Microsoft’s Free Courses

Day before an interview how I prepare

Publishing Multi-Arch Docker images to GHCR using Buildx and GitHub Actions