I have seen a lot of examples of how to build a web scraper in lots of programming languages mostly in python specifically using scrapy tool but only a few in golang.
As many golangs fans know, golang has tons of benefits when we talk about concurrency and parallelism, all of these features combined with a modern framework allow us to scratch the web in an easy and fastest way, but first of all, let's start with what a web scraper does?
Explaining Web scraper like I'm five
Quoting Wikipedia definition:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or through a web browser
So a web scraping is a technique used to extract data from websites using HTTP, think of this a web scraper is basically a robot that can read the data from a website like the human brain can read this post, a web scraper can get the text from this post, extract the data from the HTML and it can use them for many purposes.
Web scraper vs Web crawler
In order to keep this short, a web crawler is a bot that can browse the web so a search engine like google can index new websites and a web scraper is responsible of extract the data from that website.
Back to business
Now we know what we are building let's start to get our hands dirty, our first step will be to create a simple server in golang with a ping endpoint, using the standard lib it will look like this
package main
import (
"log"
"net/http"
)
func ping(w http.ResponseWriter, r *http.Request) {
log.Println("Ping")
w.Write([]byte("ping"))
}
func main() {
addr := ":7171"
http.HandleFunc("/ping", ping)
log.Println("listening on", addr)
log.Fatal(http.ListenAndServe(addr, nil))
}
That's it we just created a simple server with golang ππ» to test it just run it like any other golang program using go run main.go
you will see a log that our server is listening to the port 7171
. To test it with curl just run curl -s 'http://127.0.0.1:7171/ping'
Thanks to the http lib we can easily create a server just calling the ListenAndServe
function specifying a port and using the HandleFunc
we can manage different endpoints for our API.
Now we have a server up and running. Let's create another endpoint, this one will extract some data from the colly website.
First of all, we need to install the colly dependency to do this I highly recommend to use go module just run go mod init <project-name>
this will generate the go.mod
file where all dependencies used in the project will be. Open the go.mod
and add the colly dependency in the require section
require (
github.com/gocolly/colly v1.2.0
)
and that's it go module will take care of download the dependency to your local machine.
We are all set to extract all the data from the websites so let's create a function to get all links from any website
func getData(w http.ResponseWriter, r *http.Request) {
//Verify the param "URL" exists
URL := r.URL.Query().Get("url")
if URL == "" {
log.Println("missing URL argument")
return
}
log.Println("visiting", URL)
//Create a new collector which will be in charge of collect the data from HTML
c := colly.NewCollector()
//Slices to store the data
var response []string
//onHTML function allows the collector to use a callback function when the specific HTML tag is reached
//in this case whenever our collector finds an
//anchor tag with href it will call the anonymous function
// specified below which will get the info from the href and append it to our slice
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
if link != "" {
response = append(response, link)
}
})
//Command to visit the website
c.Visit(URL)
// parse our response slice into JSON format
b, err := json.Marshal(response)
if err != nil {
log.Println("failed to serialize response:", err)
return
}
// Add some header and write the body for our endpoint
w.Header().Add("Content-Type", "application/json")
w.Write(b)
}
This function will extract all links from any website specified in the url
param.
And finally, let's add our new endpoint to our server
func main() {
addr := ":7171"
http.HandleFunc("/search", getData)
http.HandleFunc("/ping", ping)
log.Println("listening on", addr)
log.Fatal(http.ListenAndServe(addr, nil))
}
Done we have our new search GET endpoint which receives the url
param and it will extract all links from the website specified. For test it using curl you can use curl -s 'http://127.0.0.1:7171/search?url=http://go-colly.org/'
it will response something like
["http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/","http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/docs/","https://github.com/gocolly/colly/blob/master/LICENSE.txt","https://github.com/gocolly/colly","http://go-colly.org/contact/","http://go-colly.org/docs/","http://go-colly.org/services/","https://github.com/gocolly/colly","https://github.com/gocolly/site/","http://go-colly.org/sitemap.xml"]
Conquer all the web
Congrats! you just created a web scraper API in golang ππ».
before conquering all the web remember to check the robots.txt file of the website to ensure you can extract that data. For those who are not aware of what robots.txt is.
Is a file on all the websites where it specifies the instructions to the robots like web crawler or web scraper how to handle their data or what endpoint the robots can use. For example, in google will be https://www.google.com/robots.txt
Next step
Keep going further with colly documentation, it has a lot of examples of web scraper and web crawlers.
Top comments (2)
not returning email addresses from websites.
Hello, you missed json import:
"encoding/json"