How to build a Web Scraper using golang with colly

#go #scraper #web #likeimfive

I have seen a lot of examples of how to build a web scraper in lots of programming languages mostly in python specifically using scrapy tool but only a few in golang.

As many golangs fans know, golang has tons of benefits when we talk about concurrency and parallelism, all of these features combined with a modern framework allow us to scratch the web in an easy and fastest way, but first of all, let's start with what a web scraper does?

Explaining Web scraper like I'm five

Quoting Wikipedia definition:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or through a web browser

So a web scraping is a technique used to extract data from websites using HTTP, think of this a web scraper is basically a robot that can read the data from a website like the human brain can read this post, a web scraper can get the text from this post, extract the data from the HTML and it can use them for many purposes.

Web scraper vs Web crawler

In order to keep this short, a web crawler is a bot that can browse the web so a search engine like google can index new websites and a web scraper is responsible of extract the data from that website.

Back to business

Now we know what we are building let's start to get our hands dirty, our first step will be to create a simple server in golang with a ping endpoint, using the standard lib it will look like this

package main

import (
    "log"
    "net/http"
)

func ping(w http.ResponseWriter, r *http.Request) {
    log.Println("Ping")
    w.Write([]byte("ping"))
}

func main() {
    addr := ":7171"

    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

That's it we just created a simple server with golang 👏🏻 to test it just run it like any other golang program using go run main.go you will see a log that our server is listening to the port 7171. To test it with curl just run curl -s 'http://127.0.0.1:7171/ping'

Thanks to the http lib we can easily create a server just calling the ListenAndServe function specifying a port and using the HandleFunc we can manage different endpoints for our API.

Now we have a server up and running. Let's create another endpoint, this one will extract some data from the colly website.

First of all, we need to install the colly dependency to do this I highly recommend to use go module just run go mod init <project-name> this will generate the go.mod file where all dependencies used in the project will be. Open the go.mod and add the colly dependency in the require section

require (
github.com/gocolly/colly v1.2.0
)

and that's it go module will take care of download the dependency to your local machine.

We are all set to extract all the data from the websites so let's create a function to get all links from any website

func getData(w http.ResponseWriter, r *http.Request) {
//Verify the param "URL" exists
    URL := r.URL.Query().Get("url")
    if URL == "" {
        log.Println("missing URL argument")
        return
    }
    log.Println("visiting", URL)

//Create a new collector which will be in charge of collect the data from HTML
    c := colly.NewCollector()

//Slices to store the data
    var response []string

//onHTML function allows the collector to use a callback function when the specific HTML tag is reached 
//in this case whenever our collector finds an
//anchor tag with href it will call the anonymous function
// specified below which will get the info from the href and append it to our slice
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if link != "" {
            response = append(response, link)
        }
    })

//Command to visit the website
    c.Visit(URL)

// parse our response slice into JSON format
    b, err := json.Marshal(response)
    if err != nil {
        log.Println("failed to serialize response:", err)
        return
    }
// Add some header and write the body for our endpoint
    w.Header().Add("Content-Type", "application/json")
    w.Write(b)
}

This function will extract all links from any website specified in the url param.

And finally, let's add our new endpoint to our server

func main() {
    addr := ":7171"

    http.HandleFunc("/search", getData)
    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

Done we have our new search GET endpoint which receives the url param and it will extract all links from the website specified. For test it using curl you can use curl -s 'http://127.0.0.1:7171/search?url=http://go-colly.org/' it will response something like

["http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/","http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/docs/","https://github.com/gocolly/colly/blob/master/LICENSE.txt","https://github.com/gocolly/colly","http://go-colly.org/contact/","http://go-colly.org/docs/","http://go-colly.org/services/","https://github.com/gocolly/colly","https://github.com/gocolly/site/","http://go-colly.org/sitemap.xml"]

Conquer all the web

Congrats! you just created a web scraper API in golang 🙌🏻.
before conquering all the web remember to check the robots.txt file of the website to ensure you can extract that data. For those who are not aware of what robots.txt is.
Is a file on all the websites where it specifies the instructions to the robots like web crawler or web scraper how to handle their data or what endpoint the robots can use. For example, in google will be https://www.google.com/robots.txt