Golang Html tokenizer

#go #tokenizer #html #parser

Mohammad Rahmani on Unsplash"/>

Looking for parsing & extracting HTML content in golang as we can simply do in PHP or Js by creating a new dom document. In golang, there are multiple ways to do it by using different packages based on your requirements. Some of the ways I found out are:

gohtml: gohtml is an HTML5 tokenizer and parser implementation. It returns nodes after parsing, and then the elements can be extracted by various attributes such as tag type, tag name, attr, and text data using a tokenizer concept.
goquery: goquery is built on the gohtml package and the CSS Selector library Cascadia, giving it more power over content selection and extraction. It has a similar syntax as jquery.
godom: godom is a library that allows you to manipulate the DOM in Golang similar to javascript. It compiles Go code to JavaScript using GopherJS.

For now, I will use gohtml for the demonstration purpose, to use tokenization.

Tokenization is the lexical analysis, parsing the input into tokens. Among HTML tokens are start tags, end tags, attribute names and attribute values.

Tokenizing the document is the first step in parsing it into a tree of element and text nodes, similar to the DOM.

Types of HTML Tokens Supported:

html.StartTagToken: a start tag such as
html.EndTagToken: an end tag such as
html.SelfClosingTagToken: a self-closing tag such as <img .../>
html.TextToken: text content within a tag
html.CommentToken: an HTML comment such as
html.DoctypeToken: a document type declaration such as <!DOCTYPE html>

Example:

package main
import (
 "fmt"
 "strings"
 "io"
 "golang.org/x/net/html"
)
func main() {
 tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml))
 for {
  tokenType := tokenizer.Next()
  token := tokenizer.Token()
  if tokenType == html.ErrorToken {
   if tokenizer.Err() == io.EOF {
    return
   }
   fmt.Printf("Error: %v", tokenizer.Err())
   return
  }
  fmt.Printf("Token: %v\n", html.UnescapeString(token.String()))
 }
}
const sampleHtml = `<!DOCTYPE html><html><head><style> body {background-color: powderblue;} h1 {color: red;} p {color: orange;}</style><title>Sample HTML Code</title><script src="my-script.js">abc</script></head><body><h1>Main title</h1><p id="demo"></p><a href="https://dev.to/">Dev Community</a><script>document.getElementById("demo").innerHTML = "Hello JavaScript!";</script></body></html>`

Output:

Token: <!DOCTYPE html>
Token: <html>
Token: <head>
Token: <style>
Token:  body {background-color: powderblue;} h1 {color: red;} p {color: orange;}
Token: </style>
Token: <title>
Token: Sample HTML Code
Token: </title>
Token: <script src="my-script.js">
Token: abc
Token: </script>
Token: </head>
Token: <body>
Token: <h1>
Token: Main title
Token: </h1>
Token: <p id="demo">
Token: </p>
Token: <a href="https://dev.to/">
Token: Dev Community
Token: </a>
Token: <script>
Token: document.getElementById("demo").innerHTML = "Hello JavaScript!";
Token: </script>
Token: </body>
Token: </html>

Here, I had just simply checked for Error Token or EOF and printed all the token types as it is.

We can also parse HTML based on the Token such as html.StartTagToken, html.EndTagToken, etc as mentioned above.

Also, on the element type such as html, h1, script, style, etc.

tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml))
 for {
  tokenType := tokenizer.Next()
  token := tokenizer.Token()
  if tokenType == html.ErrorToken {
   if tokenizer.Err() == io.EOF {
    return
   }
   fmt.Printf("Error: %v", tokenizer.Err())
   return
  }
  switch token.Data {
  case "script":
   fmt.Printf("Script Token: %v\n", html.UnescapeString(token.String()))
  case "style":
   fmt.Printf("Style Token: %v\n", html.UnescapeString(token.String()))
  default: //This will also include contents of <script>, <style> tags content
   fmt.Printf("Others: %v\n", html.UnescapeString(token.String()))
  }
 }

Reference

Top comments (1)

pejman hkh • Jul 20

I want to introduce my package here.
A tiny Golang HTML parser with no dependency, just like GoQuery with CSS selector.
GDP has been written in native Golang and is very light.

github.com/pejman-hkh/gdp

DEV Community

Golang Html tokenizer

Types of HTML Tokens Supported:

Example:

Output:

Reference

Top comments (1)

Read next

Never code lines on the HTML canvas again

The HTML Dialog Element: Enhancing Accessibility and Ease of Use

Advent of Code Day 4 in Golang: Searching XMAS and X-MAS

Comparative Benchmarking: ILP, A*, and Branch and Bound Algorithms in High-Throughput Scenarios