DEV Community

Lars Quentin
Lars Quentin

Posted on

Use XQuery for HTML (Web Crawling)

This is a very very niche post. If you don't know why this is a pain point, don't waste your time reading this.

XQuery is a great language for high level XML processing, providing a fully turing complete declarative language leveraging XPath. Unfortunately, it is not used often.

My personal take is that this is the case because most HTML out there is not XML compliant, mostly because of tags that are never closed (such as <link ...> instead of <link .../>). Thus your Saxon/BaseX parser will fail.

The solution is TagSoup, which provides a SAX-compliant HTML parser, pretending that it just parses XML.

With that, you can now do actual web crawling! The rest is just plain XQuery.

Top comments (0)