This is a very very niche post. If you don't know why this is a pain point, don't waste your time reading this.
XQuery is a great language for high level XML processing, providing a fully turing complete declarative language leveraging XPath. Unfortunately, it is not used often.
My personal take is that this is the case because most HTML out there is not XML compliant, mostly because of tags that are never closed (such as <link ...>
instead of <link .../>
). Thus your Saxon/BaseX parser will fail.
The solution is TagSoup, which provides a SAX-compliant HTML parser, pretending that it just parses XML.
- BaseX just uses it for you if you have it in Path!
- Saxon provides
saxon:parse-html
if TagSoup is in the classpath, which can be used after fetching withfn:unparsed-text
With that, you can now do actual web crawling! The rest is just plain XQuery.
Top comments (0)