There are many scenarios where you would need to parse the contents of a website to extract data. Search engines do this, certain applications require this functionality in order to extract information, there are integration tests that need this and even some tools must have this functionality. In this article, I will be showing how to build a website crawler in Java and how to parse the website’s content and extract information from it.
Article originally posted on my personal website under How to parse a website in Java
Building the entire parser is quite complicated, but luckily, there already exists a library that does the complicated parts for us: JSoup. It provides all the needed tools and APIs for parsing and extracting website data. It is quite easy to use and learn as well, making it ideal for most such applications.
Establishing a connection using JSoup
For the examples, I will be using my own website. So, let’s get started. First, we must establish a connection to the website and retrieve the HTML document. This is how we connect to a site and parse the DOM using JSoup.
Connection connection = Jsoup.connect("https://petrepopescu.tech");
Document document = connection.execute().parse();
We could use a simplified version, with only Jsoup.connect(“https://petrepopescu.tech”).get(), but I chose the more elaborate version to show the steps that the library does. Furthermore, we can easily use additional features, like setting the user agent (Jsoup.connect(url).userAgent(agent)) or a proxy (connection.proxy(host, port)), features that come really handy when doing a complete site parse because it minimizes the chances of your app being blocked by their firewall.
Parsing website elements
Now that we successfully established a connection and JSoup has parsed the page, let us try to identify elements in the page and get their content. For the first example, I will be retrieving the categories and the link to the specific pages. Using the browser inspector, we can see that the categories are as items in a list that are inside the div with the category-3 ID.
The same approach can be done for almost any website and element on the page to discover the needed criteria for identification.
If possible, always use IDs since they are unique in a page, or at least should be unique. If this is not enough, you can do selection by class, type, and simple navigation in the structure of the page so that you get the exact element that you want. This is exactly what we are going to do now in order to discover all the category links and the category names.
First, we will find the element with the id categories-3, and next, we will search for the list items by searching for the elements with the li tag. from those elements, we exact the href attribute and the text. Finally, we print the items.
Element categoriesMenu = document.getElementById("categories-3");
List<Element> categories = categoriesMenu.getElementsByTag("li");
Map<String, String> categoriesLinks = new HashMap<>();
for (Element category:categories) {
Element link = category.getElementsByTag("a").get(0);
String url = link.attr("href");
String categoryName = link.text();
categoriesLinks.put(categoryName, url);
}
for (Map.Entry<String, String> category:categoriesLinks.entrySet()) {
System.out.println(category.getKey() + " - " + category.getValue());
}
Using selector in JSoup to retrieve an element
Sometimes it is not that easy to get the exact element you want. It does not have an ID, or multiple elements share the same ID or maybe the ID is dynamic. For example, the articles present on the front page each have a unique ID that is generated for each individual article like this: “post-”. What if you want to get the excerpt for each one? This is where the select method from JSoup comes into play. You can write a complex selector that will help with the identification and retrieval of the exact element you want.
For example, this is one way to retrieve the excerpts from the front page.
Elements contentElements = document.select("article");
Map<String, String> articles = new HashMap<>();
for (Element element:contentElements) {
Elements postTextElements = element.getElementsByClass("entry excerpt entry-summary");
if (postTextElements.size() == 1) {
articles.put(element.attr("id"), postTextElements.get(0).text());
}
}
for (Map.Entry<String, String> article:articles.entrySet()) {
System.out.println(article.getKey() + " - " + article.getValue());
}
If we want to be even more specific, we can get all elements of type article with the id as described above using a more complex selector. Since this one takes a bit longer because of the regex, we can optimize it a bit by only searching inside the div with the actual content, and not on the entire page. To do this, we simple replace the first line with this one: Elements contentElements = document.getElementById(“grid-wrapper”).select(“article[id~=post-*]”);
There are even more selectors available in JSoup that should satisfy all your needs. You can read about them at JSoup’s official website. Just be careful when you use them since some are faster than others. Make sure you always try to optimize the search by going as down in the structure as you can and not by searching the entire page.
Conclusions
As we see, JSoup offers all the needed tools to successfully parse a web page and retrieve the data and information from it. With a bit of work and a bit of knowledge, you should be able to extract information from most websites, validate your web page structure in integration tests, or make an awesome tool that tracks the price of an item.
And, as always, you can download the full source code from my website
Article originally posted on my personal website under How to parse a website in Java
Top comments (1)
Wow! This blog provides a clear and practical guide to crawling and parsing websites using Java and JSoup. The step-by-step explanations make it easy to understand and implement. For efficient data extraction, tools like Crawlbase can further streamline the process, enhancing productivity.