Hi everyone. In this article we are going to talk about how can you write a simple web scraper and a little search application using well known existing technologies which you perhaps didn’t know they can do that.
From practical point of view the “product” we will have in the end will be barely capable of being used in production for mass web crawling, but if you just need to be able to crawl your own site or a site of your competitor or someone else’s and want to use an advanced search syntax and not just grep — this article should be useful for you.
The article should be also useful for those who are just starting with docker-compose or Manticore Search.
TL;DR
Our solution will be based on:
- wget crawling a site recursively
- tiny php script to pass the crawled content from wget to Manticore Search
- tiny php script to make a minimalistic search UI
- all wrapped in docker and docker-compose
Once we are done you should be able to run it via docker-compose like this:
domain=who.int docker-compose up
which will start crawling and indexing https://who.int and will immediately run another container with a web server, so you can search in the crawled pages:
Technologies
So what technologies will we use in our solution?
Wget
Everyone probably knows wget. When you need to download something in terminal in Linux, FreeBSD or MacOS most likely you will use wget. But did you know that wget can not just download a single file, but can be easily used as a simple web crawler which respects robots.txt, can follow links and doesn’t overload your system? Well if not, you know now. Yes, it doesn’t come with a load distribution among a network of your crawling servers or even ability to do searches in parallel. It’s actually not scalable at all, but it’s simple and it’s tried and trusted tool which suits our idea very well since the whole job can be done in just one call of the wget:
wget -nv -r -H -nd --connect-timeout=2 --read-timeout=10 --tries=1 --follow-tags=a -R "*.css*,*.js*,*.png,*.jpg,*.gif" "http://${domain}/" --domains=${domain} | php load.php
Let’s go through the most important parameters:
-
-nv
disables verbosity since we don’t need it in wget’s output which will be parsing -
-r
turns on recursive retrieving. Obviously the most important part for us -
-H
enables spanning across hosts when doing recursive retrieving -
-nd
disables creating directories when retrieving recursively -
--follow-tags=a
limits the HTML tags to follow by just the hyperlink tag -
-R "*.css*,*.js*,*.png,*.jpg,*.gif"
lists patterns to ignore. Obviously we don’t need any images or css/js files for full-text search, so we are ignoring them -
“http://${domain}/”
is our starting point. It will be the first page wget will download -
--domains=${domain}
lets us define the domains to be followed. In our case we are limiting by the same domain we are crawling -
| php load.php
and after all we want to pipe wget’s output to load.php
load.php
This is a simple and straightforward 15 lines of code script which:
- makes a connection to Manticore Search using a MySQL library
- creates a new table if it doesn’t exist yet with the morphology settings we need
- reads info about downloaded pages from wget at STDIN
- reads each page and puts it to Manticore
Here is the full script with each line commented:
So as soon as wget downloads at least something it will appear in Manticore immediately and will be searchable. Your data collection will grow until wget can’t download anything else or until you stop the container.
Manticore Search
Another important component is Manticore Search.
Manticore is a lightweight database written in C++ created specifically for search purposes with a powerful full-text search capabilities.
It can speak SQL over MySQL protocol as well as JSON over HTTP. What’s important for our purpose is that:
- it can strip HTML
- it has built-in NLP capabilities so we can split our texts into words, sentences and paragraphs efficiently and use stemmed forms of words (so e.g. “running” will find “run” etc.)
- it’s official docker image doesn’t require any configuration at all by default so we can just use 2 SQL queries: one to create a new table and another to add a new document to it
- it starts in milliseconds and is very cost-efficient in terms of RAM. No java heap which takes all your memory or garbage collection which ruins your search performance
- adding a document requires just one line of code to make an SQL query
So all we need to do to hook up Manticore in our case is these 3 lines in docker-compose.yml:
services:
manticore:
image: manticoresearch/manticore:3.4.0
Docker compose file
Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration
Our docker-compose YAML looks like this:
and there is also a Dockerfile for php+wget+mysql extension:
Please go through the comments in them. In a nutshell it includes 3 services:
- manticore: just using the official image
- php: we build it ourselves from php/Dockerfile php+wget+mysqli extension and we copy the load.php script to it — from Dockerfile. Depends on manticore
- web: from php+apache official image. Depends on manticore
Feel free to override the port from 8082 to whatever you want. We also use the environment variable $domain to specify the domain to crawl. So when you run it like this:
domain=who.int docker-compose up
it runs the above 3 services and starts crawling:
snikolaev@dev:~/crawler$ domain=who.int docker-compose up
Starting crawler_manticore_1 … done
Recreating crawler_web_1 … done
Starting crawler_php_1 … done
...
php_1 | data.5: GHO https://www.who.int/data/gho 125537 bytes
php_1 | fact-sheets.4: Fact sheets https://www.who.int/news-room/fact-sheets 83345 bytes
php_1 | facts-in-pictures.3: Facts in pictures https://www.who.int/news-room/facts-in-pictures 70227 bytes
php_1 | publications.7: WHO | Publications https://www.who.int/publications/en/ 92069 bytes
php_1 | questions-answers.3: WHO | Online Q&A https://www.who.int/features/qa/en/ 78145 bytes
php_1 | popular.3: Health topics https://www.who.int/health-topics/ 123263 bytes
php_1 | ebola-virus-disease.8: Ebola virus disease https://www.who.int/health-topics/ebola/ 112116 bytes
Search bar
The last component we haven’t covered yet is index.php which runs when you open http://hostname:8082 (or another port if you changed it in the compose file). The full script is just 13 lines of code:
Here unlike load.php we connect to Manticore over HTTP and use it’s JSON api endpoint /sql which allows to transmit any SQL command over HTTP. In a production environment it might make more sense to use Manticore’s /json/search endpoint which allows to break down the request into pieces much more granularly which is often important if your search form is not just one text area, but multi-field or in other cases. But we don’t need that all now. The logic of the script is simple:
- render a simple search form with text area named “search”
- if you press enter the form sends the typed value as an http parameter “search”
-
then the script just takes the value and passes it to Manticore in a very compact and clear SQL query:
SELECT url, highlight({}, ‘title’) title, highlight({}, ‘body’) body FROM rt WHERE MATCH(‘{$_GET[‘search’]}’) LIMIT 10");
gets the results
and renders them as HTML
That’s it. Nothing complicated.
What can it do?
Let’s now see what we can do with what we’ve built. Why didn’t we just dump wget output to files and use grep to search in them? Here is why:
Not just our search engine finds what matches your query, but it highlights the results and sorts them properly using improved ranking formula similar to BM25. For example as you can see on this picture the results containing “IPC precaution recommendations” go first since they have the whole phrase:
Second, you can use Manticore’s extended query syntax to do many interesting things. For example you might want to find only those documents that have “covid” and “caught” in the same sentence or paragraph:
Or you can match by a whole phrase, use OR and NOT and many more.
Third, do you remember when we were doing CREATE TABLE we turned on English stemming? Here is how we can now use it — if I enter “coronaviruses” it finds just “coronavirus” too:
So even though the crawling part is very basic the search part of our solution is quite powerful. You definitely can’t do anything like this with wget.
How do I run it myself?
git clone https://github.com/manticoresoftware/demos.git manticore_demos
cd manticore_demos/crawler/
domain=who.int docker-compose up
If you run it first time you’ll have to wait few minutes for docker to download the images to build the php service. Afterwards it will start crawling http://who.int , and the search UI will be available at http://localhost:8082 unless you run it on a remote server.
Thanks for reading! You can access the code here.
Top comments (0)