Lisa Jung

Posted on Sep 20, 2020 • Edited on Nov 30, 2022

Beginner's guide to Elasticsearch

#datascience #beginners #database #node

In life, we are always in search of something. Whether we are in search of the meaning of life or the most delicious tacos in town, we heavily rely on search engines to get the answers.

You may already use apps with powerful search capability such as Yelp, Uber, or Wikipedia. But did you know that these apps were built with Elasticsearch?

Elasticsearch is a free and open search and analytics engine for all types of data. It is known for its speed and scalability. Combined with its ability to index many types of content, Elasticsearch is used for numerous use cases such as application search, enterprise search, application performance monitoring, and security analytics to name a few(paraphrased from Elastic).

If you are a developer who is looking to make data usable in real time and at scale, Elasticsearch is a great tool to have on your belt.

Elasticsearch is known as the heart of the Elastic Stack, which consists of Beats, Logstash, Elasticsearch, and Kibana.

Together, the Elastic Stack allows you to take data from any source, in any format, then search, analyze, and visualize it in real time(excerpt from Elastic). This blog will specifically focus on Elasticsearch.

By the end of this blog, you will be able to:

understand how the components of Elastic Stack work together to search, analyze, and visualize data in real time
install Elasticsearch and Kibana to run queries
understand the basic architecture of Elasticsearch
discuss how sharding and replication contributes to Elasticsearch's scalability and reliability

Complementary Video

If you prefer to learn by watching videos, I have created one explaining the concepts covered in this blog. If you would like more explanations on certain concepts, be sure to check it out!

Topics covered in the video:

What is the Elastic Stack and what are the use cases of Elasticsearch and Kibana? 5:16
Basic architecture of Elasticsearch 14:53
What is Sharding? 18:21
What is replication? 23:07
Hands on Lab: Performing CRUD Operations with Elasticsearch and Kibana 25:03
Q&A 49:17

Additional Resources

After mastering the concepts covered in this blog, learn how to perform CRUD operations with Elasticsearch and Kibana by reading this blog!

How do the products of the Elastic Stack work together?

Elastic stack consists of Beats, Logstash, Elasticsearch, and Kibana.

The best way to understand how these products work together is to put these into context of a real life project. These components are usually mixed and matched to serve your specific use case. For the purpose of this tutorial, we will go over a scenario that uses them all.

Imagine you are the lead developer responsible for the hottest outdoor gear e-commerce site. You currently have a full stack app connected to a database.

Millions of customers are searching for products on your site but the current architecture is struggling to keep up with search queries submitted by users.

This is where Elasticsearch comes in.

You would connect Elasticsearch to your app. When a user sends a search query on your website, the request is sent to the server. The server in turn, sends a search query to Elasticsearch.

Elasticsearch sends the search results back to the server, which processes the info and sends the result back to the browser.

At this point, you might be wondering how do we get data into Elasticsearch?

That is where Beats and Logstash come into play.

Image source:devops_core

Beats is a collection of data shippers. When installed on your server, it collects and ships data to either Logstash or Elasticsearch.

Logstash is a data processing pipeline. Data that logstash receives(i.e. e-commerce orders & customer messages) are handled as events. These events are parsed, filtered , and transformed and are sent off to Elasticsearch where the data will be stored.

In Elasticsearch, data is stored as documents, a unit of info that is stored in JSON object. A REST API is used to query these documents.

We will delve more into Elasticsearch in a bit. For now, know that it is responsible for performing searches and analysis on large volumes of data.

All the search and analysis on data would prove useless if we could not visualize it and gain insights from it!

Kibana provides a web interface to the data stored in Elasticsearch. It allows users to send queries to Elasticsearch using the same REST API. These queries can provide answers to questions such as "How many users visit our site daily?" or What was the revenue for last month?"

Through Kibana dashboard, users can visualize the query results and gain insights from the data as shown below!

Image source: Elastic

Now that we have an understanding of how components of Elastic Stack work together, let's delve more into Elasticsearch!

First, we will install Elasticsearch and Kibana to help us visualize and consolidate the concepts we will cover in this tutorial.

Installing Elasticsearch

To keep the blog brief, I will only be going over installation directions for Windows. But don't worry! The installation steps for macs/linux are shared in this blog.

Heads up!
If you are looking to run Elasticsearch and Kibana v8+ locally, the process will be different from the one shown in this blog. Check out the blog for running v8+ here!

Step 1: Download Elasticsearch

Go to the download link.

In the region highlighted with a green box, select the download option for your operating system.

You will see that elasticsearch has been downloaded(orange box).

If you scroll down the page, you will see the installation steps. We will be using the commands specified in these steps to test whether Elasticsearch server is running smoothly.

Step 2: Relocate downloaded Elasticsearch and unzip Elasticsearch

Where you relocate Elasticsearch is up to you but for this tutorial, I have created a folder called Elastic_Stack in my Windows(C:) drive.

Move the downloaded Elasticsearch to Elastic_Stack folder.

Right click on Elasticsearch to display pop up options and click on extract all option. Once the downloaded Elasticsearch has been extracted, double click on the folder. You will see the following displayed on your screen.

Double click on the folder.

Click on bin folder(red box).

Click on the region highlighted with a green box. It should reveal the file path to the bin folder. Copy this address. We will be using it in the next step.

Step 3: Start the Elasticsearch server and ensure that everything is working properly

Search for the Command Prompt App on windows(purple box) and click on run as administrator option(red box).

In the Command Prompt App terminal, change into the bin directory(cd) by providing the file path to the bin folder. This is the file path you have copied in the previous step.

#In command prompt terminal
cd filepath to bin folder in Elasticsearch

Red box highlights the command we have used to change to the bin directory.

When you press enter, you will see that you have changed into the bin directory(blue box).

In the terminal, run the following command. If you are running on a non-window OS, then run elasticsearch in the terminal instead.

#In command prompt terminal
elasticsearch.bat

You will see the cursor blinking for a while before you see Elasticsearch server running!

You will see that Elasticsearch server is running on localhost at port 9200(red box).

Let's recap real quick. When a user(client) sends a request to the server, the server sends a search query to the Elasticsearch server. A REST API is used to query the documents and this query is sent to the endpoint http://localhost:9200.

We will use cURL command line tool to check whether the request is received by Elasticsearch server.

Open up a new command prompt window(red box).

In the new terminal, run the following command.

#In new command prompt terminal
curl http://localhost:9200

When you run the command(white box), you will see the following JSON object displayed in your terminal(blue box). That means everything is working correctly and Elasticsearch was successfully installed.

Leave these terminals open to keep the Elasticsearch server running.

Installing Kibana

Installing Kibana is very similar to installing Elasticsearch.

Step 1: Download Kibana

Kibana is a web interface for Elasticsearch. However, it ships with its backend server that communicates with Elasticsearch.

Go to the download link.

In the region highlighted with a red box, select the download option for your operating system.

You will see that Kibana has been downloaded.

If you scroll down the page, you will see the installation steps. We will be using the commands specified in these steps to test whether Kibana server is running correctly.

Step 2: Relocate downloaded Kibana and unzip Kibana

Move the downloaded Kibana to Elastic_Stack folder.

Right click on the Kibana to display options and click on extract all option. Once Kibana has been extracted, double click on the folder.

Click on bin folder(red box).

Click on the region highlighted with a green box. It should reveal the file path to the bin folder. Copy this address. We will be using it in the next step.

Step 3: Run Kibana and ensure that everything is working properly

First, go back to the command prompt window that is running the Elasticsearch server. Make sure it is still running and it is not displaying any error messages.

Open up a new command prompt window.

In the Command Prompt App terminal, change into the bin directory(cd) of Kibana by providing the file path to the bin folder. This is the path you have copied from the bin folder in the previous step.

#In command prompt terminal
cd filepath to bin folder in Kibana

The command has been highlighted with a red box.

When you press enter, you will see that you have changed into the bin directory(blue box).

In the terminal, run the following command. If you are running on a non-window OS, then run kibana in the terminal instead.

#In command prompt terminal
kibana.bat

You will see the cursor blinking for a while before you see Kibana running!

Open up a browser and go to http://localhost:5601.

You will see the following displayed on the browser.

Troubleshooting

If you are having trouble getting Kibana to work, try restarting your Elasticsearch server. Go to the command prompt terminal used for your Elasticserach server. Press `control + c`. Then, run elasticsearch.bat in the same terminal. 

Go back to your command prompt terminal for Kibana. Run `control + c` in the command prompt terminal. Then, run kibana.bat in the terminal. Go to http://localhost:5601 on your browser.

All right let's get back to the Kibana browser.

Click on the menu option(red box) to display a drop down menu. Scroll down to management section and click on Dev Tools option(green box).

This console allows us to easily send queries to Elasticsearch.

All right, now that we got installations out of the way, let's delve into the basic architecture of Elasticsearch! We will be using Kibana to look under the hood of Elasticsearch.

Basic architecture of Elasticsearch

Elasticsearch is a powerful search and analytics engine known for its distributed nature, speed, and scalability. This is due to its unique architecture.

We have just downloaded and ran a Elasticsearch server. Little did we know we were starting up a node(blue circle)! A node is a running instance of Elasticsearch that stores data. It has a unique id and a name.

Each node belongs to a cluster, which is a collection of nodes that are connected together. When we started up a node, a cluster was formed automatically(pink box).

You can add one or many nodes in a cluster. These nodes are distributed across separate machines. A node, by default, are assigned all of the following roles: master-eligible, data, ingest, and machine learning(if available). You can configure these roles and give specific roles to certain nodes.

Each node in the cluster can handle HTTP request from client and communication between nodes. All nodes are aware of fellow nodes within the same cluster and can forward HTTP requests to the node designed to handle the request.

How is data stored within the node?

The basic unit of data stored in Elasticsearch is called a document. Document is a JSON object that contains whatever data you want to store in Elasticsearch.

For example, let's say you are building an app that helps users find the best food truck in their area. In order to build this app, you will need to store data about food trucks.

A document storing data about one food truck would look like the following.

{
  "name": Pho King Rapidos,
  "cuisine": Vietnamese and Mexican fusion
}

Imagine if we had data about millions of food trucks. How would we be able to quickly search through the data to find the one we are looking for?

Searching for data is very similar to searching for a food item at a grocery store. Your search will be way more efficient if all the food items in the store were organized into specific aisles(fresh produce, meat, dairy, condiments & etc).

Documents are organized in a similar way. Every document is grouped into an index. An index is a collection of documents that share similar traits and are logically related to each other, much like an aisle of a grocery store.

The cluster contains multiple nodes. Within nodes, relevant documents are grouped under indices.

As we would go to the produce aisle to find an apple, we would run search queries against the indices when searching for documents.

All right, let's look under the hood of Elasticsearch and see how we can get information about the node and cluster we have just created.

Elasticsearch cluster exposes a REST API which receives HTTP requests. We can access this REST API with any HTTP client such as Postman or cURL but we will be using the Kibana Dev Tool to do so.

Open up your Kibana Dev Tool. You should see the following on your screen.

We will start by checking the health status of our cluster.

Delete the content in the region highlighted in grey so we can write our own query.

The syntax of the query is very simple. You initiate the query by specifying an HTTP method(GET, POST, PUT, DELETE). Then, we specify the API we want to access and what we would like to accomplish(command).

In this case, we want to retrieve(GET) the health status of our cluster. We specify that we want to access the cluster API and we want information about its health.

So our query should look like this:

GET /_cluster/health

Copy and paste the query in the region highlighted with a red box.

Make sure the query is selected by clicking on it. Run the query by clicking on the arrow highlighted with an orange box.

You will see that a JSON object has been returned to you(green box). You can see that the name of the cluster is set to elasticsearch by default and the status of the cluster is set to green.

This means that our cluster is healthy!

Let's get a list of nodes that are in our cluster.

In order to get this information, we use the _cat API. The query syntax is very similar to the query we have just sent.

We will send a GET request to the _cat API and we will use a command nodes?v to get the list of nodes in our cluster.

Your query will look like the following:

GET /_cat/nodes?v

Copy and paste the query into the dev tool. Select the query and send the query by clicking on the arrow.

You will see that basic info about the single node we have is displayed on the screen. It includes information about node's IP address, name, roles as well as some performance measures.

Great job! It looks like our node and cluster have been created successfully.

Let's create an index for food_trucks.

You can create an index by specifying the PUT method followed by the name of the index.

Run the following query in the dev tool.

PUT food_trucks

You will see a JSON object returned to you. It will specify that an index named food_trucks have been successfully created.

You will also see that the value of shards_acknowledged is set to true. We will cover this shortly!

Armed with the basic understanding of Elasticsearch architecture, we are now ready to understand what factors make Elasticsearch so scalable and reliable!

Understanding the factors behind Elasticsearch's scalability and reliability

What is sharding?

In the previous step, upon creating an index, we saw that shards_acknowledged value was set to true. What is a shard anyway?

Earlier I have mentioned that related documents are grouped into an index. Index is not actually storing documents. It is a virtual thing that keeps track of where documents are stored.

You can't find index on disk. What actually exists on disk is a shard! Shard is where data is stored in Elasticsearch. This is also where the search is ran!

When you create an index, one shard comes with it by default. You can also configure it so that you can create an index with multiple shards that are distributed across nodes.

Let's say we want to store 600K documents about food trucks in an index called Food Truck Index.

We have three nodes in our cluster, which can hold 200K documents each. When we create the Food Truck Index, one shard comes with it by default. This shard is assigned to Node-1.

Remember that shard is where data is stored. The number of documents a shard can hold depends on the capacity of the node.

We have three nodes that can only hold 200k documents each. But the entire index of 600K documents will not fit into any one of these nodes! If only we could divide these documents into smaller chunks and store them across these nodes...

Well, that is exactly what sharding is!

To make this possible, we add two additional shards to the index and distribute the shards across these nodes. Then, we store 200K documents per shard. Together, these shards are now able to store 600K food truck documents!

By dividing our documents into smaller chunks and storing these in shards that are distributed across nodes, we were able to store 600K documents. We accomplished all of this despite the fact that we do not have a single node that can store all that data! That is the beauty of sharding!

Sharding empowers Elasticsearch to adapt to support an increasing amount of data or demands placed on it.

So if our food truck app took off and we have a user base growing at a breakneck speed, we do not have to worry about increased data coming in. We can simply add more nodes and change the number of shards for whatever index you are working with!

What is replication?

Our food truck app is gaining some serious momentum. To our horror, one of the nodes goes down, taking its data down them into the dark abyss.

Can you imagine what a nightmare this would be if we had no back up mechanism to deal with something like this?

Thank goodness we have replication!

Replication creates copies of shards and keeps the copies in different nodes. If a node goes down, the copies stored in other nodes would step up to the plate and serve requests like nothing happened.

Elasticsearch automatically replicates shards without us having to configure anything. It creates a copy(replica shard) of each shard within the index.

Remember how we created an index called food_trucks earlier? Let's use Kibana to get more info about our index.

Go to the Dev Tool and run the following query.

GET /_cat/indices?v

If you look at the column, you will column pri and rep(red box). These stand for primary shard(pri) and replica shard(rep).

Let's examine the index food_trucks highlighted with a green box. You will see that upon creating an index, a primary shard and replica shard have been automatically created!

Replica shard is an identical copy of the pimary shard. It functions exactly the same way as the primary shard.

As you should never put all of your eggs in one basket, Replica shards are never stored in the same node as the primary shard. The primary shards and replica shards are distributed across the nodes in a manner shown below.

Even if a node goes down, you can rest easy knowing that replica shard stored in another node will pick up the slack as if nothing had happened!

As you can see, sharding and replication contributes to Elasticsearch's scalability and reliability.

All right, we have achieved all of our end goals! You guys deserve an applause and a long break for getting this far.

Now that you have a solid grasp on important concepts of Elasticsearch, you are now ready to explore more advanced skills like CRUD operations, mapping, analysis, and advanced queries. Go explore and see what you can do with Elasticsearch on your own!

Top comments (41)

Aravind Putrevu • Sep 23 '20

Lisa, Thanks for writing this. I work at Elastic.co.

Appreciate it if you could add a note about the security as it is free too. Because, starting up clusters like this might create a security issues in the long run. And many devs think that Elasticsearch Security is a paid feature.

Elastic Stack offers Basic Auth, TLS Encryption, and User Management for free.

Link: elastic.co/blog/security-for-elast...

Also, we're running a Contributor Programme for all the contributors who are doing awesome community work. Please join us and share your work. You'll be featured in the community.

Link: elastic.co/community/contributor/

yellow1912 • Sep 20 '20

Great article. Elastic is easy to setup but difficult to properly secure, and maintain. It is also very resource intensive and thus costly to operate.

Aravind Putrevu • Sep 23 '20

Many don't know that Elasticsearch Security is free. Please try out and let me know.

Also, regarding the architecture, there are many myths. There are quite a few companies which use ES. For example: Take dev.to search, it is powered by ES, Github Search and many many more. Not just search but analytics too, you need run a tool called Rally and benchmark for the right numbers that you want your cluster to serve.

Happy to help more!

Lisa Jung • Sep 20 '20

Hey yellow1912, thank you so much for sharing your valuable insights with me! I am really curious about your experience with Elastic as it seems like you have extensively worked with it. What made it difficult for you to properly secure and maintain it?!

yellow1912 • Sep 21 '20

Hi Lisa, sure, I will write my complete journey with Elastic some time this week.

Lisa Jung • Sep 21 '20

Thanks so much, @yellow1912 . Looking forward to it! :)

KM • Sep 23 '20

Maybe you should try Open Distro for Elasticsearch

Het • May 27 '21

Hi @lisahjung ,

If I consider an ELK setup with 1 Hot Node And 2 Warm Node Architecture and if somehow due to some issues my Hot Node server is down, then how would the Elasticsearch behave and will there any such partition in the warm node so as to behave as Hot Node temporary and how the sharding will occur.

Could you just brief me regarding cause I am a little confused.

Lisa Jung • Jun 2 '21

Hi Het! I am sorry for the delayed response. I don't monitor the blog comments often and just realized you reached out.

Thank you so much for your question! Based on the fact that you have only three nodes, multi-tier architecture may not make sense both in terms of high availability and bottlenecks.

Especially with one hot node, which node will you store the replica? Replicas are strongly recommended so there should be at least two nodes in the hot tier.

And to answer rest of your questions, there are currently 2 possible approaches:

1) Data tiers is the new one but before it would be managed through shard allocation. If it's a new set up, data tiers will be the best way to go: elastic.co/guide/en/elasticsearch/...

2) Another option would be to have a single node for cold/frozen tier if you use a searchable snapshot(paid license) as "replica".

Hope this helps!

I want to make sure your questions are answered more promptly. If you have additional questions, could you please post it on our discuss forum(discuss.elastic.co/)?

We have a team of developer advocates and community members that answer questions on this forum. You will be able to get help and advice there quickly compared to posting questions on the blog!

Thanks again! :)

AscendixTech • Feb 5 '21

What a nice and detailed article, Lisa! ⭐

We also actively use Elasticsearch to empower software products with instant search capabilities and recently we decided to compare it with other market rivals like Algolia, Azure Cognitive Search, and Swiftype.

Lisa Jung • Feb 5 '21

Hi @ascendixtech ! Your comment just made my day. Thank you for such a great compliment and for writing about how Ascendix is using Elasticsearch. What a cool use case!

anand402 • Sep 22 '20

pls help mam

Lisa Jung • Sep 23 '20 • Edited

Hi @anand402 ! I am so glad you reached out for help. What specifically about Elasticsicsearch Java API do you need help with?

I hear a sense of urgency in your post and it seems like you need immediate assistance. I am familiar with Ruby or Node.js but have yet to explore Java. But I did reach out to two developer advocates at Elastic as I am sure they will be able to direct you to an expert. I will keep you posted!

Lisa Jung • Sep 23 '20

Hey @anand402 , I got in touch with @aravind Putrevu. He is a rock star developer advocate at Elastic. Could you specify your question regarding Elasticsearch Java API? Aravind will respond directly here!

Philipp Krenn • Sep 24 '20

@anand402 if you're looking for a general intro, youtube.com/watch?v=GW7N4LH0e44 is probably a good starting point — it walks you through all the essentials in 45min :)

Lisa Jung • Sep 24 '20

You are the BEST. Thank you Philipp! :)

Lisa Jung • Sep 21 '20

You are so welcome Tuyen! I am so glad I could help. Thank you for the wonderful comment. :)

Sebastian • Sep 20 '20

This is awesome! I just needed a way to create a search system for my flask app.

Thanks for the info!

Lisa Jung • Sep 20 '20

Thanks Sebastian. You put a huge smile on my face. Glad I could help!

Aravind Putrevu • Sep 23 '20

I wrote a guide recently for building a search engine on flask app. See if it works for you.

guides.aravind.dev/codelabs/elasti...

Sebastian • Sep 23 '20

Thanks!

posgra • May 13 '21

Hi Lisa.
Thanks for this awesome introduction to ElasticSearch. I liked the care about details (summary of blog and video, with time of each content on video). Simply the best!!!