Elasticsearch is a cool distributed database that allows us to store, search and analyze structured and no-structured data. One of the most common use cases is the high performant full text search, which allows users to retrieve data based on a text input. To do so, it uses the inverted index mechanism of Apache Lucene.
Basic Architecture
Data in Elasticsearch is organized in indexes, comparable to tables on relational databases. An index is stored directly within a node or in a cluster, which is formed by more than one node, which can be physical or virtual machines. A single node architecture can be represented as follows.
This is pretty straight-forward as we have all of our data stored within a single node. That is usually the case when we want to test some features on a local environment or if we want a simple development environment. However, it's obviously not recommended for production if we want an acceptable level of availability.
Sharding
The previous architecture has several draw-backs. If we want to scale for data storage or throughput, we would need to vertically scale the node by adding more resources to it. Since that is not as simple as it looks like, usually this is not what we look for.
To achieve our scaling purposes, Elasticsearch provides a sharding mechanism. But how does it work?
Imagine that our previous node has a 500GB capacity, if we have 800GB of data to be stored it obviously will not be able to handle it. Sharding is a way used to divide our indices into smaller pieces, in which each piece of data is called a shard. Let us see how it looks like in our example.
Now we can have our index divided into 2 shards, increasing the amount of data that can be stored. We can also have more than one shard per node, in case we want to scale other nodes in the future.
Aside from improving storage, sharding also allows us to improve the performance of operations in our cluster. Since we have our data distributed in more than one node, they may be parallelized, making multiple machines work on the same query.
Trade-offs
What is the trade-off of sharding our index? If we have an improvement of both storage and performance, why not define an incredibly high number of shards? By having too many shards, our cluster needs to maintain metadata state of all of our shards in order to operate, this may lead us to poor performance of operations, so we need to put an effort on planning correctly the number of shards that we need. Although we don't have a silver bullet to make this planning, we can find insights here.
Considerations
Sharding is a mechanism that provides us ways to scale our cluster horizontally, both in storage and performance. When configuring it, it will split our index in smaller pieces of data.
We can define more than one shard per node if we plan to scale out in the future, however, we need to be aware of the effects of oversharding and avoid it.
We`ve seen that sharding our data makes us improve our storage and performance, but what happens if one of our node becomes unavailable? In our architecture, we may lose big parts of our data. Is there a way to avoid it? In the next article we will talk about replication and how it helps solving this problem.
Top comments (0)