A bit of history
In 2021 Elasticsearch and Kibana (starting with version 7.1) were being relicensed under ELv2 (Elastic License ) and SSPL (Server Side Public License).
This change triggered a response from Amazon Web Services, which offered OpenSearch (data store and search engine) and OpenSearch Dashboards (visualization and user interface) as Apache2.0 licensed open-source projects.
At its core OpenSearch is a distributed search and analytics engine. Distributed means that OpenSearch is backed up by a cluster comprised of a collection of nodes (cluster-manager master and data).
Getting acquainted with the terminology is mandatory when starting to work with OpenSearch.
What is an index?
Indices are the largest unit of data, OpenSearch organizes data into indices, and an index is a collection of JSON documents.
Sharding an index is useful
Data in an index is partitioned across shards, the reasoning being that an index might be too large to fit on a single disk, but shards being smaller can be distributed and allocated across different nodes as needed.
Searches can be run in parallel across different shards speeding up the query processing. When creating an index, you can define how many shards you want. Each shard is an independent Lucene index that can be hosted anywhere in the cluster.
Replication is good
OpenSearch provides a fail-safe mechanism for events like node crashes, by having two types of shards: primary and replica. By default, OpenSearch creates a replica shard for each primary shard.
Each shard can have several replicas (that can be configured at index creation or changed later on) and to ensure high availability the replicas are NOT PLACED on the same node as the original/primary shards. The primary shard is the main shard that handles the indexing of documents and can also handle the processing of queries.
Cluster health
OpenSearch cluster health can be either green, yellow, or red.
Green: all primary shards and their replicas are allocated to nodes
Yellow: all primary shards are allocated to nodes, but some replicas aren’t.
Red: at least one primary shard and its replicas are not allocated to any node
Rule of thumb
Shard size matters because it impacts both search latency and write performance, too many small shards will exhaust the memory (JVM Heap) too few large shards prevent OpenSearch from properly distributing requests, a good rule of thumb is to keep shard size between 10–50 GB.
There is a limit on how many shards a node can handle, it’s useful to check how many shards a node can accommodate and search, by inspecting cluster settings.
cluster.max_shards_per_node
(Integer): Limits the total number of primary and replica shards for the cluster. The limit is calculated as follows:cluster.max_shards_per_node
multiplied by the number of non-frozen data nodes. Shards for closed indexes do not count toward this limit. The default is1000
- When adding or searching data within an index, that index it’s in an open state, the longer you keep the indices open the more shards you use.
- On the other hand, a closed index is not accessible for read or write operations and does not consume compute resources, however, it still occupies disk space, one way to reduce the disk space is to shrink the index (basically moving the data of an existing index into a new index with fewer primary shards).
- Very important red indexes cause red shards, and red shards cause red clusters.
- Unassigned shards cannot be deleted, an unassigned shard is not a corrupted shard, but a missing replica.
Top comments (0)