This post lists a few advantages of using time-based indices in OpenSearch Cluster.
- Increasing / Decreasing the number of shards becomes easy
- Helps to plan cluster capacity and growth size
- Easily determine optimum number of shards
- Avoids having to reindex entire data
- Efficient Deletion and application of ISM
- Easy to include / exclude indices based on alias
- Snapshot and Restore becomes a breeze with day-wise indices
- Apply best_compression to day-wise indices
- Force-merge past indices
1. Increasing / Decreasing the number of shards becomes easy
Say, an index template that makes use of day-wise
indices is configured with 1 shard
in index settings. In case the indexing rate is slow or the shard size becomes too large (> 50 GB), the index template can be easily modified to increase the number_of_ shards
to 3
or 5
. And this gets effected from the next day. Similarly, if a day-wise index pattern is configured with more than required number of shards (oversharded), reducing it becomes easy.
2. Helps to plan cluster capacity and growth size
Let's say 100 events per second
flow into an OpenSearch cluster and each event
averages about 1 KB
in size. Thus, per day, there would be:
86400 seconds * 100 events/second = 8,640,000
events.
Since each event averages about 1 KB, the total size of 8,640,000 events = 8,640,000 * 1 KB = 8,640,000 KB / (1024 * 1024) = ~8.24 GB
.
Thus, with a day-wise index
, we could see that the day-wise index size would be ~9 GB per day
without any replicas. Considering 1 replica, the size per day would be ~18 GB
and size for 30 days
would be ~540 GB
. This helps with capacity planning and estimating cluster growth rate.
3. Easily determine optimum number of shards
With data set of about 9GB per day
, for a day-wise index
, we could start by setting "number_of_shards" : 1
in the index template since each primary shard
would be about 9 GB which is pretty reasonable for a single shard. Shards for time-based
indices can be in the range of 40-50 GB
.
4. Avoids having to reindex entire data
If the data influx increases, we could easily set "number_of_shards": 3
in the index template and this would get effected for tomorrow's
day-wise index. Without the need to reindex any data, the number of shards could be easily changed.
5. Efficient Deletion and application of ISM
Let's say we need to retain data upto 90 days. Thus, for a day-wise index which is older than 90 days
, that entire index can be purged / deleted. This is far more efficient than purging records from indices.
Also, application of index state management becomes simplified with time-based indices.
6. Easy to include / exclude indices based on alias
Let's assume the cluster needs to retain 90 days data but needs to search only on the last 60 days
data. Alias to the rescue. In this case, define an alias in index template that gets mapped to newly created day-wise indices. As soon as a past index becomes older than 60 days, the alias is removed from that index. This ensures that at any given point of time, the alias will point to a maximum of 60 day-wise indices.
7. Snapshot and Restore becomes a breeze with day-wise indices
Say you have an index named my_index-2021.11.04
created on Nov 04, 2021. On Nov 05, 2021 at say 00:45
hours when data is no longer being written to the my_index-2021.11.04
, a snapshot, snap-my_index-2021.11.04
could be triggered for that index. This snapshot would contain just the my_index-2021.11.04
. In case the index is deleted and needs to be restored, it can be easily restored from the snapshot snap-my_index-2021.11.04
.
8. Apply best_compression to day-wise indices
The index template can be modified to set "codec": "best_compression"
in index settings i.e.
"settings": {
"codec": "best_compression"
}
Depending on the use case, this could help to save disk space from 10% to 30%
or even more. The mileage would vary.
"codec": "best_compression" CANNOT be dynamically applied on existing open
indices. The index needs to closed first, then the setting applied dynamically and then the index needs to be opened.
9. Force-merge past indices
Since the data gets written only to current day's index, in case no updation happens to past data, all past indices are effectively read-only. Thus, such indices can be forcemerged by setting "max_num_segments":1
. This boosts search speed tremendously.
Top comments (2)
I fully agree with everything described in here but I have a challenge to daily assign my policy to the newly created index.
With elasticsearch i can leverage logstash output plugin for elasticsearch and configure the policy I want in there, but how do I achieve the same thing with opensearch?
My indexes are created without policy and we must daily apply the policy to the new index.
Thank you
Hey @anubisg1 sorry I missed your reply amidst a plethora of notifs. Did you figure it out? You'll need to create indices
with policy
and then it will auto apply to all matching indices. Is there a reason you cannot create a policy?