Enabling full compression can reduce data size by 89.3%!
Compression is important for performance because more data can be stored & read into memory within a block, it reduces IOps and the cost of hosting data on servers/cloud. But over-compression can get expensive due to time required for decompression which could effect read performance but it can also be useful for applications that need cold storage or just high compression.
A configurable compression strategy is required to support unique storage requirements allowing us to define how much & how often we want to compress data. For example: if your data file (Segment) is of size 100MB with 100K key-values you can express your compression requirements like
- Compress the full 100MB data file.
- Compress data every 4MB.
- Compress data every 1000 key-values.
- Compress all data but reset compression every 10th key-value
- Do not compress all data but compress every 10th key-value.
- Compress
binary-search-indexes
&hash-indexes
but not other data-blocks.
Note: data-block is referred to a logical set of bytes (Array<Byte>
) that are stored within a Segment like indexes, keys, values etc. A Segment itself is a data-block that stores other data-blocks within itself (Array<Array<Byte>>
).
Compression strategies
You can combine, enable or disable any or all of the following compression strategies
- Internal-compression includes prefix compression and duplicate value elimination.
- External-compression uses LZ4 and/or Snappy which can applied selectively to parts of a file or to the entire file.
Prefix compression
Prefix compression stores all keys in a compressed group format in their sorted order. Reading a single key from that group requires decompressing all keys that exist before the searched key within the group, so for read performance it is useful to leave some keys uncompressed. You can also prefix compress all keys if you just want high compression.
The following Map
is configured to compress 4 keys into a group and starts a new group every 5th key. The input boolean
parameter named keysOnly
is set to true
which applies prefix-compression to keys only, if false
, it applies prefix-compression to keys and all metadata that gets written with that key which would result in higher compression.
Map<Integer, String, Void> map =
MapConfig
.functionsOff(Paths.get("myMap"), intSerializer(), stringSerializer())
.setSortedKeyIndex(
SortedKeyIndex
.builder()
.prefixCompression(new PrefixCompression.Enable(true, PrefixCompression.resetCompressionAt(5)))
...
)
.get();
map.put(1, "one");
map.get(1); //Optional[one]
In the following configuration prefix compression is applied to every 5th key.
.prefixCompression(new PrefixCompression.Enable(false, PrefixCompression.compressAt(5)))
Prefix compression can also be disabled which optionally allows optimising sorted-index for direct binary-search without creating a dedicated binary-search byte array. You can read more about normaliseIndexForBinarySearch
here.
.prefixCompression(new PrefixCompression.Disable(false))
Duplicate value elimination
Time-series or events data like weather, electricity, solar etc often contains duplicate values. Such duplicate values can be detected and eliminated with the following configuration.
ValuesConfig
.builder()
.compressDuplicateValues(true)
.compressDuplicateRangeValues(true)
Duplicate value elimination is very cost effective because it does not create or leave decompression markers on compressed data, instead all decompression information for that key is embedded within an already existing 1-2 bytes space.
Range values created by the range APIs like remove-range, update-range & expire-range are most likely to have duplicate values and can be eliminated/compressed with the compressDuplicateRangeValues(true)
config.
External compression
Every data-block written into Segment file is compressible! A Segment file is nothing special but just another data-block that stores other data-blocks within itself.
You will find a compression
property that configures external compression in all data-blocks that form a Segment - SortedKeyIndex, RandomKeyIndex, BinarySearchIndex, MightContainIndex, ValuesConfig & SegmentConfig.
All LZ4 instances and Snappy are both supported.
The following snippet demos how to apply compression to a SortedKeyIndex/Linear-search-index. It tries to compress with LZ4 first at minimum 20.0% compression savings, if the compression was lower than 20.0% then Snappy is tried with the same.
SortedKeyIndex
.builder()
.compressions((UncompressedBlockInfo info) ->
Arrays.asList(
//try running LZ4 with minimum 20.0% compression
Compression.lz4Pair(
new Pair(LZ4Instance.fastestJavaInstance(), new LZ4Compressor.Fast(20.0)),
new Pair(LZ4Instance.fastestJavaInstance(), LZ4Decompressor.fastDecompressor())
),
//if not try Snappy
new Compression.Snappy(20.0)
)
...
)
UncompressedBlockInfo
provides the data size (info.uncompressedSize()
) of the data-block being compressed which can optionally be used to determine if it should be compressed or not. For example: if the data size is already too small then you can disable compression by returning Collections.emptyList()
.
.compression(
(UncompressedBlockInfo blockInfo) -> {
if (blockInfo.uncompressedSize() < StorageUnits.mb(1)) {
return Collections.emptyList();
} else { //else do compression
return {your compression};
}
}
How to apply compression at a file Level? Similar to above you can apply file level compression with SegmentConfig
.
.setSegmentConfig(
SegmentConfig
.builder()
...
.compression((UncompressedBlockInfo info) ->
{your compression config here}
)
)
How to limit compression by size & key-value count?
The property minSegmentSize
sets the compressible size of Segment and if the above compression
property is defined for a data-block then that data gets compressed every minSegmentSize
data block.
The property maxKeyValuesPerSegment
also controls the compressible limit of a Segment which, along with minSegmentSize
enables checks to limit the maximum count/number of key-values to store within a compressible Segment.
.setSegmentConfig(
SegmentConfig
.builder()
.minSegmentSize(StorageUnits.mb(4))
.maxKeyValuesPerSegment(100000)
...
)
Summary
SwayDB's compression is highly configurable and can be tuned for unique storage requirements with different tradeoffs.
GitHub repos
- SwayDB on GitHub.
- Java examples repo.
- Kotlin examples repo.
- Scala examples repo.
- Documentation.
Top comments (0)