Ian Spence

Posted on Jul 27, 2019

All About Apache Cassandra: Snapshots

#cassandra #database #nosql #devops

Welcome to the inaugural post of "All About Apache Cassandra", a new series of posts I'll be writing about Apache Cassandra exclusively on dev.to!

The concept of a "Snapshot"

In computing, a snapshot is a point-in-time copy of data or state of a machine.

You ever end up doing something like this?

Each one of these files is a snapshot, as it's a copy of your essay at the point-in-time when you saved it.

Snapshots provide us with an easy way to undo changes and backup systems.

How Cassandra Stores Data

Cassandra breaks its data down by Keyspace (which is like a Database in MySQL or Schema in PostgreSQL), and Column Family (a table). Data for the column families is stores on SSTables (Sorted String Tables).

When Cassandra writes to disk it does so by writing a new SSTable. Every SSTable is immutable, so it will always make a new SSTable each time it flushes. A process known as Compaction will automatically merge the tables into a single new table.

For example, if we have a column family named Users in the keyspace App, the data for that table would look like:



<Cassandra Data>/App/Users/
1-big-Data.db
2-big-Data.db
3-big-Data.db
4-big-Data.db

Each big-Data file is a SSTable. Compaction would take all of these tables and merge them into a single, new file 5-big-Data.db.

How Cassandra Takes Snapshots

When you take a snapshot with Apache Cassandra, it creates a Hard Link of all live SSTables. Easy, right? Well it's a little more complex than you might think

What's a hard link?

A hard link is a file that points to the same data as another file.

In UNIX-like systems (Linux, macOS, BSD), the system uses INodes to reference the data on disk, and files on your computer reference INodes. A hard link is a file that shares the same INode as another file.

Hard links are different than Soft Links, which tell the application reading that file where the actual file path is.

The Catch

There's a very important catch with hard links: The data on disk will only be deleted when all references to it are removed.

For example: Inode 1001 references 500MB of data on the disk and we have two files that point to Inode 1001. If we deleted the first link, the 500MB data will still be on disk because of the second link. We won't free up that 500MB from disk until all links have been deleted.

Where Cassandra Stores Snapshots

Since snapshots are hard links of existing files, these files must be on the same filesystem as your data. Therefor, snapshots are stored in the snapshots directory within your Column Family. If we took a snapshot named snapshot1 on our existing Users table, it would result in:



<Cassandra Data>/App/Users/
1-big-Data.db
2-big-Data.db
3-big-Data.db
4-big-Data.db
<Cassandra Data>/App/Users/snapshots/snapshot1
1-big-Data.db (Hard link)
2-big-Data.db (Hard link)
3-big-Data.db (Hard link)
4-big-Data.db (Hard link)

When compaction occurs and the SSTables are merged, our snapshot is unaffected:



<Cassandra Data>/App/Users/
5-big-Data.db
<Cassandra Data>/App/Users/snapshots/snapshot1
1-big-Data.db
2-big-Data.db
3-big-Data.db
4-big-Data.db

Because there is still at least one reference to the data of SSTables 1-4, the system will not delete that data from disk.

How Snapshots Affect Capacity

As with any distributed storage system, calculating your capacity is not as simple as just looking at how much free disk space you have.

Snapshots can have a very large impact on your clusters capacity, and great care is needed if you're considering taking snapshots on a schedule.

Because of that one important catch with hard links (see above), compaction jobs can dramatically increase the amount of data used by snapshots.

When you take a snapshot on a server with no activity whatsoever, that snapshot will not consume any (meaningful) amount of data on disk. But as soon as a new SSTable is written or compaction occurs, that snapshot will begin to take up space, potentially doubling the amount of data on your disk.

Let me try and break it down:

Let's assume you have a 100GB disk, and Cassandra is using 50GB of that for its data.

You take a new snapshot. This isn't going to use that much disk space. Still 50% full.
A major compaction occurs, something like a cleanup or a scrub, which re-writes all tables on disk.
That snapshot now references versions of SSTables that have been rewritten entirely, and is essentially duplicated data. Now 100% full.

Of course, this is a slightly exaggerated example, but the reality is that snapshots and compaction can wreck your disk capacity and ruin your day.

Restoring From a Snapshot

Restoring from a snapshot is a destructive action that will result in the loss of any new data created after the snapshot was taken. You should only restore from a snapshot when you have no other option. There is no automated way to restore Cassandra to a snapshot, it is a manual process.

While the cluster is running:

Truncate the table you need to restore
Stop the Cassandra service

Why truncate? If there are any tombstones remaining, they will be deleted after you restore. Truncating removes any tombstones from that table.

On all nodes in the cluster:

Navigate to the directory for the snapshot you wish to restore from
Copy over all SSTables in that directory into the data directory for the table

Finally:

Start up the service
Run a nodetool refresh

Command Line Reference

As with most operational tasks, you'll be using the nodetool application to manage snapshots on a server. These are some of the commands you might use.

Take a snapshots



nodetool snapshot [options] [keyspace]

Reference: https://cassandra.apache.org/doc/latest/tools/nodetool/snapshot.html

Take a snapshot with a specific name:
nodetool snapshot -t <name>

Take a snapshot on all tables in a specific keyspace:
nodetool snapshot <keyspace>

Take a snapshot on two specific tables:
nodetool snapshot -kt <comma separated list of keyspace.tableName>

List all snapshots



nodetool listsnapshots

Reference: https://cassandra.apache.org/doc/latest/tools/nodetool/listsnapshots.html

Remove one or more snapshots



nodetool clearsnapshot [options] [keyspaces]

Reference: https://cassandra.apache.org/doc/latest/tools/nodetool/clearsnapshot.html

Remove all snapshots:
nodetool clearsnapshot

Remove a snapshot named:
nodetool clearsnapshot -t <snapshot name>

Remove snapshots from a specific keyspace:
nodetool clearsnapshot <keyspace>

DEV Community

All About Apache Cassandra: Snapshots

The concept of a "Snapshot"

How Cassandra Stores Data

How Cassandra Takes Snapshots

What's a hard link?

The Catch

Where Cassandra Stores Snapshots

How Snapshots Affect Capacity

Restoring From a Snapshot

Command Line Reference

Take a snapshots

List all snapshots

Remove one or more snapshots

Top comments (0)

Read next

Selecting the Right Database for the Job

Say Goodbye to tedious Code Reviews

DevOps Engineer Skills

Demonstrating Persistence vs. Non-Persistence in Kubernetes with MongoDB