NoSQL databases such as Apache Cassandra, MongoDB, Amazon DynamoDB, Azure DocumentDB and others are being adopted by enterprises to meet the data requirements of next-generation applications. Specifically, Apache Cassandra database provides linear scalability, a flexible consistency model and shared-nothing architecture that enables enterprises to scale their database footprint according to performance needs.
As mission-critical applications are being deployed on Cassandra databases, there are enterprise data management requirements also, such as backup and recovery. Today, there are various options available for backup and recovery of Cassandra. These options have pros and cons, which I will elaborate on in this blog post. It is important for a Cassandra database administrator (DBA) to understand the pros and cons of these options before choosing any option and validating products and solutions available to achieve enterprise goals for backup and recovery.
Let me first point out the important features that DBAs should look for while planning the backup:
- Secondary Storage footprint: Efficient storage utilization is key for any backup and recovery design and plan. The backup data set may be much larger than primary setup depending on the retention period of the backup data. Hence, if storage management is not efficient, customer storage capital expenditure could increase multifold, defeating the purpose of data protection.
- RPO and RTO: Recovery point objective (RPO) is the maximum targeted period in which data might be lost from an IT service due to a major incident. Recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster. Typically, enterprise requirements for these next-generation applications dictate very small RPOs and RTOs.
- Backup performance: This is the time spent in backing up the desired data set. Typically, the backup performance should be scalable enough to match the data change rate at the primary cluster. In addition, the time spent in taking the initial full backup of the primary cluster also matters, so periodical backup is not affected.
Cassandra Backup Options
Now let’s take a look at four options available for performing Cassandra backup:
Cassandra provides native node-level snapshot option via the nodetool command. Snapshot creates a hard link to existing Cassandra data files and expects users to delete it after backing up the files. This way, despite changes in Cassandra data, the backup is performed from snapshot files.
- Simpler to manage : It requires periodical snapshot commands on each of the nodes, triggering backup on snapshot files and removal of snapshot files after backup.
- Not a true backup: Node-level snapshot doesn’t mean cluster-consistent backup and such a simplistic strategy will result in heavy repairs upon recovery.
- Potential large RPO : The snapshot operation is not trivial, as it requires flushing of all in-memory data on disk and, hence, frequent invocation of snapshots will impact the production cluster’s performance dearly.
- Large storage footprint due to data compaction at primary cluster: As part of data compaction, Cassandra merges multiple data files on a node into a new file and, hence, reduces multiple versions of a record to a single and latest version. Compaction not only helps in reduction of data at the primary cluster, it also improves the read performance as it reduces the number of files to look up for a record. As a side effect of compaction, it’s possible that same records will appear in multiple versions, as they are merged into a new compacted file. It isn’t a trivial task to filter out already captured records from a compacted file. Depending on various aspects such as workload type, compaction strategy or versioning interval, compaction may cause multifold more data to be backed up, causing a serious impact on CapEx planning.
- Snapshot storage management at primary cluster: Cassandra expects users or admins to clean up snapshot files after they are no longer required. It requires a reliable job to clean up the snapshot folders; otherwise, they may affect the primary cluster’s availability due to lack of storage.
- Siloed solution: Enterprises require multiplatform backup and recovery product and native tools are built specifically for a particular database and don’t scale for multiple databases.
Cassandra supports incremental backup functionality. If enabled, every time Cassandra flushes memory content as a new file, a hard link for the new file is created in the incremental backup folder. Once the backup of these files is done, the administrator is supposed to delete these hard links. The best part of incremental backup is that the hard link is created only for newly created file due to the flushing of in-memory data contents. Files created due to compaction are not hard-linked.
- Better storage utilization: There are no duplicate records in the backup, as the compacted files are not backed up.
- Point-in-time backup: Companies can achieve better RPO, as backing up from the incremental backup folder is a continuous process and, hence, versioning can be done more granularly.
- Space management at primary cluster: With this option, companies need a reliable space management process on the primary cluster. The incremental backup folder must be emptied after being backed up. Failure to do so may cause serious space issue on the primary cluster.
- Creates lots of small size file in backup: Since incremental backup facilitates a backup of new files created due to the flush of in-memory data from Cassandra, typically the file size is small. This leads to many small files, making file management and recovery not a trivial task (and fairly expensive compared to the RTO guarantees customers want and expect).
Incremental Backup in Combination with the Snapshot
Incremental backup in combination with snapshots is another rudimentary way to take Cassandra point-in-time backups. Periodically, data is backed up from the snapshot, while incremental backup-based files are used for point-in-time needs. This means periodically the incremental backup-based files, which are already covered via snapshot-based backup, must be deleted.
- Fairly large-size backup files: Only the data between periodical snapshots are from the incremental backup.
- Point in time: It provides a point-in-time backup and restores.
- Space management in backup: After every snapshot, the incremental backed-up data needs to be cleaned up. Also, for backed-up files from the snapshot, a large backup storage footprint due to compaction at primary always is a concern.
- Operationally very heavy: This method requires DBA admins to script solutions; this is not scalable for enterprise scale.
Commit-Log Backup in Combination with the Snapshot
This method is similar to incremental backup. Rather than backing up the newly added SSTables, DBA admins archive the commit-log. Periodical snapshot-based backup provides the bulk of backup data, while the archived commit-log is used for the point-in-time backup.
- Point in time: Snapshot-based backups in the combination of commit-log archival provide a point-in-time backup option.
- Space management: Again, the commit-log archival space management is admin’s responsibility; this method results in operational heavy solution (OpEx).
- Restore complexity: The restore is a lot more complex, as part of restore will happen from the commit-log replay.
- Storage overhead: Snapshot-based backup will have a storage overuse issue because of duplication of data due to compaction. This results in a very large use of secondary storage and, thus, a very high CapEx expenditure.
Overall, Cassandra provides multiple options for backup. Typically, snapshot-based backup in the combination of commit-log archiving/incremental backup is used if the point-in-time restore is a requirement. If point-in-time restore is not a requirement, simple snapshot-based backup is much easier to implement, though it is not a true backup and will result in heavy repairs.
Most of these options require significant manual effort and are error prone. They work for staging or preproduction and test/dev environments that are smaller in scale and can sustain data loss and high RTO. However, for large enterprise customers and production deployments, there are key requirements such as data reduction via novel compression/de-duplication techniques, various failure handling and faster recovery with desired consistency.
For mission-critical, always-on applications where minimizing data loss and operational resiliency are important factors, administrators should consider leveraging next-generation data protection products from companies that specialize in this area.