Cassandra write and read process


Storage engine

  • Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree.
  • Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can produce stall in read performance and other problems.
  • Cassandra never re-writes or re-reads existing data, and never overwrites the rows in place.

How data is written?

Different stages of write process in cassandra

  • Logging data in the commit log.
  • Writing data to the memTable.
  • Flushing data from the memTable.
  • Storing data on disk in SSTables.

Logging writes and memTable storage: when ever a write request came from client, first the data is written into memTable (it is in-memory storage) and commit log on disk (it will be present even if power fails on a node). The commit log receives every write made to a Cassandra node and these durable writes survive permanently. The memTable stores write until reaching a configurable limit, and then is flushed.

Flushing data from the memTable: To flush the data, Cassandra sorts memTables by token and then writes the data to disk sequentially. A partition index is also created on the disk that maps the tokens to a location on disk. When the memTable content exceeds the configurable threshold, the memTable is put in a queue that is flushed to disk. If the data to be flushed exceeds the queue size, Cassandra blocks writes until the next flush succeeds. You can manually flush a table using node tool flush.

Below command shows, flush command using node tool.

To reduce the commit log replay time, the recommended best practice is to flush the memTable before you restart the nodes. Commit log replay is the process of reading the commit log to recover lost writes in the event of interrupted operations.

Storing data on disk in SSTables: MemTables and SSTables are maintained per table. SSTables are immutable, not written to again after the memTable is flushed. 

dml_write-process_12

For each SSTable, Cassandra creates these structures:

  • Partition index: A list of partition keys and the start position of rows in the data file written on disk.
  • Partition summary: A sample of the partition index stored in memory.
  • Bloom filter: A structure stored in memory that checks if row data exists in the memTable before accessing SSTables on disk.

Compaction:

What is compaction and why it is required for tables in cassandra: when ever inserts/update/delete request occur, instead of overwriting the rows, Cassandra writes a new timestamped version of the inserted or updated data in another SSTable. This happens because SStables are immutable.

Compaction merges the data in each SSTable by partition key, selecting the latest data for storage based on its timestamp. Cassandra can merge the data performantly, without random IO, because rows are sorted by partition key within each SSTable. After removing deleted data, columns, and rows, the compaction process consolidates SSTables into a single file. The old SSTable files are deleted as soon as any pending reads finish using the files. Disk space occupied by old SSTables becomes available for reuse.

dml_compaction

During compaction, there is a temporary spike in disk space usage and disk I/O because the old and new SSTables co-exist. Disk space occupied by old SSTables becomes available for reuse when the new SSTable is ready. Cassandra 2.1 and later improves read performance after compaction because of incremental replacement of compacted SSTables. Instead of waiting for the entire compaction to finish and then throwing away the old SSTable, Cassandra can read data directly from the new SSTable even before it finishes writing.

As data is written to the new SSTable and reads are directed to it, the corresponding data in the old SSTables is no longer accessed and is evicted from the page cache. Thus begins an incremental process of caching the new SSTable, while directing reads away from the old one, thus avoiding the dramatic cache miss. Cassandra provides predictable high performance even under heavy load.

Types of compaction:

  • SizeTieredCompactionStrategy (STCS): Recommended for write-intensive workloads.
    • Pros: Compacts write-intensive workload very well.
    • Cons: Might hold onto stale data too long. Amount of memory needed increases over time.

The SizeTieredCompactionStrategy (STCS) initiates compaction when a set number (default is 4) of similar sized SSTables have accumulated. Compaction merges the SSTables to create one larger SSTable. As larger SSTables accumulate, the same process occurs, merging the larger SSTables into an even larger SSTable. At any given time, several SSTables of varying sizes are present. While this strategy works quite well to compact a write-intensive workload, when reads are needed, several SSTables still must be retrieved to find all the data for a row.

There is no guarantee that a row’s data will be restricted to a small number of SSTables. Also, predicting the eviction of deleted data is uneven, because SSTable size is the trigger for compaction, and SSTables might not grow quickly enough to merge and evict old data. As the largest SSTables grow in size, the amount of memory needed for compaction to hold both the new and old SSTables simultaneously can outstrip a typical amount of RAM on a node.

  • LeveledCompactionStrategy (LCS):  Recommended for read-intensive workloads.
    • Pros: Memory requirements are simple to predict. Read operations more predictable in       latency. Stale data is evicted more frequently.
    • Cons: Much higher I/O utilization that can impact operation latency.

The leveled compaction strategy creates SSTables of a fixed, relatively small size (160 MB by default) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous. Disk I/O is more uniform and predictable on higher than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables. This can improve performance for reads, because Cassandra can determine which SSTables in each level to check for the existence of row key data.

  • DateTieredCompactionStrategy (DTCS): Recommended for time series and expiring TTL workloads.
    • Pros: Memory requirements are simple to predict. Read operations more predictable in latency. Stale data is evicted more frequently.
    • Cons: Much higher I/O utilization that can impact operation latency.

The DateTieredCompactionStrategy (DTCS) acts similarly to STCS, but instead of compacting based on SSTable size, DTCS compacts based on SSTable age.

How is data updated?

  • In cassandra, inserting a duplicate primary key is treated as an upsert.
  • An upsert writes a new record to the database if the data didn’t exist before. If the data for that primary key already exists, a new record is written with a more recent timestamp.
  • During read process, only the most recent is retrieved; older timestamped data will be marked for deletion.
  • Eventually, the updates are streamed to disk using sequential I/O and stored in a new SSTable.
  • If multiple versions of the column exist in the memTable, Cassandra flushes only the newer version of the column to disk, as described in the Compaction section.

How is data deleted?

Cassandra deletes data differently than a relational database does. A relational database might spend time scanning through data looking for expired data and throwing it away or an administrator might have to partition expired data by month. Data in a Cassandra column can have an optional expiration date called TTL (time to live). Use CQL to set the TTL in seconds for data. 

Facts about deleted data to consider are:

  • Cassandra does not immediately remove data marked for deletion from disk. The deletion occurs during compaction.
  • If you use the SizeTieredCompactionStrategy or DateTieredCompactionStrategy, you can drop data immediately by manually starting the compaction process. Before doing so, understand the disadvantages of the process. If you force compaction, one potentially very large SSTable is created from all the data. Another compaction will not be triggered for a long time. The data in the SSTable created during the forced compaction can grow very stale during this long period of non-compaction.
  • Deleted data can reappear if you do not do repair routinely.

How is data read?

Cassandra processes data at several stages on the read path to discover where the data is stored, starting with the data in the memTable and finishing with SSTables:

Different stages of read process in cassandra.

  • Check the memTable.
  • Check row cache, if enabled.
  • Checks Bloom filter.
  • Checks partition key cache, if enabled.
  • Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not.

If the partition summary is checked, then the partition index is accessed

  • Locates the data on disk using the compression offset map.
  • Fetches the data from the SSTable on disk.

dml_caching-reads_12

When a read request comes, then it will goes to following stages:

  • MemTable: 
    • If the memTable has the desired partition data, then the data is read and then merged with the data from the SSTables. The SSTable data is accessed as shown in the following steps.
  • Row Cache:
    • Row cache is nothing but putting few records in in-memory. 
    • If row cache is enabled, after memTable cassandra checks the row cache if the desired partition data will be read from the row cache, potentially saving two seeks to disk for the data.
    • The rows stored in row cache are frequently accessed rows that are merged and saved to the row cache from the SSTables.
    • If the desired partition data is not found in the row cache, then the Bloom filter is checked.
  • Bloom Filter:
    • Cassandra checks the Bloom filter to discover which SSTables are likely to have the request partition data.
    • Each SSTable has a Bloom filter associated with it.
    • The main functionality of bloom filter is , when ever it says record present in the particular SSTable, there is a chance that the record may present in that table.
    • But if bloom filter says record is not present, then it guarantees that record will not be present. 
    • After this partition key cache is checked.
  • Partition Key Cache:
    • If partition key cache is enabled, stores a cache of the partition index in off-heap memory.
    • The key cache uses a small, configurable amount of memory.
    • If a partition key is found in the key cache can go directly to the compression offset map to find the compressed block on disk that has the data.
    • The partition key cache can greatly improve over the performance of cold-start reads.
    • If a partition key is not found in the key cache, then the partition summary is searched.
  • Partition Summary:
    • The partition summary is an off-heap in-memory structure that stores a sampling of the partition index.
    • A partition index contains all partition keys, whereas a partition summary samples every X keys, and maps the location of every Xth key’s location in the index file.
    • For example, if 1,2,3,4…..100 partition keys contain in an SSTable then, if the partition summary is set to sample every 20 keys, it will store the location of the first key(stores 1) as the beginning of the SSTable file, the 20th key(stores 20) and its location in the file, and so on.
    • After finding the range of possible partition key values, the partition index is searched.
  • Partition Index:
    • The partition index resides on disk and stores an index of all partition keys mapped to their offset.
    • If the partition summary has been checked for a range of partition keys, now the search passes to the partition index to seek the location of the desired partition key.
    • A single seek and sequential read of the columns over the passed-in range is performed.
    • Using the information found, the partition index now goes to the compression offset map to find the compressed block on disk that has the data.
    • If the partition index must be searched, two seeks to disk will be required to find the desired data.
  • Compression offset map:
    • The compression offset map stores pointers to the exact location on disk that the desired partition data will be found.
    • It is stored in off-heap memory and is accessed by either the partition key cache or the partition index.
    • The desired compressed partition data is fetched from the correct SSTable(s) once the compression offset map identifies the disk location. The query receives the result set.

How do write patterns effect reads?

  • It is important to consider how the write operations will effect the read operations in the cluster.
  • This will be based on type of compaction strategy we are using.
  • Using the SizeTieredCompactionStrategy or DateTieredCompactionStrategy tends to cause data fragmentation when rows are frequently updated.
  • The LeveledCompactionStrategy was designed to prevent fragmentation under this condition.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *