Table of Contents
Cassandra is another no-sql database. Similar to Hbase it is also distributed column-oriented database to handle big data workloads across multiple nodes but it can support both Local File system and HDFS, whereas in Hbase the underlying file system is also HDFS.
It overcomes single point of failure by using a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. Each node frequently exchanges state information about itself and other nodes across the cluster using peer-to-peer gossip communication protocol.
Commit Log & MemTable
Commit log on each node will maintain write activity to ensure data durability. It will be sequentially written. Data is then indexed and written to an in-memory structure, called a memTable, which resembles a write-back cache. This is similar to memstore or WAL (Write Ahead Log) in Hbase.
Once memTable is full, the data is written to an SSTable (sorted string table) data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates SSTables using a process called compaction (This is same as major compaction in Hbase to apply delete markers on HFiles), discarding obsolete data marked for deletion with a tombstone. To ensure all data across the cluster stays consistent, various repair mechanisms are employed.
Cassandra is a partitioned row store database, where rows are organized into tables with a required primary key. Cassandra’s architecture allows any authorized user to connect to any node in any data centre and access data using the CQL language. Unlike Hbase commands, CQL language uses a similar syntax to SQL and works with table data. Developers can access CQL through cqlsh, DevCenter, and via JDBC/ODBC drivers for application languages like java, python etc. Typically, a cluster has one keyspace per application composed of many different tables.
Clients read or write requests can be sent to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.
Cassandra Key Components
- A peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. This is similar to hear-beat mechanism in HDFS to get the status of each node by the master.
- A peer-to-peer communication protocol to share location and state information about the other nodes in a Cassandra cluster. Gossip information is also stored locally by each node to use immediately when a node restarts.
- The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster.
- A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.
- To prevent problems in gossip communications, use the same list of seed nodes for all nodes in a cluster. This is most critical the first time a node starts up. By default, a node remembers other nodes it has gossiped with between subsequent restarts. The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.
Note: In multiple data-center clusters, the seed list should include at least one node from each data center (replication group). More than a single seed node per data center is recommended for fault tolerance. Otherwise, gossip has to communicate with another data center when bootstrapping a node. Making every node a seed node is not recommended because of increased maintenance and reduced gossip performance. Gossip optimization is not critical, but it is recommended to use a small seed list (approximately three nodes per data center).
Data distribution and replication
How data is distributed and factors influencing replication.
In Cassandra, Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows.
Factors influencing replication include:
Virtual nodes (Vnodes):
Assigns data ownership to physical machines.
Before Cassandra 1.2, each node is assigned with a token. After Cassandra 1.2, each node is assigned with group of tokens as shown below
In the first half, each node is assigned a single token that represents a location in the ring. Each node stores data determined by mapping the partition key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster. For example, if the replication factor is 3, range E replicates to nodes 5, 6, and 1. Notice that a node owns exactly one contiguous partition range in the ring space.
The bottom half of diagram displays a ring with vnodes. Within a cluster, virtual nodes are randomly selected and non-contiguous. The placement of a row is determined by the hash of the partition key within many smaller partition ranges belonging to each node.
Replication strategy/data replication:
Determines the total number of replicas across the cluster.
We need to define the replication factor for each data center. Two replication strategies are available:
- SimpleStrategy: Use for a single data center only. SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering topology (rack or data center location).
- NetworkTopologyStrategy: Use NetworkTopologyStrategy when we have (or plan to have) our cluster deployed across multiple data centres. This strategy specifies how many replicas we want in each data center.
NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack. NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to power, cooling, or network issues.
- Two replicas in each data center: This configuration tolerates the failure of a single node per replication group and still allows local reads at a consistency level of ONE.
- Three replicas in each data center: This configuration tolerates either the failure of a one node per replication group at a strong consistency level of LOCAL_QUORUM or multiple node failures per data center using consistency level ONE.
Replication strategy is defined per keyspace, and is set during keyspace creation.
Partitions the data across the cluster.
A partitioner distributes the data across the nodes in the cluster and determines which node to place the first copy of data on. Basically, a partitioner is a hash function for computing the token of a partition key. Each row of data is uniquely identified by a partition key and distributed across the cluster by the value of the token. The read and write requests to the cluster are also evenly distributed and load balancing is simplified because each part of the hash range receives an equal number of rows on average.
- Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
- RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
- ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes
Note: If using virtual nodes (vnodes), we do not need to calculate the tokens. If not using vnodes, we must calculate the tokens to assign to the initial_token parameter in the cassandra.yaml file.
A snitch defines groups of machines into data centres and racks (the topology). We must configure a snitch when we create a cluster. All snitches use a dynamic snitch layer, which monitors performance and chooses the best replica for reading. It is enabled by default. Configure dynamic snitch thresholds for each node in the cassandra.yaml configuration file.
The default SimpleSnitch does not recognize data center or rack information. Use it for single-data center deployments or single-zone in public clouds. The GossipingPropertyFileSnitch is recommended for production. It defines a node’s data center and rack and uses gossip for propagating this information to other nodes.
There are many vtypes of snitches like dynamic snitching, simple snitching, RackInferringSnitch, PropertyFileSnitch, GossipingPropertyFileSnitch, Ec2Snitch, Ec2MultiRegionSnitch, GoogleCloudSnitch, CloudstackSnitch.