Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure.
The primary objective of a NoSQL database is to have
- simplicity of design,
- horizontal scaling, and
- finer control over availability.
These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data
Features of Cassandra
Elastic scalability – Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture – Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
Fast linear-scale performance – Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage – Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Cassandra query Language shell(Cqlsh)
Using cqlsh, you can
- define a schema,
- insert data, and
- execute a query
keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node. Given below is the syntax for creating a keyspace using the statement CREATE KEYSPACE.
When you start a Cassandra cluster, data is distributed across the nodes in the cluster based on the row key using a partitioner. You must assign each node in a cluster a token and that token determines the node’s position in the ring and its range of data.
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are stored on disk sequentially and maintained for each Cassandra table.
An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.
Key components for configuring Cassandra
A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a Cassandra cluster. Gossip information is also persisted locally by each node to use immediately when a node restarts.
A partitioner determines how to distribute the data across the nodes in the cluster and which node to place the first copy of data on. Basically, a partitioner is a hash function for computing the token of a partition key. Each row of data is uniquely identified by a partition key and distributed across the cluster by the value of the token. The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases.
A snitch defines groups of machines into data centers and racks (the topology) that the replication strategy uses to place replicas.
You must configure a snitch when you create a cluster. All snitches use a dynamic snitch layer, which monitors performance and chooses the best replica for reading. It is enabled by default and recommended for use in most deployments. Configure dynamic snitch thresholds for each node in the cassandra.yaml configuration file.
Consistent hashing allows distributing data across a cluster which minimizes reorganization when nodes are added or removed
You no longer have to calculate and assign tokens to each node.
Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster.
If a node fails, the load is spread evenly across other nodes in the cluster.
Rebuilding a dead node is faster because it involves every other node in the cluster.
A partitioner determines how data is distributed across the nodes in the cluster (including replicas). Basically, a partitioner is a function for deriving a token representing a row from its partion key, typically by hashing. Each row of data is then distributed across the cluster by the value of the token.
Both the Murmur3Partitioner and RandomPartitioner use tokens to help assign equal portions of data to each node and evenly distribute data from all the tables throughout the ring or other grouping, such as a keyspace.
Moving Data from other Databases
The COPY command, which mirrors what the PostgreSQL RDBMS uses for file/export import.
The Cassandra bulk loader provides the ability to bulk load external data into a cluster.
Time-to-live. An optional expiration date for values inserted into a column
consistency refers to how up-to-date and synchronized a row of data is on all of its replicas.
Backup and Restoring Data
Cassandra backs up data by taking a snapshot of all on-disk data files (SSTable files) stored in the data directory.
You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online.
- The nodetool utility
- A command line interface for Cassandra for managing a cluster.
- Cassandra bulk loader (sstableloader)
- The cassandra utility
- Starts the Cassandra Java server process
- .The cassandra-stress tool
- A Java-based stress testing utility for benchmarking and load testing a Cassandra cluster.
- The sstablescrub utility
- Scrub the all the SSTables for the specified table.
- The sstablesplit utility sstablekeys
- The sstablekeys utility dumps table keys.
- The sstableupgrade tool
- Upgrade the SSTables in the specified table (or snapshot) to match the current version of Cassandra.