Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data, providing high availability with no single point of failure. This is a type of NoSQL database. Let us first have a look at what is a NoSQL database.
NoSQL database is a database that provides a mechanism to store and retrieve data other than in the tabular relations used in relational databases. These databases are schema-free, support easy replication, consistent, and can handle huge amounts of data.
In another point we can say the database environment is simple, non-relational, large distributed system that enables rapid, ad-hoc organization and analysis of extremely high-volume, disparate data type.
NoSQL databases have become the first alternative to relational databases, with above mentioned characteristics such as scalability, availability and fault tolerance being key deciding factors. These databases have gone more widely understood in comparison to legacy relational databases. Further, these databases support the requirements of cloud applications.
Relational Database vs. NoSQL
Apache Cassandra database is the right database when you need scalability and high availability (no single point of failure) without compromising performance. This is an open source, distributed and decentralized/distributed storage system (database).What is Apache Cassandra?
Cassandra’s technical roots can be found at companies recognized for their ability to effectively manage big data – Google, Amazon, and Facebook. Foundation was laid in 2009 with Facebook “open sourcing Cassandra to Apache”.
The Architecture of Cassandra
The architecture of Cassandra greatly contributes to its being able to scale, perform and offer continuous availability. Cassandra was built from scratch with an understanding of hardware and system failures. This leads Cassandra to manage and protect data in a different way than a traditional RDBMS.
Cassandra has a peer-to-peer distributed architecture, easy to set up and maintain. In Cassandra all nodes are same: no concept of a master node, where all nodes communicate with each other via a gossip protocol.
Scalable architecture means that it is capable of handling petabytes of information and thousands of concurrent user/operation per second across multiple data centers. Further unlike other master-slave or shared systems Cassandra has no single point of failure and therefore it is capable of offering continues availability.
Distributing and Replicating Data
Cassandra provides automatic data distribution across all nodes that participate in a database cluster (ring). To distribute data across a cluster, there is nothing to be done programmatically for a developer / administrator. Data is transparently partitioned across all nodes in either randomized or ordered fashion.
Cassandra also provides built-in and customizable replication, which stores redundant copies of data across nodes that participate in a Cassandra cluster. Which means if any node in a cluster goes down, one or more copies of that node’s data is available on other machine’s cluster.
Configuration of Replication is not complicated as RDBMS replication. A developer / administrator will simply indicate how many data copies are needed and Cassandra takes care of the rest. Replication options are provided that also allow for data to be automatically stored in different physical racks.
Cassandra also facilitates replicating data across many different cloud platforms. Developer / Administrator can implement a single Cassandra cluster that spans / involves a cloud environment. When creating a new Cassandra database (Keyspace) we simply need to indicate through a single command which data center will hold copies of the new database; everything from that point forward is automatically maintained by Cassandra.
Reading and Writing Data
When it comes to reading and writing data Cassandra behaves as a “location independent” architecture database. This means any node in a cluster may be a read or written to, which translates into a true read/write.
When data is written to Cassandra, it is first written to a commit log, to ensure full data durability and safety. Data is also written to an in-memory structure called a ‘memtable’, which is eventually flushed to a disk structure called an ‘SSTable’ (sorted strings table).
If one or more nodes responsible for a particular set of data are down, data is written to another node, which will temporarily hold the data. Once the node is online, automatically data will be updated on the correct node. Data reading is performed in parallel between the clusters. When a user requests data from any node (it becomes the user’s coordinator node) with the user’s query being assembled from one or more nodes holding the necessary data. If a node having the data is down, Cassandra will simply request data from another node which will hold the replicated copy.
As we know the RDBMS offers ACID transactions, but the Cassandra offer the “AID” portion of ACID. That is data written has Atomic, Isolated and Durable.
Performance of Cassandra
Cassandra has a high performance for both the reads and writes, which scales linearly when new nodes are added to a cluster.
The high performance of Cassandra was practically presented at the “2011 High Performance Transactional System Workshop” that demonstrated both the ease of use and linear performance capabilities. In the Academic benchmark paper presented at the 2012 conference for “Very Large Databases” in Istanbul, a team of performance engineers benchmarked Cassandra along with a number of other NoSQL and SQL databases. During this conference the performance engineers found in comparing Hbase vs Cassandra, that Cassandra has:
10x more read throughput
8x faster read latency (up to 100x faster)
8x more write throughput
10x slower write latency (with the default configuration; that is, no write durability for HBase)
8x faster scan latency
4x more scan throughput
Managing and Monitoring Cassandra
Cassandra is a self-managing database, anyway there are many administration and monitoring tasks that are carried out with the database. Most of these monitoring operations can be performed using the DataStax OpsCenter.
DataStax OpsCenter is a visual management and monitoring solution for Cassandra and other big Data. This OpsCenter is a web based monitoring tool.
Author: Shaham Jiffry is a Senior Database Tech Lead/Team Lead at CMS - Remote Technology Center of Bluecorp, with years of experience in technologies like SQL Server, Cassandra, Aerospike, Postgres, Redshift, Neptune, and a numerous other SQL and NoSQL databases.