Cassandra core concepts broken down

1 of many Cassandra (educational) posts to come.

I am new to Cassandra, but definitely not new to databases and Cassandra’s services that make it so interesting. As I learn more, I will create more posts to expand upon everything Cassandra. I have learned way more than what is in this post already and very excited with the idea of sharing this knowledge… so look out for future posts!

There is a great deal to look at under the hood with Cassandra, but lets start with a good foundation.  In this post I will be focusing on a break down of Cassandra’s core concepts.

So what is Cassandra? The wiki page states that:

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients

This description sounds fantastical, but what does it actually mean?  Lets try to break it down into easily digestible chunks.  Lets first address all of these industry phrases in bold with respect to Cassandra.  I will also touch upon the “Data model” concept at the end.

Distributed Database:

– This means that data and processes are spread across more than one machine

Great! so the data and processes are spread across many machines, but why do we care?  We care because this means that no single machine (or a “node” as it is referred to in Cassandra lingo) holds all the data or handles all the requests.  Technically speaking its similar to a load balancing mechanism.  Each node is typically configured to house a portion of data (the distributed pieces of this data are known as “chunks”).  Additionally, all requests by design are broken up with respect to how the data was distributed.

Now we are getting somewhere.  So now if your data or processing gets to large within your environment all you have to do is add more nodes.  Additionally, if you need more parallelism, just add more nodes.

Distributed database architectures, built and configured correctly, with respect to Cassandra also means that if a node becomes unreachable that the service itself is still intact and usable.  This type of structure is also known to not have any single points of failure.

Finally, if a distributed Cassandra mechanism is well designed, it will scale well with n number of nodes. Cassandra is one of the best examples of such a system. It scales almost linearly with regard to performance as data is added and when we add new nodes. This means Cassandra can handle a ridiculous amount of data without wincing or exponential degradation of performance like most data storage solutions.

High Availability:

A high availability system means that if there is a failure the client will not notice any disruption.  This is usually achievable by having some sort of redundancy in place, like additional servers, clusters, or data centers.

Multiple data centers:

First, the term “data center” typically refers to a physical location where servers live, however with Cassandra the term is used a bit differently.  Data centers or DCs are more of a designation for a grouping of computers with a directive.  So, you could actually have multiple DCs in one physical data center.

Moving forward, multiple DCs indicates, more than not, that syncing or replication data between the different DCs is occurring.  Reasoning for having multiple DCs could be, but is not limited to replication, quicker regional access, and separation of data processing. With older data storage solutions replication is typically difficult on many levels, however this is a fairly trivial operation with Cassandra.

Replication:

Cassandra’s replication service is extremely powerful.  The replication architecture is referred to as masterless, which means, yep you guessed it, it has no master.  There is also no slave concept either.

Replication in Cassandra is also configurable so that n + m nodes will replicate data, however only m need to be verified first;  This configuration is extremely allows for crazy fast responses especially when replication is global.  Another helpful feature is that the replication is done asynchronously further decreasing latency when verifying that data has been written.

Data Model Introduction:

Cassandra has a three container data model hierarchy, one within another.  Here they are, with their RDBMS counter part terms, starting with the outermost and working our way in:

  • A keyspace is equivalent to a database
  • A column family is equivalent to a table
  • Finally columns which are still reminiscent of columns names, but created and accessed a bit differently. Columns appear to be, visually, contained within a fourth container called rows; and these rows are still identified by a unique key similar to the typical RDBMS primary keys.

So, this raps up the core concepts of Cassandra at a very high level. I am hoping to turn this into a full set of posts that cover Cassandra at all depths.

Hope you enjoyed this post and perhaps learned something.  If you find any of the information incorrect and/or out of date, then please comment.

Thank you

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s