Article From:

CassandraIt is an open source, distributed, no center node, flexible extensible, high availability, fault tolerance, consistent coordination, and column oriented NoSQL database.

CassandraCluster (Cluster)

  • Cluster
    • Data center(s)
      • Rack(s)
        • Server(s)
          • Node (more accurately, a vnode)

  • Node(Node): an instance of running Cassandra
  • Rack(Rack): a set of sets of nodes
  • DataCenter(Data center): a set of sets of racks
  • Cluster(Cluster): map to the set of all nodes with a complete token ring.

 Coordinator (Coordinator)

When the client is connected to a node to initiate read and write requests, the node acts as a node.Client applicationAnd clusterHaving the corresponding data nodeThe bridge, called coordinator, is determined according to the configuration of the cluster.Which node in the ring (ring)This request should be obtained.

Using the CQL connection specified-hNodes are coordination nodes

  • Any node in a cluster can become a coordinator.
  • Each client request can be coordinated by different nodes.
  • Managing replication factors by a coordinator (replication factor: how many nodes should a new data be copied to)
  • The coordinator applies consistency level (consistency level: how many nodes in the cluster must be read and written correspondingly).

Partition (Partitioner)

Partitioning determines how data is distributed within the cluster. In Cassandra, each row of table is identified by a unique primarykey.partitionerIn fact,A hash functionUseCalculating the token of primary key。CassandraAccording to this token value, the corresponding rows are placed in the cluster.

CassandraThree different partitions are provided

  • Murmur3Partitioner(Default) – distribute data evenly in cluster based on MurmurHash hash value.
  • RandomPartitioner – Distribute data evenly in cluster based on MD5 hash value.
  • ByteOrderedPartitioner – Keep the ordered distribution of data words through key bytes.

Virtual node (Vnode)

Each virtual node corresponds to a token value, and each token decides.The location of the node in the ringas well asThe range of hash values that a node should bear for a continuous range of data.Therefore, each node has a continuous token. This continuous token constitutes a closed circle.

No virtual nodes are used, the number of tokens of the Ring ring = the number of machines in the cluster. For example, there are 6 nodes in all, so the number of token =6. is due to the replica factor =3, and a record is in the presence of three nodes in the cluster. The simple way is to calculate the has of the rowkey.The H value falls on the token of the ring. The first data is on that node, and the remaining two copies fall on the last two nodes of the node on the token ring.

The A, B, C, D, E, F are the range of key, the real value is hash ring space, such as the 0~2^32 interval is divided into 10 parts. Each segment is 1/10. node 1 of 2^32.And so on.

If you do not use virtual nodes, you need to manually calculate and assign a token to each node in the cluster. Each token determines the location of nodes in the ring and the range of hash values that a node should assume. In the half part of the diagram, each node is assigned a separate token.Representing a position in the ring, each node stores row key as a hash value and falls in the only continuous hash range of data that the node should bear. Each node also contains a copy of the row from other nodes.

Using virtual nodes allows each node to have multiple smaller discontinuous hash ranges. In the lower part of the graph, the nodes in the cluster use virtual nodes, and the virtual nodes are randomly selected and discontinuous. The location of the data is also determined by the hash value obtained from the mapping of row key, but it falls.In a smaller partition.

The benefits of using virtual nodes

  • There is no need to compute and allocate token for each node
  • There is no need to rebalance cluster load after adding nodes removed.
  • The node of the reconstruction of the exception is faster

Replication (Replication)

There are two available replication strategies currently available:

  • SimpleStrategy:Used only for a single data center, the first replica is placed in a node determined by partitioner, and the rest of the replicas is placed in the clockwise subsequent node of the above nodes.


  • NetworkTopologyStrategy:It can be used in more complex multi data centers. You can specify how many copies of replicas are stored in each data center.

The replication policy is specified when the keyspace is created, such as

CREATE KEYSPACE Excelsior WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor' : 3 };  
CREATE KEYSPACE Excalibur WITH REPLICATION = {'class' :'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

 The data center names of DC1 and DC2 are identical to the names configured in snitch. The topology policy above indicates that 3 copies are configured in DC1, and 2 copies are configured in DC2.

The Eight Diagrams (gossip)

GossipProtocol is an internal communication technology for communication between nodes in a cluster. Gossip is an efficient, lightweight and reliable broadcast protocol for data exchange between nodes. It is a distributed, fault-tolerant and point to point communication protocol. Cassandra uses gossipibing to enterPeer to peer discovery and metadata propagation.

gossipThe specific representation of the protocol is the seeds seed node in the configuration file. One note is that the seed nodes of all nodes of the same cluster should be consistent. Otherwise, if the seed nodes are not consistent, there will be cluster splitting, that is, two clusters will appear.Early detection of other nodes in the cluster. Each node exchanges information with other nodes. Due to randomness and probability, it is sure that all nodes of the cluster are given out. At the same time each node will save all the other nodes in the cluster. So which node can know all the other nodes in the cluster.If CQL connects to a node of the cluster, it can get the state of all the nodes of the cluster. That is to say, the state of any node about the node information in the cluster should be consistent.

gossipThe process runs once per second, exchanging information with up to 3 other nodes, so that all nodes can quickly learn about other nodes in the cluster. Since the whole process is dispersed, there is no node to coordinate the gossip information of each node, and each node selects one to three nodes independently.Gossip communication always selects peers that are active in the cluster, and selects seed nodes and unreachable nodes probabilistically.


gossipThe protocol and the TCP three handshake are similar. Using a conventional broadcast protocol, there is only one message per round, and the message is allowed to propagate within the cluster, but with the gossip protocol, each gossip message will contain only three messages, adding a certain degree of inverse entropy. This process allowsThe data exchange between nodes is fast convergent.

First, the system needs to configure several seed nodes, such as A, B, each participating node will maintain the state of all nodes, node-> (Key, Value, Version), the version number is larger and the data is relatively new, node P can only update its own state directly.Node P can only update the data of other nodes maintained by the gossip indirectly through the protocol.

The general process is as follows.

 SYN:Node A selects some nodes randomly, where you can choose only send summaries, that is, do not send valus to avoid excessive messages.

ACK:When the node B receives the message, it will merge it with the local, where the combination uses a contrast version, the larger version of the data is more new. For example, node A sends data C to node B (key, valuea, 2), and node B is C (key, valueb, 3).Then, because the version of B is relatively new, the data after the merge is the data stored on the B machine, and then it will be sent back to the A node.

ACK2:Node A receives the ACK message and applies it to the local data.

Conformance (Consistency Level)

Consistency refers to whether all the replica sets of a row of data are up-to-date and synchronized. Cassandra extends the concept of final conformance to a read or write operation, the concept of an adjustable consistency, that refers to the client that initiates the request, which can be passed through the consistency level parameter,Specify the consistency required for this request.

Consistency of write operation

ANYAny node write operation has been successful. If all replica nodes are hung, the write operation can still be successful after recording a hinted handoff event. If all replica nodes are hung, the data written is in the repl that hangs off.Before the ICA node is restored, it can not be read.The minimum latency wait and ensure that the write request will not fail. Provide minimum consistency and maximum availability relative to other levels.
ALLThe write operation must write the data of the specified row to the commit log and memtable of all replica nodes.Provide the highest consistency and minimum availability relative to other levels.
EACH_QUORUMThe write operation must write the data of the specified row to the commit log and memtable of the replica node of the quorum number of each data center.For multi data center clusters, the same level of consistency is strictly guaranteed. For example, if you want to fail when a data center is hung out, or when the replica node write operation that can’t meet the quorum number is successful, the write request returns.
LOCAL_ONEThe replica node in any local data center is successfully written.For multiple data centers, it is often expected that at least one replica node will write successfully, but do not want any data center communication. LOCAL_ONE just meets this demand.
LOCAL_QUORUMThe replica node write number of quorum number in local data center is successful. Avoid communication across data centers.Can’t be used with SimpleStrategy. It is used to ensure data consistency in local data centers.
LOCAL_SERIALThe replica node of the quorum number in the local data center is written successfully (conditionally) conditionally.For lightweight transaction, linearizable consistency is implemented to avoid unconditional (unconditional) updates.
ONEAny replica node write operation has been successful.Meet the needs of most users. Generally, the nearest replica node from the coordinator node is given priority.

Be careful:

Even if consistency level ON or LOCAL_QUORUM is specified, the write operation will be sent to all replica nodes, including the other replica nodes in other data centers. Consistency level onlyIt is decided that the number of replica nodes that need to be written successfully will need to be informed before the client request is successful.

Reading consistency

ALLQuerying data to all replica nodes, returning the latest timestamp data in all replica returned data. If a replica node does not respond, the read operation fails.It provides the highest consistency and minimum availability relative to other levels.
EACH_QUORUMQuerying data from every replica node in the data center of quorum, and returning the latest data with time stamp.LOCAL_QUORUM.
LOCAL_SERIALSame as SERIAL, but only limited to local data center.SERIAL.
LOCAL_QUORUMQuerying data from every replica node in the data center of quorum, and returning the latest data with time stamp. Avoid communication across data centers.The use of SimpleStrategy will fail.
LOCAL_ONEReturns the data of the nearest replica node in the local data center from the coordinator node.Use the same level of operation in Consistency level.
ONEReturns the result of the latest replica returned by snitch. By default, the background will trigger read repair to ensure the consistency of other replica data.The highest level of availability is provided, but the results returned are not necessarily up-to-date.
QUORUMThe result of reading all the nodes of quorum in the data center is returned, and the latest results of the merged timestamp are returned.Ensure strong consistency, though it is possible to read failure.
SERIALIt is allowed to read the current (including uncommitted) data, and if the transaction of uncommitted is found in the process of reading, then commit.Lightweight transactions.
TWOBack to the latest data of two recent replica.It’s similar to ONE.
THREEBack to the latest data of three recent replica.It’s similar to TWO.


About the QUORUM level

QUORUMLevel ensures that data is written to nodes with specified quorum numbers. The value of a quorum is calculated from the following formula four, five,

(sum_of_replication_factors / 2) + 1

sum_of_replication_factorsRefers to the sum of all replication_factor settings in each data center.

For example, if the replication factor of a single data center is 3, the quorum value is 2-, indicating that the cluster can tolerate at least 1 nodes down. If replication factor is 6, quorum value is 4-, indicating cluster.2 nodes, down, can be tolerated at most. If it is a dual data center, the replication factor of each data center is 3, and the quorum value is 4-, indicating that the cluster can tolerate at most 2 nodes down. If it is the 5 data center, Rep of each data centerLication factor is 3, and the value of quorum is 8.

If you want to ensure consistency in reading and writing, you can use the following formula:

(nodes_written + nodes_read) > replication_factor

For example, if the application uses the QUORUM level to read and write, and the value of the replication factor is 3, the setting can ensure that the 2 nodes are bound to be written and read. The number of read nodes plus the number of nodes written (4) is replicationFactor (3) is large so that consistency can be ensured.

Link of this Article: Cassandra internal architecture

Leave a Reply

Your email address will not be published. Required fields are marked *