article

Consensus Algorithms Explained - Raft and Paxos

Table of contents
reading time: 10 min

Navigate through the article!

Find the topic you are interested in and proceed to it directly from the table of contents!
Yevhenii Hordashnyk
DevOps consultant, co-founder

Intro

While listening to the episode of Kubernetes podcast devoted to etcd I noted that etcd uses Raft consensus protocol under the hood to manage a highly-available replicated log. Since I've heard of Raft several times before (Consul, Zookeeper), I decided to take a deeper look into it and finally figure out its purpose.

Turns out, there is a perfect explanation of why we need a consensus protocol in the first place and it's called Byzantine fault. The Wiki page explains it really well, but let me give you a short version here. Imagine the Byzantine army besieging a city, deciding whether to storm the city or keep the siege. Army's 9 generals decide to get ready for the storm but not to attack until they all agree on a common decision. The partial attack will allow the enemy to overpower Byzantines and defeat them, which is worse than a full-scale attack (Byzantine wins) or inaction (Byzantine does not lose). The state of affairs is getting more difficult as there is an unknown amount of treacherous generals which will try to achieve the halfhearted attack. The situation imposes a question: in which way should generals communicate and come to a common decision and with what number of treacherous generals is that even possible?

Let's take that question into the technology field. Having a distributed cluster of databases, how do we ensure consistency across the cluster, detect faulty nodes, and minimize the impact of a node being down? This and some additional questions are answered by Raft.
Byzantine fault

Paxos

All these questions were originally answered by Robert Shostak and Leslie Lamport the creator of the Paxos. Paxos is a consensus algorithm, meaning the algorithm that actually solves the outlined questions thus achieving overall system reliability in the presence of several faulty processes. However, the main problem of Paxos is that it is extremely difficult to understand and I guess (since I didn't get it) is even more difficult to implement. Let me just put these quotes originally presented by Raft author - Diego Ongaro:
The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos :-)
-- NSDI Reviewer

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the final system will be based on an unproven protocol.
-- Chubby authors (Google internal project)

Long story short, Raft has emerged as an attempt to make the consensus algorithm easier to understand and implement. Spoiler alert, based on research and real-world raft implementation, it was a big success.

Raft Consensus Algorithm

Unlike Paxos, Raft's authors decided to simplify the algorithm and make it based on a leader-follower approach or, which is more commonly used, a master-slave approach, where there is always a single master that receives read and write requests and replicates the changes to the read-only slaves. This is how the authors reason such a decision:
Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

Raft consensus

Raft operates 3 different nodes - leader, follower, candidate. The leader receives requests from the clients and replicates the changes to the followers. A candidate emerges once a leader becomes unavailable. It happens when a follower hasn't received a message from a leader within a predefined time and decided to promote itself to a leader.

In addition to the states, raft uses a `term` term. It is an auto-incremented integer that represents the cycle of an election. Each time the election takes place, the term is incremented. E.g. 3->4. It is necessary to keep track of the latest elections and to find the nodes with the stale data.

Election

Let's check how that election works. I strongly recommend having that awesome Raft Visualization diagram in front of you and play around with it (click on one of the nodes and change its state)

  1. Follower has been receiving a heartbeat requests from the leader, but the last one hasn't come in within a specified time interval
  2. Follower increments a term and transitions to the Candidate state
  3. Candidate votes for itself, notifies every other server about it by asking to vote and this is how the election begins
  4. Now there are 3 possible outcomes: Candidate wins, Candidate loses, Timeout
Candidate wins if the majority of the servers voted for it, like in real democratic elections (yes, USA, I'm looking at you). How does one decide who to vote for? As simple as voting for the server which request for the vote came first.
Candidate wins
In this case, Leader-elect notifies other servers that he has become a leader and that there should not be any more elections.
Candidate loses
It happens when Candidate waits for all the votes to come in, but another request from a Leader-elect emerges notifying Candidate about a new leader(see #Candidate-wins). If the term of a Leader-elect is the same or greater than the one of Candidate, Candidate accepts the new Leader and loses.
Timeout
Timeout happens when a specified period dedicated to the election process goes by and some Candidate has neither won nor lost. It means that a new leader hasn't been elected and it is time to increment the term and start from scratch.
A reasonable question appears - if elections failed and no candidate was able to get the majority of the votes, we need to do the elections again. But why are we sure it'll work this time and we won't get into an infinite circle of re-elections until some kind of a circuit breaker interferes?
I think Raft's authors picked the easiest yet very effective solution - random. Each term servers get a random value of election timeout (remember, it is different for each server and isn't the same value for the whole cluster) and therefore breaks the circle.

Replication

Once Leader is elected, it starts receiving queries from the clients. Each mutation (or write) query is considered either committed or not committed. What does it mean?

Once a leader applies a write query (Update, Insert, Delete in the SQL world. Basically, any data modification as opposed to ReadOnly queries), it sends the log of it to all the followers, waiting for them to run the same query at their sides. Only if and once it is done transaction is considered committed and a response is sent to the client.

Let's say a leader got a query, sent it to the followers and then went offline without committing it. Then the new leader takes its place, it receives a new query from the client and it results in a query log being diverged between the new leader and the followers which received the query from the old leader. In this case, the new leader considers his log the main one, finds the common place in both logs, and rewrites every piece of the log at the follower's log with its own data. The next replication entry will already be coherent across the cluster.

Safety

That all sounds great as long as all the nodes know "the rules". What if some node (a follower or a candidate) goes offline for some time, then it gets back and the exact same term it gets elected as a leader? Will it rewrite the entire replication log basically just erasing all changes that happened during the time it was offline?

In order to avoid such a situation, a restriction to replication log was added - a candidate's replication log must contain all committed entries (pay attention here - not yet committed entries aren't counted). Keeping in mind that candidate also needs to receive a majority of the votes to become a leader, it means that candidate can win as long as it has the same replication log as the majority of the nodes in the cluster.

Performance considerations

Although distributed systems that use Raft achieve consensus in a rather fast way, one needs to keep in mind that there will be split-vote cases which are resolved by means of timeout, meaning spending more time to elect a new leader. And the bigger the cluster gets and the lesser the pool of timeout values the slower the consensus is achieved.
Timeout value
According to the original study, the best range for randomly selected timeout value is 150ms-300ms
Number of nodes
I wasn't able to find any official research about the cluster performance depending on the number of nodes. However, the recommendation is to keep it low and odd with a minimum of 3 and a maximum of 7 (9 in very rare cases).

Proofs: Consul, NATS, etcd

Afterword

In this article, I wanted to give you some basic understanding of what Raft is, which problems it aims to solve and how it actually works. If this topic is of any further interest to you, I recommend checking out the whitepaper (it is really easy and nice to read and understand) and the upcoming article about Raft implementations.

Resources

Related services
Recommended articles
// CONTACT US:
Made on
Tilda