Raft Consensus Algorithm

One of the algorithms for server architecture that’s been on my list for awhile to investigate is the Raft Consensus Algorithm. Recently I had enough spare time to start engaging with some of the content on their GitHub page and now I feel like I have a high level understanding of the basic concept.

Dealing with faulty computers is notoriously difficult. I’d wager that a very large portion of server architectures in the real world are ones that don’t deal with computer crashes in a fault tolerant way. Why? Well the simple answer is that computer crashes in these environments are rare events. Over the last 10 years of working in the industry, I’ve only seen crashes of this kind in production a few times. So it’s a rare event, and it’s also a hard thing to architect around. As a result you end up with a situation where you depend on your servers to be running and a crash of the system your server is running on results in downtime until you respond and start your system up again.

I watched a few different videos on Raft, and one thing that struck me is their emphasis on making the algorithm easy to understand. That’s not to say it’s a simple system: the complexities of handling all the edge cases is not simple. But it was refreshing to see an area of computing where some smart people spent significant effort to make things simpler to grasp. The fact that this algorithm has so much adoption since its inception is kind of proof that they hit on something important.

So far I’ve watched three of the videos that are on the Raft GitHub page. The best was probably this one:

It’s pretty cool. One thing that wasn’t clear to me after the first video I watched was how this system deals with clients, and this video explains that better. So the way I understand it, the distributed log that they talk about is more of a log of client requests. Those requests may or may not succeed once they are actually attempted in the system’s state machine. But the consensus algorithm here is getting agreement across multiple servers for the order in which those requests are going to be processed. Once a majority of servers has committed the request to their log then it is safe to attempt executing the request and returning the result (whether success or failure) to the client.

When I first was looking at this, I interpreted the log in this mechanism as what I’d call an event log - a record of the state changes in the system. But that’s the output of the state machine in this formulation so this initial confusion set me on the wrong path a little bit.

I remember our team looking into ways to make our monolithic server architecture more fault tolerant. The observation was that the main server that was responsible for all the state change was vulnerable to these computer crashes or network outages. One strategy that we explored was having a backup server that was getting its state replicated from the current live one, and then have some mechanism to hand over responsibility to the backup if there was a problem with the original. The problem was always going to be that this was a rare occurrence and any handover would likely need to be manual. I guess recovery times might have been quicker though if you knew that it was likely safe to just instruct the backup to become the new live authority.

I find myself curious if Raft would have been an option for us. The downside is that client requests would have more overhead since they would need to be committed to the distributed log first before being attempted. But the cool thing would be that by running three instances of these core servers you could be fault tolerant to any one of them crashing.

Here’s a question I don’t know the answer to though: what happens if these logs of client requests contain sensitive data? In today’s GDPR world, what do you do architecturally to clean out any sensitive information?