The second chapter of Understanding Distributed Systems is out
In the second chapter of Understanding Distributed Systems, I explore the core building blocks at the heart of many distributed systems. What can you expect to learn from it as a reader?
In a distributed system, anything that can fail will eventually do so. But, you can only mitigate a failure if you can detect it in the first place. Hence, the second chapter starts by introducing the concept of failure detection and its implementations.
On a single-threaded system, if you know when two operations have been executed, you can say which one came before the other. And if you know the order of the operations, you can say something about the operations’ side effects. But, there is no such thing as a global wall clock that all nodes in a distributed system follow. In the second chapter, you will learn about logical clocks, which measure the passing of time in terms of operations executed, rather than seconds.
Now, suppose you want to guarantee that only a single node in the system can access a shared resource - how can the nodes decide which one should have access to it? A leader election algorithm solves this problem, and you will learn how to implement a flavor of it armed with the knowledge of how failure detectors and logical clocks work.
A single node is a single point of failure - if it goes down, its local state goes down with it. To guarantee the state remains available even if the node falls down, it needs to be replicated on multiple nodes. The second chapter will teach you how to leverage leader election to implement a state machine replication algorithm. While discussing replication, you will gain an understanding of what consistency is, why there are so many shades of it, and how it can be guaranteed.
Finally, you will use everything you have learned so far to implement distributed transactions, which guarantee that a group of operations spanning multiple nodes execute atomically, so that either all operations succeed or none do.