Testing and operating distributed systems

February 07, 2021

I am excited to announce that the first edition of my book about distributed systems is finally complete!

First things first, I have rebranded the book from “The Distributed Systems Manual” to “Understanding Distributed Systems,” as I felt it was more appropriate for the content. It’s the second time I change the title, but also the last, I promise :) I have also changed the book’s PDF format to include a larger margin on every page for notes, references, and comments.

Finally, I have added a brand-new part to the book that discusses how to test and operate distributed systems. Historically, developers, testers, and operators were part of different teams. The developers handed over their software to a team of QA engineers responsible for testing it. When the software passed that stage, it moved to an operations team responsible for deploying it to production, monitoring it, and responding to alerts.

This model is being phased out in the industry as it has become commonplace for the development team to also be responsible for testing and operating the software they write. This forces the developers to embrace an end-to-end view of their applications, acknowledging that faults are inevitable and need to be accounted for.

Chapter 18 describes the different types of tests — unit, integration, and end-to-end tests — you can leverage to increase the confidence that your distributed applications work as expected.

Chapter 19 dives into continuous delivery and deployment pipelines used to release changes safely and efficiently to production.

Chapter 20 discusses how to use metrics and service-level indicators to monitor the health of distributed systems. It then describes how to define objectives that trigger alerts when breached. Finally, the chapter lists best practices for dashboard design.

Chapter 21 introduces the concept of observability and how it relates to monitoring. Then it describes how traces and logs can help developers debug their systems.

What’s next for the book? I have a lot more content I plan to add to it now that the fundamentals are covered, like CRDTs, control planes, and Byzantine consensus. Stay tuned!


Written by Roberto Vitillo

Want to learn how to build scalable and fault-tolerant cloud applications?

My book explains the core principles of distributed systems that will help you design, build, and maintain cloud applications that scale and don't fall over.

Sign up for the book's newsletter to get the first two chapters delivered straight to your inbox.

    I respect your privacy. Unsubscribe at any time.