I have released a new chapter of the Distributed Systems Manual: Resiliency Patterns. The chapter is all about failures and their mitigations.
Any failure that can happen will eventually happen at scale - hardware faults, software crashes, memory leaks - you name it. The more your system scales out, and the more failures it will experience. Eventually, the only way to cope with them is with automated self-healing and defense mechanisms.
The chapter is packed with practical defense mechanisms that have helped the systems I have built to stand the test of time and scale to millions of users. It starts out describing the most common sources of failures, like single point of failures and slow processes, and then goes on to describe various defense mechanisms, like circuit breakers, rate-limiting, and load shedding.