Book Review: The Practice Of Cloud System Administration Volume 2 – Designing And Operating Large Distributed Systems

Hello everyone with another book review. This time, I will be reviewing a book that I consider a classic. As always, let’s start with the list of contents:
Part I Design: Building it

Designing in a distributed world
Designing for Operations
Selecting a Service Platform
Application Architectures
Design Patterns for Scaling
Design Patterns for Resiliency

Part II Operations: Running it

Operations in a Distributed World
DevOps Culture
Service Delivery: The Build Phase
Service Delivery: The Deployment Phase
Upgrading Live Services
Automation
Design Documents
Oncall
Disaster Preparedness
Monitoring Fundamentals
Monitoring Architecture and Practice
Capacity Planning
Creating KPIs
Operational Excellence

Part III Appendices

Assessments
The Origins and Future of Distributed Computing and Clouds
Scaling Terminology and Concepts
Templates and Examples
Recommended Reading

overall a bit over 500 beautifully printed pages (as you would come to expect from Addison-Wesley).
As you can see from the ToC, the breadth of information contained in this book is tremendous, every chapter can easily expand into a book on its own (and indeed, there are volumes that expand on a lot of the topics), however this book achieves to give the astute reader a ton of information, heck it is almost like the information is condensed – just add water. The authors do not fell into the pit of sticking with a particular technology, they maintain a level of abstraction, that in my opinion is about right, not too abstract (that would limit the potential of the book to be applied in real world situations) and, yet, not tied to a particular technology (i.e. this book came before container orchestration frameworks became as popular as they are today but you will not notice) that would instantly severely date the book. The format of the book is similar for all chapters, first an attention-grabbing introduction, then a nice discussion of the topic at hand and finally exercises, so the reader can follow up with what has been discussed – most of them are open ended. After all, large scale distributed systems have a common set of characteristics, no matter what the implementation details are or purpose.
The potential audience of this book are both SREs and their managers. In particular, Part II of the book contains a ton of information relevant to both sides of the equation. If you manage SREs, you’d better be at least acquainted with the material and this book is more than a fine introduction. If you need a book on how to use AWS/Azure/GCP or their specifics, this volume will NOT meet your expectations, as discussed this book is more like a framework.
In case, this is not obvious by now, I consider this book a must-read for anyone dealing with modern distributed systems, be it SRE, SWE or Engineering Manager. I cannot praise this book enough, it is extremely well written, in certain cases it goes against the trends and how can you go wrong with a book that considers a zombie outbreak a valid reason for a datacenter outbreak?
Further resources:
Companion Website
Thomas Limoncelli’s Twitter

PS. A book that everybody is recommending (and asking me about it, in a variety of contexts) is Google’s SRE book. If you have not read this book by now, then you can start by going there to enjoy the book in its entirety. While the Google SRE book is an extremely useful resource, and without wanting to create a false dichotomy, it kind of overshadows this volume, which, in my humble opinion is a better choice in certain regards. Specifically, while both books have an strong Google influence (one is coming from Google, the author of the other was a Google SRE), I find that the “Practice of …” is a more focused volume, something perhaps to be expected given that it is written by “only” three authors. So, do yourself a favour, read both books, there is a wealth of information contained therein.

Share this:

Related

Leave a comment Cancel reply