Reviews

Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Jennifer Petoff, Chris Jones, Niall Richard Murphy

Only show reviews with written explanations

x0pherl's review against another edition

3.0

The first thing that comes to mind about this book is how massive it is. Most of my peers who have read it have read one or two chapters, skimmed one or two more and called it a day. I can see why. However, I found a lot of value in each part of the book. Look, google runs a LOT of stuff. And the run it really well.
I've worked in Ops teams, Dev teams and DevOps teams, and this book gives you plenty to think about wherever you fall in that spectrum, whether you work in an organization where SRE is being considered or not.
My favorite quote from the book, attributed to Joseph Bironas:

If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.

mrahim's review against another edition

Go to review page

5.0

This book details a lot of modern-day infra-related concepts and the rationale behind them.

Examples include the discussion around Service-Level Objectives (SLOs).

- If an SLO is 99.99%, then this leaves you with an error budget of 0.01% (52 minutes a year). What surprised me is how product teams pushed back against increasing this further without serious justification, due to the impact on engineering velocity if they are only allowed 52 minutes of downtime a year, thus preventing an ability to embrace risk & move fast.

- The team should aim for higher than 99.99% anyway, but should not be penalised unless breached.

- An SLO should never be 100% as there are too many real-world factors that can't be controlled here such as ISPs having downtimes.

- How do you measure an SLO? A simple one would be (successful requests)/(total requests) as a %. It should also consider measuring it from client-side rather than server-side since these are the ones who have their UX impacted.

- Another idea is not to be "too available" or teams can become dependent on it, but I guess this depends on the services provided.

Toil

- Hands-on time running a script is toil. If it can be automated away but hasn't been, then it is toil.

- Overhead is not the same as toil. Overhead is things like meetings, code reviews, etc.

- SREs @ Google are expected to spend <50% of their time dealing with toil (20h a week) max. Most teams are at about 33%.

- Engineers are expected to complain loudly if there is too much toil as it will increase attrition of the best engineers if not dealt with. This is because not enough time is spent on new projects, causing stagnation and low morale. If someone is content with toil, then others will give that work to them too.

Misc ideas

- Testing is used for known data whereas monitoring is used for unknown or unpredictable data

- Responses are measured differently based on context. Throughput is used for sending videos (guarantee sending media even if it takes longer), whereas latency is for searching for videos (faster results to the users).

- Paxos variations and how they are used at large-scale.

migueldavid's review against another edition

Go to review page

4.0

This book may be the first book to really show the outside world how Google engineers think. That is a good thing (they have gold pieces of knowledge to share) but also a bad thing (the book feels academic, too much "this is how we do it at Google" and just a bit condescending).

poetsofsweetpea's review against another edition

Go to review page

informative slow-paced

5.0

Its a lot of information! I'll probably have to listen to it again.

lwenkai's review against another edition

Go to review page

medium-paced

4.0

schwarmgiven's review against another edition

Go to review page

5.0

The Bible.

This book is so good--basically creating a field and defining a set of practices that needed a lot of definition.

Helpful for everyone regardless of where they are on in their path of ensure prod is up--covers on boarding, on call, and post mortem's in such honest clarity is is disarming.

the books is not all written with equal strength and clarity, but it is all worth reading and the good stuff is honestly life changing.

Strongly recommended to anyone that uses a computer.

Gold.

niharikaaaaaa9's review against another edition

Go to review page

It's hard to give this book a star rating - there were some essays that I really enjoyed and learned from, while others I found dry. Perhaps my only recommendation would regarding this book would be to not read it cover to cover, and instead, hone in on the parts of Site Reliability Engineering, and read just those essays.

Somewhat tangential, and likely due to recency bias, but one of the essays I liked the most was one of the last ones, in which the authors interviewed a host of Google software engineers, who had backgrounds in industries that also cared heavily about reliability - think air traffic control, working on the 911 system, nuclear power plant engineers, and lifeguards. The essay compared and contrasted those industries' definitions and standards for reliability with the Google SRE teams, and while there were quite a lot of similarities, there were also notable differences which I found interesting.

erikars's review against another edition

Go to review page

4.0

As a Google software engineer, I read this book largely from the perspective of better understanding the practices I've seen supporting the various services I've worked on over the years. As a SWE on a high traffic critical surface, I saw many of the best practices mentioned in this book develop. It was useful to see a snapshot of how they fit together with the broader (and itself evolving) SRE philosophy.

Although I found the philosophy most interesting, the bulk of the book was practical principles on how to run reliable services. Although many of the details are specific to Google, the general principles are not. The authors provided enough information on why certain practices worked well to allow others to make their own tradeoffs about what works in their environment.

Overall, this was an interesting and valuable read.

trnl's review against another edition

Go to review page

5.0

One of the best books I've ever read. It's a must for everyone who is connected to software, systems and architecture. I deeply impressed how it's covering everything, from sociological (interrupts, burnouts, communication) to technology (distributed consensus, telemetry) and standardization.

gianouts's review against another edition

Go to review page

4.0

This book contains a number of insights into Google's SRE practice. It is a bit repetitive at times but this assists in drilling in some of the key points.

More...