A review by kwugirl
Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Niall Richard Murphy

3.0

We read this in the book club at work and had skipped some of the later chapters. I'd been thinking that I should read all of it to really mark this book as done but...I can't quite remember anymore what we didn't read anyway, and I know we did at least read most of it!

The general conclusion from our discussions was that there was a lot of a value in getting this book published and understanding some of the processes that Google has set up over the years. Most interesting to me was the idea of an "error budget" to align reliability folks and new feature developers--you're allowed up to a certain amount of downtime, so people assess the risk of adding new items and can't just turf over reliability as someone else's problem.

There were some instances of us sensing a tone of Google arrogance, but it's hard to pin down; some of these practices were in fact invented there, but others are sometimes described in a slightly patronizing way that just made it look like Google engineers should go out to conferences and mingle with others in their own field a bit more, because those practices are now accepted as standard and not a shiny revelation.

Overall, since I am not an SRE nor have worked in ops, I was glad to read this book to have a better understanding of that field, but also grateful to have read it with practicing non-Google SREs to have their commentary and not accept The Google Way as gospel.