So, why this post? Well…
I'm (almost) done with my exams, so I'll finally have much more free time. What are some technical books you would recommend to me? 📚 I'm interested in Kubernetes, containers, distributed systems, security (although I'm a noob here), and more.— Marko Mudrinić (@xmudrii) July 5, 2019
For those of you who don’t know Marko, he is a former GSoC student at the CNCF working on Kubernetes, and is a contributor to Kubernetes Cluster API, so I’m just going to take the idea that Marko’s a n00b with a pinch of salt the size of handfuls.
Anyway, it’s a common enough request that it’s probably worth documenting my 2p here. What follows is mostly things and authors that have interested me of. Other opinions are also available.
- Higtower, Kelsey and Brendan Burns and Joe Beda. Kubernetes: up and running: dive into the future of infrastructure, 2nd Edition. O'Reilly, 2019.
- Burns, Brendan and Craig Tracey. Managing Kubernetes. O'Reilly, 2018.
- Garrison, Justin, and Kris Nova. Cloud Native Infrastructure: Patterns for Scalable Infrastructure and Applications in a Dynamic Environment. O'Reilly, 2017.
On distributed systems
- Burns, Brendan. Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services. O'Reilly, 2018.
- Sridharan, Cindy. Distributed Systems Observability. O'Reilly, 2018. http://bit.ly/2FXnS3f.
On organizational practice
- Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne. The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly, 2018. https://landing.google.com/sre/workbook/toc/.
- Forsgren, Nicole and Jez Humble. Accelerate: The Science of Lean Software and Devops: Building and Scaling High Performing Technology Organizations. 2018.
- Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. https://landing.google.com/sre/sre-book/toc/index.html.
On Cluster Orchestration
- Choudhury, Diptanu Gon, and Timothy Perrett. Designing cluster schedulers for internet-scale services. Communications of the ACM 61 no. 6 (2018): 34-40. https://doi.org/10.1145/3190564
- Leung, Andrew, Andrew Spyker, and Tim Bozarth. Titus: introducing containers to the Netflix cloud.. Communications of the ACM 61 no. 2 (2018): 38-45. https://doi.org/10.1145/3152529
- Burns, Brendan, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, Omega, and Kubernetes. Communications of the ACM 59 no. 5 (2016): 50-57. https://doi.org/10.1145/2890784
- Verma, Abhishek, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, 2015. https://doi.org/10.1145/2741948.2741964.
- Bailis, Peter and Kyle Kingsbury. The Network is Reliable: An informal survey of real-world communications failures. ACM Queue 12 no. 7 (2014): 1-13. https://doi.org/10.1145/2643130. http://bit.ly/2JfqCuO.
- DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review 41 no. 6 (2007): 205-220. https://doi.org/10.1145/1323293.1294281
- Lamport, Leslie. Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32 no. 4 (2001): 51-58
- Frazelle, Jessie. Research for practice: security for the modern age. Communications of the ACM 62 no. 1 (2019): 43-45. https://doi.org/10.1145/3287295. http://bit.ly/2JqTfUB.
- Saltzer, Jerome H., David P. Reed, and David D. Clark. End-to-end arguments in system design. Technology 100 (1984): 0661