Building an SRE team 101

Ioannis Georgoulas
3 min readJan 10, 2022
Photo by Marcel Strauß on Unsplash

Site Reliability Engineering means something completely different from person to person. Personally, SRE is a set of practices that focuses on having a system "good enough" without striving for perfection but excellence and meeting customer requirements.

But how do you build a team for something that generic? Depending on how mature is your organisation with the SRE practices (that I will not cover as you can read a philosophically view about this in the SRE book), you have to build a tech roadmap and hiring plan accordingly. Start simple by defining what kind of SRE team makes sense for your organisation. Then, based on your organisations' maturity, set an SRE/Developer ratio; in my case, I set 1/10.

I have started with a Kitchen Sink SRE and tackling the most unreliable/"full toil" systems in the platform that are not always user-facing but slow down the whole department (e.g. how you deploy your code!)

Hiring Plan

The "unicorn SRE" candidate has a software engineer background who has worked on distributed systems as a system engineer. OK! Back to reality now, if you are looking for these engineers, you break the first SRE principle, looking for perfection, not excellence, and your hiring will be rigid and slow.

Split your hiring into the following two types of SREs and try to balance your team based on everyone's skills; the categorisation is simplistic but, in my case, did the trick!

Application SRE (AppSRE):

  • Someone with solid software engineering experience. What programing language doesn't matter as long as the candidates are willing to learn the one you use as a team.
  • Have worked with the Cloud Provider of your choice. This is important as the learning curve will be steep if they do not have foundation knowledge.

Infrastructure SRE (InfraSRE):

  • Someone with solid systems and distributed experience and in-depth knowledge of the Cloud Provider of your choice.
  • Can write code and understand the importance of peer review as part of anything we do.

For both of these types, there are some common things that you should look after:

  • Business use-case mindset. We are here to solve business problems, and most of the time, we have to take the appropriate trade-offs.
  • No fixation with tools. Everyone likes to play with the latest tech, but there is the right time and place for everything.
  • Collaboration and communication
  • Willingness to learn new things and be out of their comfort zone

Environment

After you hire all these excellent engineers, it's vital to have an environment where they can thrive and put their skills into action. For all of the project work you plan, I suggest evaluating alternatives with POC and deciding based on your use case and what the business needs.

Also, it's essential to give SREs time and support on learning something new, e.g. an AppSRE working on more ops work and the other way around. You should expect that everyone on the team can work on anything with the proper support from their peers.

Last but not least, SRE should be the engineers that evangelise reliability engineering. They should be willing to work with other teams, get feedback, and adapt the system and the practices. SREs should not cross a vital line here: they should be the enablers and not the "doers"/" fixers" who apply production changes in a silo. By following that, you will upscale the SRE mindset across all engineering, and you should consider this way of working another "skill" to look at candidates.

Building an SRE team comes with challenges, but it can be rewarding when everyone is enjoying their work and business the team output.

--

--