🗂 Link: Scale and velocity are driving the next generation of DevOps

This massive growth in scale has required an evolution in practices and organization to achieve success. Most of what technologists are aware of in this regard is labeled “DevOps,” but there is more nuance to it than that. The way infrastructure capacity is allocated becomes decoupled from specific hardware, so the infrastructure team has to adapt new tools. The way databases and message busses, among other things, are operated and made available to applications has become more “self-service”, and thus those teams have to see themselves as service providers rather than as infrastructure teams.

Source: Scale and velocity are driving the next generation of DevOps

5 Definitions of DevOps, or, ¯\_(ツ)_/¯

DevOpsDays Amsterdam - Thursday June 25th

I’ve tracked at least three different definitions of DevOps since the days of “agile infrastructure”:

  1. Using Puppet and Chef (and then Ansible and Chef) to replace Opsware and BladeLogic.
  2. Full stack engineers to setup EC2, load-balancers, and other Morlock shit.
  3. Full stack engineers are bad, but sort of the same thing. Also, you can’t have a DevOps “group” or title. But, you know, someone should do all that automation.
  4. Putting all the people on one team, having them focus on a product, and establishing a culture of caring and learning.
  5. SRE is not DevOps.

So…actually five. Maybe some of them just being footnotes on the evolving concept. (And, if you, dear reader, feel these are wrong, then let’s compromise and make the list six.)

All of them evolved around bringing down The Wall of Confusion, allowing “developers” to deploy their software to production more frequently, weekly, if not daily. And, of course, making sure production stays up. (You’re supposed to call that “resiliency” and instead of SLAs use SLOs and some other newly named metrics that answer the question “IS MY SHIT WORKING?” Whatever you do, just don’t say “uptime,” or you’re in for it and will be relegated to running the AS/400’s.)

I used to snide that the developers seemed to have been yanked out of DevOps, sometime around 2014 and 2015. All the talks I saw were, basically, operations talks. I haven’t really checked in on DevOps conference talks recently, but at the time, I don’t think there was much application development stuff. (I’m not sure if there ever was?)

None of this means that DevOps is not a thing. Not at all. It just means that the enterprise finds its own use for things. It also means there’s still weekly write-ups of what DevOps is – you know, those ones that are always lists of ideas, things you’re getting wrong, and how to start.

Autonomous product teams

This kind of thing is happening all the time

Nowadays, I try to stick to that forth one: you want to setup autonomous teams that have all the skills and responsibility/authority/tools needed to “own” the software being specified, designed, developed, and run. This means you have to, basically, remove-by-automating all the operations stuff it takes to stand-up environments, deploy things, and do all that “day 2” stuff.

(HEY! HEY! WANT TO BUY SOME ENTERPRISE SOFTWARE?!)

Now, I think this product-centric notion of DevOps is, well, kind of an over-extension of the term “DevOps.” But since SRE has sucked out the “ops” part (but, remember, dear reader, don’t commit the embarrassing act of saying SRE is DevOps – no, no, you’d never do that, right? SO SHAMEFUL! (SRE is totally different – no overlap or similar goals shared between them at all. I mean, they have separate groups, silos! COME ON!)), slicing “DevOps” back to just “Dev,” but with a product-not-project focus isn’t too shabby.

Anyhow. I came across a good overview of this product notion of DevOps, all the way back from 2016, while re-reading Schwartz’s evergreen excellent The Art of Business Value:

Agile approaches attempt to bring together developers and the business in an atmosphere of mutual respect and joint contribution. Until now, however, the focus has been on users of the software, product visionaries, and developers. Recent developments in the Agile world—notably DevOps—have broadened this idea of respect and inclusion to encompass Operations and Security. The DevOps model, in other words, looks to break down the silos that have resulted from technical specialization over the last few decades. But the DevOps spirit goes further, looking to eliminate the conflicting incentives of organizational silos and the inhumane behaviors that can result from those conflicting incentives.

 

Perhaps we can take this idea even further still. There is no reason why the DevOps team’s responsibility needs to stop at the border of what used to be considered IT. The team is part of a broader enterprise, whose collective knowledge, skills, and judgment need to be part of the value creation process.

Look a’ that guy! Business Value just effortlessly jets out of his pores like a peripatetic thought-monarch!

This is from an executives perspective, but it drives home the point we’re always trying to get to with software: doing whatever it takes to figure out, create, and give users features that are actually useful to them. Somewhere beyond that, if you’re lucky, it’ll help out “the business.” Also, it should implement The Unspoken User Story: user would like software to actually work.

Link: SRE: The Biggest Lie Since Kanban

That’s why SRE is a Big Lie – because it enables people to say they’re doing a thing that could help their organization succeed, and their dev and ops engineers to have a better career and life while doing so – but not really do it. Yes, there have been Big Lies before, which is why I cite Kanban as another example – but even if the new criminal is pretty much like the old criminal, you still put their picture up on the post office wall.

If something you’re selling is profoundly misused it’s your responsibility to be more up front about the issues.
Original source: SRE: The Biggest Lie Since Kanban

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Full Cycle Developers at Netflix

How Netflix thinks about standardized platforms and tools, plus their adaptation of DevOps and SRE.

“Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will enable my partners to answer their questions without needing me to be involved?” This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.”
Original source: Full Cycle Developers at Netflix

Good, simple explanation of Service Level Objectives (SLOs)

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Source: Building good SLOs – CRE life lessons

Pivotal Conversations: “Running like Google,” the CRE Program & Pivotal, with Andrew Shafer

The summary:

What does it really mean to “run like Google”? Is that even a good idea? Andrew Shafer comes back to the podcast to talk with Coté about how the Google SRE book and the newly announced Google CRE program start addressing those questions. We discuss some of the general principals, and “small” ones too that are in those bodies of work and how they represent an interesting evolution of it IT management is done. Many of the concepts that the DevOps and cloud-native community talks about pop in Google’s approach to operations and software delivery, providing a good, hyper-scale case study of how to do IT management and software development for distributed applications. We also discuss Pivotal’s involvement in the Google CRE program.

Check out the SoundCloud listing, or download the MP3 directly.