Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: Q&A on the Book Agile Management

“Most of the available maturity models measure the degree to which the agile techniques and tools are deployed. I prefer to look at it from a different angle. First, define what your most important performance indicators are with respect to agility. For instance, time-to-market, employee satisfaction, customer satisfaction, and so on. Then benchmark these, if possible. And also follow their development over time, to determine whether they are improving or not.”
Original source: Q&A on the Book Agile Management

Link: How to build a business case for DevOps transformation

“Here are a few signs that your company should consider transitioning to DevOps:

Does it take a long time to deliver features?
Are features underutilized?
Do you not know the utilization of features?
Do you have downtime during maintenance or deployment windows?
Do your customers tell you your site is down before you know it?
Do outages occur repeatedly for the same reason?
Are customer feature requests implemented in a way that doesn’t actually fulfill the customer’s needs?”

Original source: How to build a business case for DevOps transformation

Good, simple explanation of Service Level Objectives (SLOs)

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Source: Building good SLOs – CRE life lessons

Core DevOps (tech) metrics, from Nicole Forsgren

Everyone always wants to know metrics. While the answer is always a solid “it depends – I mean, what are your business goals and then we can come up with some KPIs,” there’s a reoccurring set of technical metrics. Nicole lists some off:

These IT performance metrics capture speed and stability of software delivery: lead time for changes (from code commit to code deploy), deployment frequency, mean time to restore (MTTR), and change fail rate. It’s important to capture all of these because they are in tension with each other (speaking to both speed and stability, reflecting priorities of both the dev and ops sides of the team), and they reflect overall goals of the team. These metrics as a whole have also been shown to drive organizational performance.

And, then, further summarized by Daniel Bryant:

Key metrics for IT performance capture speed and stability of software delivery, and include: lead time for changes (from code commit to code deploy), deployment frequency, mean time to restore (MTTR), and change fail rate.

Also in the interview, a concise DevOps definition:

I define DevOps as a technology transformation that drives value to organizations through an ability to deliver code with both speed and stability.

See the rest.