Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M

“Sensu Enterprise is the commercial version of that project, and it costs between $99 and $999 depending on how many servers you’ll need to monitor your cloud environment. You also get customer service that you won’t get if you try to install the open-source project on your own, a key part of the strategy of many enterprise startups building around open-source projects.”
Original source: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M