Weblog

Link: Preliminary Analysis of the Site Reliability Engineer Survey

July 9, 2018

If the response takes too long to get to your phone, the system might as well be “unavailable”: ‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability.

Continue Reading →

Link: Monitoring SRE's Golden Signals

July 9, 2018

Lists out how to get the metrics from various systems and software. Original source: Monitoring SRE’s Golden Signals

Link: Monitoring SRE's Golden Signals

July 9, 2018

Lists out how to get the metrics from various systems and software. Original source: Monitoring SRE’s Golden Signals

Link: Monitoring SRE's Golden Signals

July 9, 2018

Lists out how to get the metrics from various systems and software. Original source: Monitoring SRE’s Golden Signals

Link: How to Monitor the SRE Golden Signals

July 9, 2018

[Summary from the post of metrics to use:] Rate — Request rate, in requests/sec Errors — Error rate, in errors/sec Latency — Response time, including queue/wait time, in milliseconds. Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter. Utilization — How busy the resource or system is.

Continue Reading →

Link: How to Monitor the SRE Golden Signals

July 9, 2018

[Summary from the post of metrics to use:] Rate — Request rate, in requests/sec Errors — Error rate, in errors/sec Latency — Response time, including queue/wait time, in milliseconds. Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter. Utilization — How busy the resource or system is.

Continue Reading →

Link: How to Monitor the SRE Golden Signals

July 9, 2018

[Summary from the post of metrics to use:] Rate — Request rate, in requests/sec Errors — Error rate, in errors/sec Latency — Response time, including queue/wait time, in milliseconds. Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter. Utilization — How busy the resource or system is.

Continue Reading →

Link: "An astonishing paper that may explain why it’s so difficult to patch."

July 9, 2018

The most important thing is to be able to fix The Broken quickly, not make sure it never breaks. “They monitored 400 libraries. In 116 days, they saw 282 breaking changes! Each day, there’s 6.1% chance of breaking chg, for each lib you use!" Original source: “An astonishing paper that may explain why it’s so difficult to patch."

Link: "An astonishing paper that may explain why it’s so difficult to patch."

July 9, 2018

The most important thing is to be able to fix The Broken quickly, not make sure it never breaks. “They monitored 400 libraries. In 116 days, they saw 282 breaking changes! Each day, there’s 6.1% chance of breaking chg, for each lib you use!" Original source: “An astonishing paper that may explain why it’s so difficult to patch."

Link: "An astonishing paper that may explain why it’s so difficult to patch."

July 9, 2018

The most important thing is to be able to fix The Broken quickly, not make sure it never breaks. “They monitored 400 libraries. In 116 days, they saw 282 breaking changes! Each day, there’s 6.1% chance of breaking chg, for each lib you use!" Original source: “An astonishing paper that may explain why it’s so difficult to patch."