Q&A on the Book Evidence-Based Management

The most important issue in organizational data quality is whether you have the data you need to test whether your beliefs about the organization are really true. So if I believe my organization has a reliable backoffice in terms of transactions, do I have the data that show how many errors are made a day or a month for a given volume of transactions.? Counts tell us almost nothing; we need rates, like errors/daily volume. If I am relying on my impressions, I am talking to myself.

Source: Q&A on the Book Evidence-Based Management

Link: Pivotal and New Relic Deliver Visibility, Value, and Velocity

A nice listing of some metrics to monitor out of the box in PCF, and just performance metrics, but meatware and product related stuff too.
Original source: Pivotal and New Relic Deliver Visibility, Value, and Velocity

Link: PCF and New Relic at West

“Thomas said he plans to migrate hundreds of existing applications to Pivotal Cloud Foundry, which will also serve as the standard platform-of-excellence for all new applications. New Relic will ensure reliability, availability, and performance of those workloads as well as enable West’s ops team to monitor Pivotal Cloud Foundry itself.”
Original source: PCF and New Relic at West

Link: New Cloud Unicorn: PagerDuty Scores $1.3 Billion Valuation In $90 Million Round

“The company says it passed $100 million in annual recurring revenue in recent months”
Original source: New Cloud Unicorn: PagerDuty Scores $1.3 Billion Valuation In $90 Million Round

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: “An astonishing paper that may explain why it’s so difficult to patch.”

The most important thing is to be able to fix The Broken quickly, not make sure it never breaks.

“They monitored 400 libraries. In 116 days, they saw 282 breaking changes! Each day, there’s 6.1% chance of breaking chg, for each lib you use!”
Original source: “An astonishing paper that may explain why it’s so difficult to patch.”

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: ​AppDynamics touts the agility of a startup with the pocket of a global giant

“Our ability to close a customer when Cisco is involved is up to 50 percent faster.”

One of the best advantages of being part of a big, tech company.
Original source: ​AppDynamics touts the agility of a startup with the pocket of a global giant

Link: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M

“Sensu Enterprise is the commercial version of that project, and it costs between $99 and $999 depending on how many servers you’ll need to monitor your cloud environment. You also get customer service that you won’t get if you try to install the open-source project on your own, a key part of the strategy of many enterprise startups building around open-source projects.”
Original source: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M

Link: Misunderstanding “Open Tracing” for the Enterprise

Enterprise systems management software is hard.

“OpenTracing doesn’t solve the interoperability problem, so what does the “open standard” attempting to solve? Well, for one thing, is that it allows those making gateways, proxies, and frameworks the ability to write instrumentation. That should, in theory, make it easier to get traces connected, but once again the requirement to change implementation details for each tool is a problem.”
Original source: Misunderstanding “Open Tracing” for the Enterprise

Link: Moogsoft gets $40m round D

“Moogsoft claims to have more than doubled revenue in the past year thanks to new customer wins. The startup counts Cisco Systems Inc., T-Mobile USA Inc., Intuit Inc. and other major tech firms among its users.”
Original source: Moogsoft gets $40m round D

Link: Datadog log monitoring software branches out as DevOps spreads

‘Enterprises initially implemented DevOps within specialized groups that owned a specific application and chose their own IT management tools, said Nancy Gohring, analyst at 451 Research. “Then one day, the enterprise woke up and saw it had 50 different tools, and in some cases, multiple instances of the same tool, each managed by different people,” she said.’
Original source: Datadog log monitoring software branches out as DevOps spreads

Link: New Relic CEO Lew Cirne – “Digital is the new front door” for business

“For its third quarter non-GAAP operating income was $2.7 million compared to an operating loss of $4.9 million for the same period last year. Revenue was $91.8 million for the third quarter, up 35% year-over-year.”
Original source: New Relic CEO Lew Cirne – “Digital is the new front door” for business

Link: Reading Up on Observability and Monitoring – Adron Hall

“key in understanding the difference in monitoring — the combing of data to determine the state or well-being of a system — versus observability — the view into and understanding of the state of events within a system.”
Original source: Reading Up on Observability and Monitoring – Adron Hall