Q&A on the Book Evidence-Based Management

The most important issue in organizational data quality is whether you have the data you need to test whether your beliefs about the organization are really true. So if I believe my organization has a reliable backoffice in terms of transactions, do I have the data that show how many errors are made a day or a month for a given volume of transactions.? Counts tell us almost nothing; we need rates, like errors/daily volume. If I am relying on my impressions, I am talking to myself.

Source: Q&A on the Book Evidence-Based Management

Link: PCF and New Relic at West

“Thomas said he plans to migrate hundreds of existing applications to Pivotal Cloud Foundry, which will also serve as the standard platform-of-excellence for all new applications. New Relic will ensure reliability, availability, and performance of those workloads as well as enable West’s ops team to monitor Pivotal Cloud Foundry itself.”
Original source: PCF and New Relic at West

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
Original source: Splunk acquires VictorOps to take it – and you – into site reliability engineering

Link: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M

“Sensu Enterprise is the commercial version of that project, and it costs between $99 and $999 depending on how many servers you’ll need to monitor your cloud environment. You also get customer service that you won’t get if you try to install the open-source project on your own, a key part of the strategy of many enterprise startups building around open-source projects.”
Original source: Monitoring continues to be a valuable part of the hybrid cloud, as Sensu raises $10M

Link: Misunderstanding “Open Tracing” for the Enterprise

Enterprise systems management software is hard.

“OpenTracing doesn’t solve the interoperability problem, so what does the “open standard” attempting to solve? Well, for one thing, is that it allows those making gateways, proxies, and frameworks the ability to write instrumentation. That should, in theory, make it easier to get traces connected, but once again the requirement to change implementation details for each tool is a problem.”
Original source: Misunderstanding “Open Tracing” for the Enterprise

Link: Datadog log monitoring software branches out as DevOps spreads

‘Enterprises initially implemented DevOps within specialized groups that owned a specific application and chose their own IT management tools, said Nancy Gohring, analyst at 451 Research. “Then one day, the enterprise woke up and saw it had 50 different tools, and in some cases, multiple instances of the same tool, each managed by different people,” she said.’
Original source: Datadog log monitoring software branches out as DevOps spreads

Link: With Loggly, SolarWinds scoops up another log service

“With the acquisition of Loggly, SolarWinds obtains an asset that was slow in getting started but has hit a patch of growth recently. As of September, we believe the company was on track to finish 2017 with roughly $10m in billings, up from mid-single digits in 2016. Founded in 2009 with a mission of offering a SaaS-based, easy-to-use logging product with helpful visualizations built using advanced analytics, Loggly had raised $47m in venture capital, including a $11.5m series D round in June 2016.” They estimate ~3,000 paying customers.
Original source: With Loggly, SolarWinds scoops up another log service

Link: Microsoft gets serious about monitoring

“Microsoft’s vision is to deliver tools that can offer a holistic view of services to application architects looking to optimize their software; performance information and debugging capabilities for DevOps and ops pros; insight into KPIs for executives; and information about customer usage to product owners. Microsoft doesn’t yet have a cohesive offering for all of the above, but it has the pieces to enable it and has begun delivering on some integrations across products.”
Original source: Microsoft gets serious about monitoring

Good, simple explanation of Service Level Objectives (SLOs)

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Source: Building good SLOs – CRE life lessons

ScienceLogic momentum

The company targets very large users, with 60% of its customers being MSPs, followed by enterprises at about 30%, and the rest coming from government agencies. It doesn’t report the number of direct customers, but its website boasts 47,000 organizations as users, many of them employing ScienceLogic via service providers. Average annual contract value for direct customers is $125,000.

Source: ScienceLogic targets new use case aimed at frustrated CMDB users

Pivotal Conversations: Cloud-native monitoring & PCF Metrics, with Todd Persen

This week’s podcast:

In this episode we talk with Todd Persen on the topic of monitoring cloud-native applications with Pivotal Cloud Foundry Metric. We discuss the changing nature of monitoring in cloud-native platforms, how developers can now turn black-boxes into white-boxes, why time-series dominates the thought-technology in this space now, and the benefits of open source taking over most innovation in systems management. Richard is out this week, so Andrew Shafer returns to fill in as co-host.

Listen above, download the MP3 directly, and/or subscribe to the podcast feed if you haven’t.

Update on Dynatrace, around half a million in revenue

From Nancy Gohring:

In 2015, Dynatrace recorded $466.6m in revenue, including $30m from services and $60m from SIGOS, the mobile network-testing company that Keynote acquired in 2006. Dynatrace’s APM revenue was $376.6m, representing 15% growth over the previous year, and making it twice as large by revenue as two of its primary competitors – New Relic and AppDynamics.

She writes fine reports.

Source: Dynatrace tackles integration of Keynote and Ruxit

Stackify caters to DevOps-oriented teams with ITMaaS monitoring tool (451 Report)

Looking for a production monitoring tool? Stackify launched recently, and my 451 report on them is up. Clients can read the full report, but here’s the 451 Take:

Our surveys of this space show steady interest in new tools and methods for monitoring and managing cloud-native applications. There are many entrenched tools from the Big Four (CA, IBM, HP, BMC) and other ‘legacy’ systems management vendors; these vendors have had mixed success in ‘keeping up,’ opening market gaps for Stackify and others. Early success in this market depends on good marketing and go-to-market models, with SolarWinds being an iconic example. The danger for a small company like Stackify comes in the form of well-moneyed and innovative competitors of all sizes and flavors. Stackify will have to quickly stake out its ground and begin expertly managing its deal funnel and pace of innovation – the core challenges of any startup.

If you’re not a client already, apply for a trial to check it out and put my name in as a reference.

Related, see a slice of the recent survey results on monitor tools from 451’s TheInfoPro – chart above – free on their wonderful chart blog.

Stackify caters to DevOps-oriented teams with ITMaaS monitoring tool (451 Report)