🗂 Link: Comparing the Predictive Ability of the Net Promoter Score and Satisfaction in 12 Industries

It’s behavioral intention vs. satisfaction. Another reason for why the NPS correlations with growth were nominally higher is because the NPS is a measure of behavioral intention (likelihood to recommend). In contrast, the ACSI is a measure of attitude. While attitudes do predict behavior, they tend to predict behavior through behavioral intentions. This idea is also supported by Pollack & Alexandrov (2013), who argue that satisfaction can be seen as an antecedent to repurchase intentions.

Source: Comparing the Predictive Ability of the Net Promoter Score and Satisfaction in 12 Industries

🗂 Link: Waste Management dumps legacy processes, drives digital change | CIO

We saw a spike of over 70% points for our new monthly bill-pay option. In the past, we’ve said that monthly billing is not convenient for us, but our customers told us that’s what they want. When we gave it to them, they rewarded us with an auto bill-pay rate that spiked, which is important because autopay is a leading indicator of how long a customer will stay with us. We saw a 40% jump in ecommerce revenue almost overnight. We are now levering those learnings in our commercial markets.

Source: Waste Management dumps legacy processes, drives digital change | CIO

🗂 Link: What Cloud Vendors Really Want From Their Customers

How much a customer spends on an annual basis is absolutely an indicator of strength, both internally and to the market. It is also a clear indicator of being able to effectively execute to the often-publicized overarching objective of expanding the customer base’s portfolios of products over the course of the relationship and since inception. Cloud vendors and the analysts that cover them also know that as the annual spend rises, the baseline spend grows, which can be hit with an increase at renewal.

Source: What Cloud Vendors Really Want From Their Customers

Link: How the subscription paradigm flips the cloud financials market

In the subscription world, you must understand the lifetime relationship with the customer – all the upsells and renewals and how they all build on one another. You also must understand the revenue, billings, and cash derived from those-again, over the entire lifetime of the customer relationship.

In an ASC-606 world, you must track all performance obligations, which is a fancy term for your promises – both the ones explicitly written in your customer agreement and all those pesky side terms that your sales rep slipped into the free-text field on the quote. You must also know all the implicit promises that people make in the deal or that have become ingrained in your business processes.

Source: How the subscription paradigm flips the cloud financials market

Link: Adobe Summit 2019 analysis – CX hype is countered by Chegg’s digital turnaround story

Chegg isn’t digital-only today. They still ship five million textbooks a year – but their mission has changed. And, as Narayan pointed out, they now need new metrics. Rosensweig said those metrics include subscriber growth, revenue growth, engagement, renewal, and conversion rates. But there’s an underlying metric: create something awesome.

What we learned was you not only have to get them to subscribe, you [also have to] lower your cost to customer acquisition, which at one point for us was 27 dollars. Now it’s $3.50. Our renewal rates were 63 percent; they’re now in the mid-80s for a monthly renewal.

Source: Adobe Summit 2019 analysis – CX hype is countered by Chegg’s digital turnaround story

Link: Defense Innovation Board proposes new metrics for assessing DOD software development

“The metrics can be broken down into four broad categories — deployment rate metrics, response rate metrics, code quality metrics, and program management, assessment, and estimation metrics.  The DIB also provides general timeframes for what a “good” score looks like for each metric.”
Original source: Defense Innovation Board proposes new metrics for assessing DOD software development

Tracking your improvement  – “metrics”

This post is an early draft of a chapter in my book,  Monolithic Transformation.

Tracking the health of your overall innovation machine can be both overly simplified and overly complex. What you want to measure is how well you’re doing at software development and delivery as it relates to improving your organization’s goals. You’ll use these metrics to track how your organization is doing at any given time and, when things go wrong, get a sense of what needs to be fixed. As ever with management, you can look at this as a part of putting a small batch process in place: coming up with theories for how to solve your problems and verifying if the theory worked in practice or not.

All that monitoring

In IT most of the metrics you encounter are not actually business oriented and instead tell you about the health of your various IT systems and processes: how many nodes are left in a cluster, how much network traffic customers are bringing in, how many open bugs development has, or how many open tickets the help desk is dealing with on average.

Example of Pivotal Cloud Foundry’s Healthwatch metrics dashboard.

All of these metrics can be valuable, just as all of them can be worthless in any given context. Most of these technical metrics, coupled with ample logs, are needed to diagnose problems as they come and go. In recent years, there’ve been many advances in end-to-end tracing thanks to tools like Zipkin and Spring Sleuth. Log management is well into its newest wave of improvements, and monitoring and IT management analytics are just ascending another cycle of innovation — they call it “observability” now, that way you know it’s different this time!

Instead of looking at all of these technical metrics, I want to look at a few common metrics that come up over and over in organizations that are improving their software capabilities.

Six common cloud native metrics

Some metrics consistently come up when measuring cloud native organizations:

Lead Time

Source

Lead time is how long it takes to go from an idea to running code in production, it measures how long your small batch loop takes. It includes everything in-between from specifying the idea, writing the code and testing it, passing any governance and compliance needs, planning for deployment and management, and then getting it up and running in production.

If your lead time is consistent enough, you have a grip on IT’s capability to help the business by creating and deploying new software and features. Being this machine for innovation through software is, as you’ll hopefully recall, the whole point of all this cloud native, agile, DevOps, and digital transformation stuff.

As such, you want to monitoring your lead closely. Ideally, it should be a week. Some organizations go longer, up to two weeks, and some are even shorter, like daily. Target and then track an interval that makes sense for you. If you see your lead time growing, then you should stop everything, find the bottlenecks and fix them. If the bottlenecks can’t be fixed, then you probably need to do less each release.

Velocity

Velocity shows how many features are typically deployed each week. Whether you call features “stories,” “story points,” “requirements,” or whatever else, you want to measure how many of them the team can complete each week; I’ll use the term “story.” Velocity tells you three things:

  1. Your progress to improving and ongoing performance — at first, you want to find out what your team’s velocity is. They will need to “calibrate” on what they’re capable of doing each week. Once you establish this base line, if it goes down something is going wrong and you can investigate.
  2. How much the team can deliver each week — once you know how many features your team can deliver each week, you can more reliability plan your road-maps. If a team can only deliver, for example, 3 stories each week, asking them to deliver 20 stories in a month is absurd. They’re simply not capable of doing that. Ideally, this means your estimates are no longer, well, always wrong.
  3. If the the scope of features is getting too big or too small — if previously, reliability performing team’s velocity starts to drop, it means that they’re scoping their stories incorrectly: they’re taking on too much work, or someone is forcing them to. On the other hand, if the team is suddenly able to deliver more stories each week or finds themselves with lots of extra time each week, it means they should take on more stories each week.

There are numerous ways to first calibrate on the number of stories a team can deliver each week and managing that process at first is very important. As they calibrate, your teams will, no doubt, get it wrong for many releases, which is to be expected (and one of the motivations in picking small projects at first instead of big, important ones). Other reports like burn down charts can help illustrate how the team’s velocity is getting closer to delivering across major releases (or in each release) and help you monitor any deviation from what’s normal.

Latency

In general, you want your software to be as responsive as possible. That is, you want it to be fast. We often think of speed in this case, how fast is the software running and how fast can it respond to requests? Latency is a slightly different way of thinking about speed, namely, how long does a request take end-to-end to process, returning back to the user.

Latency is different than the raw “speed” of the network. For example, a fast network will send a static file very quickly, but if the request requires connecting to a database to create and then retrieve a custom view of last week’s Austrian sales, it will take awhile and, thus, the latency will be much longer than downloaded an already made file.

From a user’s perspective, latency is important because an application that takes 3 minutes to respond versus 3 milliseconds might as well be “unavailable.” As such, latency is often the best way to measure if your software is working.

Measuring latency can be tricky….or really simple. Because it spans the entire transaction, you often need to rely on patching together a full view — or “trace” — of any given user transaction. This can be done by looking at locks, doing real or synthetic user-centric tracing, and using any number of application performance monitoring (APM) tools. Ideally, the platform you’re using will automatically monitor all user requests and also put together catalog all of the sub-processes and sub-sub-processes that make up the entire request. That way, you can start to figure why things are so slow.

Error Rates

Often, your systems and software will tell when there’s an error: an exception is thrown in the application layer because the email service is missing, an authentication service is unreachable so the user can’t login, a disk is failing to write data. Tracking and monitoring these errors is, obviously, a good idea. Some of them will range from “is smoke coming out of the box?” to more obtuse ones like servicing being unreachable because DNS is misconfigured. Oftentimes, errors are roll-ups of other problems: when a web server fails, returning a 500 response code, it means something went wrong, but doesn’t the error doesn’t usually tell you what happened.

Error rates also occur before production, while the software is being developed and tested. You can look at failed tests as error rates, as well as broken builds and failed compliance audits.

Fixing errors in development can be easier and more straight forward, whereas triaging and sorting through errors in production is an art. What’s important to track with errors is not just that one happened, but the rate at which they happen, perhaps errors per second. You’ll have to figure out an acceptable level of errors because there will be many of them. What you do about all these errors will be driven by your service targets. These targets may be foisted on you in the form of heritage Service Level Agreements or you might have been lucky enough to negotiate some sane targets.

Chances are, a certain rate of errors will be acceptable (have you ever noticed that sometimes, you just need to reload a web-page?) Each part of your stack will throw off and generate different errors: some are meaningless (perhaps they should be more warnings or even just informative notices, e.g., “you’re using an older framework that might be deprecated sometime in the next 30 years) and others could be too costly, or even impossible to fix (“1% of user’s audio uploads fail because their upload latency and bandwidth is too slow”). And some errors may be important above all else: if an email server is just losing emails every 5 minutes…something is terribly wrong.

Generally, errors are collected from logs, but you could also poll the service in question and it might send alerts to your monitoring systems, be that an IT management system or just your phone.

Mean-time-to-repair (MTTR)

If you can accept the reality that things will go wrong with software, how quickly you can fix those problems becomes a key metric. It’s bad when an error happens, but it’s really bad if it takes you a long time to fix it.

Tracking mean-time-to-repair is an ongoing measurement of how quickly you can recovering from errors. As with most metrics, this gives you a target to improve towards and then allows you to make sure you’re not getting worse.

If you’re following cloud native practices and using a good platform, you can usually shrink your MTTR with the ability to roll back changes. If a release turns out to be bad (an error), you can back it out quickly, removing the problem. This doesn’t mean you should blithely roll out bad releases, of course.

Measuring MTTR might require tracking support tickets and otherwise manually tracking the time between incident detection and fix. As you automate remediations, you might be able to easily capture those rates. As with most of these metrics, what becomes important in the long term is tracking changes to your acceptable MTTR and figuring out why the negative changes are happening.

Costs

Everyone wants to measure cost, and there are many costs to measure. In addition to the time spent developing software and the money spent on infrastructure, there are ratios you’ll want to track like number of applications to platform operators. Typically, these kinds of ratios give you a quick sense of how efficiently IT runs. If each application takes one operator, something is probably missing from your platform and process. T-Mobile, for example, manages 11,000 containers in production with just 8 platform operators.

There are also less direct costs like opportunity and value lost due to waiting on slow release cycles. For example, the US Air Force calculated that is saved $391M by modernizing it’s software methodology. The point is that you need to obviously track the cost of what you’re doing, but you also need to track the costs of doing nothing, which might be much higher.

Business Value

“Comcast Cloud Foundry Journey — Part 2,” Greg Otto, Comcast, June 2017.

Of course, none of the metrics so far has measured the most valuable, but difficult metric: value delivered. How do you measure your software’s contribution to your organization’s goals? Measuring how the process and tools you use contributes to those goals is usually harder. This is the dicey plain of correlation versus causation.

Somehow, you need to come up with a scheme that shows and tracks how all this cloud native stuff you’re spending time and money on is helping the business grow. You want to measure value delivered over time to:

  1. Prove that you’re valuable and should keep living and get more funding,
  2. Figure out when you’re failing to deliver so that you can fix it

There are a few prototypes of linking cloud native activities to business value delivered. Let’s look at a few examples:

  1. As described in the case study above, when the IRS replaced call centers with poor availability with software, IT delivered clear business value. Latency and error rates decreased dramatically (with phone banks, only 37% of calls made it through) and the design improvements they discovered led to increased usage of the software, pulling people away from the phones. And, then, the results are clear: by the Fall of 2017, the this application had collected $440m in back taxes.
  2. Sometimes, delivering “value” means satisfying operational metrics rather than contributing dollars. This isn’t the best of all situations to be in, but if you’re told, for example, that in the next two years 60% of applications need to be “on the cloud,” then you know the business value you’re supposed to deliver on. In such cases, simply tracking the replatforming of applications to a cloud platform will probably suffice.
  3. Running existing businesses more efficiently is a popular goal, especially for large organizations. In this case, the value you deliver with cloud native will usually be speeding up businesses processes, removing wasted time and effort, and increasing quality. Duke Energy’s lineworker case is a good example, here. Duke gave lineworkers a better, highly tuned application that queue and coordinate their work in the field. The software increased lineworker’s productivity and reduced waste, directly creating business value in efficiencies.
  4. The US Air Force’s tanker scheduling case study is another good example here: by adapting a cloud native, software model they were able to ship the first version in 120 days and started saving $100,000’s in fuel costs each week. Additionally, the USAF computed the cost of delay — using the old methods that took longer — at $391M, a handy financial metric to consider.
  5. And, then, of course, there comes raw competition. This most easily manifests itself as time-to-market, either to match competitors or get new features out before them. Liberty Mutual’s ability to enter the Australian motorcycle market, from scratch, in six months is a good example. Others, like Comcast demonstrate competing with major disruptors like Netflix.

It’s easy to get very nuanced and detailed when you’re mapping IT to business value. You need to keep things as simple as possible, or, put another way, only as complex as needed. As with the example above, clearly link your cloud native efforts to straight forward business goals. Simply “delivering on our commitment to innovation” isn’t going to cut it. If you’re suffering under vague strategic goals, make them more concrete before you start using them to measure yourself. On the other end, just lowering costs might be a bad goal to shoot for. I talk with many organizations who used outsourcing to deliver on the strategic goal of lowering costs and now find themselves incapable of creating software at the pace their business needs to compete.

Fleshing out metrics

I’ve provided a simplistic start at metrics above. Each layer of your organization will want to add more detail to get better telemetry on itself. Creating a comprehensive, umbrella metrics system is impossible, but there are many good templates to start with.

Pivotal has been developing a cloud native centric template of metrics, divided into 5 categories:

BuiltToAdapt Benchmark.

These metrics cover platform operations, product, and business metrics. Not all organizations will want to use all of the metrics, and there’s usually some that are missing. But, this 5 S’s template is a good place to start.

If you prefer to go down rabbit holes rather than shelter under umbrellas, there are more specialized metric frameworks to start with. Platform operators should probably start by learning how the Google SRE team measures and manages Google, while developers could start by looking at TK( need some good resource ).

Whatever the case, make sure the metrics you choose are

  1. targeting the end goal of putting a small batch process in place to create better software,
  2. reporting on your ongoing improvement towards that goal, and,
  3. alerting you that you’re slipping and need to fix something…or find a new job.

This post is an early draft of a chapter in my book,  Monolithic Transformation.

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: Preliminary Analysis of the Site Reliability Engineer Survey

If the response takes too long to get to your phone, the system might as well be “unavailable”:

‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.’
Original source: Preliminary Analysis of the Site Reliability Engineer Survey

Link: How to Monitor the SRE Golden Signals

[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Original source: How to Monitor the SRE Golden Signals

Link: Q&A on the Book Agile Management

“Most of the available maturity models measure the degree to which the agile techniques and tools are deployed. I prefer to look at it from a different angle. First, define what your most important performance indicators are with respect to agility. For instance, time-to-market, employee satisfaction, customer satisfaction, and so on. Then benchmark these, if possible. And also follow their development over time, to determine whether they are improving or not.”
Original source: Q&A on the Book Agile Management

Link: How to build a business case for DevOps transformation

“Here are a few signs that your company should consider transitioning to DevOps:

Does it take a long time to deliver features?
Are features underutilized?
Do you not know the utilization of features?
Do you have downtime during maintenance or deployment windows?
Do your customers tell you your site is down before you know it?
Do outages occur repeatedly for the same reason?
Are customer feature requests implemented in a way that doesn’t actually fulfill the customer’s needs?”

Original source: How to build a business case for DevOps transformation

Good, simple explanation of Service Level Objectives (SLOs)

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Source: Building good SLOs – CRE life lessons

Core DevOps (tech) metrics, from Nicole Forsgren

Everyone always wants to know metrics. While the answer is always a solid “it depends – I mean, what are your business goals and then we can come up with some KPIs,” there’s a reoccurring set of technical metrics. Nicole lists some off:

These IT performance metrics capture speed and stability of software delivery: lead time for changes (from code commit to code deploy), deployment frequency, mean time to restore (MTTR), and change fail rate. It’s important to capture all of these because they are in tension with each other (speaking to both speed and stability, reflecting priorities of both the dev and ops sides of the team), and they reflect overall goals of the team. These metrics as a whole have also been shown to drive organizational performance.

And, then, further summarized by Daniel Bryant:

Key metrics for IT performance capture speed and stability of software delivery, and include: lead time for changes (from code commit to code deploy), deployment frequency, mean time to restore (MTTR), and change fail rate.

Also in the interview, a concise DevOps definition:

I define DevOps as a technology transformation that drives value to organizations through an ability to deliver code with both speed and stability.

See the rest.

Remember “the attention economy”?

We pay more attention to time spent reading than number of visitors at Medium because, in a world of infinite content — where there are a million shiny attention-grabbing objects a touch away and notifications coming in constantly — it’s meaningful when someone is actually spending time. After all, for a currency to be valuable, it has to be scarce. And while the amount of attention people are willing to give to media and the Internet in general has skyrocketed — largely due to having a screen and connection with them everywhere — it eventually is finite.

That seems like a useful metric.

See also: http://en.m.wikipedia.org/wiki/Attention_economy

Remember “the attention economy”?