Cloud-native at Comcast, working with Pivotal – Highlights

I’m doing a podcast with Comcast in a few weeks, so I’ve been going over all their public talks on their cloud-native efforts. They’ve been working with Pivotal since around 2014 and are one of the more impressive customer cases with over a 1,000 applications now on Pivotal Cloud Foundry.
Here are some highlights from the talks I’ve been watching. As always, things I put in square brackets are my own comments, the rest are quotes or summaries of what people said:

August, 2016 – Empowering Devops with Cloud Foundry – Sergey Matochkin, Neville George; Comcast

  • Sergey Matochkin.
  • Slides.
  • (17:00) Every deployment to production took at least 6 weeks, but most commonly around 2 months end-to-end. Which also means you need to plan capacity much in advance.
  • We started to use virtualization and containerization “well, well before Docker existed… it was some success, we had some improvements, but those improvements were marginal.”
  • Traditionally, it’d take at least 4-6 months to setup your dev/test infrastructure. But, luckily, virtualization came along.
  • (9:20) Business drivers… Comcast phone service, set-top boxes get DVRs, VoD, etc. All of these require apps on the backend, so the portfolio of apps starts to grow, and with they way they were before it meant they had to build a new datacenter every six months. Virtualization helped here, of course.
  • Also, virtualization allowed us to put a service layer [think “platform”] on-top of the infrastructure.
  • It’d take 4-6 weeks for testing environment, but now it takes 10-15 minutes in a self-service portal.
  • Demo of using Pivotal Cloud Foundry for much of the automation needed to deploy and scale an application.
  • (~32:00) We used to have things like “order servers” and “make load-balancer changes” and somewhere in the bottom of the backlog was “write some code and do some testing.” [That is, they were focusing on items with low business value, below “the value line,” rather than customer features.]
  • “What Cloud Foundry essentially helped us with was to get all those unnecessary user stories out of our backlog so we can focus on the writing code, on testing, and deploying rather than managing infrastructure.”
  • (33:45) momentum/proof-points:
  • momemtum
  • 9 PCF instances; 900+ developers; 2,000+ active apps “most of which are in “the critical path of our customer experience”; 4,100 application instances; 2,000 requests per second.
  • Lots of Slack/ChatOps usage for monitoring and such.

August 3rd, 2016 – Transforming the monolith at 20M tph – Nick Beenham, Comcast

  • Slides.
  • Existing state:
    • 250m transaction per day.
    • Would take 3 months to get a server useful, from moment of purchasing to using.
    • “Over a 100 services run by development teams.”
    • In functional, silo roles.
  • (3:45) “We knew we had that large, rigid infrastructure. [Pivotal] Cloud Foundry and it’s adoption really enables us to change that to gain the agility, to gain the elasticity at scale.
  • Taking away roles to reduce finger-pointing and all the negative stuff, and unified team, of course.
  • (7:35) Anecdote of Nick going from “ops guy” to writing code and liking coding.
  • (12:18) ESP router that was a small router written in Go to translate SOAP requests as part of a strangler pattern. Decades old SOA layer that they wanted to modernize. But they couldn’t strip it out, would take so long. So, were going to duck-type as SOA, but do REST and micro services underneath. Strangler pattern, etc. This is what the ESP router does marshals and unmarshalls between microservices and SOAP stuff. But new things need to be done in new style.
  • Also, “de-mingling data,” moving off Oracle RAC/GoldenGate for multi-site. Some simpler CRUD services to front the data.
  • (~15:00) Used to take a week+ to deploy the entire stack, but with Pivotal Cloud Foundry it takes minutes. It gives us a great deal of velocity that we’ve never had before. “Sometimes we’ll deploy multiple times an hour.”
  • (17:00) From 1,000’s of lines of bash to deploy out to various WebLogic clusters, which has for the most part moved to Cloud Foundry.
  • Improving production updates: bringing new node up and shutting old node down slowly; canary updates, with a CI test suite, then switching over to a production install.

August 1st, 2016 – James Taylor – The Power of Partnership & Building a Cloud Native Tier-1 Platform

  • @jctbmwi8
  • “Sparrow, Service Activation Platform.”
  • “Helping someone put a smile on their face is one of the greatest gifts we can give each other.”
  • Their VP provides the feedback loop of things to focus on. Right now: reducing technical debt, reducing incidents, increasing velocity, experimentation.
  • (~6:30) “You can’t move forward – innovate – if you don’t have time to try new things.”
  • (~18:35) “If you’re spending time configuring a Docker container, that’s time you’re not spending coding or solving a problem.”
  • (13:51): “At the end of the day, [business] value is what puts money in everyone’s pocket. If our company, Comcast, can’t create something of value, no one’s gonna pay for us…if we can’t create value. So it’s important for us to understand ‘how can you create value?’”
  • (~22:02, starting epic rant!) “Who is our customer and what value do we bring to our customers…”
  • If you’re spending money on support, that’s cutting into your margins. A call coming in costs $8 right off the bat, then more as it takes longer. So you want to figure out preventing customer support problems… which points to understanding your customers more.
  • [A good overview of thinking about “value” in the context of a specific application, their customer activation center, Sparrow.] “If you have a [support] call rate of 30%, you’re probably cutting out all the value… So we try to figure out, how do we prevent calls?” [Very similar to IRS cloud-native story.]
  • “We’ve been holding technical workshops”: Internal training things every month with Pivotal people, leveraging Pivotal knowledge. With our development teams every month: webinar, or on-site visit.
  • Sparrow: 5 junior Java developers… we built it from scratch in parallel while existing teams maintained the platform… we then had to integrate the processes together… figure out decomposing the monolith platforms, etc….then we had to just cut off stuff when it was too much of a hassle.

August 17th, 2016 – Greg Otto SpringOne Platform keynote

  • Slides.
  • X1 boxes – a new release about once a month.
  • Processing 10’s of millions of transactions on this new platform daily on Pivotal Cloud Foundry/new platform.
  • “About a 75% lift in velocity as well as time to market, and the business is really feeling it.”
  • Developer reactions:
  • comcast what customers are saying.png
  • Momentum Stats:
  • comcast key state from otto.png
    • 40 apps to 900 apps, 2015 to 2016
    • 300 AIs to 4,100 AIs, 2015 to 2016
  • All with “zero outbound marketing from my team, this all word of mouth from all those happy developers.”

June 9th, 2016 – Greg Otto CF Summit keynote

  • “Late last year in 2015” – live in production [on Pivotal Cloud Foundry] with business critical systems from our back-office systems on our Cloud Foundry environment.
  • We put Pivotal Cloud Foundry directly in the customer critical path.
  • Applications doing 30,000 event a second on Cloud Foundry.
  • Started in 2014, met with Pivotal.
  • Had sort of thrown all the people into the Pivotal Cloud Foundry pool, they had to do a lot of research and such.
  • But, people were really interested in the ease of working with the platform [the productivity improvements].
  • Successful prototype app 30 days after platform.
  • Idea to feature, before after: “several weeks, at least”/“2-3 days”
  • Time-line and summary:
  • comcast otto summary.png

June, 2016 – Open source at Comcast story

  • Write-up.
  • “If Comcast has a problem to solve, there are three possible approaches: solve it themselves by making an investment in teams and resources; solve it through a commercial vendor that could build a product for them; or work with the open source community.”
  • OpenStack: “In addition to Linux, Comcast is a heavy user of OpenStack. They use a KVM hypervisor, and then a lot of data center orchestration is done through OpenStack for the coordination of storage and networking resources with compute and memory resources. Muehl said that Comcast has roughly a petabyte of memory and around a million virtual CPU cores that they are running under the OpenStack umbrella. As an operator, Comcast does a lot of things around operations, and they use Ansible to deploy and manage OpenStack at scale.”
  • Cloud Foundry: “They also use Cloud Foundry, but according to Muehl that work is in the very early stages at Comcast.”

May 2015 – Running Cloud Foundry at Comcast talk

  • Neville George, Sam Guerrero, Tim Leong, Sergey Matochkin
  • They wanted to make custom URLs.
  • Used Puppet for stuff.
  • (~8:30) Their requirements for a platform:
  • comcast platform requirements.png
  • A lot of emphasis on self-service and the micro services benefits of operating independently, product management wise.
  • They use OpenStack, Docker, and [Pivotal] Cloud Foundry.
  • Pre-provisioning resources for a pool of containers that are ready to go, etc.
  • (~27) a couple applications in production today… we’ll be ramping up quickly.
  • (Either this video or the 2016 one, a few minutes from the end) Q, training mode. A, Sergey: “I can’t say we have a really good training model…. We do brown-bags to have people aware. We focus on 12 factor application model… on overall microservices model, not just to shape application, but also data. Developers need to understand how they [do] applications for PaaS instead of traditional.

Omni-channel at Target: 14% of 2016 sales were “digital,” with 68% fulfilled in-store

In 2014, more than 93% of our transactions took place in stores, less than 7% digital. That season we had just started shipping from a small number of stores. In 2015, that same timeframe, digital sales reached almost 10% of our total sales. We more than doubled our ship-from store-capability to nearly 500 stores. We fulfilled 41% of all our digital orders inside of a store.

For 2016, just a few months ago, just last year, digital sales climbed to 14%, more than twice what we did two years earlier. We double ship-from-stores again, more than 1,000 stores. Our stores were fulfilling 68% of our digital orders. We finished December with record digital growth, including record-breaking days on both Thanksgiving and Cyber Monday.

Always nice to see multi-year numbers.

Link

Advice on introducing DevOps from Merrill Corp & SPS Commerce – Highlights

Nicely moderated by Bridget. Some of my notes and highlights:

  • Amy talks about pace of change, sustaining it in the beginning, etc.
    • The amount of time it took us to get going was a surprise – was longer.
    • If you can start to show results early, it helps build up momentum. “Having enough wins, like that, really helped us to keep the momentum going while we were having a culture change like DevOps.”
    • It takes the right people to keep that energy going, but also be able to go back to the business to show that why we are putting these changes in place.
    • You’re going to be able to see the changes to the business right away.
  • Peg – tools, don’t try to fix the old ones, like ITIL service desk tools. Instead we just had Jenkins open tickets and such, automating the toil of dealing with old tools
  • Global/offshore tactics, from Amy:
    • What with all the retrospective stuff, you need to be able to get teams together, physically. The collaboration angles are much better in person
    • Set-up each “shore” as an architecturally and management island, make them as independent as possible. They also need their own context, not held up by time zones so they don’t need to wait 24-48 hours for authorizations and collaboration. [To my mind, this means taking advantage of the organizational de-coupling you can get with microservices.]
  • Starting change, even when they company needs it. Amy: You have to start with the business need, what’s the big driver behind a change like DevOps. [Managers often don’t make sure they figure this out, let alone decimate it to staff.]

Vanguard’s thinking on microservices

Breaking up the monolith with good, old fashioned, OO-think:

Instead, Vanguard has begun a journey to break apart our monolithic legacy systems piece-by-piece by replacing them with microservices over time. With a microservices architecture, we remove the business logic and data logic from our applications and replace it with a set of re-usable modules of code that are built and deployed as independent entities. We then compliment this architecture by chunking out our user interfaces into modular purpose-built components.

De-coupling for stability and resiliency, among other things:

This service-based approach to application architecture provides a variety of advantages over the jumble of code that defines a non-modular monolithic application. First, services reduce redundancy by making sure there is only one copy of application logic for a given capability – regardless of how many applications leverage that logic. In the long run, this leads to lower development costs and increases speed to market. Second, since these services are deployed independently and built in a resilient manner, outages in one area of an application are less likely to bring down an entire system. In some instances, several of our services can be down without our clients being aware of a loss in functionality thanks to the ability of our applications to automatically react to a service that isn’t available. Finally, services enable our applications to scale easier. The marriage of cloud and services means we can quickly spin up infrastructure to handle surges in the number of transactions we need to handle without needing to scale up an entire application.

Vanguard CIO: Why we’re on a journey to evolve to a microservices architecture

Australian government’s Cloud Foundry apps in production

Delivery teams are now able to build services faster and easier. In July 2016, DTA had 14 apps in production and 50 apps in development. In October 2016, the numbers increased to 47 apps in production and 225 apps in development.

Australian Government Cuts Release Time with Cloud Foundry, Iterates Faster – Cloud Foundry Live

If compliance is so important, bake it into the platform

Can we take that governance and work with the platform team to codify, to automate that which they were doing on a per application basis – that’s, quiet frankly slowing down the delivery of the software – can we take that governance and can we have them work with the platform team to codfiy, to actually automate on a per application basis, have them expose that as a service on the platform

Cornelia Davis on governance and cloud-native, “Who Does What? Mapping Cloud Foundry Activities and Entitlements to IT Roles,” August 2016

In other words: you should not only automate the audit three-ring binders of compliance, but enforce as much as possible in the platform.

The rest of the talk is good stuff on how think through re-arranging your organization to be all DevOps-y, with the help of Pivotal Cloud Platform to automate all the infrastructure and middleware stuff:

Pacing cloud-native transformation, and actually doing the work to increase productivity

I like to tell large organizations that compared to the break-neck pace of “the silicon valley mindset,” they can operate at a leisurely pace. That pace is usually fast for these enterprises, but their problem set and risk profile is a lot different than hats on cats. Abby has a nice, short write-up that hits on this topic among others:

By the end of his first year, Safford and his teams had built prototypes and market tests and finished 16 new software projects.

At Home Depot, they were at about 140 to 150 projects after a year or so. However, it’s common in the first year to do a lot of replatforming of “simple,” mostly cloud-native compatible apps in there. You can do these at a pretty fast clip, with the rule of thumb being 10 apps in 10 weeks. This is in addition to new applications, but explains high numbers like those at Home Depot. I suspect the Allstate numbers are mostly net-new apps, though.

Goals:

Safford’s eventual goal is to shift Allstate software development to 70 percent extreme agile programming and 30 percent traditional scrum and waterfall. Where developers used to spend only 20 percent of their time coding software, today up to 90 percent of their days are spent programming. Each of his CompoZed development labs around the world has the same startup look and feel, including scooters parked in the hallways. This is not your grandfather’s insurance company anymore.

What you hear over and over again from organizations going cloud-native is that developers were spending lots of time in meetings, checking email, and otherwise not coding (and, yes, by “coding” I don’t mean just recklessly LOC‘ing it up without design, and all that). Management had to spend much effort to get them back to coding.

As I fecklessly tell my seven year old when he’s struggling with homework: the only way to finish this quickly is to actually do the work.

(Also: nice write-up from Abby!)

Source: Don’t Forget People and Process in Your Digital Transformation

How JPMC is making IT more innovative with PaaS, public and private

wocintech (microsoft) - 154

A good, pretty long overview of JPMorgan Chase’s plans for doing cloud with a PaaS focus. Some highlights.

More than just private-IaaS and DIY-platforms:

Like most large U.S. banks, JPMorgan Chase has had some version of a private cloud for years, with virtualized servers, storage and networks that can be shared in a flexible way throughout the organization.

The bank is upgrading its private cloud to “platform as a service” — in other words, the cloud service will manage the infrastructure (servers, storage, and networks), so that developers don’t have to worry about that stuff.

On the multi-/hybrid-cloud thing:

By the second half of 2017, the bank plans to run proprietary applications on the public cloud. At the same time, it’s building a new, modern internal cloud, code-named Gaia.

While “hybrid-cloud” has been tedious vendor-marketing-drivel over the past ten years, pretty much all of the large organizations I work with at Pivotal have exactly this approach. Public, private, whatever: we want to do it all.

Shifting their emphasis innovation:

“We aren’t looking to decrease the amount of money the firm is spending on technology. We’re looking to change the mix between run-the-bank costs versus innovation investment,” he said. “We’ve got to continue to be really aggressive in reducing the run-the bank costs and do it in a very thoughtful way to maintain the existing technology base in the most efficient way possible.” …Dollars saved by using lower-cost cloud infrastructure and platforms will be reinvested in technology, he said.

On appreciating the scale of “large organizations” that drive their very real challenges with adopting new ways of running IT:

The bank has 43,000 employees in IT; almost 19,000 are developers.

Good luck having the “we have no process by design” process with that setup.

On security, there’s a nice, almost syllogistic re-framing of “cloud security here”:

For years, banks have worried about using the public cloud out of security concerns and fears of what their regulators will say. Ever since the 2013 Target data breach, in which hackers stole card information from 40 million customers by breaking into the computers of an air conditioning company Target used, regulators have strongly urged banks to carefully vet and monitor all third parties, with a specific focus on security.

“We’re spending a significant amount of time to ensure that any applications we choose to run on a public cloud will have the same level of security and controls as those run internally,” Deasy said.

Most notable corporate security breeches over the year have involved on-premises IT (like the HVAC example above). The point is not to make sure that “cloud is as secure as [all that on-prem IT that’s been the source of most security problems in the past], but to make sure that all IT has a rigorous approach to security. “Cloud” isn’t the security problem, doing a shitty job at security is the security problem.

Source: Unexpected Champion of Public Clouds: JPMorgan CIO Dana Deasy, Penny Crosman, American Banker

Automation at Goldman, The Computer takes out four people

Today, nearly 45 percent of trading is done electronically, according to Coalition, a U.K. firm that tracks the industry.

Pay:

Average compensation for staff in sales, trading, and research at the 12 largest global investment banks, of which Goldman is one, is $500,000 in salary and bonus, according to Coalition. Seventy-five percent of Wall Street compensation goes to these highly paid “front end” employees, says Amrit Shahani, head of research at Coalition… Investment bankers working on corporate mergers and acquisitions at large banks like Goldman make on average $700,000 a year, according to Coalition, and in a good year they can earn far more.

Automating those $700,000+ meat-sacks:

Goldman Sachs has already begun to automate currency trading, and has found consistently that four traders can be replaced by one computer engineer, Chavez said at the Harvard conference. Some 9,000 people, about one-third of Goldman’s staff, are computer engineers.

Finding the things to automate:

Though those “rainmakers” won’t be replaced entirely, Goldman has already mapped 146 distinct steps taken in any initial public offering of stock, and many are “begging to be automated,” he said.

To be all double-turns-out about the grim automation stuff, in theory, this could mean hiring more programmers and people who support those robots, bringing down those big chunks of cash from “rainmakers” and spreading it down to “lower” grade staff. Of, you know, the bank can just keep that money and trickle it up to execs and share-holders.

Source: As Goldman Embraces Automation, Even the Masters of the Universe Are Threatened

When to go private cloud

As represented with the star in the map above, according to CPI data, at labor efficiency of 1,000 VMs per engineer and 66% utilization, these enterprises are poised to beat public cloud on price regardless of whether they use a commercial orchestration software package, an OpenStack distribution or the OpenStack source.

And, on IaaS pricing:

But price still does matter: In a 451 Research custom study commissioned by Microsoft earlier this year, the biggest reason to change primary provider was price, cited by 34% of respondents. Consumers don’t necessarily want the cheapest cloud service, but they don’t want to feel ripped off. If there is a cheaper option elsewhere, it appears end users will take it into consideration.

Announcements on price cuts gather attention, and are a great publicity and discussion tool for service providers. We think cloud prices will continue to come down through 2017, and may spread beyond virtual machines into object storage, and perhaps even databases – virtual machines came down 7% globally in 2015, but the cost of our small application only came down 2.4%. The fact that margins are still healthy suggests providers aren’t sacrificing huge amounts of gross margin to give such cuts. If they are, it might be a few nickels and dimes here and there, but it’s more likely that they are reducing costs through better procurement and management. If we are in a cloud price war, we’ve yet to see it really get off the ground.

And, see more commentary on the topic of IaaS pricing.

Source: Cloud gross margins: The price war has yet to really kick off