Tag: SRE

  • Using AI to help with SRE, ops, etc.: The problem, he said, is that Claude “will get wrong correlation versus causation.” It’s like a new joiner on the team, they will think “oh, it’s a capacity problem, when actually you lost your cache.” “This is why we can’t trust LLMs for incident response,” said Palcuie.…

    Read more

  • It’s real, enterprises just need to do the CISO work and SRE work

    After using Claude more and more for task in my personal life, my current zinger analyst take on the Squawk Box would be: “OpenAI talks about business strategy, Anthropic just does it.” It’s really getting close to a sci-fi personal assistant. It takes A LOT of work to get your rig (or “harness”) setup, and…

    Read more

  • A great platform as a product paper, and a fun platform philosophy thereof

    I like this platform as a product paper a lot. You should check it out if you’re into DevOps, SRE, platform engineering, whatever. It’s also available in O’Reilly if you have that subscription and don’t want to lead-in yourself. Here’s some fun parts: Adopting a product mindset starts with continually evaluating the business context to…

    Read more

  • Automating bullshit – OpenAI ChatGPT removes office worker toil

    Current status: there’s about two weeks left before the work year is over. I’m lucky that I get a lot of time off at the end of year. VMware gives a whole week. The idea of the world being shutdown and, thus, feel guilt free about doing nothing is more appealing than it’s ever been.…

    Read more

  • 🗂 Kubernetes, what it’s for

    > The reason that Kubernetes is successful is because people look at it and they don’t understand why they need it until they see it do stuff. Then they say “Oh my God, I need that!”I can’t say how many talks and presentations I’ve done in front of skeptical audiences where they don’t understand what…

    Read more

  • Link: SRE: The Biggest Lie Since Kanban

    That’s why SRE is a Big Lie – because it enables people to say they’re doing a thing that could help their organization succeed, and their dev and ops engineers to have a better career and life while doing so – but not really do it. Yes, there have been Big Lies before, which is…

    Read more

  • Link: Preliminary Analysis of the Site Reliability Engineer Survey

    If the response takes too long to get to your phone, the system might as well be “unavailable”: ‘If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We…

    Read more

  • Link: Monitoring SRE’s Golden Signals

    Lists out how to get the metrics from various systems and software. Original source: Monitoring SRE’s Golden Signals

    Read more

  • Link: How to Monitor the SRE Golden Signals

    [Summary from the post of metrics to use:] Rate — Request rate, in requests/sec Errors — Error rate, in errors/sec Latency — Response time, including queue/wait time, in milliseconds. Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated,…

    Read more

  • Link: Serverless Impacts on Business, Process and Culture

    ‘Sharples said the main interest stems from an enterprise love of microservices, where incremental delivery, agility and faster delivery are being embraced. “But we see adopters struggle with the operational complexity of managing and monitoring distributed systems, and that is where serverless has gotten their attention. You get the microservices benefits, but from a developer…

    Read more

  • Link: Splunk acquires VictorOps to take it – and you – into site reliability engineering

    “Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.” Original source: Splunk acquires VictorOps to take it – and…

    Read more

  • Link: Full Cycle Developers at Netflix

    How Netflix thinks about standardized platforms and tools, plus their adaptation of DevOps and SRE. “Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will…

    Read more