<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Operability on alexos.dev</title>
    <link>https://alexos.dev/tags/operability/</link>
    <description>Recent content in Operability on alexos.dev</description>
    <generator>Hugo</generator>
    <language>en-gb</language>
    <lastBuildDate>Fri, 24 Sep 2021 19:28:00 -0100</lastBuildDate>
    <atom:link href="https://alexos.dev/tags/operability/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Developer-Friendly Runbooks: A Guide</title>
      <link>https://alexos.dev/2021/09/24/developer-friendly-runbooks-a-guide/</link>
      <pubDate>Fri, 24 Sep 2021 19:28:00 -0100</pubDate>
      <guid>https://alexos.dev/2021/09/24/developer-friendly-runbooks-a-guide/</guid>
      <description>&lt;figure&gt;&lt;img src=&#34;https://alexos.dev/images/runbook-doggo.jpg?width=600px&amp;amp;classes=shadow&#34;&gt;&lt;figcaption&gt;&#xA;      &lt;p&gt;&#xA;          &lt;a href=&#34;https://unsplash.com/photos/gySMaocSdqs&#34;&gt;Photo by Cookie the Pom on Unsplash&lt;/a&gt;&lt;/p&gt;&#xA;    &lt;/figcaption&gt;&#xA;&lt;/figure&gt;&#xA;&#xA;&lt;p&gt;&lt;strong&gt;When&lt;/strong&gt; things go wrong - and yes, they will go wrong - it&amp;rsquo;s extremely helpful to have easy access to a set of runbooks to guide the unfortunate engineer through the steps needed to mitigate the problem as swiftly as possible. In this post I&amp;rsquo;m going to describe the approach we use for this where I work, which we&amp;rsquo;ve found to work very well.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Team Nimbus and the Agents of Chaos</title>
      <link>https://alexos.dev/2020/03/29/team-nimbus-and-the-agents-of-chaos/</link>
      <pubDate>Sun, 29 Mar 2020 18:00:00 -0100</pubDate>
      <guid>https://alexos.dev/2020/03/29/team-nimbus-and-the-agents-of-chaos/</guid>
      <description>&lt;p&gt;This blog post is about my team&amp;rsquo;s first ever Chaos Day - where we ran a series of experiments designed to test how our platform performed when we tried to disrupt it, or the workloads that run on it.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;This entry was originally &lt;a href=&#34;https://medium.com/john-lewis-software-engineering/team-nimbus-and-the-agents-of-chaos-ab257e41fe36&#34;&gt;posted on Medium&lt;/a&gt; under my employer&amp;rsquo;s publication.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;It was January 2020, and we had just gone through another Peak trading period - significant for a larger retailer. The Digital Platform had performed extremely well. There were no incidents, no last-minute panic scaling, and no fall-backs enabled — even though the number of services and overall complexity of the platform was significantly higher than this time last year. Not perhaps the backdrop to make a compelling case for running a series of complex operational test scenarios then? Well, we are not the sort to be resting on our laurels &amp;hellip;&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
