Learning from KaiNexus Service Disruptions

Mar 31, 2021 8:37:00 AM

Networking Cloud Image Here at KaiNexus, we take the continuous improvement practices that we preach to heart. We believe that every misstep and failure is an opportunity for improvement, and tackle those problems head on. We've had a couple of app outages this year that identified a hole in our communication plan, which, once revealed, was addressed by a cross-functional team. In this post, I want to share with you what happened, how we fixed it, and what the results were the next time we had an outage.

What Happened

In January, users started to experience slowness of the KaiNexus application related to excessive API utilization that caused sporadic unresponsiveness starting around 1:30. Normal functions were restored by 5:00, but not before we had a collective heart attack as we realized we didn't have a process for communicating outages and updates internally. Even worse, we didn't have a process for communicating these efficiently with our customers - many of whom reached out very concerned during the outage.

We knew that we didn't handle that well, and identified communication as an opportunity for improvement. But, we weren't in any rush, because in the ten-ish years the app has been in existence, it's literally never happened - so what are the odds that it would happen again?

Joke's on us, though, because it happened again in a separate and unrelated issue a couple of weeks later. This time, the app was down for about an hour.

Our Response to the January and February Incidents

The technical issues that occurred were flukes that we don't expect to reoccur, but we saw very clearly the need to improving the way we communicate outages to our team and to our customers. As a result, we assembled a cross-functional team with the goal of creating standard work to be called upon in the event of an emergency. Over the course of several work sessions, we identified

The correct path for escalating potential outages
How to get immediate notice out to our team to trigger the emergency standard work plan
A flowchart detailing what each team needs to do as soon as that plan is activated
A process for communicating outages immediately to our customers, updating them when service is restored, and followup up with a root cause analysis

Surprise! Google Cloud Outage a Few Weeks Later

We weren't expecting to get to try out our new emergency communication plan anytime soon, so imagine our surprise when a few weeks later, Google Cloud Platform (our hosting provider) experienced a multi-region service disruption that took our app offline yet again.

Within 5 minutes of confirming the outage, we'd identified the source (Google Cloud Platform), notified our team, and activated our emergency response plan. We knew exactly who needed to do what, in what order, and in less than 10 minutes we had an email to customers ready to go. I know an outage is nothing to be excited about, but to be honest, it was exciting to test out the plan we'd just finalized and see it work so smoothly.

We're sincerely sorry if these outages caused pain for you and your organization. Thank you for sticking with us on our mission to help organizations improve, grow better, and thank you for your patience as we improve our own processes too.