In January, users started to experience slowness of the KaiNexus application related to excessive API utilization that caused sporadic unresponsiveness starting around 1:30. Normal functions were restored by 5:00, but not before we had a collective heart attack as we realized we didn't have a process for communicating outages and updates internally. Even worse, we didn't have a process for communicating these efficiently with our customers - many of whom reached out very concerned during the outage.
We knew that we didn't handle that well, and identified communication as an opportunity for improvement. But, we weren't in any rush, because in the ten-ish years the app has been in existence, it's literally never happened - so what are the odds that it would happen again?
Joke's on us, though, because it happened again in a separate and unrelated issue a couple of weeks later. This time, the app was down for about an hour.
The technical issues that occurred were flukes that we don't expect to reoccur, but we saw very clearly the need to improving the way we communicate outages to our team and to our customers. As a result, we assembled a cross-functional team with the goal of creating standard work to be called upon in the event of an emergency. Over the course of several work sessions, we identified
We weren't expecting to get to try out our new emergency communication plan anytime soon, so imagine our surprise when a few weeks later, Google Cloud Platform (our hosting provider) experienced a multi-region service disruption that took our app offline yet again.
Within 5 minutes of confirming the outage, we'd identified the source (Google Cloud Platform), notified our team, and activated our emergency response plan. We knew exactly who needed to do what, in what order, and in less than 10 minutes we had an email to customers ready to go. I know an outage is nothing to be excited about, but to be honest, it was exciting to test out the plan we'd just finalized and see it work so smoothly.
We're sincerely sorry if these outages caused pain for you and your organization. Thank you for sticking with us on our mission to help organizations improve, grow better, and thank you for your patience as we improve our own processes too.