As some of you may have noticed, SendHub was unavailable for 2 hours yesterday morning. This was on top of 4 hours of downtime last week. Embarrassing? Yes. Fixable? Hell yes. Downtime sucks, especially when the causes are outside of our immediate control. Not cool.
Yesterday morning, Heroku ran into some problems related to how they route requests to our application (which impacted many more sites than just us). I’ve already said being down isn’t cool, but what *is* cool is how our team handled this incident.
During the Outage
Although we had not practiced it yet as a team, we followed a procedure we like to call the ‘Emergency Response Mode’ (ERM) - a plan we created after our little faux-pas last week.
What’s the Impetus for an ERM?
After we discovered database connection on Thursday evening last week, we all joined in a chat room to appraise the situation. It was great to have everyone eager to help after hours, but we dropped the ball in a number of ways. No one was communicating status updates to our users. Every engineer began diagnosing different systems and “flipping bits” attempting to resolve the issue. This resulted in spending time on things which did not lead to a resolution of the problem. Reflecting on what happened, it was easy to see that we could have been a lot more effective by approaching the situation in a more organized and methodical way. Our ERM fixes that.
We know how frustrating it is when sites and services that we rely on are down, so we wanted to share an inside look with you on what exactly goes on during that time, and what our ‘Emergency Response Mode’ is (straight from our wiki.)
Emergency Response Mode
This describes the team roles and processes in place when we are suffering a widespread service outage. As you might imagine, on small teams people fill multiple roles. This should be used by the Quarterback to make sure that the process goes smoothly
== ROLES ==
Quarterback: (Primary, Secondary)
In charge of ensuring that the Emergency Response Mode is executed properly and according to the EMR communication agreements.
Social Lead: (Primary, Secondary )
In charge of posting updates to Twitter, Facebook and the blog about the current service outage. No updates on resolution should be provided in support unless previously communicated on social channels.
Support Lead: (Primaries, Secondary)
In charge of responding to users over Chat, Email, SMS & Phone.
Eng Lead: (Primary, Secondary)
In charge of diagnosing the critical production systems and services, and isolating the root cause of the problem. Also responsible for communicating with platform service providers and disseminating status updates to the rest of the team.
Ops Lead: (Primary, Secondary)
In charge of modifying the production environment. This is the only person allowed to make production changes. This ensures that no conflicting changes are made to the production environment.
Responsible for searching logs and gathering monitoring application metrics and keeping track of the status of platform service providers.
== Process ==
- QB Invokes the Emergency Response Mode by alerting the Team and Designating Separate Communication Channels for Technical (HipChat) and NonTechnical (Google chat)
NOTE: Our “Bat-Signal” is currently in for repairs, until then chat will have to suffice!
- Monitors begin gathering data for a service outage to inform the technical team
- Engineering Lead opens communication channels to our platform service providers
- Operations Lead puts site into maintenance-mode so the site support channel remains open
- Engineering Lead provides a time of the next external communication update (e.g. 1 Hour) to social and support
- Social Lead posts a blog article explaining the situation
- Support Leads will assemble respond to incoming inquiries
- The following steps will be repeated until a resolution is found
- Engineering Lead will begin working with Monitors and TechOps Lead to find the source of the problem
- Once the problem source has been identified, The Eng Lead will pass along a plan and resolution times to social and support so that it can be disseminated to our users
- Engineering Lead will work with TechOps Lead to follow the proposed plan to achieve a resolution
- Once a resolution has been achieved the Eng Lead will update social and support channels
- QB will immediately conduct a post mortem to analyze what has happened and determine how the team can improve response in the future
How do you like that? Since we were sitting around with time on our hands yesterday, we also added olark to our maintenance-mode site so our support team could chat with our users when things went awry. Bonus points: we also made sure to tweet/fb/shout it from the rooftops that we were having service issues, so our users would be aware of this (and we couldn’t apologize enough). The only thing that makes up for downtime is not having downtime.
Feedback? Thoughts on how we can improve our ERM? Text us at (650) 830-5662
Ryan & the SendHub Team