Monday, January 21, 2008

Best practice: what to do when a service outage occurs

We have discussed the pros and cons of relying on service providers. Reliability is one issue, and, when something goes wrong, you want fast action and continuous status reports. 37Signals, provider of Basecamp project management and other services, responded well during a recent two hour service interruption. The following log was posted on their Web site, fully informing users of the problem and progress toward its solution.

  • All systems are currently offline as we're experiencing network outage from our provider. We're working on it right now. No data has been lost, all our machines are still working, but they're not accessible from the internet. Sorry for the inconvenience — 10:03am CST (16:03 GMT) on January 18, 2008.
  • We have located the problem to be with the load balancer setup. A new unit is being installed. We should be back shortly. Again, we're terribly sorry for this disruption of service — 10:28am CST (16:28 GMT) on January 18, 2008.
  • The technicians at our service provider are still working on installing the new load balancer. We're breathing down their neck as heavily as we can. And we profusely apologize for this unacceptable interruption of service — 10:56am CST (16:56 GMT) on January 18, 2008.
  • The load balancer has been swapped and is currently being configured. We should be in the home stretch now. Again, we're incredibly sorry for this disruption. This is not how Fridays are supposed to be — 11:18am CST (17:18 GMT) on January 18, 2008.
  • Our service provider is still working on the configuration of the new load balancer. We're on their case every few minutes to get updates. It's hard not to be very disappointed that a simple load balancer replacement can take this long for someone who's supposed to be the best in the business. We're out of new ways to say we're sorry, so we'll just say it again: We're so sorry for this — 11:50am CST (17:50 GMT) on January 18, 2008.
  • The latest from our service provider is that the reason it's taking so long is because the configuration for the load balancer lives on a CF card that has also gone bad. The configuration is currently being rebuilt from scratch — 11:57am CST (17:57 GMT) on January 18, 2008.
  • Now we finally have access to the servers with remote access. We should be able to bring the services back one by one now. Looks like there's finally light at the end of the tunnel — 11:59am CST (17:59 GMT) on January 18, 2008.
  • Basecamp is now back. We're bringing back everything else we can as fast as possible. It might take a few minutes for the DNS access to update, but it's coming back — 12:05am CST (18:05 GMT) on January 18, 2008.
  • All the products should be coming back online now as soon as the DNS updates. It seems like we're coming out of the woods entirely7 — 12:13am CST (18:13 GMT) on January 18, 2008.
Could an outage like this occur if the user ran their own application? Would in-house staff have responded as well as 37Signals staff did to this outage?