This post on the blog of Basecamp (née 37signals) tells a wonderful and tragic and altogether human story of operating a modern web product. I can tell you that this kind of introspection is always good, though there’s a little too much self-blame for my taste – these things happen and it’s best to learn from them and prevent them from happening again rather than spending time feeling like you messed up. I especially liked this passage:
We also need to raise our sense of urgency for rapid follow up on outage issues. That doesn’t mean we just add them to our list. We need to clear room for post-incident action explicitly. I will clarify the priorities and and explicitly push out other work.
Follow-up work after an incident is absolutely crucial. It prevents these same problems from recurring, but more importantly it gives the team the confidence to continue innovating knowing that there is one less thing to worry about.
Well done, and thanks for sharing your experience, Basecamp team!