When your software goes down, there are two audiences that need to know about it.
One: the people who are going to get frustrated and blame you for the inconvenience.
Two: the people who can fix the problem.
The first audience doesn’t need to know the details of the problem – they just need to know that you’re on top of fixing it, and how long they can expect to wait before full functionality is restored (insofar as you can make a realistic estimate about that).
The key with this audience is timeliness, convenience and professionalism – keep your cool, keep them informed, and make things as easy for them as possible.
The second audience is your DevOps team, the people who can fix the problem. They need to know the details and they need as few distractions as possible as they work to resolve the issue quickly. Hopefully, by the time you’re aware of an issue, they’re already working on it.
More than anything else, what you can do to help speed things along is protect them from the distraction of providing status updates to customers and to the rest of the company (that first audience). It’s important that your DevOps team restore application uptime as quickly as possible, to minimize the financial damage the outage or performance issue has caused (read more about the cost of downtime here).
Application Performance Management (APM) Tools and Uptime
Typically, your DevOps team is going to be working with one or more application performance management (APM) tools that, as their name suggests, monitor your application’s performance. These tools will notify the team immediately when there’s an issue, so that you aren’t waiting for customers to realize that there’s a problem before DevOps begins investigating. Some of these tools can even be predictive, offering insight into troubling trends – giving your DevOps team the opportunity to fix the cause of an issue before it actually becomes an issue that brings your application down.
Last month, New Relic shared a few quick stats that can be useful for your DevOps team to track for this purpose.
Using a Hosted Status Page to Manage Application Uptime
A hosted status page is an easy way to serve both audiences. In terms of the frustrated end-user, it provides the most timely information about the status of your application. For your DevOps team, it takes the burden of communication off their hands, so they can focus on fixing the issues disrupting your application uptime as quickly as possible.
By integrating your status page with your APM, you are providing your customers with the same immediacy that your DevOps team receives when there’s an issue. If your particular APM has been known to send “false alarms”, you can set your hosted status page to delay alerts or to not send alerts to your end-users without manual approval. This way, you have full control over when your end-users are notified that there’s an issue, but the message is ready and waiting once your APM detects there’s a problem.
Recall the three most important characteristics of communication with this audience:
- timely
- professional
- convenient
You want to make sure that you aren’t waiting too long before sending an alert about the disruption to application uptime to your end-users. You can fully automate this if you’re comfortable with that.
You need to make sure the message is intelligible to your audience, and demonstrates that you are aware that there’s an issue and that you are working to resolve it as quickly as possible. A hosted status page, even one integrated with an APM, should allow you to create a “translated message” that is informed by the technical issues detected by the APM but can be understood by an audience that doesn’t need or want the full technical details of why they’re having trouble using your application.
Finally, you want to make the flow of information to your end-users convenient for them. Again, your hosted status page should take care of this. It can provide end-users with several options for how they’d like to receive alerts/updates (e.g. twitter, email, SMS/text, Slack or HipChat message, etc).