When your software goes down, there are two audiences that need to know about it.
One: the people who are going to get frustrated and blame you for the inconvenience.
Two: the people who can fix the problem.
The first audience doesn’t need to know the details of the problem – they just need to know that you’re on top of fixing it, and how long they can expect to wait before full functionality is restored (insofar as you can make a realistic estimate about that).
The key with this audience is timeliness, convenience and professionalism – keep your cool, keep them informed, and make things as easy for them as possible.
The second audience is your DevOps team, the people who can fix the problem. They need to know the details and they need as few distractions as possible as they work to resolve the issue quickly. Hopefully, by the time you’re aware of an issue, they’re already working on it.
More than anything else, what you can do to help speed things along is protect them from the distraction of providing status updates to customers and to the rest of the company (that first audience). It’s important that your DevOps team restore application uptime as quickly as possible, to minimize the financial damage the outage or performance issue has caused (read more about the cost of downtime here).
Last month, New Relic shared a few quick stats that can be useful for your DevOps team to track for this purpose.
A hosted status page is an easy way to serve both audiences. In terms of the frustrated end-user, it provides the most timely information about the status of your application. For your DevOps team, it takes the burden of communication off their hands, so they can focus on fixing the issues disrupting your application uptime as quickly as possible.
By integrating your status page with your APM, you are providing your customers with the same immediacy that your DevOps team receives when there’s an issue. If your particular APM has been known to send “false alarms”, you can set your hosted status page to delay alerts or to not send alerts to your end-users without manual approval. This way, you have full control over when your end-users are notified that there’s an issue, but the message is ready and waiting once your APM detects there’s a problem.
Recall the three most important characteristics of communication with this audience:
You need to make sure the message is intelligible to your audience, and demonstrates that you are aware that there’s an issue and that you are working to resolve it as quickly as possible. A hosted status page, even one integrated with an APM, should allow you to create a “translated message” that is informed by the technical issues detected by the APM but can be understood by an audience that doesn’t need or want the full technical details of why they’re having trouble using your application.
Finally, you want to make the flow of information to your end-users convenient for them. Again, your hosted status page should take care of this. It can provide end-users with several options for how they’d like to receive alerts/updates (e.g. twitter, email, SMS/text, Slack or HipChat message, etc).
Servers are down. Employees are scrambling. Customers are upset. The pressure...
A Cloud Of Complexity Observability has become increasingly important for IT professionals as the...
The Incident Management and Status Page solution that lets you organize your enterprise IT team and...