Best practice to prevent stuff happening and for when stuff actually happens!
In the IT world, outages and service disruption are a fact of life. Stuff hits the fan... Stuff happens! And it can happen to any service provider - even the most well designed and managed SaaS applications and platforms.
One of the reasons why stuff happens is failing to adhere to best practices. To minimize the potential for problems, here we run over some of the key points from the cloud platform management best practice playbook.
Best practice #1
Outages can and do just happen. However, we must do everything we can to prevent contributing to an outage and causing a self-inflicted loss of service. Planned maintenance is a key area where there is a potential for this type of problem.
Updates and other housekeeping tasks are routinely carried out by service providers. However, routine has a habit of breeding complacency. So the first point of best practice is to make sure:
> Technical procedures need to be thoroughly checked and properly thought through before execution.
Best practice #2
No matter how thoroughly you shakedown maintenance or other procedures and processes, there is always the potential for data loss. It’s not just a maintenance issue. Data loss is simply another fact of life and we must steadfastly defend against it. Consequently, the point of best practice here is to ensure that you:
> Regularly test and check DR and BC procedures to make sure you can achieve SLA targets.
Best practice #3
When stuff actually does happen, it is really important to communicate properly. This means being honest from the outset and relating as fully as possible what you know in an appropriate language for the audience.
This includes such information as:
- Explaining what has happened
- Indicate when service is (likely) going to be restored
- Issue updates regularly, even if there’s nothing new to report
The point of best practice here is to always:
> Follow best practice for incident management communication.
Be better at managing incidents with best practice
Recently, some customers of one leading incident management and status page company experienced an outage. A break down of the incident can be found on this Twitter thread by Gergely Orosz, a software engineer who has attracted over 71,000 Twitter followers in the software development community through his newsletter.
The StatusCast white paper ‘Best Practices For Managing Application Downtime’ lets you see how to do a great job of managing service outages. The paper shows you how to build processes and checklists for communicating with end users, customers, and partners where systems become unavailable.
The blog ‘Who’s Better at Downtime Communication: IT or Marketing?’ discusses the problem from the perspective of communication skills and how to deliver alerts so that they are on-brand and provide the alert information in a way that is appropriate to the audience.