Blog

IT outages are a fact of life – it’s how you handle them

11th May 2022

Best practice to prevent stuff happening and for when stuff actually happens!

In the IT world, outages and service disruption are a fact of life. Stuff hits the fan… Stuff happens! And it can happen to any service provider – even the most well designed and managed SaaS applications and platforms. 

One of the reasons why stuff happens is failing to adhere to best practices. To minimize the potential for problems, here we run over some of the key points from the cloud platform management best practice playbook.

Best practice #1

Outages can and do just happen. However, we must do everything we can to prevent contributing to an outage and causing a self-inflicted loss of service. Planned maintenance is a key area where there is a potential for this type of problem.

Updates and other housekeeping tasks are routinely carried out by service providers. However, routine has a habit of breeding complacency. So the first point of best practice is to make sure:

> Technical procedures need to be thoroughly checked and properly thought through before execution.

Best practice #2

No matter how thoroughly you shakedown maintenance or other procedures and processes, there is always the potential for data loss. It’s not just a maintenance issue. Data loss is simply another fact of life and we must steadfastly defend against it. Consequently, the point of best practice here is to ensure that you:

> Regularly test and check DR and BC procedures to make sure you can achieve SLA targets.

Best practice #3

When stuff actually does happen, it is really important to communicate properly. This means being honest from the outset and relating as fully as possible what you know in an appropriate language for the audience.

This includes such information as:

  • Explaining what has happened
  • Indicate when service is (likely) going to be restored
  • Issue updates regularly, even if there’s nothing new to report

The point of best practice here is to always:

> Follow best practice for incident management communication.

Be better at managing incidents with best practice

Recently, some customers of one leading incident management and status page company experienced an outage. A break down of the incident can be found on this Twitter thread by Gergely Orosz, a software engineer who has attracted over 71,000 Twitter followers in the software development community through his newsletter.

The StatusCast white paper ‘Best Practices For Managing Application Downtime’ lets you see how to do a great job of managing service outages. The paper shows you how to build processes and checklists for communicating with end users, customers, and partners where systems become unavailable.

The blog ‘Who’s Better at Downtime Communication: IT or Marketing?’ discusses the problem from the perspective of communication skills and how to deliver alerts so that they are on-brand and provide the alert information in a way that is appropriate to the audience.

 

 


Latest Posts

10th November 2022

Incident Management and Status Pages for Enterprise IT Departments

The Incident Management and Status Page solution that lets you organize your enterprise IT team and...


Read More
3rd October 2022

What Metrics Should Be Tracked Within Incident Management?

As digital services have become increasingly important to businesses and organizations, reducing...


Read More
22nd September 2022

What are the new stages of incident management?

Good communication is at the core of any incident management process, empowering stakeholders with...


Read More
See it for yourself

Test Drive it For Yourself with a Free Trial

StatusCast Dashboard
StatusCast Dashboard