Major Incident Handling

Modified on Fri, 5 Mar, 2021 at 11:19 AM

What is a Major Incident?

A major incident is a high-impact, urgent issue that usually affects the whole School, Trust or a major part of it. A major incident almost always results in services that affect teaching and learning becoming unavailable. There are two ways a major incident can affect teaching and learning:

Not Accessible – Preventing staff or students from accessing services such as the Internet, Microsoft 365 and any of its components, the local network or wireless infrastructure.
Disruption – A reduction in quality of service that dramatically affects response times can lead to lesson disruption, loss of work and a lot of frustration.

4 Stages of a Major Incident

Identification

When a major incident is identified it is important to agree some key responsibilities and the communication protocol that will be used:

Who does the investigation? – Internal and/or Third-Party Resource
Who needs to be communicated to? – Which Key Stakeholders should be informed of the incident. This also includes sending an update to stakeholders whose work may be delayed due to the incident.
Who will do the communicating? – Decide who will update the service status page and be responsible for sending out regular communications. Ideally not the person carrying out the technical work.
How often will the communication take place? Hourly updates during operational hours are expected.

There are five main groups that need to be informed of major incidents:

Technical team: Trust IT Team
Third Party: (optional) For services provided or maintained by external third parties under SLA.
Management: Head of ICT, Trust COO, affected Head Teachers and their deputies should all be informed.
Key stakeholders: Those department heads affected also need to be informed of major incidents and receive regular status updates.
Staff: Staff need to know which services may be unavailable due to a major incident.

The Incident Team work together to find a fix for the major incident and bring operations back to normal.

Communications

As well as updating the Service Status a problem ticket should be created to discover and understand the root cause of the major incident. This can help prevent similar major incidents in the future by addressing the causes of the major incident. Following work may require further Change Requests.

It is recommended that key staff use the Subscribe facility on the Service Status page to ensure that they receive timely updates.

Below is an example email covering the key information regarding an incident. This information can also be used within the New Incident on the Service Status page.

We are sorry to report that we are currently aware of a problem affecting internet connectivity and are identifying the cause of the issue.

Who does this affect?

This problem affects all staff and students.

What does this affect?

All online resources such as Microsoft 365, Email, YouTube, and some telephony services.

What are my alternatives?

At present there are no work arounds.

Where is the incident?

This outage is limited to Arrow Vale site only.

When did it start?

The incident started at 09:40 on 04/03/2021

When is it likely to be resolved?

We are currently investigating the source of the problem however do not have a resolution time at present.

When will I receive my next update?

Another update will be issued within 60 minutes.

For further updates and support you can also visit:

IT Help Guides Portal | IT Service Status Page | IT Team Schedule & Availability

Investigation

During the investigation process it is handy to use the Plan, Do, Check, Act improvement cycle which can help create a consistent way of working for all of the members of the Incident Team. A summary of this is below.

Communication throughout the wider team is important to assess if changes have been made (documented or undocumented) that may have led to the incident as well as any known tickets which may also be as a result of the incident. Teams is an ideal platform to use in this instance.

Resolution

It is good practice to implement the fix for the major incident as a change using the helpdesk system to ensure that the resolution is properly documented and implemented. Implementing the resolution as a change minimises the risk of a poorly planned resolution disrupting other services.

Monitoring

Once a resolution has been applied it is important to continue to monitor the systems and ask for feedback from key staff who were affected. Monitoring should be done during a similar time and capacity timeframe to ensure a like-for-like environment.

If the resolution was to "switch it off and on again" then further investigation and support is required to isolate the root cause and implement a more robust solution. This can mean while an incident is closed, it has generated further work to be carried out to prevent future reoccurrences.

SUMMARY TO-DO LIST

IDENTIFICATION
- CREATE - Team, Ticket, Status, Communications. (Include Ticket # within Status Page for easy referral).
INVESTIGATION
- PLAN - Systems to investigate, who should investigate.
- DO - Carry out investigation, contact third party support where available
- CHECK - Monitoring, Change Log, Event Logs, Third Party Status
- ACT - Agree and deliver resolution
RESOLUTION
- ASSESS - Assess RISK and IMPACT of resolution.
- COMMUNICATE - Communicate the impact of the resolution (if necessary) & agree outage(s).
- DOCUMENT - Changes and work carried out should be documented within the ticket
MONITORING
- REVIEW - Return to INVESTIGATION if RESOLUTION fails
- DOCUMENT - Solution and changes made within ticket.
- IMPROVE - Outline requirements to prevent future incidents and improve overall solution.