Major incident management is one of the highest-pressure disciplines in IT operations, yet many service desks treat it as an extension of normal incident handling rather than a distinct process. This guide walks through how to define, declare, and resolve major incidents efficiently — covering roles, communication, escalation, and the post-incident steps that prevent recurrence.
What Counts as a Major Incident
Not every outage is a major incident, and calling everything one exhausts your team and dilutes the label. Most organisations define a major incident as an unplanned disruption that meets one or more of these thresholds:
- A critical business service is completely unavailable
- A large number of users or business units are affected simultaneously
- There is significant financial, reputational, or regulatory exposure
- Normal incident resolution procedures are insufficient to restore service quickly
The exact thresholds should be documented in your incident classification policy and agreed with business stakeholders before an outage happens — not negotiated in the middle of one. Common triggers include full application outages, network failures affecting multiple sites, security breaches that interrupt service, and data unavailability affecting business operations.
Why a Separate Process Matters
Standard incident management is designed to handle routine disruptions at pace. Major incidents require coordinated effort across multiple teams, real-time executive communication, and a command structure that keeps decisions moving under pressure. Without a separate process, you get confusion over who is in charge, inconsistent stakeholder updates, and slower resolution times.
Roles and Responsibilities in a Major Incident

Clarity of ownership is the single biggest factor in how quickly a major incident gets resolved. Define these roles before you need them.
Major Incident Manager
This person owns the resolution process from declaration to closure. They do not fix the technical problem — they coordinate the people who do, remove blockers, run the bridge call, and ensure communication flows. In smaller organisations this role may sit with a senior service desk lead or IT manager. In larger environments it is often a dedicated role.
Technical Lead
The technical lead is the most qualified engineer for the affected system. They direct the diagnostic and remediation effort on the bridge call, delegate tasks to other technical staff, and report progress to the major incident manager at agreed intervals.
Communications Lead
Someone must own stakeholder updates. This is often underestimated. During a major incident, business leaders, end users, and sometimes customers need timely, accurate information. The communications lead drafts and sends updates, manages the status page if one exists, and fields inbound queries so the technical team can focus on resolution.
Resolver Groups
These are the subject-matter experts pulled in as needed — network engineers, database administrators, application owners, third-party vendors. Each resolver group should have a named contact and an escalation path documented in your runbooks before an incident occurs.
Declaring and Activating the Major Incident Process

The decision to declare a major incident should be fast and low-friction. Delays in declaration mean delays in assembling the right people.
A good activation checklist looks like this:
- Confirm the incident meets your documented severity criteria
- Assign a major incident manager and technical lead immediately
- Open a dedicated bridge call or war room — do not rely on email threads
- Create a major incident ticket that is separate from or linked to the originating incident record
- Send the first stakeholder notification within fifteen minutes of declaration, even if it only confirms that investigation is underway
- Identify and invite resolver groups based on the affected service and initial diagnosis
- Set a communication cadence — many teams use updates every thirty minutes during active resolution
Keeping the Bridge Call Productive
A major incident bridge call can quickly become chaotic. The major incident manager should open every call with a thirty-second situation summary, assign a scribe to capture actions and findings, and keep the call focused on decisions and blockers rather than open-ended troubleshooting. Side conversations and diagnostic rabbit holes should happen off the main call with findings reported back at intervals.
Communication During a Major Incident

Poor communication during a major incident often causes as much damage as the outage itself. Business leaders who cannot get updates escalate through informal channels, creating noise that distracts the technical team. Customers who see no acknowledgement lose trust faster than the outage itself erodes it.
Effective major incident communication follows these principles:
- Send updates on a fixed schedule, not only when there is news to share
- Use plain language that non-technical stakeholders can understand
- State clearly what is affected, what is not affected, and what is being done
- Give a realistic estimate for the next update rather than a resolution time you cannot commit to
- Use a single authoritative channel — a status page, an ITSM notification, or an internal broadcast — rather than ad-hoc emails from multiple people
Most ITSM platforms allow you to send bulk notifications from the major incident ticket. This keeps the communication trail in one place and reduces the chance of contradictory messages going out.
Resolution, Workarounds, and Service Restoration

Resolution in a major incident context often happens in two stages: first a workaround that restores service to an acceptable level, then a permanent fix that addresses the root cause. These should be treated as separate milestones.
When a workaround is available:
- Communicate it clearly to affected users through the same channels used for updates
- Document it in the major incident record so it can be referenced in the post-incident review
- Do not close the major incident until the permanent fix is in place or a problem record has been raised to track the underlying cause
When service is fully restored:
- Confirm restoration with the business stakeholder who reported the impact, not just with internal technical checks
- Send a final stakeholder notification confirming resolution and the expected timeline for a post-incident review
- Close or link the originating incident records to the major incident ticket
Raising a Problem Record
Every major incident should result in a problem record unless the cause is already known and fixed. The problem record drives the root cause analysis and tracks any permanent remediation work. This is the link between incident management and problem management, and it is where most organisations lose continuity — the major incident gets closed and the underlying cause is never formally investigated.
Post-Incident Review and Continuous Improvement

The post-incident review — sometimes called a post-mortem or after-action review — is where major incident management delivers its long-term value. It should happen within three to five business days of resolution while details are still fresh.
A structured post-incident review covers:
- A factual timeline of the incident from first detection to resolution
- What worked well in the response
- What slowed the response down
- Root cause findings from the linked problem record
- Specific action items with owners and due dates — not vague recommendations
The output should be shared with relevant stakeholders and tracked to completion. An action item that sits in a document nobody reads is not improvement — it is documentation theatre.
Metrics to Track for Major Incidents
Tracking major incident performance over time helps you identify systemic weaknesses in your process. Useful metrics include:
- Time to declare (from first alert to major incident declaration)
- Time to assemble (from declaration to full resolver group on the bridge)
- Mean time to restore service for major incidents
- Number of major incidents per quarter by service or category
- Percentage of major incidents that result in a completed post-incident review
- Repeat major incidents linked to the same root cause
These metrics belong in your regular service review alongside standard incident and SLA data.
Key Takeaways
- Define your major incident criteria in writing before an outage forces the decision under pressure
- Assign clear roles — major incident manager, technical lead, communications lead — and make sure everyone knows them
- Declare fast, open a bridge call, and send the first stakeholder update within fifteen minutes
- Communicate on a fixed schedule using plain language through a single authoritative channel
- Treat workaround and permanent fix as separate milestones and raise a problem record for every major incident
- Run a structured post-incident review within five days and track action items to completion
TIKTING supports major incident management with dedicated severity classification, linked problem records, bulk stakeholder notifications, and SLA tracking across incident lifecycles. Odysseus asset discovery feeds accurate configuration data into TIKTING so your resolver groups can see affected infrastructure immediately rather than hunting for it during a live outage — reducing time to diagnose and restoring service faster.
























