IT Major Incident Management: A Practical 2025 Guide

Major incident management is one of the highest-pressure disciplines in IT operations, yet many service desks treat it as an extension of normal incident handling rather than a distinct process. When payroll fails the day before payday, the difference between a forty-minute outage and a four-hour one usually comes down to process, not technical skill. This guide walks through how to define, declare, and resolve major incidents efficiently — covering roles, communication, escalation, and the post-incident steps that prevent recurrence.

What Is Major Incident Management?

Major incident management is the process for handling the highest-impact IT disruptions — outages that threaten critical business services, affect large numbers of users, or carry financial and regulatory consequences. It defines who takes charge, how resolver teams are assembled, how stakeholders are kept informed, and how the organisation learns from each event afterwards.

In the ITIL framework maintained by Axelos, major incident handling sits inside the incident management practice but follows its own accelerated procedure with dedicated roles and communication rules. ITIL v4 says a separate procedure must exist but leaves the detail to you — and that design work, done calmly before an outage, is what this guide covers.

What Counts as a Major Incident

Not every outage is a major incident, and calling everything one exhausts your team and dilutes the label. Most organisations define a major incident as an unplanned disruption that meets one or more of these thresholds:

A critical business service is completely unavailable — for example, the ERP, the customer-facing website, or the clinical records system
A large number of users or business units are affected simultaneously — many organisations set a numeric trigger, such as more than 100 users or more than two sites
There is significant financial, reputational, or regulatory exposure — a payment platform outage costing an estimated five figures per hour clears the bar even if only a handful of staff use it
Normal incident resolution procedures are insufficient to restore service quickly, or the incident has breached its priority-one SLA target with no resolution in sight

The exact thresholds should be documented in your incident classification policy and agreed with business stakeholders before an outage happens — not negotiated in the middle of one. Common triggers include full application outages, network failures affecting multiple sites, security breaches that interrupt service, and data unavailability affecting business operations. If you already run a solid impact-and-urgency prioritisation model for standard tickets, the major incident threshold is simply the line above your highest priority band.

Major Incident or Just a P1?

A common source of confusion is the relationship between priority-one incidents and major incidents. They overlap but are not identical. A P1 is a classification on the priority matrix; a major incident is a declaration that invokes a different way of working. Some P1s are resolved in twenty minutes by a single engineer and never need a bridge call, while an incident logged as P2 can be promoted when its true scope emerges. The practical rule: priority describes the ticket, major incident status changes the process. Keep the two decisions separate.

Why a Separate Process Matters

Standard incident management is designed to handle routine disruptions at pace. Major incidents require coordinated effort across multiple teams, real-time executive communication, and a command structure that keeps decisions moving under pressure. Without a separate process, you get engineers troubleshooting in parallel without coordination, three different versions of the truth circulating among executives, and no record of what was tried and when. The fundamentals still rest on the same foundation described in our guide to incident management best practices — the major incident process builds an emergency command layer on top of them.

Roles and Responsibilities in Major Incident Management

Clarity of ownership is the single biggest factor in how quickly a major incident gets resolved. Define these roles before you need them, publish a rota so cover exists around the clock, and rehearse them so the first live incident is not also the first rehearsal.

Major Incident Manager

This person owns the resolution process from declaration to closure. They do not fix the technical problem — they coordinate the people who do, remove blockers, run the bridge call, and ensure communication flows. They also make the judgement calls engineers should not have to make mid-crisis: whether to invoke a vendor's emergency support contract, fail over to a secondary site, or wake the CIO. In smaller organisations the role may sit with a senior service desk lead; in larger environments it is often dedicated, with a formal on-call rotation. Whoever holds it needs authority to pull people from other work without negotiation.

Technical Lead

The technical lead is the most qualified engineer for the affected system. They direct the diagnostic and remediation effort on the bridge call, delegate tasks to other technical staff, and report progress to the major incident manager at agreed intervals. Critically, the technical lead should not be the person typing commands — the moment they go heads-down in a terminal, technical coordination stops. Pair them with hands-on engineers and keep them at the decision level.

Communications Lead

Someone must own stakeholder updates. This is often underestimated. During a major incident, business leaders, end users, and sometimes customers need timely, accurate information. The communications lead drafts and sends updates, manages the status page if one exists, and fields inbound queries so the technical team can focus on resolution. Where no dedicated role exists, the major incident manager absorbs this work — and both jobs suffer.

Scribe

A dedicated scribe captures the timeline as it happens: when the incident was detected, what hypotheses were tested, what changes were made, when each update went out. This record is invaluable twice — during the incident, when a new resolver joins the bridge and needs a sixty-second briefing, and afterwards, when the post-incident review needs facts rather than recollections. Without a scribe, the timeline gets reconstructed days later from fragmented chat logs and fallible memory.

Resolver Groups

These are the subject-matter experts pulled in as needed — network engineers, database administrators, application owners, third-party vendors. Each resolver group should have a named contact and an escalation path documented in your runbooks before an incident occurs. For vendor-dependent services, record the support contract reference, the severity definitions the vendor uses, and the phone number that reaches a human — hunting for these details during a live outage adds thirty minutes nobody can spare.

Declaring and Activating the Major Incident Process

The decision to declare a major incident should be fast and low-friction. Delays in declaration mean delays in assembling the right people. A useful internal benchmark: from first credible evidence of a qualifying outage to formal declaration should take no more than ten to fifteen minutes. If your average is an hour, the problem is usually cultural — staff fear being blamed for a false alarm. Make it explicit in policy that a reasonable declaration later downgraded is a success of the process, not a failure of judgement. Good event management and monitoring shortens this window further, because a well-tuned alert often reaches the service desk before the first user call does.

A good activation checklist looks like this:

Confirm the incident meets your documented severity criteria
Assign a major incident manager and technical lead immediately
Open a dedicated bridge call or war room — do not rely on email threads
Create a major incident ticket that is separate from or linked to the originating incident record
Send the first stakeholder notification within fifteen minutes of declaration, even if it only confirms that investigation is underway
Identify and invite resolver groups based on the affected service and initial diagnosis
Set a communication cadence — many teams use updates every thirty minutes during active resolution
Nominate a scribe and start the live timeline from the first entry

Declaration is also where escalation discipline matters. Functional escalation brings in deeper technical expertise; hierarchic escalation informs management and unlocks decisions such as emergency spend. Both routes should be pre-mapped rather than improvised — our guide to escalation management covers how to build those paths so they hold up under pressure.

Keeping the Bridge Call Productive

A major incident bridge call can quickly become chaotic. The major incident manager should open every call with a thirty-second situation summary, confirm the scribe is capturing actions and findings, and keep the call focused on decisions and blockers rather than open-ended troubleshooting. Side conversations and diagnostic rabbit holes should happen off the main call — in breakout channels or separate huddles — with findings reported back at intervals. Two more habits pay off: keep a visible list of open actions with named owners, and time-box every diagnostic avenue. If a hypothesis has produced nothing in twenty minutes, the major incident manager should ask whether it deserves another twenty.

Diagnose by Hypothesis, Not by Guesswork

Under pressure, teams default to trying things. A better pattern is explicit hypothesis testing: state what you believe is wrong, what evidence would confirm it, and what you will do if it is confirmed or ruled out. The first question should always be: what changed? A large share of major incidents trace back to a recent change, which is why a disciplined change management process with a searchable change record is one of the fastest diagnostic tools you have. Check the change schedule, recent deployments, and expiring certificates before diving into packet captures.

Communication During a Major Incident

Poor communication during a major incident often causes as much damage as the outage itself. Business leaders who cannot get updates escalate through informal channels, creating noise that distracts the technical team. Customers who see no acknowledgement lose trust faster than the outage itself erodes it.

Effective major incident communication follows these principles:

Send updates on a fixed schedule, not only when there is news to share — silence is interpreted as things getting worse
Use plain language that non-technical stakeholders can understand; nobody outside IT needs to hear about BGP or connection pools
State clearly what is affected, what is not affected, and what is being done
Give a realistic estimate for the next update rather than a resolution time you cannot commit to
Use a single authoritative channel — a status page, an ITSM notification, or an internal broadcast — rather than ad-hoc emails from multiple people
Separate audiences: executives need impact and decisions, users need workarounds and expectations, customers need reassurance and honesty

Most ITSM platforms allow you to send bulk notifications from the major incident ticket. This keeps the communication trail in one place and reduces the chance of contradictory messages going out.

A Simple Update Template

Write your update template before you need it, and keep it to five lines any communications lead can complete in two minutes:

Status: investigating, identified, fixing, monitoring, or resolved
Impact: which services and user groups are affected, in business terms
Workaround: what affected users can do right now, if anything
Actions: what the response team is doing, in one plain sentence
Next update: a specific time, which you then honour even if nothing has changed

Consistency matters more than eloquence. Stakeholders who receive the same shaped message every thirty minutes stop phoning the service desk for news, and your inbound call volume during the outage drops noticeably.

Resolution, Workarounds, and Service Restoration

Resolution in a major incident context often happens in two stages: first a workaround that restores service to an acceptable level, then a permanent fix that addresses the root cause. These should be treated as separate milestones, because conflating them either delays restoration while engineers chase perfection or lets the underlying fault survive because service came back.

When a workaround is available:

Communicate it clearly to affected users through the same channels used for updates
Document it in the major incident record so it can be referenced in the post-incident review
Assess its risk explicitly — a workaround that disables a safety control or degrades data integrity may be worse than a longer outage
Do not close the major incident until the permanent fix is in place or a problem record has been raised to track the underlying cause

When service is fully restored:

Confirm restoration with the business stakeholder who reported the impact, not just with internal technical checks — a green dashboard and a working user experience are not the same thing
Hold a short monitoring period, typically thirty to sixty minutes, before standing the bridge down, since premature stand-down followed by recurrence damages credibility badly
Send a final stakeholder notification confirming resolution and the expected timeline for a post-incident review
Close or link the originating incident records to the major incident ticket so reporting stays accurate

For incidents severe enough to threaten the survival of a service — a data centre loss, a ransomware event — the major incident process hands over to disaster recovery. Know where that boundary sits in advance; our guide to IT service continuity management covers how to define the invocation criteria. Security-driven major incidents should additionally follow your security incident response plan, ideally aligned with guidance from NIST, whose incident handling framework is the reference point for most security response programmes.

Raising a Problem Record

Every major incident should result in a problem record unless the cause is already known and fixed. The problem record drives the root cause analysis and tracks any permanent remediation work. This is the link between incident management and problem management, and it is where most organisations lose continuity — the major incident gets closed and the underlying cause is never formally investigated. Six months later the same failure recurs and the response starts from zero. Make the linked problem record a mandatory field for major incident closure and the gap disappears.

Post-Incident Review and Continuous Improvement

The post-incident review — sometimes called a post-mortem or after-action review — is where major incident management delivers its long-term value. It should happen within three to five business days of resolution while details are still fresh, and it should be scheduled before the bridge call ends so it cannot quietly slip.

A structured post-incident review covers:

A factual timeline of the incident from first detection to resolution, taken from the scribe's record
What worked well in the response — worth capturing explicitly, because good practice that goes unnamed gets lost
What slowed the response down — detection gaps, missing runbooks, unreachable contacts, tooling friction
Root cause findings from the linked problem record, using a structured technique such as root cause analysis or five whys
Specific action items with owners and due dates — not vague recommendations

Run the review blamelessly. The question is never who caused the outage but why the system made that failure possible and why it was not caught sooner. Teams that fear blame hide information, and hidden information guarantees repeat incidents. The output should be shared with relevant stakeholders and tracked to completion through your continual improvement register — an action item that sits in a document nobody reads is not improvement, it is documentation theatre.

Metrics to Track for Major Incidents

Tracking major incident performance over time helps you identify systemic weaknesses in your process. Useful metrics include:

Time to declare, from first alert to major incident declaration
Time to assemble, from declaration to full resolver group on the bridge
Mean time to restore service for major incidents, tracked separately from your general MTTR figures so a handful of severe outages do not distort the routine picture
Number of major incidents per quarter by service or category
Percentage of major incidents that result in a completed post-incident review, which should be 100
Percentage of review action items completed on time
Repeat major incidents linked to the same root cause

These metrics belong in your regular service review alongside standard incident and SLA data. Service management standards such as ISO/IEC 20000, published by ISO, expect exactly this kind of evidenced measurement and review cycle, so the same data supports certification if you pursue it.

Prepare Before It Happens: Runbooks and Drills

The strongest predictor of a good major incident response is what was done before the incident. Three preparations pay for themselves many times over.

First, maintain response runbooks for your most critical services: architecture summary, dependencies, known failure modes, diagnostic starting points, vendor contacts, and failover steps. Keep them where they can be reached when the primary systems are down — a runbook stored only on the wiki that just went offline is a punchline, not a plan.

Second, keep configuration data current. Resolver groups lose serious time during live incidents working out what infrastructure sits behind a failing service. Accurate dependency mapping in a CMDB turns that from an investigation into a lookup. TIKTING links major incidents to affected configuration items, problem records, and bulk stakeholder notifications in one place, with Odysseus discovery data keeping the underlying asset picture current automatically.

Third, run drills. A twice-yearly simulated major incident — a tabletop walkthrough or a game-day exercise against a non-production environment — surfaces broken contact lists, unclear authority, and stale runbooks at a cost of two hours rather than a live outage. Rotate the people playing each role so cover is genuine rather than nominal.

Common Major Incident Management Pitfalls

Most process failures fall into a small number of recurring traps:

Declaring too slowly because nobody wants responsibility for a false alarm — fix it with explicit authority and a no-blame downgrade policy
Everyone troubleshooting, nobody coordinating — the major incident manager role exists precisely to prevent this
Executives joining the technical bridge and pulling engineers into status conversations — give leadership a separate briefing channel
Updates promised and then missed, which destroys trust faster than the outage itself
Closing the incident at workaround stage with no problem record, guaranteeing recurrence
Skipping the post-incident review when the resolution was quick, which is exactly when cheap lessons are available
Treating the process as a document rather than a capability — unrehearsed processes fail on first contact with a real outage

Key Takeaways

Define your major incident criteria in writing before an outage forces the decision under pressure, and give named people authority to declare
Assign clear roles — major incident manager, technical lead, communications lead, scribe — and rehearse them
Declare fast, open a bridge call, and send the first stakeholder update within fifteen minutes
Communicate on a fixed schedule using plain language through a single authoritative channel, with a pre-written template
Treat workaround and permanent fix as separate milestones and raise a problem record for every major incident
Run a blameless post-incident review within five days and track action items to completion
Measure time to declare, time to assemble, and restore times, and drill the process at least twice a year

Frequently Asked Questions

What is the difference between an incident and a major incident?

An incident is any unplanned interruption or degradation of an IT service, most of which are resolved through normal service desk workflows. A major incident is the highest-impact category — a critical service down, many users affected, or serious business exposure — and it triggers a separate procedure with a dedicated incident manager, a bridge call, and formal stakeholder communications rather than standard queue-based handling.

Who declares a major incident?

Declaration authority should be explicitly assigned, typically to the duty service desk manager, the on-call major incident manager, or a senior operations lead. What matters is that the named people can declare without seeking permission, and that policy protects them if an incident is later downgraded. Slow declarations caused by fear of false alarms cost far more than the occasional over-call.

What does a major incident manager do?

The major incident manager owns the response from declaration to closure. They run the bridge call, assign and track actions, decide on escalations such as invoking vendor support or failing over, ensure updates go out on schedule, and keep the technical team shielded from interruptions. They coordinate the fix rather than performing it, which is why the role must stay separate from hands-on troubleshooting.

How quickly should the first communication go out?

Within fifteen minutes of declaration, even if the only content is an acknowledgement that a major incident is being investigated. Early acknowledgement stops the flood of duplicate tickets and informal escalations that otherwise consume the service desk. After that, updates should follow a fixed cadence — commonly every thirty minutes — with each update naming the time of the next one.

How long after a major incident should the post-incident review happen?

Within three to five business days of resolution. Sooner than that, the team is often still completing cleanup work; later, details fade and the timeline becomes guesswork. Schedule the review before the bridge call is stood down, run it blamelessly, and publish action items with named owners and due dates so improvements are tracked rather than merely discussed.

Is every P1 incident a major incident?

No. Priority describes urgency and impact on the ticket; major incident status is a process decision. Many P1s are resolved quickly through normal channels and never need a bridge call, while an incident logged at lower priority can be promoted once its real scope becomes clear. Define the major incident threshold separately from the priority matrix and let the incident manager apply judgement.

How many major incidents per year is normal?

There is no universal benchmark — it depends on estate size, service criticality, and how strictly you apply the definition. The more useful signals are trends: rising declaration counts for the same services, repeat incidents sharing a root cause, or a growing gap between time to declare and time to assemble. Track those quarter on quarter and investigate the pattern rather than chasing an external norm.