IT Major Incident Management: A Practical Process Guide for 2025

June 20, 2026
5 min read

Major incidents need a process of their own. Learn how to declare, manage, communicate, and review major incidents with a practical step-by-step framework.

Major incident management is one of the highest-pressure disciplines in IT operations, yet many service desks treat it as an extension of normal incident handling rather than a distinct process. This guide walks through how to define, declare, and resolve major incidents efficiently — covering roles, communication, escalation, and the post-incident steps that prevent recurrence.

What Counts as a Major Incident

Not every outage is a major incident, and calling everything one exhausts your team and dilutes the label. Most organisations define a major incident as an unplanned disruption that meets one or more of these thresholds:

  • A critical business service is completely unavailable
  • A large number of users or business units are affected simultaneously
  • There is significant financial, reputational, or regulatory exposure
  • Normal incident resolution procedures are insufficient to restore service quickly

The exact thresholds should be documented in your incident classification policy and agreed with business stakeholders before an outage happens — not negotiated in the middle of one. Common triggers include full application outages, network failures affecting multiple sites, security breaches that interrupt service, and data unavailability affecting business operations.

Why a Separate Process Matters

Standard incident management is designed to handle routine disruptions at pace. Major incidents require coordinated effort across multiple teams, real-time executive communication, and a command structure that keeps decisions moving under pressure. Without a separate process, you get confusion over who is in charge, inconsistent stakeholder updates, and slower resolution times.

Roles and Responsibilities in a Major Incident

Blog image

Clarity of ownership is the single biggest factor in how quickly a major incident gets resolved. Define these roles before you need them.

Major Incident Manager

This person owns the resolution process from declaration to closure. They do not fix the technical problem — they coordinate the people who do, remove blockers, run the bridge call, and ensure communication flows. In smaller organisations this role may sit with a senior service desk lead or IT manager. In larger environments it is often a dedicated role.

Technical Lead

The technical lead is the most qualified engineer for the affected system. They direct the diagnostic and remediation effort on the bridge call, delegate tasks to other technical staff, and report progress to the major incident manager at agreed intervals.

Communications Lead

Someone must own stakeholder updates. This is often underestimated. During a major incident, business leaders, end users, and sometimes customers need timely, accurate information. The communications lead drafts and sends updates, manages the status page if one exists, and fields inbound queries so the technical team can focus on resolution.

Resolver Groups

These are the subject-matter experts pulled in as needed — network engineers, database administrators, application owners, third-party vendors. Each resolver group should have a named contact and an escalation path documented in your runbooks before an incident occurs.

Declaring and Activating the Major Incident Process

Blog image

The decision to declare a major incident should be fast and low-friction. Delays in declaration mean delays in assembling the right people.

A good activation checklist looks like this:

  • Confirm the incident meets your documented severity criteria
  • Assign a major incident manager and technical lead immediately
  • Open a dedicated bridge call or war room — do not rely on email threads
  • Create a major incident ticket that is separate from or linked to the originating incident record
  • Send the first stakeholder notification within fifteen minutes of declaration, even if it only confirms that investigation is underway
  • Identify and invite resolver groups based on the affected service and initial diagnosis
  • Set a communication cadence — many teams use updates every thirty minutes during active resolution

Keeping the Bridge Call Productive

A major incident bridge call can quickly become chaotic. The major incident manager should open every call with a thirty-second situation summary, assign a scribe to capture actions and findings, and keep the call focused on decisions and blockers rather than open-ended troubleshooting. Side conversations and diagnostic rabbit holes should happen off the main call with findings reported back at intervals.

Communication During a Major Incident

Blog image

Poor communication during a major incident often causes as much damage as the outage itself. Business leaders who cannot get updates escalate through informal channels, creating noise that distracts the technical team. Customers who see no acknowledgement lose trust faster than the outage itself erodes it.

Effective major incident communication follows these principles:

  • Send updates on a fixed schedule, not only when there is news to share
  • Use plain language that non-technical stakeholders can understand
  • State clearly what is affected, what is not affected, and what is being done
  • Give a realistic estimate for the next update rather than a resolution time you cannot commit to
  • Use a single authoritative channel — a status page, an ITSM notification, or an internal broadcast — rather than ad-hoc emails from multiple people

Most ITSM platforms allow you to send bulk notifications from the major incident ticket. This keeps the communication trail in one place and reduces the chance of contradictory messages going out.

Resolution, Workarounds, and Service Restoration

Blog image

Resolution in a major incident context often happens in two stages: first a workaround that restores service to an acceptable level, then a permanent fix that addresses the root cause. These should be treated as separate milestones.

When a workaround is available:

  • Communicate it clearly to affected users through the same channels used for updates
  • Document it in the major incident record so it can be referenced in the post-incident review
  • Do not close the major incident until the permanent fix is in place or a problem record has been raised to track the underlying cause

When service is fully restored:

  • Confirm restoration with the business stakeholder who reported the impact, not just with internal technical checks
  • Send a final stakeholder notification confirming resolution and the expected timeline for a post-incident review
  • Close or link the originating incident records to the major incident ticket

Raising a Problem Record

Every major incident should result in a problem record unless the cause is already known and fixed. The problem record drives the root cause analysis and tracks any permanent remediation work. This is the link between incident management and problem management, and it is where most organisations lose continuity — the major incident gets closed and the underlying cause is never formally investigated.

Post-Incident Review and Continuous Improvement

Blog image

The post-incident review — sometimes called a post-mortem or after-action review — is where major incident management delivers its long-term value. It should happen within three to five business days of resolution while details are still fresh.

A structured post-incident review covers:

  • A factual timeline of the incident from first detection to resolution
  • What worked well in the response
  • What slowed the response down
  • Root cause findings from the linked problem record
  • Specific action items with owners and due dates — not vague recommendations

The output should be shared with relevant stakeholders and tracked to completion. An action item that sits in a document nobody reads is not improvement — it is documentation theatre.

Metrics to Track for Major Incidents

Tracking major incident performance over time helps you identify systemic weaknesses in your process. Useful metrics include:

  • Time to declare (from first alert to major incident declaration)
  • Time to assemble (from declaration to full resolver group on the bridge)
  • Mean time to restore service for major incidents
  • Number of major incidents per quarter by service or category
  • Percentage of major incidents that result in a completed post-incident review
  • Repeat major incidents linked to the same root cause

These metrics belong in your regular service review alongside standard incident and SLA data.

Key Takeaways

  • Define your major incident criteria in writing before an outage forces the decision under pressure
  • Assign clear roles — major incident manager, technical lead, communications lead — and make sure everyone knows them
  • Declare fast, open a bridge call, and send the first stakeholder update within fifteen minutes
  • Communicate on a fixed schedule using plain language through a single authoritative channel
  • Treat workaround and permanent fix as separate milestones and raise a problem record for every major incident
  • Run a structured post-incident review within five days and track action items to completion

TIKTING supports major incident management with dedicated severity classification, linked problem records, bulk stakeholder notifications, and SLA tracking across incident lifecycles. Odysseus asset discovery feeds accurate configuration data into TIKTING so your resolver groups can see affected infrastructure immediately rather than hunting for it during a live outage — reducing time to diagnose and restoring service faster.

More Articles

IT Configuration Management: Build a CMDB That Drives Real Value

IT Configuration Management: Build a CMDB That Drives Real Value

Most CMDBs fail within months of launch. Learn how to design, populate, and maintain a configuration management practice that teams actually trust and use.

IT Release Management: A Practical Guide for Service Desk Teams

IT Release Management: A Practical Guide for Service Desk Teams

A poorly managed release floods your service desk with incidents. This practical guide covers the full release management process, common mistakes, and a step-by-step checklist.

IT Service Catalog: How to Build One That Actually Gets Used

IT Service Catalog: How to Build One That Actually Gets Used

Learn how to build an IT service catalog users actually adopt — with the right structure, intake forms, fulfillment workflows, SLA targets, and a quarterly review process.

IT Service Continuity Management: A Practical ITSM Guide

IT Service Continuity Management: A Practical ITSM Guide

Learn how to build a practical IT service continuity management programme: BIA, recovery strategies, testing, and how ITSCM connects to your wider ITSM practices.

ITSM vs ITAM: Key Differences and Why You Need Both in 2025

ITSM vs ITAM: Key Differences and Why You Need Both in 2025

ITSM and ITAM solve different problems, but gaps between them cause incidents, audit risk, and failed changes. Learn the differences and how to connect them.

ITSM Tool Selection: How to Choose the Right Platform in 2025

ITSM Tool Selection: How to Choose the Right Platform in 2025

Choosing the wrong ITSM tool costs years of workarounds. This guide covers requirements, shortlisting, POC testing, and total cost of ownership to help you decide.

IT Onboarding and Offboarding: A Service Desk Process Guide

IT Onboarding and Offboarding: A Service Desk Process Guide

Ad hoc onboarding and offboarding leaves accounts open and assets untracked. Learn how to build a repeatable, ITIL-aligned process that closes both gaps.

Shadow IT Discovery: How to Find and Manage Unauthorized Tools

Shadow IT Discovery: How to Find and Manage Unauthorized Tools

Shadow IT grows when users bypass IT to get things done. Learn how to discover unauthorized tools and devices, manage the risk, and fix the root cause.

IT Change Advisory Board: How to Run a CAB That Works

IT Change Advisory Board: How to Run a CAB That Works

A change advisory board only adds value if it's run well. Learn who should attend, how to structure meetings, and which metrics keep your CAB improving.

IT License Compliance: How to Audit and Stay Audit-Ready

IT License Compliance: How to Audit and Stay Audit-Ready

A failed software audit can mean penalties and emergency spend. Learn how to build an IT license compliance programme that keeps you audit-ready year-round.

IT Asset Lifecycle Management: A Complete Guide for 2025

IT Asset Lifecycle Management: A Complete Guide for 2025

Learn the six stages of IT asset lifecycle management, the most common failure points at each stage, and a practical checklist to improve visibility and control.

IT Self-Service Portal Best Practices: Reduce Ticket Volume in 2025

IT Self-Service Portal Best Practices: Reduce Ticket Volume in 2025

Most self-service portals go unused. Learn practical steps to design, populate and promote a portal that genuinely deflects tickets and improves service desk efficiency.

IT Escalation Management: How to Build a Process That Works

IT Escalation Management: How to Build a Process That Works

A weak escalation process is behind most missed SLAs and burned-out teams. Learn how to design clear tiers, triggers, and workflows that actually hold up.

Network Asset Discovery: How to Find Every Device on Your Network

Network Asset Discovery: How to Find Every Device on Your Network

Network asset discovery finds every device on your network and keeps your CMDB accurate. Learn how it works and how to build a process that lasts.

IT Service Request Management: A Complete Process Guide for 2025

IT Service Request Management: A Complete Process Guide for 2025

Learn how to build a scalable service request management process — from service catalogue design and fulfilment workflows to SLAs, automation, and CMDB integration.

IT Problem Management: How to Stop Recurring Incidents for Good

IT Problem Management: How to Stop Recurring Incidents for Good

Recurring incidents drain your team. Learn how IT problem management works, the five-step workflow to find root causes, and how to stop the cycle for good.

IT Knowledge Management: Build a Self-Service KB That Reduces Tickets

IT Knowledge Management: Build a Self-Service KB That Reduces Tickets

A dusty wiki nobody reads won't reduce your ticket queue. Learn how to build and maintain a self-service knowledge base that actually deflects tickets.

SLA Management in ITSM: How to Set, Track, and Meet Targets

SLA Management in ITSM: How to Set, Track, and Meet Targets

Missing SLA targets? Learn how to set realistic service level agreements, track compliance in real time, and fix the root causes of breaches in your ITSM environment.

IT Service Desk Metrics That Actually Matter in 2025

IT Service Desk Metrics That Actually Matter in 2025

Tracking the wrong service desk metrics wastes time and hides real problems. Learn which KPIs actually improve outcomes and how to build a reporting cadence that drives action.

IT Asset Management Best Practices: A Complete 2025 Guide

IT Asset Management Best Practices: A Complete 2025 Guide

Discover the IT asset management best practices that keep your CMDB accurate, license costs controlled, and your IT estate fully visible in 2025.

IT Change Management Process: A Step-by-Step Guide for 2025

IT Change Management Process: A Step-by-Step Guide for 2025

A poor IT change management process causes outages and compliance gaps. Learn the ITIL v4 workflow, change types, CAB best practices, and key metrics in this step-by-step guide.

IT Incident Management Best Practices: A Complete Guide

IT Incident Management Best Practices: A Complete Guide

Cut downtime and missed SLAs with these proven IT incident management best practices — from triage and escalation to SLA tracking and post-incident review.

CMDB Best Practices: How to Build and Maintain a Clean CMDB

CMDB Best Practices: How to Build and Maintain a Clean CMDB

A stale CMDB costs your team time and trust. Learn how to scope, build, and maintain a clean CMDB with practical steps and a maintenance checklist.

Why Email-Based IT Support Fails in Large Organizations

Why Email-Based IT Support Fails in Large Organizations

Email-based IT support fails in large organizations due to lost requests, no accountability, poor visibility, and compliance risks. Learn why.

Showcases TIKTING at ITCN Asia 2026 in Lahore

Showcases TIKTING at ITCN Asia 2026 in Lahore

ITDEVTECH showcased its flagship solution TIKTING at ITCN Asia 2026 in Lahore, demonstrating how it streamlines IT operations and empowers organizations.

TIKTING — Enterprise Service Management

Service Desk, Asset Management, Change Management, Remote Support, and more. All-in-one platform.

No credit card required.

Your information is safe and used only to onboard.

On-Premises

Download the Installer and deploy on your own server

Phone Number

Please type the number with the international dialing code (e.g +81)