A few months ago, I was helping a client engineering team review their new SOP and identify gaps. The task should have been simple. It wasn’t.

It looked clean. It looked technically sound—then I saw the critical flaw in on-call escalation:

  • A five-minute “quick fix” window, but no reference to how escalation actually works
  • No defined escalation point
  • No criteria for when to escalate
  • No description of how the escalation chain works
  • No redundant escalation path

I picked up the phone and asked a very simple question: “How does on-call work?” The answer still sits with me today, even though it was corrected: on-call was treated as a volunteer rotation among senior engineers, with a loose minimum number of shifts per year—not a defined operational model.

On-call shouldn’t be volunteer-only, and it shouldn’t be limited to senior engineers by default.

The fix

The fix is straightforward:

  • A defined on-call schedule
  • A multi-level redundant on-call plan

For this team, the right answer was a one-call plan with redundant tiers through engineering leadership and executive escalation. It worked roughly like this:

  1. The engineer hits an outage they can’t fix and calls the primary on-call senior engineering contacts—we agreed to use two numbers for redundancy.
  2. If both escalation points were unavailable, or if broader authority was needed, the issue escalated to a staff-level engineer, with the original senior engineers staying on bridge coverage unless formally relieved.
  3. If the staff engineer didn’t reply, it went to the on-call engineering manager, then continued up through executive leadership as needed.

Elegant? No.

Effective? Yes.

The best solutions are often function over form in engineering and incident management.


This post was imported from the author’s LinkedIn.