Rethinking Resilience: Why RTOs, SLAs, and Third-Party Risk Metrics Miss the Point
Introduction: The Broken Promises of Business Continuity
The resilience industry has a trust problem. Not because systems fail, that’s inevitable, but because the response to failure is wrapped in red tape, vanity metrics, and meaningless jargon. Clients/customers don’t need perfection; they want honesty, clarity, and a sense that someone is in control. Yet traditional business continuity planning clings to outdated concepts like RTOs and boilerplate SLAs while ignoring what really matters: how people feel when things go wrong.
At PAI Consulting, we call this out for what it is: resilience theater. That’s why we’ve built our Realistic Resilience methodology around the truth: systems fail, third parties falter, and clients/customers will forgive you, as long as you treat them like adults.
Why RTOs Are Vanity Metrics
The Recovery Time Objective (RTO) is one of the most widely used (and widely misunderstood) metrics in resilience planning. It represents the maximum amount of time a system or function can be down before significant impact occurs. But in practice?
- RTOs are often guessed, not calculated.
- They're set without real input from IT or third parties.
- They are rarely accurately or realistically tested or validated.
- And worst of all, they are virtually never communicated to clients/customers.
This leads to absurd situations: a service outage occurs, and even if you're technically "within your RTO," clients/customers are furious because they had no idea what that meant. Or worse, they weaponize the RTO as a hard expectation, regardless of the root cause.
Instead of relying on fictional timelines, we advocate for percentage-based availability and transparency-driven communication.
SLAs: The Illusion of Control
Service Level Agreements (SLAs) often promise 99.9% uptime, 24/7 support, and rapid response. But what they really offer is legal cover. Vendors treat SLAs as compliance documents, not living operational commitments.
We’ve seen organizations get burned because their vendor hit the letter of the SLA while completely violating the spirit of trust and service. Even a 99.999% uptime guarantee still allows for ~5 minutes of downtime per month, but those minutes can matter if they occur during a critical transaction.
And critically, SLAs typically do not differentiate the nature of the outage:
- A full system crash counts the same as a degraded system with latency issues.
- A partial availability problem, where some users are affected but not all, may not even register as an SLA violation.
- Intermittent errors, performance slowdowns, or localized failures are often invisible in standard SLA reports.
This misalignment creates a dangerous blind spot. From the customer’s perspective, any degradation in performance feels like a failure. But under the SLA? Everything looks fine.
Instead of worshipping at the altar of five nines, Realistic Resilience encourages organizations to:
- Track actual availability month over month
- Share real mean time to recovery (MTTR) stats
- Be proactive in customer comms when things break
- Acknowledge and address partial, latent, or non-total outages as real customer-impacting incidents
Third-Party Risk: More Than Questionnaires
Current third-party risk frameworks obsess over vendor questionnaires, due diligence checklists, and static scorecards. But when things go wrong, all that prep means nothing if there's no plan to communicate.
Worse, many organizations treat third parties as magical black boxes: “They’ll handle it.” No. You’re accountable to your clients/customers even for things you don’t directly control.
Realistic Resilience flips the script by embedding crisis communication and accountability into third-party relationships:
- We assume vendors will fail at some point.
- We require communications protocols, not just SLAs.
- We treat third-party disruptions as brand risks, not just ops risks.
Crisis Management Isn't Just for Disasters
One of the most dangerous misconceptions is that crisis management is only activated when there's a disaster, such as a cyberattack, natural catastrophe, or full-blown outage. But in the modern resilience environment, a 10-minute login issue at your SaaS provider could do more reputational damage than a day-long power outage.
Realistic Resilience advocates for micro-activation of crisis comms:
- Any issue that affects customer experience = activate the plan.
- Fast, plain-language updates trump silence and delay.
- Train teams to respond to perception, not just impact.
The difference between a crisis and an inconvenience is how you handle it.
What Clients/Customers Really Want: Trust, Not Perfection
Clients/customers are surprisingly forgiving, IF you're transparent.
- Tell them what's happening.
- Tell them what you're doing.
- Tell them when they’ll hear from you again.
They don’t care about your RTO. They care that you show up.
Realistic Resilience embraces this by aligning metrics with customer experience:
- Use uptime percentages, not recovery guesses
- Share real incident timelines, not idealized plans
- Replace "blame the vendor" responses with co-owned resolution strategies
Case Study: A Realistic Resilience Response in Action
A regional bank relying on a third-party SaaS provider experienced a partial service outage during peak hours. The vendor’s SLA technically allowed for up to 30 minutes of monthly downtime, and the system was restored in 22 minutes. But by the time the platform was live again, dozens of high-value clients had already submitted complaints.
Using the Realistic Resilience framework, the bank’s crisis team activated their communications protocol within the first 5 minutes after confirming the outage:
- A banner message was posted to the login screen acknowledging the issue.
- Clients were emailed within 15 minutes with a clear, jargon-free explanation.
- A follow-up message provided recovery confirmation and a brief postmortem within 2 hours.
Even though the SLA wasn’t violated, the team treated the event as a trust risk, not just a technical one. The result? Near-zero client churn and several clients/customers responded with praise for the transparency.
Conclusion: Let’s Kill the Theater
If resilience planning continues to rely on fake deadlines, obscure metrics, and silence during actual disruptions, it will continue to fail. RTOs should not be front-line commitments. SLAs should not be escape hatches. Third-party risk should not be checkbox compliance.
At PAI Consulting, we don’t chase illusions. Realistic Resilience means planning for the messy, unpredictable, and very human reality of modern service delivery. And it means treating your clients/customers like people, not SLAs.
Because in the end, resilience isn’t about uptime, it’s about trust recovery.








