Most teams I talk to say they “have SLOs.”
Very few can tell me which one would page them right now if users were actively failing.
That gap is the problem.
SLOs are supposed to connect reliability to real user impact. In practice, they often become dashboards no one trusts, alerts no one responds to, and metrics leadership can’t act on. The result is surprise incidents, alert fatigue, and a growing sense that observability is expensive but ineffective.
The issue usually isn’t effort or tooling. It’s how SLOs break down once theory meets real systems.
The most common ways SLOs fail
1. SLIs measure systems, not users
Many SLOs are built on metrics that are easy to collect rather than signals that reflect user experience.
Examples:
- Pod availability instead of request success
- CPU or memory health instead of latency or error rates
- Internal queue depth instead of end-to-end behavior
These metrics can look healthy while users are actively impacted. When incidents happen, teams discover that the SLO never represented what customers actually felt.
If your SLO doesn’t answer “are users able to do what they came here to do?”, it won’t help during an outage.
2. Objectives are chosen arbitrarily
“99.9%” shows up in more SLOs than almost any other number — not because it’s correct, but because it feels reasonable.
Objectives are often selected without considering:
- historical performance
- error budget consumption patterns
- business tolerance for failure
- recovery characteristics
When objectives aren’t grounded in reality, teams either page constantly or never page at all. Both outcomes erode trust in the system.
An SLO that no one believes in won’t change behavior.
3. Alerts are technically correct but operationally useless
Even when teams use burn-rate alerts, paging often fails because alerting wasn’t designed with on-call reality in mind.
Common problems:
- Alerts that fire during deploys but resolve themselves
- Alerts that trigger after customers are already complaining
- Pages that require deep investigation just to understand what broke
If an alert doesn’t clearly answer “what should I do right now?”, it becomes noise. Over time, teams stop responding with urgency — or at all.
4. Dashboards exist, but aren’t used during incidents
Many SLO dashboards look great in reviews but aren’t opened during outages.
Why?
- They’re too abstract
- They lack supporting signals
- They don’t answer immediate questions
During incidents, teams revert to ad-hoc metrics, logs, or tribal knowledge. The SLO becomes a post-incident artifact instead of a real-time decision tool.
That’s a sign the SLO was designed for reporting, not response.
Why smart teams still get this wrong
Most SLO failures aren’t caused by incompetence. They’re caused by incentives and defaults.
- Tooling makes it easy to measure the wrong things
- Examples focus on theory, not operational tradeoffs
- Leadership wants a reliability “number” instead of a reliability signal
- Alerting is designed without empathy for on-call engineers
Teams end up implementing SLOs because they feel like the right thing to do — but without aligning them to how incidents actually unfold.
What working SLOs look like in practice
Effective SLOs tend to share a few traits:
- SLIs reflect user-visible success or failure
- Objectives are tied to real-world tolerance, not round numbers
- Paging alerts are reserved for customer-impacting events
- Error budgets are discussed and acted on, not just tracked
- Dashboards are actively used during incidents
Most importantly, teams trust them.
They don’t eliminate incidents — they reduce surprise, confusion, and noise when incidents happen.
Closing the gap
If your team has SLOs but still feels surprised by incidents or buried in alerts, the issue usually isn’t tooling — it’s alignment.
I help engineering teams running Kubernetes and Prometheus review their SLOs and alerting in short, fixed-scope engagements to identify what’s broken, what’s noise, and what actually matters.
If this sounds familiar, you can book a short exploratory call to see whether a focused SLO and alerting review would be useful.
