When you go from 8 engineers to 150 in 18 months, on-call breaks. Here's what we did.
The old way
Round-robin across the whole team. One engineer per week. It worked at 8. It collapsed at 40. People burned out, alerts got ignored, and our MTTR doubled.
The new way
Tiered response with AI as the first responder. Every page hits an AI runbook executor first. It correlates signals, runs diagnostics, and either auto-resolves (60% of cases) or pages a human with full context attached.
TagsEngineeringReliabilityCulture