How Small Engineering Teams Handle On-Call and Incidents Without a Dedicated SRE
Site Reliability Engineering was born inside Google, where the practice was written down, refined across hundreds of services, and staffed at a scale that most companies will never reach. A single SRE team there can be larger than an entire startup. The ideas that came out of that environment — error budgets, blameless postmortems, structured on-call — are genuinely good, and they’ve spread far beyond the place that invented them. The trouble is that the staffing model came along for the ride, and the staffing model does not survive contact with a company that has eleven engineers.
Most software companies are not Google. They have five people, or ten, or twenty, and every one of those people is shipping product. They still need their systems to stay up, still get paged at two in the morning, still have to explain to a customer why the dashboard was blank for forty minutes. Reliability work doesn’t disappear because the org chart is small. It just lands on people who already have other jobs, and the interesting question is how the better-run small teams absorb that work without quietly destroying the people doing it.
What SRE Work Looks Like When Nobody Holds the Title
Start by being honest about what the work actually is. On a small team, on-call is a rotation among the same engineers who wrote the code that’s now misbehaving. There is no separate group whose entire purpose is to keep the lights on, so the person carrying the pager this week is also the person who was supposed to finish a feature by Friday. When something breaks, incident response interrupts whatever sprint work was in flight, and the cost of that interruption is real even when the incident itself is small.
Monitoring tends to have a similar history. Someone set it up once, usually during a stressful week, and then it sat there. Alerts fire, half of them are noise, and nobody clearly owns the job of tuning them. Postmortems happen when an outage is bad enough to demand one and get skipped when everyone is tired, which means the team learns from its worst failures and quietly forgets the medium-sized ones that were trying to teach the same lesson. None of this is negligence. It’s what reliability looks like when it’s nobody’s full-time responsibility and everybody’s part-time problem.
The On-Call Rotation That Doesn’t Burn People Out
The first thing well-run small teams get right is the shape of the rotation. Google’s own guidance on being on-call makes a point that scales down cleanly: on-call should be sustainable, which means bounded in scope and balanced against the rest of someone’s work. A rotation that pages a single hero every night is not a rotation, it’s a slow resignation letter.
In practice, small teams usually settle on weekly handoffs shared across everyone who can meaningfully respond, often three to six people. A week is long enough to avoid constant churn and short enough that no one dreads it for a month. Distributed teams sometimes get to do something genuinely nice here — a follow-the-sun arrangement where the person on-call is awake during their own daytime, so the rotation barely touches anyone’s sleep. What matters more than the exact cadence is the discipline around it. A clear escalation policy tells the primary responder who to wake up when they’re stuck, so being on-call doesn’t mean being alone with a problem you can’t solve. And limiting the scope of what actually pages a human — reserving the interruption for things that are both urgent and real — is what lets people rest during the weeks they aren’t holding the pager, and even during the weeks they are.
How Modern Tooling Collapsed the SRE Stack
The reason any of this is feasible without a dedicated role is that the tooling got dramatically better. Five or six years ago, assembling a credible reliability setup meant standing up Nagios for monitoring, wiring it to PagerDuty for alerts, running StatusPage for external communication, building a dashboard or two by hand, and keeping runbooks in a wiki that drifted out of date the moment they were written. Each piece was a project. Keeping them integrated and current was, more or less, a job.
That stack has collapsed into a handful of services that already know how to talk to each other. A typical small-team stack might use DevHelm for uptime monitoring and status pages, PagerDuty or Opsgenie for on-call routing, and Linear or Notion for postmortem tracking — three services replacing what used to be a full-time job to maintain. The work of gluing them together, which used to consume the time you didn’t have, is mostly done before you arrive. A monitor that detects an outage can route a page and update a status page without anyone manually flipping switches, and the engineer who would have been the integration glue gets to keep writing product instead.
Telling People What’s Happening
The part teams underestimate most is communication, and it’s the part that quietly decides how much trust survives an outage. When something is down, silence is read as either incompetence or indifference, and it erodes confidence faster than the downtime itself. Users will refresh, check their own connection, ask each other whether it’s just them, and in the absence of anything official they’ll write the story for you in support tickets and angry posts. A status page that updates the moment monitoring detects a problem replaces all of that guessing with a single place to look.
The deeper win is that automating this removes a task from a person at the exact moment they have no spare attention. During a live incident, the responder is debugging, and asking that same person to also draft customer updates and ping stakeholders is asking them to context-switch in the middle of a fire. PagerDuty’s widely used incident response guide is built around exactly this idea of clear roles and predictable communication, treating updates and postmortems as part of the response rather than an afterthought. When the status update is driven by the monitoring that already noticed the problem, the team communicates well precisely when it would otherwise communicate worst.
When You Actually Need to Hire an SRE
The part-time model has a ceiling, and the teams that handle reliability well also know how to recognize when they’ve hit it. The signals are not subtle once you look for them. Incidents that used to be occasional become weekly. The same class of failure keeps recurring because nobody has had a clear stretch of time to fix the root cause. Toil — the repetitive manual work of keeping things running — starts eating a meaningful fraction of the engineering week, and reliability debt compounds the way technical debt does, quietly, until it’s the only thing anyone talks about.
When that happens, the rational move is to make reliability someone’s actual job rather than everyone’s spare-time burden. Hiring a dedicated SRE, or even just naming an existing engineer as the owner of reliability and protecting their time, is what lets the team get ahead of the problem instead of perpetually reacting to it. The mistake is treating that hire as the starting point rather than the response to a real and measured need.
SRE Is a Set of Practices, Not a Headcount
The useful reframing is that SRE was always a discipline before it was a department. The on-call rotations, the blameless postmortems, the bias toward measuring reliability instead of guessing at it — those practices don’t require a particular number of people to be worth adopting. A small team can take the parts that fit, lean on tooling for the parts that used to demand a specialist, and reach a level of reliability that genuinely would have needed a dedicated team only a few years ago. The title is optional. The habits are not.
