Skip to content
All posts
Operations··5 min read

What 'production-ready' actually means

Production-ready isn't a checklist your team produces. It's a question someone else can answer in fifteen minutes.

When an engineer says "this is production-ready," what do they mean?

In most teams, it means: "I built it, it works on my machine, the tests pass, and I think it's good." That's a useful baseline. It's not what production-ready means.

Production-ready means a different person — one who didn't build the system — can pick up an alert at 3am, find the runbook, follow it, and either fix the problem or escalate it correctly. Everything that supports that one sentence is the work of production-readiness. Everything that doesn't, isn't.

The fifteen-minute test

The clearest test we use is the fifteen-minute test. We pull an engineer who has never seen the system before. We give them a fictional alert: "this service's error rate is 3x normal, started fifteen minutes ago." We give them access. We give them fifteen minutes.

If at the end of fifteen minutes, that engineer can answer:

  1. What does this service do, in two sentences?
  2. Where do I see what's currently wrong?
  3. What are the three most likely causes, given the symptoms?
  4. What's the one thing I should not do without escalating?

…then the system is production-ready. If not, it isn't.

This test cuts past most checkbox approaches. You can have 95% test coverage and fail the fifteen-minute test. You can have a beautifully maintained Notion page of architecture diagrams and still fail. The test is grounded in a specific, realistic moment — and either the system supports that moment or it doesn't.

The minimum viable runbook

Most "runbooks" are aspirational documents written when the system was new and never updated. The minimum viable runbook is shorter than that and more useful. It contains:

Top of page: what this service does. Two sentences. Not a paragraph. Not an architecture diagram. Two sentences a sleepy engineer can absorb in five seconds.

Health. A direct link to the dashboard that answers "is this service okay?" Not three dashboards. One. The one with the answer.

The five most common alerts. For each alert: what it means, the three most likely causes ranked by frequency, and the diagnostic command or query to confirm each cause. Not the full set of possible alerts. The five that fire most often.

What not to do. Things that will make a small problem worse. Don't restart the leader without quorum check. Don't run the migration in a hotfix. Don't roll back beyond version 4.7.2.

Escalation. Names, in order. Not "the platform team." Names. The platform team's on-call rotation if the names aren't around.

That's it. Five sections. Two pages. Editable. If you can't fit your runbook into five sections, your service is doing too many things and your runbook should be five separate runbooks.

The observability minimum

Production-ready doesn't mean perfect observability. It means the floor: the minimum below which you can't responsibly diagnose problems.

The floor we use:

  • Every external request is traced. You can find any request in a tracing tool given a request ID, and see the full chain of internal calls.
  • Every error is logged with a structured payload, including a correlation ID that ties to the trace.
  • Every meaningful metric is on a dashboard. "Meaningful" means: would I check this in an incident? If yes, dashboard. If no, leave it off the dashboard.
  • The dashboard links from the alert. The alert tells you which dashboard to look at, not "go find it."

This isn't an exhaustive observability strategy. It's the floor. If you don't have these four things, you don't have production observability — you have a hope.

The deployment minimum

You can deploy. That's the minimum. You can deploy quickly, with one command, with a clear path to roll back, and with the deployment itself producing a trace that someone can follow if something goes wrong.

The corollary: you have actually deployed recently. A deployment process that works on paper but hasn't run in production in six weeks is not a working deployment process. It's a hope. Every two weeks of "no deploys" is one more hidden bug in the deploy path that someone is going to find at the worst possible time.

The team minimum

Production-ready services have at least two engineers who can respond to incidents on them. Not "the team supports it" — two named people who could be paged tonight and could fix the most common problems without phoning a friend.

If only one person can respond to a service, the service has a single point of failure that isn't on any architecture diagram: a human one.

The point

Production-ready isn't a feature you add. It's a property of the human and technical system around the code. The runbook, the dashboards, the deploy process, the on-call rotation — these are the production-ready features. Without them, your code is a science experiment that happens to be running on a server people use.

The next time someone on your team says "this is production-ready," ask them to walk you through the fifteen-minute test. Watch what happens.

Author

Diego Reyes

Head of Engineering

Newsletter

More writing like this, once a month.

One essay, no clickbait. Unsubscribe in one click.