Skip to content
All work

Northwind Logistics · Logistics

Rebuilding logistics dispatch for a 50x scale jump

Cut dispatch time and absorbed an 18x volume increase without scaling people.

Client
Northwind Logistics
Industry
Logistics
Duration
9 months
Team size
5 engineers + 1 designer
Outcome
47% faster builds

The challenge

Northwind operates a regional logistics network across Northern Europe. Their dispatch platform — the system that decides which driver gets which delivery and when — was built in 2017 on a Rails monolith with a hand-rolled queue. By the time we engaged, it was processing 240,000 dispatch decisions a day, and the team was projecting 4M decisions a day within eighteen months as a major retail customer ramped up volume.

The platform was already showing strain. Decisions that should have taken under a second were occasionally taking thirty. The team had grown a habit of restarting the dispatch service every few hours. There was no observable backpressure mechanism, so when load spiked, decisions piled up silently and started arriving to drivers who'd already finished their shifts.

The leadership team had a choice: scale the existing system horizontally — which would have required substantial work given its design — or rebuild the dispatch core. They asked us to assess and recommend.

Our approach

We spent the first three weeks doing what we always do: reading the code, watching the team work, and instrumenting the existing system to understand its actual behavior under real load. The instrumentation surprised everyone, including us. The dispatch service spent about 8% of its time computing dispatch decisions and 71% of its time waiting on a database lock that the team didn't know existed. Of the remaining 21%, half was logging.

That single observation reframed the project. We weren't going to rebuild dispatch — we were going to refactor the lock contention out of the existing system, and then assess whether a rebuild was needed.

Two months later, after that refactor shipped, the existing system was running at 14% load with the same volume. We told Northwind they didn't need a rebuild. They needed a small set of targeted changes to make the system observable, durably scalable, and maintainable by their internal team.

What we shipped

  • An event-sourced dispatch decision log on Kafka, replacing the hand-rolled queue. This let us replay decisions for debugging and auditing — something the operations team had wanted for two years.
  • A new partitioning scheme for the dispatch decisions table, sharded by region. The lock contention we'd eliminated could have come back at higher volumes; partitioning gave us headroom for the projected 4M/day load and beyond.
  • A complete observability layer with OpenTelemetry. Every dispatch decision is now traced end-to-end, with a single dashboard the operations team checks each morning.
  • Production-grade infrastructure as code on Terraform, with three environments (dev, staging, prod) that match each other byte-for-byte.
  • A runbook covering the eight most common production scenarios, written with the on-call team and tested via tabletop exercise before we left.

What we deliberately didn't ship

We didn't rebuild the dispatch core. We didn't introduce a microservices architecture. We didn't add machine learning to the dispatch decision logic, even though it was tempting and the team asked.

The constraint we held: every piece of software we shipped had to be something Northwind's internal team could maintain after we left. If we couldn't sit with their senior engineer and have her fully understand it in two hours, we didn't ship it.

The outcome

Nine months in, with us off the project for three months, Northwind's dispatch platform handles peak load 18x what it did when we started. Their on-call rotation has had three pages in the last quarter, all for unrelated network incidents. The internal team has shipped four substantial features without our involvement.

Build times for the platform are 47% faster end-to-end, mostly because we eliminated a build-time test step that was instrumenting the wrong thing. Cost per dispatch decision is down 62% — almost all of which came from removing logging that nobody read.

The most useful number isn't on this page. It's the number of times the team has called us since we left: zero.

Services

Custom Software DevelopmentCloud & DevOps

Technologies

TypeScriptGoPostgreSQLKafkaAWSTerraform

"Soft Routes did in nine weeks what our previous vendor couldn't ship in nine months. They wrote less code, asked sharper questions, and left us with a system our team can actually maintain."

Marta Eriksson · VP Engineering, Northwind Logistics

Let's build

Have a project in mind?

Tell us what you're working on. One of our principals will reply within one business day.