← Back to work
02Real-time, high-throughput systems

Everything is fine until the load shows up.

A system under load behaves like a different system. Code that's fine with one user can fall over at a thousand, or the second a feed drops and every client tries to reconnect at once. Those are the failures I design around.

01

The load test broke the design

On a live-streaming platform, the first architecture was fine until I load-tested it. The test was simple: could it absorb a sudden surge of viewers without degrading for the people already watching. It couldn't. Fixing that meant rebuilding the ingestion and distribution pipeline halfway through the project.

Presence, viewer counts, and chat moved to Redis pub/sub. The player got adaptive bitrate for weak connections. I'd rather find that kind of thing in a load test than in the launch metrics.

02

Reliability lived in the reconnect

On a forex platform, reliability came down to the price feed. The WebSocket layer had to survive the ugly moments: a dropped connection, frames arriving out of order, a client reconnecting while holding a price that's already gone stale.

A trader staring at a spinner just waits. A trader acting on a stale price loses money, and so does whoever took the other side of that trade.

03

GPS, battery, and a map that lies

A field-operations app for a security firm tracked guards in real time. Poll GPS constantly and the battery's dead before the shift ends. Poll it slowly and the supervisor's map shows everyone where they were ten minutes ago.

So the polling rate moved with the situation: fast while a guard was on the move or near a checkpoint, slow while they were parked.

In practice

Live-streaming platform

A viewer surge broke the first architecture. Rebuilt the ingestion and distribution pipeline halfway through the project, moved presence and chat onto Redis pub/sub, added adaptive bitrate for weak connections.

Forex price feed

In live trading, reliability comes down to the price feed. Rebuilt the WebSocket layer around the failure cases: dropped frames, reconnects, out-of-order updates.

Field-operations app

Constant GPS polling flattens the battery. Poll it rarely and the supervisor's map goes stale. Tied the rate to each guard's movement: fast while moving, slow while still.

I don't trust a system until I've watched it fall over under load.
Up next
Systems ergonomicsGood ergonomics shouldn't cost you speed or safety.