Scalability Patterns

Cross-cutting techniques that apply regardless of framework. The order below is roughly the order to apply them.

Scale up before scaling out

A bigger box is often cheaper than the complexity of distributing. Profile first — a missing index or an N+1 query usually beats any amount of horizontal scaling.

Statelessness enables horizontal scaling

Keep app servers stateless (no in-process session/state); push state to Postgres/Redis. Then you can run N identical instances behind a load balancer and add more under load. This is the foundation everything else builds on.

Caching layers (cheapest big win)

CDN for static assets and cacheable responses — never hit your origin for those.
Application cache (Redis) for hot queries and computed values.
DB query cache / materialized views for expensive aggregations.

Layer them; each absorbs load before it reaches the next.

Read/write splitting

Most workloads are read-heavy. Route reads to Postgres replicas, writes to the primary (see the Postgres scalability note). Often the single biggest database scaling lever.

Async everything that can be

If the user doesn't need the result right now, do it in the background. Return fast, process later via a queue/worker/Temporal. Keeps request latency low and smooths spiky load.

Back-pressure & load shedding

Bound your queues. An unbounded queue under sustained overload just converts a fast failure into a slow death (memory exhaustion, ever-growing latency).
Shed load deliberately: rate-limit, return 429, or degrade gracefully when at capacity. A fast, honest rejection beats a timeout for everyone.
Apply timeouts and circuit breakers so one slow dependency doesn't cascade into total failure.

Autoscaling

Scale workers on a leading signal (queue depth, request latency) rather than a lagging one (CPU). Set sane min/max bounds so a traffic spike doesn't bankrupt you and a lull doesn't drop you to zero capacity.

Measure everything (the meta-rule)

You cannot scale what you can't see. Track the golden signals — latency, traffic, errors, saturation — per service. Set SLOs and alert on them. Every scaling decision should follow a measurement, not a hunch.

The progression, in order

Profile & fix queries → add indexes → cache → read replicas → async/queues → horizontal scale (stateless) → partition → and only then, shard or split into services.

Links1