Google Research Paper

The Tail at Scale

How rare latency hiccups become dominant performance killers in large distributed systems — and the ingenious techniques Google uses to tame them.

Jeff Dean & Luiz André Barroso Google Communications of the ACM, 2013

Just as fault-tolerant computing creates a reliable whole out of less-reliable parts, tail-tolerant systems create a predictably responsive whole out of less-predictable parts. You can't eliminate all variability — so you design around it.

Why the Tail Matters at Scale

Modern web services respond to a single user request by fanning it out across hundreds or thousands of servers in parallel. Every server must respond before the user gets a result. Even if each individual server is fast 99% of the time, the math works brutally against you at scale — the slowest server dictates your response time.

The Scale Amplification Effect

If each server has a 1% chance of being slow (>1 second), what happens when you fan out to many servers?

1%
1 server
10%
10 servers
63%
100 servers
~100%
2,000 servers

Probability of at least one server exceeding 1 second — with a per-server 99th-percentile latency of 1s.

10ms
99th percentile latency for a single leaf request at Google
140ms
99th percentile latency when waiting for ALL leaf servers to respond
50%
of the tail latency caused by the slowest 5% of requests alone

Three Foundational Observations

Rare events become common at scale

Even a 0.01% outlier rate means that in a system with 2,000 servers, nearly 1 in 5 user requests hits at least one slow server. The tail is no longer "rare" — it's the dominant experience.

Eliminating all variability is impractical

Shared resources, background daemons, garbage collection, power management, thermal throttling, network contention — there are too many sources of jitter to stamp out completely, especially in shared multi-tenant environments.

Fault-tolerance patterns apply to latency

Instead of preventing every possible latency spike, design systems that produce predictable end-to-end performance from unpredictable components — "tail-tolerant" software, analogous to fault-tolerant hardware.

Why Does Latency Variability Exist?

The paper catalogs the many sources of jitter that contribute to tail latency, spanning both software and hardware layers.

S

Shared Resources

CPU cores, caches, memory bandwidth, and network bandwidth are contested by co-located applications and concurrent requests.

S

Background Daemons

Background processes use few resources on average, but when they're scheduled they can cause multi-millisecond hiccups.

S

Maintenance Activities

Distributed file system reconstruction, BigTable log compaction, and language-level garbage collection trigger periodic latency spikes.

S

Queueing Amplification

Multiple layers of queues in servers and network switches multiply variability at each hop, compounding delays exponentially.

H

Power Limits & Throttling

CPUs boost above average power temporarily, then thermally throttle — introducing unpredictable slowdowns under sustained load.

H

SSD Garbage Collection

SSDs offer fast random reads, but periodic block garbage collection can inflate read latency by 100x, even with light write activity.

H

Energy Management

Power-saving modes save energy but add latency when devices wake from inactive to active states, introducing cold-start penalties.

H

Global Resource Sharing

Applications on different machines contend for shared global resources like network switches and distributed file systems.

Reducing Component Variability

Before applying tail-tolerant techniques, careful engineering can reduce the base level of variability. These are necessary but not sufficient at scale.

Differentiated Service Classes

Prefer scheduling interactive user-facing requests over batch operations. Keep low-level OS disk queues shallow so higher-level priority policies take effect quickly. Google's storage servers maintain their own priority queues rather than relying on deep OS-level queues.

Reducing Head-of-Line Blocking

Break expensive, long-running requests into sequences of smaller operations. This allows cheap, fast queries to interleave and avoid being stuck behind a single heavy computation. Google Search uses time-slicing to prevent a few expensive queries from delaying thousands of cheap ones.

Synchronized Disruption

Rather than letting background tasks (log compaction, GC) run randomly across machines, synchronize them into brief coordinated bursts. This confines the latency impact to one short window instead of having a few slow machines at all times, constantly pushing out the tail for every request.

Note on caching: While caching is essential for performance, it does not directly address tail latency unless the entire working set fits in cache. A cache miss on the critical path is still subject to all the variability above.

Tail-Tolerant Techniques

Since eliminating all variability is impossible, Google developed techniques that mask or work around latency problems rather than trying to prevent them. These fall into two categories based on their response timescale.

Short-Term
Within-request
~10s of ms
Long-Term
Cross-request
~10s of sec to min
IR-Specific
Information retrieval
systems

Within-Request Techniques

These operate at the millisecond scale, reacting immediately before longer-term systems can kick in. They leverage the fact that most services already have data replicated for fault tolerance.

Short-Term

Hedged Requests

The simplest and most elegant technique. Instead of waiting for a single slow server, send the same request to multiple replicas and use whichever responds first.

1

Send the request to the primary replica believed to be most appropriate.

2

If no response arrives within the 95th-percentile expected latency, send a duplicate "hedge" to a second replica.

3

Use whichever response comes back first. Cancel the other request.

Real result from Google: In a benchmark reading 1,000 BigTable keys across 100 servers, hedging after just a 10ms delay reduced 99.9th-percentile latency from 1,800ms to 74ms — a 24x improvement — while adding only 2% extra requests.
99.9th percentile: 1,800ms → 74ms with only 2% overhead
Short-Term

Tied Requests

A more sophisticated evolution of hedged requests. The key insight: most latency variability comes from queueing delays before execution, not from execution itself. Tied requests attack queueing directly.

1

Send copies of the request to two servers simultaneously, each tagged with the identity of the other ("tied").

2

When either server begins executing the request, it immediately sends a cancellation to its tied counterpart.

3

The counterpart, if still queued, is immediately aborted or deprioritized. A tiny stagger (~1ms) between sends prevents both starting simultaneously when queues are empty.

Real result from Google: In BigTable reads with 3 replicas, tied requests reduced median latency by 16% and 99.9th-percentile latency by nearly 40%. When a large concurrent sorting job was contending for the same disk — the tied-request profile on a busy cluster was nearly identical to an idle cluster without them.
Enables workload consolidation — dramatic cost reduction with <1% disk overhead

Cross-Request Techniques

These operate over seconds to minutes, addressing coarser-grain phenomena like load imbalance and machine-level performance degradation.

Long-Term

Micro-Partitions

Instead of assigning one partition per machine (making rebalancing slow and coarse), create many more partitions than machines. Google typically uses ~20 partitions per machine. This enables:

Fine-grained load shedding: Move a single micro-partition (~5% of load) instead of all-or-nothing.

20x faster rebalancing: Shifting one micro-partition takes 1/20th the time of moving a full partition.

Faster failure recovery: When a machine dies, many machines each pick up one small unit of work in parallel.

Long-Term

Selective Replication

Build on micro-partitioning by detecting or predicting hot items and proactively creating extra replicas. Load balancers then spread requests for popular data across more machines without physically moving partitions.

Google Search creates additional copies of popular and important documents across multiple micro-partitions. The system even adjusts replication by language, following query language mix changes throughout the day and handling sudden shifts (e.g., an Asian data center outage routing traffic to North America).
Long-Term

Latency-Induced Probation

Continuously monitor each server's latency distribution. When a server becomes abnormally slow, temporarily exclude it from serving ("probation") while continuing to send shadow requests to track its recovery.

The paradox: removing serving capacity during high load actually improves latency. The slowness is usually temporary (network interference, CPU spikes from co-located jobs) and resolves itself. The system reincorporates the server automatically when metrics normalize.

Techniques for Search Systems

In large IR systems like Google Search, speed is itself a quality metric — returning good results quickly beats returning the best results slowly.

IR-Specific

"Good Enough" Results

Once a sufficient fraction of leaf servers have responded, return slightly incomplete results rather than waiting for every last server. The chance that a specific slow server holds the best result is less than 1 in 1,000 — odds further reduced by replicating important documents across multiple servers.

Return results once an acceptable fraction of the corpus has been searched.

Non-essential subsystems (ads, spelling correction) can be gracefully skipped if they don't respond in time.

Carefully tuned to ensure "good enough" results are rare, preserving quality while protecting latency.

IR-Specific

Canary Requests

In high fan-out systems, a single bad request could crash or hang thousands of servers simultaneously (untested code path, malicious input). Canary requests provide a safety net.

1

Before fanning out to thousands of leaf servers, send the request to just 1–2 "canary" servers first.

2

Only proceed with full fan-out if the canary responds successfully within a reasonable time.

3

If the canary crashes or hangs, the request is flagged as dangerous and blocked from further execution.

The latency overhead is minimal — waiting for 1 server has far less variability than waiting for thousands. Despite this small cost, canary requests are used for every request in all of Google's large fan-out search systems because the safety benefit is so high.

What About Mutations?

State-mutating operations are easier to handle because:

  • Latency-critical modifications are generally small in scale
  • Updates can often happen off the critical path, after responding to the user
  • Many services tolerate inconsistent update models
  • Quorum algorithms like Paxos commit to only 3–5 replicas, making them inherently tail-tolerant

Hardware Trends

The future makes these techniques more important and more effective:

  • Aggressive power optimization and deep submicron fabrication increase device-level variability
  • Higher data center network bisection bandwidth reduces tied-request cost
  • RDMA and lower per-message overhead make cancellations arrive faster
  • Fine-grained requests improve multiplexing and reduce head-of-line blocking

The Big Takeaways

The paper's enduring contribution is a paradigm shift: stop trying to prevent every latency spike, and instead build systems that are resilient to them.