Gmail Smart Compose

Objective¶

Predict and suggest real-time phrase completions for email composition using compact on-device models with cloud validation. Serve over 1.8 billion users globally while balancing latency, accuracy, and rigorous privacy standards [1].

System Architecture¶

High-level: the client triggers prediction (debounced), the edge service enriches and caches context, inference returns candidates which are filtered and ranked before client-side rendering. The system adheres to a strict end-to-end backend latency targeted at a P90 of less than 60 milliseconds [4], ensuring the experience remains assistive without feeling intrusive.

Technical Approach¶

ML Model Evolution¶

RNN & LSTM: Early versions utilized seq2seq RNNs and LSTMs. They averaged word embeddings of the subject and previous message (context) to feed into decoding steps.
Transformers: Shifted to self-attention based architectures for parallelism and long-range dependencies, operating primarily as decoder-only sequence predictors [2, 12].

Key Components¶

Context Caching: Encodes fixed context (subject, thread history) into cached Key-Value (KV) pairs so only the newly typed prefix computes attention.
Language Model: Compact Transformers hosted on TPU Pods, quantized (fp32 to int8/bf16) for inference speed [23].
Sampling & Ranking Layer: Uses a very narrow Beam Search (width 1-3) coupled with confidence thresholding to prevent user distraction.
Personalization: Uses Katz-Backoff N-grams implemented as Weighted Finite Automata (WFA) for lightweight, high-efficiency personal model adaptation [12], which interpolates with the global model.

Complexity Analysis & Metrics¶

Metric	Complexity / Value	Notes
Users Served	1.8 Billion+	Global deployment requiring robust load balancing
Latency Target	P95 < 60ms	Includes network, 20ms P50 inference [4,8]
Typing Saved	1B+ chars/week	Massively reduces repetitive idiomatic typing
Acceptance Rate	> 10%	Threshold for utility without annoyance [26]

System Design Interview Framework¶

In an ML System Design interview (“Design Gmail Smart Compose”), candidates should highlight:

Capacity Estimation: At ~2.5 trillion requests/day (1.8B users * 5 emails * 50 predictions), peak QPS hits 10-15M.
Bottlenecks vs. Trade-offs:
- Network latency is solved via edge serving, quantization, and context caching.
- Quality vs. Speed is mitigated by small beam widths and Speculative Decoding (TinyLMs mask latency while cloud TPU logic finishes validating).
API Design: Needs user_id, subject, thread_context, current_prefix, and metadata (locale/timestamp).

Privacy, Security, and Ethics¶

Smart Compose relies heavily on privacy isolation:

Differential Privacy (DP): DP-SGD noise injection prevents individual influence on model weights [28].
Federated Learning (FL): Future on-device adaptations use Secure Aggregation to train local data without centralizing it [28].
Data Scrubbing: Strict PII normalization (generic tokens like [NAME]) before training.

Pipeline / Data Flow¶

Client triggers after debounce or token boundary and sends prefix + metadata.
Edge app server attaches session context (cached encoded subject/thread) and routes to inference.
Inference service attends to cached context + prefix; decoder produces candidate sequences.
Post-processing filters for toxicity/PII and applies personalization interpolation with local signals.
Top candidate(s) returned; client renders ghost text and accepts on user action.

Complexity Analysis¶

Metric	Complexity	Notes
Model size	10–100M params	Small enough for on-device / edge quantization and fast inference
Time complexity	O(seq_len) per token	Autoregressive decoding dominates; caching reduces repeated work
Space complexity	~50–200MB	Includes KV cache, model weights (quantized), and personal model artifacts
Latency target	p95 < 50ms	Includes network, inference, and post-filtering; client-side tiny LM can mask network delays
Throughput target	1000s reqs/s aggregated	Scale via sharding, batching, and edge replication

Pros & Cons¶

Pros¶

Contextual Assistance: Reduces attention residue and saves up to 84% composition time in reply scenarios [7].
Scalability: Custom TPU hardware co-design makes per-keystroke feature affordable at 15M+ QPS.

Cons¶

Infrastructure cost: TPUs and Edge replica caches are expensive, demanding massive scale to amortize.
Privacy risk: Managing edge cases like generative extraction requiring deep guardrails.