Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gmail Smart Compose

Authors
Affiliations
Birmingham City University
Sunway College Kathmandu

Objective

Predict and suggest real-time phrase completions for email composition using compact on-device models with cloud validation. Serve over 1.8 billion users globally while balancing latency, accuracy, and rigorous privacy standards [1].

System Architecture

High-level: the client triggers prediction (debounced), the edge service enriches and caches context, inference returns candidates which are filtered and ranked before client-side rendering. The system adheres to a strict end-to-end backend latency targeted at a P90 of less than 60 milliseconds [4], ensuring the experience remains assistive without feeling intrusive.

Technical Approach

ML Model Evolution

Key Components

Complexity Analysis & Metrics

MetricComplexity / ValueNotes
Users Served1.8 Billion+Global deployment requiring robust load balancing
Latency TargetP95 < 60msIncludes network, 20ms P50 inference [4,8]
Typing Saved1B+ chars/weekMassively reduces repetitive idiomatic typing
Acceptance Rate> 10%Threshold for utility without annoyance [26]

System Design Interview Framework

In an ML System Design interview (“Design Gmail Smart Compose”), candidates should highlight:

  1. Capacity Estimation: At ~2.5 trillion requests/day (1.8B users * 5 emails * 50 predictions), peak QPS hits 10-15M.

  2. Bottlenecks vs. Trade-offs:

    • Network latency is solved via edge serving, quantization, and context caching.

    • Quality vs. Speed is mitigated by small beam widths and Speculative Decoding (TinyLMs mask latency while cloud TPU logic finishes validating).

  3. API Design: Needs user_id, subject, thread_context, current_prefix, and metadata (locale/timestamp).

Privacy, Security, and Ethics

Smart Compose relies heavily on privacy isolation:

Pipeline / Data Flow

  1. Client triggers after debounce or token boundary and sends prefix + metadata.

  2. Edge app server attaches session context (cached encoded subject/thread) and routes to inference.

  3. Inference service attends to cached context + prefix; decoder produces candidate sequences.

  4. Post-processing filters for toxicity/PII and applies personalization interpolation with local signals.

  5. Top candidate(s) returned; client renders ghost text and accepts on user action.

Complexity Analysis

MetricComplexityNotes
Model size10–100M paramsSmall enough for on-device / edge quantization and fast inference
Time complexityO(seq_len) per tokenAutoregressive decoding dominates; caching reduces repeated work
Space complexity~50–200MBIncludes KV cache, model weights (quantized), and personal model artifacts
Latency targetp95 < 50msIncludes network, inference, and post-filtering; client-side tiny LM can mask network delays
Throughput target1000s reqs/s aggregatedScale via sharding, batching, and edge replication

Pros & Cons

Pros

Cons

References & Citations

  1. Google Help: Use Smart Compose in Gmail

  2. Attention is All You Need / Transformer scale

  3. Gmail Smart Compose: Real-Time Assisted Writing (KDD 2019)

  4. Integrated Gmail Updates with Improved Looks and Handy: Real Efficiency Gains

  5. Google Research: Smart Compose: Using Neural Networks to Help Write Emails

  6. Weak Learner: Gmail Smart Compose Real-Time Assisted Writing Summary

  7. What is AI Inference? Complete Guide to AI Model Deployment

  8. The KPIs that actually matter for production AI agents

  9. Private Federated Learning in Gboard