Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

19. LLM Serving Infrastructure

Authors
Affiliations
Birmingham City University
Sunway College Kathmandu

Objective

Manages low-latency, high-throughput serving of large language models across distributed GPU clusters, handling batching, caching, and dynamic scaling.

System Architecture

[Mermaid diagram - flowchart showing core components and data flow]

[3-5 sentence description of architecture]

Technical Approach

Key Components

Pipeline / Data Flow

[Detailed description of request → processing → response flow]

Complexity Analysis

MetricComplexityNotes
Model sizeVariable (7B-70B+)[implications]
Time complexityO(batch_size × seq_len²)[notes]
Space complexity~2x model size for activations[notes]
Latency targetp95 <1s per request[real-time vs. batch]
Throughput target1000-10000 req/s per cluster[per GPU/instance]

Pros & Cons

Pros

Cons

Trade-offs

[1-2 paragraphs discussing key technical trade-offs]

Real-World Applications

Where This Pattern Appears

Production Considerations

[2-3 paragraphs on scaling, failure modes, monitoring, cost]

References & Citations

Citation 1: Architecture & Design

Title: [Paper/Blog Title on LLM Serving Infrastructure Architecture]

Citation 2: Performance & Benchmarks

Title: [Performance Benchmarks for LLM Serving Infrastructure]

Citation 3: Implementation Details

Title: [Implementation Details and Trade-offs]

Citation 4: Real-World Deployment

Title: [Production Deployment Insights]

Reproducibility Checklist