Load Balancing¶
SMG provides multiple load balancing policies to distribute requests across workers. Choosing the right policy depends on your workload characteristics.
Overview¶
Bucket¶
Request-length-based routing with adaptive boundaries. Designed for PD disaggregation workloads.
Power of Two¶
Load-aware selection without global state. Samples two workers, routes to the lighter one.
Consistent Hashing¶
Header-based routing with minimal redistribution on scaling. Ideal for session affinity.
Policy Comparison¶
| Policy | Load Aware | Cache Affinity | Session Affinity | Complexity | Best For |
|---|---|---|---|---|---|
cache_aware |
✓ | ✓ | ✗ | O(prefix) | Production LLM |
bucket |
✓ | ✗ | ✗ | O(n) | PD disaggregation |
power_of_two |
✓ | ✗ | ✗ | O(1) | Load balancing |
consistent_hashing |
✗ | ✗ | ✓ | O(log n) | Session affinity |
prefix_hash |
✓ | Partial | ✗ | O(log n) | Lightweight caching |
manual |
✗ | ✗ | ✓ | O(1) | Stateful chat |
round_robin |
✗ | ✗ | ✗ | O(1) | Even distribution |
random |
✗ | ✗ | ✗ | O(1) | Testing |
Cache-Aware¶
The recommended policy for production LLM inference. Maintains a multi-tenant radix tree that mirrors backend KV cache state, enabling perfect cache prediction with integrated load balancing.
smg --policy cache_aware --worker-urls http://w1:8000 http://w2:8000Limitations¶
- Higher memory usage (radix tree per worker)
- O(prefix) selection time
- Requires tokenization
Use when: Production workloads with repeated prefixes—multi-turn conversations, RAG applications, batch processing with templates.
Learn more about Cache-Aware Routing →
Bucket¶
Routes requests based on request text length using adaptive boundaries. Periodically adjusts boundaries based on observed load distribution.
smg --policy bucket --worker-urls http://w1:8000 http://w2:8000 http://w3:8000Limitations¶
- O(n) complexity
- No cache locality
- Requires understanding of length distribution
Use when: PD disaggregation where prefill workers handle different request sizes, or workloads with bimodal request length distribution.
Power of Two Choices¶
Samples two random workers and selects the one with lower load. Provides good load distribution with minimal coordination overhead—a proven algorithm from distributed systems research.
smg --policy power_of_two --worker-urls http://w1:8000 http://w2:8000Limitations¶
- No cache locality
- Requires load metrics from workers
- May not find optimal worker
Use when: Heterogeneous workers with varying response times, or when cache locality doesn't matter.
Consistent Hashing¶
Provides header-based consistent routing using a hash ring. Minimizes redistribution when workers scale—only ~1/N keys move when adding/removing workers.
smg --policy consistent_hashing --worker-urls http://w1:8000 http://w2:8000Limitations¶
- No load awareness
- No cache locality
- Requires routing key header
Routing Headers¶
| Header | Description |
|---|---|
X-SMG-Target-Worker |
Direct routing by worker index (0-based) |
X-SMG-Routing-Key |
Consistent hash routing for session affinity |
Priority order: X-SMG-Target-Worker → X-SMG-Routing-Key → Implicit keys (Authorization, X-Forwarded-For, Cookie) → Random fallback
Use when: Session affinity needed, user-to-worker pinning, or consistent routing for stateful applications.
Prefix Hash¶
A lightweight alternative to full cache-aware routing. Routes requests based on a hash of the first N tokens, using consistent hashing with load factor override.
smg --policy prefix_hash --prefix-token-count 256 --worker-urls http://w1:8000 http://w2:8000Limitations¶
- Prefix grouping, not exact matching
- Less precise than cache_aware
- Load factor can cause redistribution
Comparison with Cache-Aware¶
| Aspect | prefix_hash | cache_aware |
|---|---|---|
| Lookup | O(log n) | O(prefix_len) |
| Memory | O(workers × virtual_nodes) | O(total_tokens) |
| Precision | Prefix grouping | Exact matching |
Use when: Need some cache locality with predictable performance and lower memory footprint.
Manual¶
Provides sticky session routing with explicit routing key mapping. Unlike consistent hashing, sessions stay with their assigned worker even when new workers are added.
smg --policy manual --assignment-mode min_load --worker-urls http://w1:8000 http://w2:8000Limitations¶
- No load balancing for existing sessions
- Requires
X-SMG-Routing-Keyheader - Memory grows with active sessions
Assignment Modes¶
| Mode | Description |
|---|---|
random |
Randomly select from healthy workers |
min_load |
Select worker with fewest active requests |
min_group |
Select worker with fewest routing keys assigned |
Use when: Stateful chat sessions where context is stored on workers, or when session continuity is critical.
Round Robin¶
Rotates through workers sequentially, guaranteeing even distribution over time. Skips unhealthy workers automatically.
smg --policy round_robin --worker-urls http://w1:8000 http://w2:8000Limitations¶
- No load awareness
- No cache locality
- Ignores request characteristics
Use when: All workers have equal capacity and you want predictable, even distribution.
Random¶
The simplest policy—each healthy worker has equal probability of selection. Zero state overhead.
smg --policy random --worker-urls http://w1:8000 http://w2:8000Limitations¶
- No load awareness
- No cache locality
- Can create hot spots
Use when: Testing environments or completely homogeneous workloads where simplicity is preferred.
Choosing a Policy¶
Decision Guide¶
| Requirement | Recommended Policy |
|---|---|
| Production LLM inference | cache_aware |
| Session affinity (sticky sessions) | manual or consistent_hashing |
| PD disaggregation | bucket |
| Load balancing without cache | power_of_two |
| Lightweight cache locality | prefix_hash |
| Even distribution | round_robin |
| Testing/development | random |
Scenario Guide¶
RAG Applications¶
Recommended: cache_aware
Exploits common document prefixes for faster Time to First Token.
Multi-Tenant Platform¶
Recommended: consistent_hashing or manual
User-to-worker affinity for tenant isolation or stateful sessions.
PD Disaggregation¶
Recommended: bucket (prefill) + power_of_two (decode)
Length-based routing for prefill, load-based for decode workers.
What's Next?¶
Circuit Breakers¶
How SMG handles worker failures gracefully.