Load Balancing¶

SMG provides multiple load balancing policies to distribute requests across workers. Choosing the right policy depends on your workload characteristics.

Overview¶

Bucket¶

Request-length-based routing with adaptive boundaries. Designed for PD disaggregation workloads.

Power of Two¶

Load-aware selection without global state. Samples two workers, routes to the lighter one.

Consistent Hashing¶

Header-based routing with minimal redistribution on scaling. Ideal for session affinity.

Policy Comparison¶

Policy	Load Aware	Cache Affinity	Session Affinity	Complexity	Best For
`cache_aware`	✓	✓	✗	O(prefix)	Production LLM
`bucket`	✓	✗	✗	O(n)	PD disaggregation
`power_of_two`	✓	✗	✗	O(1)	Load balancing
`consistent_hashing`	✗	✗	✓	O(log n)	Session affinity
`prefix_hash`	✓	Partial	✗	O(log n)	Lightweight caching
`manual`	✗	✗	✓	O(1)	Stateful chat
`round_robin`	✗	✗	✗	O(1)	Even distribution
`random`	✗	✗	✗	O(1)	Testing

Cache-Aware¶

The recommended policy for production LLM inference. Maintains a multi-tenant radix tree that mirrors backend KV cache state, enabling perfect cache prediction with integrated load balancing.

smg --policy cache_aware --worker-urls http://w1:8000 http://w2:8000

Limitations¶

Higher memory usage (radix tree per worker)
O(prefix) selection time
Requires tokenization

Use when: Production workloads with repeated prefixes—multi-turn conversations, RAG applications, batch processing with templates.

Learn more about Cache-Aware Routing →

Bucket¶

Routes requests based on request text length using adaptive boundaries. Periodically adjusts boundaries based on observed load distribution.

smg --policy bucket --worker-urls http://w1:8000 http://w2:8000 http://w3:8000

Limitations¶

O(n) complexity
No cache locality
Requires understanding of length distribution

Use when: PD disaggregation where prefill workers handle different request sizes, or workloads with bimodal request length distribution.

Power of Two Choices¶

Samples two random workers and selects the one with lower load. Provides good load distribution with minimal coordination overhead—a proven algorithm from distributed systems research.

smg --policy power_of_two --worker-urls http://w1:8000 http://w2:8000

Limitations¶

No cache locality
Requires load metrics from workers
May not find optimal worker

Use when: Heterogeneous workers with varying response times, or when cache locality doesn't matter.

Consistent Hashing¶

Provides header-based consistent routing using a hash ring. Minimizes redistribution when workers scale—only ~1/N keys move when adding/removing workers.

smg --policy consistent_hashing --worker-urls http://w1:8000 http://w2:8000

Limitations¶

No load awareness
No cache locality
Requires routing key header

Routing Headers¶

Header	Description
`X-SMG-Target-Worker`	Direct routing by worker index (0-based)
`X-SMG-Routing-Key`	Consistent hash routing for session affinity

Priority order: X-SMG-Target-Worker → X-SMG-Routing-Key → Implicit keys (Authorization, X-Forwarded-For, Cookie) → Random fallback

Use when: Session affinity needed, user-to-worker pinning, or consistent routing for stateful applications.

Prefix Hash¶

A lightweight alternative to full cache-aware routing. Routes requests based on a hash of the first N tokens, using consistent hashing with load factor override.

smg --policy prefix_hash --prefix-token-count 256 --worker-urls http://w1:8000 http://w2:8000

Limitations¶

Prefix grouping, not exact matching
Less precise than cache_aware
Load factor can cause redistribution

Comparison with Cache-Aware¶

Aspect	prefix_hash	cache_aware
Lookup	O(log n)	O(prefix_len)
Memory	O(workers × virtual_nodes)	O(total_tokens)
Precision	Prefix grouping	Exact matching

Use when: Need some cache locality with predictable performance and lower memory footprint.

Manual¶

Provides sticky session routing with explicit routing key mapping. Unlike consistent hashing, sessions stay with their assigned worker even when new workers are added.

smg --policy manual --assignment-mode min_load --worker-urls http://w1:8000 http://w2:8000

Limitations¶

No load balancing for existing sessions
Requires X-SMG-Routing-Key header
Memory grows with active sessions

Assignment Modes¶

Mode	Description
`random`	Randomly select from healthy workers
`min_load`	Select worker with fewest active requests
`min_group`	Select worker with fewest routing keys assigned

Use when: Stateful chat sessions where context is stored on workers, or when session continuity is critical.

Round Robin¶

Rotates through workers sequentially, guaranteeing even distribution over time. Skips unhealthy workers automatically.

smg --policy round_robin --worker-urls http://w1:8000 http://w2:8000

Limitations¶

No load awareness
No cache locality
Ignores request characteristics

Use when: All workers have equal capacity and you want predictable, even distribution.

Random¶

The simplest policy—each healthy worker has equal probability of selection. Zero state overhead.

smg --policy random --worker-urls http://w1:8000 http://w2:8000

Limitations¶

No load awareness
No cache locality
Can create hot spots

Use when: Testing environments or completely homogeneous workloads where simplicity is preferred.

Choosing a Policy¶

Decision Guide¶

Requirement	Recommended Policy
Production LLM inference	`cache_aware`
Session affinity (sticky sessions)	`manual` or `consistent_hashing`
PD disaggregation	`bucket`
Load balancing without cache	`power_of_two`
Lightweight cache locality	`prefix_hash`
Even distribution	`round_robin`
Testing/development	`random`

Scenario Guide¶

RAG Applications¶

Recommended: cache_aware

Exploits common document prefixes for faster Time to First Token.

Multi-Tenant Platform¶

Recommended: consistent_hashing or manual

User-to-worker affinity for tenant isolation or stateful sessions.

PD Disaggregation¶

Recommended: bucket (prefill) + power_of_two (decode)

Length-based routing for prefill, load-based for decode workers.

What's Next?¶

Circuit Breakers¶

How SMG handles worker failures gracefully.

Circuit Breakers →