Retries¶
SMG implements automatic retries with exponential backoff to handle transient failures gracefully without overwhelming recovering services.
Overview¶
Exponential Backoff¶
Space out retry attempts with increasing delays to give services time to recover.
Jitter¶
Add randomness to backoff timing to prevent thundering herd problems.
Smart Selection¶
Only retry on transient error codes that are likely to succeed on retry.
Why Retries?¶
Transient failures are common in distributed systems:
- Network timeouts: Temporary network congestion or packet loss
- Worker overload: Temporary capacity limits (429 responses)
- Intermittent errors: Brief service interruptions during deployments
- Connection issues: Worker restart or network partition
Without retries, every transient failure becomes a client-visible error. With retries, SMG handles these automatically.
Exponential Backoff with Jitter¶
SMG uses exponential backoff with jitter to space out retry attempts:
delay = initial_backoff_ms * (backoff_multiplier ^ attempt)
delay = min(delay, max_backoff_ms)
delay = delay * (1 + random(-jitter_factor, +jitter_factor))Example Progression¶
With default settings (no jitter for clarity):
| Attempt | Calculated Delay |
|---|---|
| 1 | 50ms |
| 2 | 75ms |
| 3 | 112ms |
| 4 | 168ms |
| 5 | 253ms |
Why Jitter?¶
Without jitter, if multiple requests fail simultaneously, they all retry at exactly the same time—potentially overwhelming the recovering service. Jitter spreads out retries randomly to prevent this "thundering herd" problem.
Retryable Status Codes¶
SMG automatically retries requests that fail with these status codes:
| Code | Meaning | Why Retryable |
|---|---|---|
408 |
Request Timeout | Temporary network issue |
429 |
Too Many Requests | Worker temporarily overloaded |
500 |
Internal Server Error | Transient server issue |
502 |
Bad Gateway | Upstream temporarily unavailable |
503 |
Service Unavailable | Service temporarily down |
504 |
Gateway Timeout | Upstream timeout |
Requests with other status codes (e.g., 400 Bad Request, 401 Unauthorized) are not retried because they would likely fail again.
Configuration¶
smg \
--worker-urls http://w1:8000 http://w2:8000 \
--retry-max-retries 5 \
--retry-initial-backoff-ms 50 \
--retry-max-backoff-ms 30000 \
--retry-backoff-multiplier 1.5 \
--retry-jitter-factor 0.2Parameters¶
| Parameter | Default | Description |
|---|---|---|
--retry-max-retries |
5 |
Maximum number of retry attempts |
--retry-initial-backoff-ms |
50 |
Initial delay before first retry (milliseconds) |
--retry-max-backoff-ms |
30000 |
Maximum backoff delay (milliseconds) |
--retry-backoff-multiplier |
1.5 |
Multiplier applied to delay after each retry |
--retry-jitter-factor |
0.2 |
Random jitter factor (0.0-1.0) to prevent thundering herd |
--disable-retries |
false |
Disable automatic retries entirely |
Recommended Configurations¶
High-Availability¶
Balanced retries for production workloads.
smg \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-backoff-multiplier 2.0Use when: Production APIs, multi-worker deployments
Batch Processing¶
Aggressive retries for offline workloads.
smg \
--retry-max-retries 10 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 60000 \
--retry-backoff-multiplier 2.0Use when: Batch inference, non-interactive pipelines
No Retries¶
Disable retries entirely.
smg --disable-retriesUse when: Client handles retries, testing failure scenarios
Interaction with Circuit Breakers¶
Retries and circuit breakers work together:
| Circuit State | Retry Behavior |
|---|---|
| Closed | Normal retries to the worker |
| Open | Worker skipped; retry goes to different worker |
| Half-Open | Limited test requests; failures don't count against retry budget |
When a circuit is open:
- Requests are rejected immediately (no retry to that worker)
- If other healthy workers exist, the retry goes to them
- If all circuits are open, the request fails
Monitoring¶
Metrics¶
| Metric | Description |
|---|---|
smg_worker_retries_total |
Total retry attempts by worker type and endpoint |
smg_worker_retries_exhausted_total |
Requests that exhausted all retries by worker type and endpoint |
smg_worker_retry_backoff_seconds |
Histogram of backoff delays |
Useful PromQL Queries¶
Backoff Distribution¶
# Average backoff delay
rate(smg_worker_retry_backoff_seconds_sum[5m]) /
rate(smg_worker_retry_backoff_seconds_count[5m])
# 99th percentile backoff
histogram_quantile(0.99, smg_worker_retry_backoff_seconds_bucket)Alert Thresholds¶
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Retry rate | >10/sec | >50/sec | Investigate worker health |
| Retry success rate | <80% | <50% | Check for persistent failures |
| Avg backoff | >5s | >15s | Workers may be overloaded |
Tuning Guidelines¶
| Symptom | Potential Adjustment |
|---|---|
| Excessive latency from retries | Reduce --retry-max-retries, decrease backoff times |
| Thundering herd on recovery | Increase --retry-jitter-factor |
| Retries exhausted too quickly | Increase --retry-max-retries, --retry-max-backoff-ms |
| Clients seeing too many errors | Increase retry count, check worker health |
What's Next?¶
Health Checks¶
Proactive worker monitoring and failure detection.
Rate Limiting¶
Protect workers from overload with token bucket rate limiting.