Circuit Breakers¶
Circuit breakers prevent cascade failures by stopping traffic to unhealthy workers. They're essential for maintaining system stability when workers fail.
Overview¶
Fail Fast¶
Detect failing workers and stop sending traffic immediately—no wasted requests or timeout waits.
Self-Healing¶
Automatically test recovery and restore traffic when workers become healthy again.
Observable¶
Full Prometheus metrics for state transitions, failure counts, and recovery timing.
Why Circuit Breakers?¶
Without circuit breakers, a failing worker can cause:
- Wasted requests: Requests sent to failing workers timeout
- Increased latency: Clients wait for timeouts before retry
- Resource exhaustion: Connections pile up to dead workers
- Cascade failures: Retry storms overwhelm remaining workers
Circuit breakers fail fast—they detect failing workers and stop sending traffic immediately.
How It Works¶
Each worker has its own circuit breaker with three states:
Open State¶
Circuit tripped - requests rejected immediately.
- Worker isolated from pool
- No traffic sent to worker
- Transitions to half-open after timeout
Half-Open State¶
Testing recovery - limited probe requests allowed.
- Success → close circuit
- Any failure → reopen circuit
- Gradual traffic restoration
State Transitions¶
Closed → Open¶
The circuit opens when:
consecutive_failures >= failure_thresholdA single successful request resets consecutive_failures to zero. --cb-window-duration-secs is accepted and validated but is not yet consumed by the state machine — failures are tracked via a running consecutive-failure counter rather than a sliding window.
Open → Half-Open¶
After the circuit has been open for --cb-timeout-duration-secs, it transitions to half-open to test if the worker has recovered.
Half-Open → Closed¶
If --cb-success-threshold consecutive requests succeed, the circuit closes and normal operation resumes.
Half-Open → Open¶
If any request fails during half-open, the circuit immediately reopens.
Configuration¶
smg \
--worker-urls http://w1:8000 http://w2:8000 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60Parameters¶
| Parameter | Default | Description |
|---|---|---|
--cb-failure-threshold |
10 |
Consecutive failures before circuit opens |
--cb-success-threshold |
3 |
Successes in half-open state to close circuit |
--cb-timeout-duration-secs |
60 |
Seconds before open circuit transitions to half-open |
--cb-window-duration-secs |
120 |
Accepted and validated (must be > 0) but not yet consumed by the state machine; see note under Closed → Open |
--disable-circuit-breaker |
false |
Disable circuit breakers entirely |
Configuration Examples¶
Tolerant Configuration¶
Allow occasional failures before tripping.
smg \
--cb-failure-threshold 20 \
--cb-success-threshold 5 \
--cb-timeout-duration-secs 120Use when: Flaky workers, network instability, batch processing
Tuning Guidelines¶
| Scenario | Recommendation |
|---|---|
| Flaky workers | Higher failure_threshold, shorter timeout |
| Critical availability | Lower failure_threshold, longer timeout |
| Fast recovery workers | Lower timeout, lower success_threshold |
| Slow recovery workers | Higher timeout, higher success_threshold |
Example Scenarios¶
Normal Operation¶
Worker Fails¶
Recovery¶
Monitoring¶
Metrics¶
| Metric | Description |
|---|---|
smg_worker_cb_state |
Current state per worker (0=closed, 1=open, 2=half-open) |
smg_worker_cb_transitions_total |
State transitions by worker and direction |
smg_worker_cb_consecutive_failures |
Current failure count per worker |
smg_worker_cb_consecutive_successes |
Current success count per worker |
Useful PromQL Queries¶
Transitions¶
# State transitions rate
rate(smg_worker_cb_transitions_total[5m])
# Consecutive failures per worker
smg_worker_cb_consecutive_failuresAlert Thresholds¶
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Open circuits | 1 worker | All workers | Investigate worker health |
| Transition rate | >10/min | >50/min | Check for flapping |
| Consecutive failures | >5 | >threshold | Worker likely failing |
Alerting Example¶
groups:
- name: smg-circuit-breakers
rules:
- alert: CircuitBreakerOpen
expr: smg_worker_cb_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker open for {{ $labels.worker }}"
- alert: AllCircuitsOpen
expr: count(smg_worker_cb_state == 1) == count(smg_worker_cb_state)
for: 30s
labels:
severity: critical
annotations:
summary: "All worker circuit breakers are open"Interaction with Other Features¶
Retries¶
When a circuit is open:
- Requests are rejected immediately (no retry to that worker)
- Other workers may be tried if available
When a circuit is half-open:
- Only a limited number of test requests are sent
- Failures don't count against retry budget
Health Checks¶
Circuit breakers and health checks work together:
| Health Check | Circuit Breaker | Worker State |
|---|---|---|
| Passing | Closed | Healthy, receiving traffic |
| Failing | Open | Unhealthy, no traffic |
| Passing | Open | Recovering, limited traffic |
Disabling Circuit Breakers¶
In some cases, you may want to disable circuit breakers:
smg --worker-urls http://w1:8000 --disable-circuit-breakerWhat's Next?¶
Health Checks¶
Proactive worker monitoring and failure detection.
Rate Limiting¶
Protect workers from overload with token bucket rate limiting.
Metrics Reference¶
Complete list of circuit breaker metrics.