Multiple Workers¶
SMG can route across many workers simultaneously — local inference servers, remote cloud APIs, or a mix of both. This guide covers how to add workers and balance traffic across them.
Supported Worker Types¶
SMG connects to workers over HTTP or gRPC, and supports both local inference servers and remote API providers:
| Worker Type | Protocol | Example URL |
|---|---|---|
| SGLang | HTTP / gRPC | http://worker:8000 or grpc://worker:50051 |
| vLLM | gRPC | grpc://worker:50051 |
| TensorRT-LLM | gRPC | grpc://worker:50051 |
| OpenAI (GPT) | HTTP | https://api.openai.com |
| Anthropic (Claude) | HTTP | https://api.anthropic.com |
| xAI (Grok) | HTTP | https://api.x.ai |
| Google (Gemini) | HTTP | https://generativelanguage.googleapis.com |
| Any OpenAI-compatible API | HTTP | https://your-provider.com |
Static Workers via CLI¶
Pass multiple URLs to --worker-urls:
smg \
--worker-urls http://worker1:8000 http://worker2:8000 http://worker3:8000 \
--policy round_robin \
--host 0.0.0.0 \
--port 30000For gRPC workers, use the grpc:// scheme and provide --model-path so the gateway can load the tokenizer:
smg \
--worker-urls grpc://worker1:50051 grpc://worker2:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy round_robinSee gRPC Workers for details on what gRPC mode enables.
Cloud API Workers¶
Route to cloud providers by setting --backend and passing the provider URL. API keys are either passed by the caller through the Authorization header (BYOK) or stored on the worker record when registering via the admin API. SMG auto-detects the provider (OpenAI, Anthropic, xAI, Gemini) from the model name and applies the correct API transformations.
export OPENAI_API_KEY=sk-...smg
--backend openai
--worker-urls https://api.openai.com
--host 0.0.0.0
--port 30000
export ANTHROPIC_API_KEY=sk-ant-...
smg
--backend openai
--worker-urls https://api.anthropic.com
--host 0.0.0.0
--port 30000
export XAI_API_KEY=xai-...
smg
--backend openai
--worker-urls https://api.x.ai
--host 0.0.0.0
--port 30000
export GEMINI_API_KEY=...
smg
--backend openai
--worker-urls https://generativelanguage.googleapis.com
--host 0.0.0.0
--port 30000
Dynamic Workers with IGW Mode¶
In Inference Gateway (IGW) mode, SMG starts with no workers and you add or remove them at runtime via the REST API:
smg --enable-igw --host 0.0.0.0 --port 30000Add a worker¶
curl -X POST http://localhost:30000/workers \
-H "Content-Type: application/json" \
-d '{"url": "http://worker1:8000"}'Response:
{
"status": "accepted",
"worker_id": "a1b2c3d4",
"url": "http://worker1:8000",
"location": "/workers/a1b2c3d4",
"message": "Worker addition queued for background processing"
}List workers¶
curl http://localhost:30000/workersRemove a worker¶
curl -X DELETE http://localhost:30000/workers/{worker_id}Worker configuration options¶
The POST /workers endpoint accepts additional fields:
{
"url": "http://worker:8000",
"api_key": "optional-key",
"runtime": "sglang",
"worker_type": "regular",
"priority": 50,
"cost": 1.0,
"labels": {"region": "us-east"}
}| Field | Default | Description |
|---|---|---|
url |
(required) | Worker URL (http://, grpc://, or https:// for cloud) |
api_key |
— | API key for authenticated workers |
runtime |
(auto-detect) | Runtime: sglang, vllm, trtllm, mlx, or external |
worker_type |
regular |
Type: regular, prefill, or decode |
priority |
50 |
Routing priority (0–100, higher = preferred) |
cost |
1.0 |
Cost multiplier for cost-aware routing |
labels |
{} |
Arbitrary metadata |
Verify¶
# List connected workers
curl http://localhost:30000/workers
# Check health
curl http://localhost:30000/health
# Send a request
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'Next Steps¶
- Monitoring — Track request rates, latency, and worker health
- gRPC Workers — Enable tokenization, chat templates, and tool parsing at the gateway
- PD Disaggregation — Separate prefill and decode onto specialized workers