Configuration Reference¶

Complete configuration reference for tuning SMG behavior.

Configuration Methods¶

SMG can be configured through:

Command-line arguments (highest priority)
Environment variables
Default values (lowest priority)

Worker Configuration¶

Host¶

Network interface to bind to.

Option	`--host`
Environment	-
Default	`0.0.0.0`

Value	Description
`127.0.0.1`	Localhost only
`0.0.0.0`	All IPv4 interfaces
`::`	All IPv6 interfaces
`::1`	IPv6 localhost

Port¶

Port for the main API server.

Option	`--port`
Environment	-
Default	`30000`

Worker URLs¶

List of worker URLs to route requests to.

Option	`--worker-urls`
Environment	-
Default	Empty
Format	Space-separated URLs

Examples:

--worker-urls http://worker1:8000 http://worker2:8000
--worker-urls http://[::1]:8000 http://192.168.1.1:8000  # IPv6 and IPv4
--worker-urls grpc://worker1:50051  # gRPC mode

Routing Policy Configuration¶

Load Balancing Policy¶

Controls how requests are distributed across workers.

Option	`--policy`
Environment	-
Default	`cache_aware`
Values	`random`, `round_robin`, `cache_aware`, `power_of_two`, `prefix_hash`, `consistent_hashing`, `bucket`, `manual`

Policy Comparison:

Policy	Use Case	KV Cache	Load Balance
`random`	Simple deployments	Poor	Fair
`round_robin`	Uniform workloads	Poor	Good
`power_of_two`	Variable workloads	Poor	Excellent
`cache_aware`	LLM inference	Excellent	Good
`prefix_hash`	Consistent routing by prefix	Good	Good
`consistent_hashing`	Session affinity via hash ring	Good	Good
`bucket`	Load balancing with bucket boundaries	Poor	Excellent
`manual`	Sticky sessions with LRU eviction	Good	Manual

Recommendation: Use cache_aware for LLM workloads to maximize KV cache hit rates.

Cache-Aware Policy Options¶

Option	Description	Default
`--cache-threshold`	Cache threshold (0.0-1.0) for cache-aware routing	`0.3`
`--balance-abs-threshold`	Absolute threshold for load balancing trigger	`64`
`--balance-rel-threshold`	Relative threshold for load balancing trigger	`1.5`
`--eviction-interval`	Interval in seconds between cache eviction operations	`120`
`--max-tree-size`	Maximum size of the approximation tree	`67108864`
`--block-size`	KV cache block size for event-driven cache-aware routing	`16`

Prefix Hash Policy Options¶

Option	Description	Default
`--prefix-token-count`	Number of prefix tokens to use for hashing	`256`
`--prefix-hash-load-factor`	Load factor threshold for rebalancing	`1.25`

Manual Policy Options¶

Option	Description	Default
`--max-idle-secs`	Maximum idle time before eviction	`14400` (4 hours)
`--assignment-mode`	Mode for new routing key assignment	`random`

Assignment Modes:

random - Assign to a random worker
min_load - Assign to worker with fewest active requests
min_group - Assign to worker with fewest routing keys

Advanced Routing Options¶

Option	Description	Default
`--dp-aware`	Enable data parallelism aware scheduling	`false`
`--enable-igw`	Enable IGW (Inference Gateway) mode for multi-model support	`false`
`--dp-minimum-tokens-scheduler`	Enable minimum tokens scheduler for data parallel group	`false`
`--load-monitor-interval`	Interval in seconds between load monitor checks for PowerOfTwo routing	`10`

PD Disaggregation Configuration¶

Prefill-Decode disaggregated mode separates prefill and decode operations across different workers.

Enable PD Mode¶

Option	`--pd-disaggregation`
Environment	-
Default	`false`

Prefill Servers¶

Option	`--prefill`
Format	`URL [BOOTSTRAP_PORT]`
Multiple	Yes (specify multiple times)

Examples:

--prefill http://prefill1:30001 9001 \
--prefill http://prefill2:30002 9002 \
--prefill http://prefill3:30003 none  # No bootstrap port

Decode Servers¶

Option	`--decode`
Format	URL
Multiple	Yes (specify multiple times)

Example:

--decode http://decode1:30003 \
--decode http://decode2:30004

PD-Specific Policies¶

Option	Description	Default
`--prefill-policy`	Specific policy for prefill nodes	Uses main `--policy`
`--decode-policy`	Specific policy for decode nodes	Uses main `--policy`

Worker Startup Configuration¶

Option	Description	Default
`--worker-startup-timeout-secs`	Timeout for worker startup and registration	`1800` (30 min)
`--worker-startup-check-interval`	Interval between worker startup checks	`30`

Service Discovery (Kubernetes)¶

Enable Service Discovery¶

Option	`--service-discovery`
Environment	-
Default	`false`

Note: Enabling service discovery automatically enables IGW mode.

Label Selector¶

Option	`--selector`
Format	`key=value` (space-separated for multiple)

Example:

--selector app=sglang-worker tier=inference

Namespace¶

Option	`--service-discovery-namespace`
Environment	-
Default	All namespaces

Worker Port¶

Option	`--service-discovery-port`
Environment	-
Default	`80`

PD Service Discovery Selectors¶

Option	Description
`--prefill-selector`	Label selector for prefill server pods
`--decode-selector`	Label selector for decode server pods

HA Mesh Router Discovery¶

Option	Description
`--router-selector`	Label selector for router pod discovery in HA mesh mode (format: `key=value`)

Per-Worker Model ID Override¶

Option	Description
`--model-id-from`	Override each worker's `model_id` from pod metadata. Accepted values: `namespace`, `label:<key>`, or `annotation:<key>`.

Tokenizer Configuration¶

Model Path¶

Option	`--model-path`
Environment	-
Default	None
Description	HuggingFace model ID or local path for loading tokenizer

Tokenizer Path¶

Option	`--tokenizer-path`
Environment	-
Default	None
Description	Explicit tokenizer path (overrides model_path tokenizer)

Chat Template¶

Option	`--chat-template`
Environment	-
Default	None
Description	Path to chat template file

Disable Tokenizer Autoload¶

Option	`--disable-tokenizer-autoload`
Environment	-
Default	`false`
Description	Disable automatic tokenizer loading at startup and during worker registration. Useful when tokenizers are loaded on-demand via the API.

Tokenizer Cache (L0 - Exact Match)¶

Option	Description	Default
`--tokenizer-cache-enable-l0`	Enable L0 exact match cache	`false`
`--tokenizer-cache-l0-max-entries`	Maximum entries in L0 cache	`10000`

Tokenizer Cache (L1 - Prefix Matching)¶

Option	Description	Default
`--tokenizer-cache-enable-l1`	Enable L1 prefix matching cache	`false`
`--tokenizer-cache-l1-max-memory`	Maximum memory for L1 cache (bytes)	`52428800` (50MB)

Parser Configuration¶

Reasoning Parser¶

Option	`--reasoning-parser`
Environment	-
Default	None
Values	`deepseek-r1`, `qwen3`, etc.
Description	Parser for reasoning models with thinking tokens

Tool Call Parser¶

Option	`--tool-call-parser`
Environment	-
Default	None
Values	`json`, `qwen`, etc.
Description	Parser for tool-call/function-calling interactions

MCP Configuration¶

MCP Config Path¶

Option	`--mcp-config-path`
Environment	-
Default	None
Description	Path to MCP (Model Context Protocol) server configuration file

Backend Configuration¶

Backend Runtime¶

Option	`--backend`
Environment	-
Default	None (auto-detected)
Values	`sglang`, `vllm`, `trtllm`, `openai`, `anthropic`, `gemini`

History Backend¶

Option	`--history-backend`
Environment	-
Default	`memory`
Values	`memory`, `none`, `oracle`, `postgres`, `redis`

Storage Configuration¶

Oracle Database¶

Option	Environment	Description
`--oracle-wallet-path`	`ATP_WALLET_PATH`	Path to Oracle ATP wallet directory
`--oracle-tns-alias`	`ATP_TNS_ALIAS`	Oracle TNS alias from tnsnames.ora
`--oracle-dsn`	`ATP_DSN`	Oracle connection descriptor/DSN
`--oracle-user`	`ATP_USER`	Oracle database username
`--oracle-password`	`ATP_PASSWORD`	Oracle database password
`--oracle-external-auth`	`ATP_EXTERNAL_AUTH`	Enable Oracle external authentication (default: `false`)
`--oracle-pool-min`	`ATP_POOL_MIN`	Minimum connection pool size (default: 1)
`--oracle-pool-max`	`ATP_POOL_MAX`	Maximum connection pool size (default: 16)
`--oracle-pool-timeout-secs`	`ATP_POOL_TIMEOUT_SECS`	Pool timeout in seconds (default: 30)

PostgreSQL Database¶

Option	Environment	Description	Default
`--postgres-db-url`	`POSTGRES_DB_URL`	PostgreSQL connection URL	-
`--postgres-pool-max-size`	`POSTGRES_POOL_MAX`	Maximum pool size	`16`

Redis Database¶

Option	Environment	Description	Default
`--redis-url`	`REDIS_URL`	Redis connection URL	-
`--redis-pool-max-size`	`REDIS_POOL_MAX`	Maximum pool size	`16`
`--redis-retention-days`	`REDIS_RETENTION_DAYS`	Data retention (-1 for persistent)	`30`

WASM Configuration¶

Enable WebAssembly¶

Option	`--enable-wasm`
Environment	-
Default	`false`
Description	Enable WebAssembly support

Storage Hook WASM Component¶

Option	`--storage-hook-wasm-path`
Environment	-
Default	None
Description	Path to a WASM component implementing storage hooks. When set, wraps all storage backends with hook-based interceptors.

Schema Config File¶

Option	`--schema-config`
Environment	-
Default	None
Description	Path to a YAML schema config file for storage table/column remapping.

WebRTC Configuration¶

Option	Description	Default
`--webrtc-bind-addr`	Bind address for WebRTC UDP sockets (client-facing ICE candidate IP). Set to `127.0.0.1` for local development on the same machine.	`0.0.0.0` (auto-detect via routing table)
`--webrtc-stun-server`	STUN server for ICE candidate gathering (`host:port`). Set to your own STUN server for enterprise deployments that restrict outbound traffic to external STUN servers.	`stun.l.google.com:19302`

Mesh Server Configuration¶

High-availability mesh networking for multi-router coordination.

Option	Description	Default
`--enable-mesh`	Enable mesh server for HA multi-router coordination. Requires at least two SMG instances.	`false`
`--mesh-server-name`	Name for this mesh node. If not set, a random name is generated (e.g., `Mesh_a1b2`).	Auto-generated
`--mesh-host`	Bind address for the mesh server.	`0.0.0.0`
`--mesh-advertise-host`	Routable address advertised to other mesh peers. Required when `--mesh-host` is an unspecified bind address such as `0.0.0.0`.	`--mesh-host`
`--mesh-port`	Port for the mesh server.	`39527`
`--mesh-peer-urls`	Peer mesh node addresses to join (format: `host:port`). Used for initial cluster formation.	(none)

Example:

smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527

Request Handling Configuration¶

Request Timeout¶

Option	`--request-timeout-secs`
Environment	-
Default	`1800` (30 minutes)
Description	Maximum time for request processing

Shutdown Grace Period¶

Option	`--shutdown-grace-period-secs`
Environment	-
Default	`180` (3 minutes)
Description	Time to wait for in-flight requests during shutdown

Maximum Payload Size¶

Option	`--max-payload-size`
Environment	-
Default	`536870912` (512MB)
Description	Maximum request payload size in bytes

CORS Configuration¶

Option	`--cors-allowed-origins`
Environment	-
Default	Empty
Format	Space-separated URLs

Example:

--cors-allowed-origins http://localhost:3000 https://example.com

Request ID Headers¶

Option	`--request-id-headers`
Environment	-
Default	None (uses common defaults)
Description	Custom HTTP headers to check for request IDs

Example:

--request-id-headers x-request-id x-trace-id x-correlation-id

Storage Context Headers¶

Option	`--storage-context-headers`
Environment	-
Default	Empty
Format	Space-separated `header=context_key` entries
Description	Maps request headers into storage hook request context

Example:

--storage-context-headers x-tenant-id=tenant_id x-user-id=user_id

This lets storage hooks read values such as tenant_id and user_id from the request context without hard-coding specific headers in the gateway.

Only map headers that are injected or sanitized by a trusted upstream. Client-supplied headers can otherwise spoof storage hook request context values.

Rate Limiting Configuration¶

Concurrent Request Limit¶

Option	`--max-concurrent-requests`
Environment	-
Default	`-1` (unlimited)
Range	`-1` or `1+`

Sizing Guide:

max_concurrent_requests = num_workers * requests_per_worker_capacity

Worker GPU Memory	Suggested per Worker
16GB	4-8
40GB	8-16
80GB	16-32

Queue Configuration¶

Option	Description	Default
`--queue-size`	Maximum requests waiting when rate limit reached	`100`
`--queue-timeout-secs`	Maximum time a request can wait in queue	`60`

Token Bucket Rate Limiting¶

Option	`--rate-limit-tokens-per-second`
Environment	-
Default	Same as `max-concurrent-requests`
Description	Token bucket refill rate

Retry Configuration¶

Retry Options¶

Option	Description	Default
`--retry-max-retries`	Maximum retry attempts	`5`
`--retry-initial-backoff-ms`	Initial backoff delay (ms)	`50`
`--retry-max-backoff-ms`	Maximum backoff delay (ms)	`30000`
`--retry-backoff-multiplier`	Exponential backoff multiplier	`1.5`
`--retry-jitter-factor`	Jitter factor (0.0-1.0)	`0.2`
`--disable-retries`	Disable automatic retries	`false`

Backoff Formula:

delay = min(initial_backoff * multiplier^attempt, max_backoff) * (1 + random(0, jitter_factor))

Circuit Breaker Configuration¶

Option	Description	Default
`--cb-failure-threshold`	Failures before circuit opens	`10`
`--cb-success-threshold`	Successes needed to close in half-open state	`3`
`--cb-timeout-duration-secs`	Time before attempting recovery	`60`
`--cb-window-duration-secs`	Sliding window for tracking failures	`120`
`--disable-circuit-breaker`	Disable circuit breaker	`false`

Circuit Breaker States:

Closed: Normal operation, tracking failures
Open: All requests fail fast, circuit tripped
Half-Open: Testing if service recovered

Health Check Configuration¶

Option	Description	Default
`--health-failure-threshold`	Failures before marking unhealthy	`3`
`--health-success-threshold`	Successes before marking healthy	`2`
`--health-check-timeout-secs`	Timeout for health check requests	`5`
`--health-check-interval-secs`	Interval between health checks	`60`
`--health-check-endpoint`	Health check endpoint path	`/health`
`--disable-health-check`	Disable all health checks	`false`
`--remove-unhealthy-workers`	Remove workers from the registry when marked unhealthy by health checks. Useful for ephemeral worker pools where failed workers should be deregistered.	`false`

Prometheus Metrics Configuration¶

Metrics Server¶

Option	Description	Default
`--prometheus-port`	Port for Prometheus metrics endpoint	`29000`
`--prometheus-host`	Host for Prometheus metrics server	`0.0.0.0`
`--prometheus-duration-buckets`	Custom histogram buckets	Default buckets

Example:

--prometheus-duration-buckets 0.001 0.005 0.01 0.025 0.05 0.1 0.25 0.5 1.0 2.5 5.0 10.0

OpenTelemetry Configuration¶

Enable Tracing¶

Option	`--enable-trace`
Environment	-
Default	`false`

OTLP Endpoint¶

Option	`--otlp-traces-endpoint`
Environment	-
Default	`localhost:4317`
Format	`host:port`

Example:

smg --enable-trace --otlp-traces-endpoint jaeger:4317

TLS/mTLS Security Configuration¶

Server TLS¶

For HTTPS on the gateway:

Option	Description
`--tls-cert-path`	Path to server certificate (PEM format)
`--tls-key-path`	Path to server private key (PEM format)

Client mTLS¶

For secure communication to workers (Python bindings):

Option	Description
`--client-cert-path`	Path to client certificate
`--client-key-path`	Path to client private key
`--ca-cert-paths`	Path(s) to CA certificate(s)

Control Plane Authentication¶

API Key (Worker Authorization)¶

Option	`--api-key`
Environment	-
Default	None
Description	API key for worker authorization (useful with dp-aware scheduling)

Control Plane API Keys¶

Option	`--control-plane-api-keys`
Environment	`CONTROL_PLANE_API_KEYS`
Format	`id:name:role:key`
Multiple	Yes

Example:

--control-plane-api-keys 'key1:Admin:admin:secret123' 'key2:ReadOnly:user:secret456'

JWT/OIDC Authentication¶

Option	Environment	Description
`--jwt-issuer`	`JWT_ISSUER`	OIDC issuer URL
`--jwt-audience`	`JWT_AUDIENCE`	Expected audience claim
`--jwt-jwks-uri`	`JWT_JWKS_URI`	Explicit JWKS URI (auto-discovered if not set)
`--jwt-role-claim`	-	JWT claim containing role (default: `roles`)
`--jwt-role-mapping`	-	Role mapping from IDP to gateway role

JWT Role Mapping Example:

--jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user'

Audit Logging¶

Option	`--disable-audit-logging`
Environment	-
Default	`false` (audit logging enabled)

Logging Configuration¶

Log Level¶

Option	`--log-level`
Environment	`RUST_LOG`
Default	`info`
Values	`debug`, `info`, `warn`, `error`

Per-Module Logging:

RUST_LOG=smg=debug,hyper=warn smg ...

Log Directory¶

Option	`--log-dir`
Environment	-
Default	None (console only)
Description	Directory to store log files

JSON Logs¶

Option	`--log-json`
Environment	-
Default	`false`
Description	Output logs as JSON (structured). Defaults to human-readable text logs.

Configuration Examples¶

Minimal Configuration¶

smg --worker-urls http://localhost:8000

High-Throughput Configuration¶

smg \
  --worker-urls http://w1:8000 http://w2:8000 http://w3:8000 http://w4:8000 \
  --policy cache_aware \
  --max-concurrent-requests 200 \
  --queue-size 400 \
  --queue-timeout-secs 60 \
  --retry-max-retries 3

Low-Latency Configuration¶

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --policy power_of_two \
  --max-concurrent-requests 50 \
  --queue-size 25 \
  --queue-timeout-secs 5 \
  --health-check-interval-secs 5 \
  --request-timeout-secs 30

PD Disaggregated Mode¶

smg \
  --pd-disaggregation \
  --prefill http://prefill1:30001 9001 \
  --prefill http://prefill2:30002 9002 \
  --decode http://decode1:30003 \
  --decode http://decode2:30004 \
  --prefill-policy cache_aware \
  --decode-policy round_robin

Kubernetes Service Discovery¶

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

High-Availability Mesh¶

# Router 1
smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.11:39527 \
  --worker-urls http://worker1:8000

# Router 2
smg \
  --enable-mesh \
  --mesh-server-name router-2 \
  --mesh-advertise-host 192.168.1.11 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527 \
  --worker-urls http://worker2:8000

Secure Production Configuration¶

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --policy cache_aware \
  --max-concurrent-requests 100 \
  --tls-cert-path /etc/certs/server.crt \
  --tls-key-path /etc/certs/server.key \
  --jwt-issuer https://login.microsoftonline.com/tenant/v2.0 \
  --jwt-audience api://smg-gateway \
  --jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user' \
  --enable-trace \
  --otlp-traces-endpoint jaeger:4317 \
  --host 0.0.0.0 \
  --port 443

With Tokenizer and Parsers¶

smg \
  --worker-urls http://localhost:8000 \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000 \
  --reasoning-parser deepseek-r1 \
  --tool-call-parser json

With Database Backend¶

# PostgreSQL
smg \
  --worker-urls http://localhost:8000 \
  --history-backend postgres \
  --postgres-db-url "postgres://user:pass@localhost:5432/smg" \
  --postgres-pool-max-size 32

# Redis
smg \
  --worker-urls http://localhost:8000 \
  --history-backend redis \
  --redis-url "redis://localhost:6379" \
  --redis-pool-max-size 32 \
  --redis-retention-days 7

Environment Variable Reference¶

Environment Variable	CLI Option	Description
`RUST_LOG`	`--log-level`	Log level
`ATP_WALLET_PATH`	`--oracle-wallet-path`	Oracle wallet path
`ATP_TNS_ALIAS`	`--oracle-tns-alias`	Oracle TNS alias
`ATP_DSN`	`--oracle-dsn`	Oracle DSN
`ATP_USER`	`--oracle-user`	Oracle username
`ATP_PASSWORD`	`--oracle-password`	Oracle password
`ATP_EXTERNAL_AUTH`	`--oracle-external-auth`	Enable Oracle external authentication
`ATP_POOL_MIN`	`--oracle-pool-min`	Oracle min pool size
`ATP_POOL_MAX`	`--oracle-pool-max`	Oracle max pool size
`ATP_POOL_TIMEOUT_SECS`	`--oracle-pool-timeout-secs`	Oracle pool timeout
`POSTGRES_DB_URL`	`--postgres-db-url`	PostgreSQL URL
`POSTGRES_POOL_MAX`	`--postgres-pool-max-size`	PostgreSQL max pool
`REDIS_URL`	`--redis-url`	Redis URL
`REDIS_POOL_MAX`	`--redis-pool-max-size`	Redis max pool
`REDIS_RETENTION_DAYS`	`--redis-retention-days`	Redis retention
`JWT_ISSUER`	`--jwt-issuer`	JWT issuer URL
`JWT_AUDIENCE`	`--jwt-audience`	JWT audience
`JWT_JWKS_URI`	`--jwt-jwks-uri`	JWKS URI
`CONTROL_PLANE_API_KEYS`	`--control-plane-api-keys`	Control plane API keys