Getting Started¶
Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.
Install¶
Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Intel and Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.
pip install smgThis installs both:
smg serve(Python orchestration command for workers + gateway)smg launch(router launch path in Rust CLI)
cargo install smgSMG only (gateway/router, no inference engine):
Multi-architecture images are available for x86_64 and ARM64.
docker pull lightseekorg/smg:latestAvailable tags: latest (stable), v1.4.x (specific version), nightly (development, from ghcr.io/lightseekorg/smg:nightly).
SMG + Engine (all-in-one, ready to serve models):
Engine images bundle SMG with a specific inference engine (x86_64/CUDA only). Use these when you want a single container that can both route and serve.
# SGLang
docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10# vLLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0
# TensorRT-LLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
Tag format: {smg_version}-{engine}-{engine_version}. Browse all tags at ghcr.io/lightseekorg/smg.
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"# Clone and build
git clone https://github.com/lightseekorg/smg.git
cd smg
cargo build --release
The binary is available at ./target/release/smg.
Choose one of these startup paths.
Option A: All-in-one with smg serve¶
smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.
smg serve \
--backend sglang \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--connection-mode grpc \
--host 0.0.0.0 \
--port 30000smg serve \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--host 0.0.0.0 \
--port 30000smg serve \
--backend trtllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--host 0.0.0.0 \
--port 30000| Option | Default | Description |
|---|---|---|
--backend |
sglang |
Inference backend: sglang, vllm, or trtllm |
--connection-mode |
grpc |
Worker connection mode: grpc or http (TensorRT-LLM only supports gRPC) |
--data-parallel-size |
1 |
Number of worker replicas (one per GPU) |
--worker-base-port |
31000 |
Base port for worker processes |
--host |
127.0.0.1 |
Router host |
--port |
8080 |
Router port |
Option B: Launch gateway only with smg launch¶
Use this when workers are already running or managed by another platform.
For gRPC workers:
smg launch \
--worker-urls grpc://localhost:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy round_robin \
--host 0.0.0.0 \
--port 30000For HTTP workers:
smg launch \
--worker-urls http://localhost:8000 \
--policy round_robin \
--host 0.0.0.0 \
--port 30000Step 2: Verify Core Endpoints¶
Health:
curl http://localhost:30000/health
curl http://localhost:30000/readinessOpenAI-compatible chat completions:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'Responses API:
curl http://localhost:30000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"input": "Say hello in one sentence."
}'Step 3: Choose Your Setup Track¶
Core Deployment¶
Operations and Security¶
Reliability and Data¶
Advanced Features¶
Worker Startup Recipes (Standalone)¶
Use these when workers are not started via smg serve.
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-modepython -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--tensor-parallel-size 1python -m tensorrt_llm.commands.serve \
meta-llama/Llama-3.1-8B-Instruct \
--grpc \
--host 0.0.0.0 \
--port 50051 \
--backend pytorch \
--tp_size 1For prefill-decode disaggregation, start separate prefill and decode workers:
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998# Decode worker
python -m sglang.launch_server
--model-path meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 50052
--grpc-mode
--disaggregation-mode decode
--disaggregation-bootstrap-port 8999
Start SMG with bootstrap ports for SGLang coordination:
smg launch \
--pd-disaggregation \
--prefill grpc://localhost:50051 8998 \
--decode grpc://localhost:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998# Decode worker
python -m sglang.launch_server
--model-path meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8001
--disaggregation-mode decode
--disaggregation-bootstrap-port 8999
Start SMG with bootstrap ports for SGLang coordination:
smg launch \
--pd-disaggregation \
--prefill http://localhost:8000 8998 \
--decode http://localhost:8001 \
--host 0.0.0.0 \
--port 30000vLLM uses NIXL for KV cache transfer between prefill and decode workers:
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601
python -m vllm.entrypoints.grpc_server
--model meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 50052
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
Start SMG (no bootstrap ports needed — NIXL handles KV transfer):
smg \
--pd-disaggregation \
--prefill grpc://localhost:50051 \
--decode grpc://localhost:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000Send a Request¶
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50
}'Expected response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 8,
"total_tokens": 22
}
}Verify Health¶
# Gateway health
curl http://localhost:30000/health
# Worker status
curl http://localhost:30000/workersDeploy with Docker¶
For local deployment, run SMG in a container and point it at your worker:
docker pull lightseekorg/smg:latest
docker run -d \
--name smg \
-p 30000:30000 \
-p 29000:29000 \
lightseekorg/smg:latest \
--worker-urls http://host.docker.internal:8000 \
--policy cache_aware \
--prometheus-port 29000Verify:
docker ps | grep smg
curl http://localhost:30000/healthAll-in-one with engine images¶
Engine images include both SMG and an inference engine. Use serve to launch workers and the gateway together:
docker run -d --gpus all \
--name smg \
-p 30000:30000 \
-v /path/to/models:/models \
ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 \
serve \
--backend sglang \
--model-path /models/meta-llama/Llama-3.1-8B-Instruct \
--port 30000Verify:
curl http://localhost:30000/health
curl http://localhost:30000/v1/modelsDeploy to Kubernetes (Quick Start)¶
Run SMG in-cluster and use service discovery to pick up worker pods automatically.
Start SMG with service discovery:
smg \
--service-discovery \
--selector app=sglang-worker \
--service-discovery-namespace inference \
--service-discovery-port 8000 \
--policy cache_awareRequired RBAC permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: smg-discovery
namespace: inference
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]Verify:
kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workersNavigate by Category¶
Core Setup¶
- Multiple Workers — connect local or external worker endpoints
- gRPC Workers — gateway-side tokenization, parsing, and tool handling
- PD Disaggregation — split prefill and decode paths
- Service Discovery — Kubernetes pod-based worker registration
Operations¶
- Monitoring — Prometheus metrics, tracing, and alerts
- Logging — structured logs and aggregation patterns
- TLS — HTTPS gateway configuration
- Control Plane Auth — secure worker/tokenizer/WASM management endpoints
Reliability and Data¶
- Reliability Controls — concurrency limits, retries, and circuit breakers
- Data Connections — history backend setup for Postgres, Redis, and Oracle
- Tokenization and Parsing APIs — tokenize, detokenize, and parser endpoints
Advanced Features¶
- Load Balancing — policy selection and tuning
- Tokenizer Caching — L0/L1 cache setup for gRPC mode
- MCP in Responses API — configure and execute MCP tools through
/v1/responses
Troubleshooting¶
Request times out
**Symptoms:** Gateway logs show connection errors.
**Solutions:**
1. Verify the worker is running: `curl http://localhost:8000/health`
2. Check network connectivity between gateway and worker
3. If using Docker, ensure proper network configuration (`--network host` or Docker network)Model not found error
**Symptoms:** Requests hang or return 504 errors.
**Solutions:**
1. Check worker health: `curl http://localhost:30000/workers`
2. Increase timeout: `--request-timeout-secs 120`
3. Check worker logs for errors**Symptoms:** `model not found` in response.
**Solutions:**
1. The `model` field in requests should match the model loaded on the worker
2. Check available models: `curl http://localhost:30000/v1/models`