Monitoring¶
Set up Prometheus monitoring, OpenTelemetry tracing, and Grafana dashboards for SMG.
Metrics endpoint not responding
Enable Metrics¶
SMG exposes Prometheus metrics on a dedicated port with a 6-layer metric hierarchy.
Start SMG with metrics¶
smg \
--worker-urls http://worker:8000 \
--prometheus-port 29000 \
--prometheus-host 0.0.0.0Verify metrics endpoint¶
curl http://localhost:29000/metricsYou should see Prometheus-formatted metrics:
# HELP smg_http_requests_total Total HTTP requests
# TYPE smg_http_requests_total counter
smg_http_requests_total{method="POST",path="/v1/chat/completions"} 1234
...OpenTelemetry Tracing¶
SMG supports distributed tracing via OpenTelemetry.
Enable tracing¶
smg \
--worker-urls http://worker:8000 \
--enable-trace \
--otlp-traces-endpoint localhost:4317Configuration¶
| Flag | Default | Description |
|---|---|---|
--enable-trace |
false |
Enable OpenTelemetry tracing |
--otlp-traces-endpoint |
localhost:4317 |
OTLP gRPC collector endpoint |
Trace propagation¶
SMG automatically propagates W3C TraceContext headers to workers:
traceparent— Trace ID and span IDtracestate— Vendor-specific trace data
Prometheus Configuration¶
Basic configuration¶
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'smg'
static_configs:
- targets: ['localhost:29000']
metrics_path: /metricsKubernetes ServiceMonitor¶
For Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: smg
namespace: inference
labels:
app: smg
spec:
selector:
matchLabels:
app: smg
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- inferenceKey Metrics by Layer¶
Layer 1: HTTP Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_http_requests_total |
Counter | Requests by method, path |
smg_http_request_duration_seconds |
Histogram | Request latency |
smg_http_responses_total |
Counter | Responses by path, status_code, error_code |
smg_http_connections_active |
Gauge | Active connections |
smg_http_rate_limit_total |
Counter | Rate limit decisions |
Layer 2: Router Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_router_requests_total |
Counter | Requests by router_type, model, endpoint |
smg_router_ttft_seconds |
Histogram | Time to first token (gRPC) |
smg_router_tpot_seconds |
Histogram | Time per output token (gRPC) |
smg_router_tokens_total |
Counter | Tokens by type (input/output) |
smg_router_stage_duration_seconds |
Histogram | Pipeline stage durations |
Layer 3: Worker Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_worker_health |
Gauge | Health status (1=healthy, 0=unhealthy) |
smg_worker_requests_active |
Gauge | Active requests per worker |
smg_worker_cb_state |
Gauge | Circuit breaker state |
smg_worker_retries_total |
Counter | Retry attempts |
Layer 5: MCP Metrics¶
| Metric | Type | Description |
|---|---|---|
smg_mcp_tool_calls_total |
Counter | Tool invocations by tool_name, result |
smg_mcp_tool_duration_seconds |
Histogram | Tool execution time |
smg_mcp_servers_active |
Gauge | Active MCP servers |
Grafana Dashboards¶
Essential panels¶
Request Rate
sum(rate(smg_http_requests_total[5m]))P99 Latency
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))Error Rate
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
/ sum(rate(smg_http_responses_total[5m]))/v1/responses Success Rate
sum(rate(smg_http_responses_total{path="/v1/responses",status_code=~"2.."}[5m]))
/ sum(rate(smg_http_responses_total{path="/v1/responses"}[5m]))Time to First Token (TTFT)
histogram_quantile(0.5, rate(smg_router_ttft_seconds_bucket[5m]))Tokens per Second
sum(rate(smg_router_tokens_total[5m]))Worker Health
sum(smg_worker_health)Alerting Rules¶
groups:
- name: smg
rules:
- alert: SMGHighErrorRate
expr: |
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
/ sum(rate(smg_http_responses_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on SMG"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: SMGWorkerUnhealthy
expr: smg_worker_health == 0
for: 1m
labels:
severity: warning
annotations:
summary: "SMG worker unhealthy"
description: "Worker {{ $labels.worker }} is unhealthy"
- alert: SMGHighLatency
expr: |
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on SMG"
description: "P99 latency is {{ $value }}s"
- alert: SMGCircuitBreakerOpen
expr: smg_worker_cb_state == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open"
description: "Circuit breaker for {{ $labels.worker }} is open"
- alert: SMGHighTTFT
expr: |
histogram_quantile(0.95, rate(smg_router_ttft_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High time to first token"
description: "P95 TTFT is {{ $value }}s"
- alert: SMGRateLimitRejections
expr: rate(smg_http_rate_limit_total{result="rejected"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit rejections"
description: "{{ $value }} rejections/sec"Useful Queries¶
Request analysis¶
# Request rate by endpoint
sum by (path) (rate(smg_http_requests_total[5m]))
# Success rate
sum(rate(smg_http_responses_total{status_code="200"}[5m]))
/ sum(rate(smg_http_responses_total[5m]))
# Latency percentiles
histogram_quantile(0.50, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))LLM performance¶
# Tokens per second by model
sum by (model) (rate(smg_router_tokens_total[5m]))
# TTFT by model
histogram_quantile(0.5, sum by (model, le) (rate(smg_router_ttft_seconds_bucket[5m])))
# Input/output token ratio
sum(rate(smg_router_tokens_total{token_type="output"}[5m]))
/ sum(rate(smg_router_tokens_total{token_type="input"}[5m]))Worker analysis¶
# Load distribution
smg_worker_requests_active / ignoring(worker) group_left sum(smg_worker_requests_active)
# Unhealthy workers
count(smg_worker_health == 0)
# Circuit breaker states
count by (worker) (smg_worker_cb_state == 1)MCP tool analysis¶
# Tool success rate
sum(rate(smg_mcp_tool_calls_total{result="success"}[5m]))
/ sum(rate(smg_mcp_tool_calls_total[5m]))
# Most used tools
topk(10, sum by (tool_name) (rate(smg_mcp_tool_calls_total[5m])))
# Slowest tools
topk(5, histogram_quantile(0.95, sum by (tool_name, le) (rate(smg_mcp_tool_duration_seconds_bucket[5m]))))Verification¶
# Check metrics are being scraped
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="smg")'
# Query a metric
curl -s 'http://prometheus:9090/api/v1/query?query=smg_http_requests_total' | jq
# Check alerts
curl -s http://prometheus:9090/api/v1/alerts | jqTroubleshooting¶
Traces not appearing
1. Verify SMG is running with `--prometheus-port`:
```bash
ps aux | grep smg
```
2. Check the port is listening:
```bash
netstat -tlnp | grep 29000
```
3. Check firewall rules allow accessMissing metrics
1. Verify OTLP endpoint is reachable:
```bash
curl http://localhost:4317
```
2. Check SMG was started with `--enable-trace`
3. Verify collector is receiving spans1. Ensure the feature generating metrics is enabled
2. Some metrics only appear for specific router types (e.g., TTFT is gRPC-only)
3. Verify metric name spelling in queriesWhat's Next?¶
- Configure Logging — Structured log aggregation
- Configure TLS — Secure client-to-gateway traffic
- Metrics Reference — Complete metrics documentation