Monitoring¶

Set up Prometheus monitoring, OpenTelemetry tracing, and Grafana dashboards for SMG.

Metrics endpoint not responding

Enable Metrics¶

SMG exposes Prometheus metrics on a dedicated port with a 6-layer metric hierarchy.

Start SMG with metrics¶

smg \
  --worker-urls http://worker:8000 \
  --prometheus-port 29000 \
  --prometheus-host 0.0.0.0

Verify metrics endpoint¶

curl http://localhost:29000/metrics

You should see Prometheus-formatted metrics:

# HELP smg_http_requests_total Total HTTP requests
# TYPE smg_http_requests_total counter
smg_http_requests_total{method="POST",path="/v1/chat/completions"} 1234
...

OpenTelemetry Tracing¶

SMG supports distributed tracing via OpenTelemetry.

Enable tracing¶

smg \
  --worker-urls http://worker:8000 \
  --enable-trace \
  --otlp-traces-endpoint localhost:4317

Configuration¶

Flag	Default	Description
`--enable-trace`	`false`	Enable OpenTelemetry tracing
`--otlp-traces-endpoint`	`localhost:4317`	OTLP gRPC collector endpoint

Trace propagation¶

SMG automatically propagates W3C TraceContext headers to workers:

traceparent — Trace ID and span ID
tracestate — Vendor-specific trace data

Prometheus Configuration¶

Basic configuration¶

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'smg'
    static_configs:
      - targets: ['localhost:29000']
    metrics_path: /metrics

Kubernetes ServiceMonitor¶

For Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: smg
  namespace: inference
  labels:
    app: smg
spec:
  selector:
    matchLabels:
      app: smg
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - inference

Key Metrics by Layer¶

Layer 1: HTTP Metrics¶

Metric	Type	Description
`smg_http_requests_total`	Counter	Requests by method, path
`smg_http_request_duration_seconds`	Histogram	Request latency
`smg_http_responses_total`	Counter	Responses by path, status_code, error_code
`smg_http_connections_active`	Gauge	Active connections
`smg_http_rate_limit_total`	Counter	Rate limit decisions

Layer 2: Router Metrics¶

Metric	Type	Description
`smg_router_requests_total`	Counter	Requests by router_type, model, endpoint
`smg_router_ttft_seconds`	Histogram	Time to first token (gRPC)
`smg_router_tpot_seconds`	Histogram	Time per output token (gRPC)
`smg_router_tokens_total`	Counter	Tokens by type (input/output)
`smg_router_stage_duration_seconds`	Histogram	Pipeline stage durations

Layer 3: Worker Metrics¶

Metric	Type	Description
`smg_worker_health`	Gauge	Health status (1=healthy, 0=unhealthy)
`smg_worker_requests_active`	Gauge	Active requests per worker
`smg_worker_cb_state`	Gauge	Circuit breaker state
`smg_worker_retries_total`	Counter	Retry attempts

Layer 5: MCP Metrics¶

Metric	Type	Description
`smg_mcp_tool_calls_total`	Counter	Tool invocations by tool_name, result
`smg_mcp_tool_duration_seconds`	Histogram	Tool execution time
`smg_mcp_servers_active`	Gauge	Active MCP servers

View all metrics →

Grafana Dashboards¶

Essential panels¶

Request Rate

sum(rate(smg_http_requests_total[5m]))

P99 Latency

histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))

Error Rate

sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
/ sum(rate(smg_http_responses_total[5m]))

/v1/responses Success Rate

sum(rate(smg_http_responses_total{path="/v1/responses",status_code=~"2.."}[5m]))
/ sum(rate(smg_http_responses_total{path="/v1/responses"}[5m]))

Time to First Token (TTFT)

histogram_quantile(0.5, rate(smg_router_ttft_seconds_bucket[5m]))

Tokens per Second

sum(rate(smg_router_tokens_total[5m]))

Worker Health

sum(smg_worker_health)

Alerting Rules¶

groups:
  - name: smg
    rules:
      - alert: SMGHighErrorRate
        expr: |
          sum(rate(smg_http_responses_total{status_code=~"5.."}[5m]))
          / sum(rate(smg_http_responses_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on SMG"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: SMGWorkerUnhealthy
        expr: smg_worker_health == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "SMG worker unhealthy"
          description: "Worker {{ $labels.worker }} is unhealthy"

      - alert: SMGHighLatency
        expr: |
          histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on SMG"
          description: "P99 latency is {{ $value }}s"

      - alert: SMGCircuitBreakerOpen
        expr: smg_worker_cb_state == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker open"
          description: "Circuit breaker for {{ $labels.worker }} is open"

      - alert: SMGHighTTFT
        expr: |
          histogram_quantile(0.95, rate(smg_router_ttft_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High time to first token"
          description: "P95 TTFT is {{ $value }}s"

      - alert: SMGRateLimitRejections
        expr: rate(smg_http_rate_limit_total{result="rejected"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate limit rejections"
          description: "{{ $value }} rejections/sec"

Useful Queries¶

Request analysis¶

# Request rate by endpoint
sum by (path) (rate(smg_http_requests_total[5m]))

# Success rate
sum(rate(smg_http_responses_total{status_code="200"}[5m]))
/ sum(rate(smg_http_responses_total[5m]))

# Latency percentiles
histogram_quantile(0.50, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(smg_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))

LLM performance¶

# Tokens per second by model
sum by (model) (rate(smg_router_tokens_total[5m]))

# TTFT by model
histogram_quantile(0.5, sum by (model, le) (rate(smg_router_ttft_seconds_bucket[5m])))

# Input/output token ratio
sum(rate(smg_router_tokens_total{token_type="output"}[5m]))
/ sum(rate(smg_router_tokens_total{token_type="input"}[5m]))

Worker analysis¶

# Load distribution
smg_worker_requests_active / ignoring(worker) group_left sum(smg_worker_requests_active)

# Unhealthy workers
count(smg_worker_health == 0)

# Circuit breaker states
count by (worker) (smg_worker_cb_state == 1)

MCP tool analysis¶

# Tool success rate
sum(rate(smg_mcp_tool_calls_total{result="success"}[5m]))
/ sum(rate(smg_mcp_tool_calls_total[5m]))

# Most used tools
topk(10, sum by (tool_name) (rate(smg_mcp_tool_calls_total[5m])))

# Slowest tools
topk(5, histogram_quantile(0.95, sum by (tool_name, le) (rate(smg_mcp_tool_duration_seconds_bucket[5m]))))

Verification¶

# Check metrics are being scraped
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="smg")'

# Query a metric
curl -s 'http://prometheus:9090/api/v1/query?query=smg_http_requests_total' | jq

# Check alerts
curl -s http://prometheus:9090/api/v1/alerts | jq

Troubleshooting¶

Traces not appearing

1. Verify SMG is running with `--prometheus-port`:
```bash
ps aux | grep smg
```

2. Check the port is listening:
```bash
netstat -tlnp | grep 29000
```

3. Check firewall rules allow access

Missing metrics

1. Verify OTLP endpoint is reachable:
```bash
curl http://localhost:4317
```

2. Check SMG was started with `--enable-trace`

3. Verify collector is receiving spans

1. Ensure the feature generating metrics is enabled

2. Some metrics only appear for specific router types (e.g., TTFT is gRPC-only)

3. Verify metric name spelling in queries

What's Next?¶

Configure Logging — Structured log aggregation
Configure TLS — Secure client-to-gateway traffic
Metrics Reference — Complete metrics documentation