Metrics Reference

Complete reference for Prometheus metrics exposed by SMG. Metrics are organized in six layers matching the request lifecycle.


Metrics Endpoint

Metrics are exposed on the Prometheus port (default: 29000):

curl http://localhost:29000/metrics

The same listener also serves a WebSocket stream of real-time metric updates at /ws/metrics (used by the TUI and dashboards that need live state).

Configure via CLI:

smg --prometheus-port 29000 --prometheus-host 0.0.0.0

Layer 1: HTTP Metrics

Metrics for incoming HTTP requests at the gateway edge.

smg_http_requests_total

Total HTTP requests received by the gateway.

Type Labels
Counter method, path
# Request rate by endpoint
sum by (path) (rate(smg_http_requests_total[5m]))

# Total request rate
sum(rate(smg_http_requests_total[5m]))

smg_http_request_duration_seconds

HTTP request duration from receipt to response.

Type Labels
Histogram method, path
# P99 latency by endpoint
histogram_quantile(0.99, sum by (path, le) (rate(smg_http_request_duration_seconds_bucket[5m])))

# Average latency
rate(smg_http_request_duration_seconds_sum[5m]) / rate(smg_http_request_duration_seconds_count[5m])

smg_http_responses_total

HTTP responses by path, status, and error code.

Type Labels
Counter path, status_code, error_code
# Error rate (5xx responses)
sum(rate(smg_http_responses_total{status_code=~"5.."}[5m])) / sum(rate(smg_http_responses_total[5m]))

# Success rate
sum(rate(smg_http_responses_total{status_code="200"}[5m])) / sum(rate(smg_http_responses_total[5m]))

# Success rate for /v1/responses
sum(rate(smg_http_responses_total{path="/v1/responses",status_code=~"2.."}[5m]))
/
sum(rate(smg_http_responses_total{path="/v1/responses"}[5m]))

smg_http_connections_active

Currently active HTTP connections.

Type Labels
Gauge None

smg_http_inflight_request_age_count

Distribution of in-flight request ages for Grafana heatmaps.

Type Labels
Gauge gt, le

Age buckets (seconds): 30, 60, 180, 300, 600, 1200, 3600, 7200, 14400, 28800, 86400


smg_http_rate_limit_total

Rate limiting decisions.

Type Labels
Counter result

Values: allowed, rejected

# Rejection rate
rate(smg_http_rate_limit_total{result="rejected"}[5m]) / sum(rate(smg_http_rate_limit_total[5m]))

Layer 2: Router Metrics

Metrics for request routing and processing.

smg_router_requests_total

Requests processed by the router.

Type Labels
Counter router_type, backend_type, connection_mode, model, endpoint, streaming

Router types: openai, http, grpc Backend types: regular, pd, external, harmony Endpoints: chat, generate, responses, completions, rerank, embeddings, classify, messages, realtime, realtime_sessions, realtime_client_secrets, realtime_transcription Streaming: true, false

# Request rate by model
sum by (model) (rate(smg_router_requests_total[5m]))

# Streaming vs non-streaming
sum by (streaming) (rate(smg_router_requests_total[5m]))

smg_router_request_duration_seconds

Total router request duration.

Type Labels
Histogram router_type, backend_type, connection_mode, model, endpoint

smg_router_request_errors_total

Router errors by type.

Type Labels
Counter router_type, backend_type, connection_mode, model, endpoint, error_type

Error types: no_workers, timeout, backend_error, validation_error, internal_error

# Error rate by type
sum by (error_type) (rate(smg_router_request_errors_total[5m]))

smg_router_stage_duration_seconds

Duration of individual pipeline stages (gRPC mode only).

Type Labels
Histogram router_type, stage

Stage names are emitted by the gRPC pipeline (e.g., tokenize, route, inference, detokenize, tool_parse).

# Tokenization latency
histogram_quantile(0.99, rate(smg_router_stage_duration_seconds_bucket{stage="tokenize"}[5m]))

smg_router_ttft_seconds

Time to first token (gRPC streaming only).

Type Labels
Histogram router_type, backend_type, model, endpoint
# P50 TTFT by model
histogram_quantile(0.5, sum by (model, le) (rate(smg_router_ttft_seconds_bucket[5m])))

smg_router_tpot_seconds

Time per output token (gRPC streaming only).

Type Labels
Histogram router_type, backend_type, model, endpoint
# Average TPOT
rate(smg_router_tpot_seconds_sum[5m]) / rate(smg_router_tpot_seconds_count[5m])

smg_router_tokens_total

Token counts by type.

Type Labels
Counter router_type, backend_type, model, endpoint, token_type

Token types: input, output

# Tokens per second
sum by (token_type) (rate(smg_router_tokens_total[5m]))

# Input/output ratio
sum(rate(smg_router_tokens_total{token_type="output"}[5m])) / sum(rate(smg_router_tokens_total{token_type="input"}[5m]))

smg_router_generation_duration_seconds

Total generation time (first token to last token).

Type Labels
Histogram router_type, backend_type, model, endpoint

smg_router_upstream_responses_total

HTTP responses from upstream workers.

Type Labels
Counter router_type, status_code, error_code

Layer 3: Worker Metrics

Metrics for worker pool management and resilience.

smg_worker_pool_size

Number of workers in the pool.

Type Labels
Gauge worker_type, connection_mode, model

smg_worker_connections_active

Active connections per worker pool.

Type Labels
Gauge worker_type, connection_mode

smg_worker_requests_active

Active requests per worker.

Type Labels
Gauge worker
# Load distribution across workers
smg_worker_requests_active / ignoring(worker) group_left sum(smg_worker_requests_active)

smg_worker_health

Worker health status.

Type Labels Values
Gauge worker 1 = healthy, 0 = unhealthy
# Count healthy workers
sum(smg_worker_health)

# Alert on unhealthy workers
smg_worker_health == 0

smg_worker_health_checks_total

Health check results.

Type Labels
Counter worker_type, result

Results: success, failure


smg_worker_selection_total

Worker selection events by load balancer.

Type Labels
Counter worker_type, connection_mode, model, policy

smg_worker_errors_total

Worker-level errors by type.

Type Labels
Counter worker_type, connection_mode, error_type

Circuit Breaker Metrics

smg_worker_cb_state

Circuit breaker state per worker.

Type Labels Values
Gauge worker 0 = closed, 1 = open, 2 = half-open
# Workers with open circuits
count(smg_worker_cb_state == 1)

smg_worker_cb_transitions_total

Circuit breaker state transitions.

Type Labels
Counter worker, from, to

smg_worker_cb_outcomes_total

Request outcomes tracked by circuit breaker.

Type Labels
Counter worker, outcome

Outcomes: success, failure

smg_worker_cb_consecutive_failures

Consecutive failures per worker.

Type Labels
Gauge worker

smg_worker_cb_consecutive_successes

Consecutive successes per worker.

Type Labels
Gauge worker

Retry Metrics

smg_worker_retries_total

Retry attempts.

Type Labels
Counter worker_type, endpoint

smg_worker_retries_exhausted_total

Requests that exhausted all retries.

Type Labels
Counter worker_type, endpoint

smg_worker_retry_backoff_seconds

Retry backoff durations by attempt number.

Type Labels
Histogram attempt

Layer 4: Discovery Metrics

Metrics for service discovery.

smg_discovery_registrations_total

Worker registrations.

Type Labels
Counter source, result

Sources: static, kubernetes, consul, manual


smg_discovery_deregistrations_total

Worker deregistrations.

Type Labels
Counter source, reason

smg_discovery_sync_duration_seconds

Discovery sync duration.

Type Labels
Histogram source

smg_discovery_workers_discovered

Workers discovered per source.

Type Labels
Gauge source

Layer 5: MCP Tool Metrics

Metrics for Model Context Protocol tool execution.

smg_mcp_tool_calls_total

MCP tool invocations.

Type Labels
Counter model, tool_name, result

Results: success, error

# Tool success rate
sum(rate(smg_mcp_tool_calls_total{result="success"}[5m])) / sum(rate(smg_mcp_tool_calls_total[5m]))

# Most used tools
topk(10, sum by (tool_name) (rate(smg_mcp_tool_calls_total[5m])))

smg_mcp_tool_duration_seconds

Tool execution duration.

Type Labels
Histogram model, tool_name

smg_mcp_servers_active

Active MCP servers.

Type Labels
Gauge None

smg_mcp_tool_iterations_total

Tool loop iterations in Responses API.

Type Labels
Counter model

Layer 6: Database Metrics

Metrics for storage operations.

smg_db_operations_total

Database operations.

Type Labels
Counter storage_type, operation, result

Storage types: response, conversation, conversation_item Operations: get, put, delete, list


smg_db_operation_duration_seconds

Database operation duration.

Type Labels
Histogram storage_type, operation

smg_db_connections_active

Active database connections.

Type Labels
Gauge storage_type

smg_db_items_stored

Items stored in database.

Type Labels
Counter storage_type

Cache Routing Metrics

smg_manual_policy_cache_entries

Entries in the cache-aware routing cache.

Type Labels
Gauge None

smg_worker_routing_keys_active

Active routing keys per worker (used by cache-aware policies).

Type Labels
Gauge worker

smg_manual_policy_branch_total

Manual policy execution branch counts for routing decisions.

Type Labels
Counter branch

smg_consistent_hashing_policy_branch_total

Consistent hashing policy execution branch counts for routing decisions.

Type Labels
Counter branch

smg_prefix_hash_policy_branch_total

Prefix hash policy execution branch counts for routing decisions.

Type Labels
Counter branch

Dashboard Queries Summary

Metric Query
Request rate sum(rate(smg_http_requests_total[5m]))
Error rate sum(rate(smg_http_responses_total{status_code=~"5.."}[5m])) / sum(rate(smg_http_responses_total[5m]))
P99 latency histogram_quantile(0.99, rate(smg_http_request_duration_seconds_bucket[5m]))
TTFT P50 histogram_quantile(0.5, rate(smg_router_ttft_seconds_bucket[5m]))
Tokens/sec sum(rate(smg_router_tokens_total[5m]))
Healthy workers sum(smg_worker_health)
Open circuits count(smg_worker_cb_state == 1)
Rate limit rejections rate(smg_http_rate_limit_total{result="rejected"}[5m])
MCP tool success rate sum(rate(smg_mcp_tool_calls_total{result="success"}[5m])) / sum(rate(smg_mcp_tool_calls_total[5m]))

Histogram Buckets

Default histogram buckets (29 buckets from 1ms to 7200s) applied to every metric whose name ends with duration_seconds:

0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0,
10.0, 15.0, 30.0, 45.0, 60.0, 90.0, 120.0, 180.0, 240.0, 300.0,
480.0, 900.0, 1200.0, 1800.0, 2700.0, 3600.0, 5400.0, 7200.0

Configure custom buckets via CLI:

smg --prometheus-duration-buckets 0.01 0.1 0.5 1 5 10