gRPC Pipeline¶

When workers communicate via gRPC, SMG becomes a complete OpenAI-compatible server with a sophisticated request processing pipeline for reasoning extraction, tool call parsing, and MCP execution.

Overview¶

Tokenization Caching¶

Two-level tokenization cache reduces CPU overhead by 60-90% for repeated content.

Reasoning Extraction¶

Extract chain-of-thought content from thinking models (DeepSeek-R1, Qwen3, etc.).

Tool Call Parsing¶

Parse function calls and execute MCP tools with automatic result injection.

Pipeline Architecture¶

HTTP Mode¶

Gateway = Smart Proxy

SMG handles routing, load balancing, and failover. Workers run full OpenAI-compatible servers.

Responsibility Comparison¶

Capability	gRPC Mode (Gateway)	HTTP Mode (Worker)
Chat template	Gateway	Worker
Tokenization	Gateway (cached)	Worker
Load balancing	Token-aware	Request count
Reasoning extraction	Gateway	Worker
Tool call parsing	Gateway	Worker
MCP execution	Gateway	N/A

Reasoning Parsers¶

Reasoning parsers extract chain-of-thought content from model outputs. Essential for models that produce thinking tokens before their final response.

Configuration¶

Option	`--reasoning-parser`
Default	Auto-detected from model name

Supported Parsers¶

Qwen3

Pattern: *qwen3*
Initial state: Not in reasoning
Tokens: <think> / </think>

smg --reasoning-parser qwen3

Kimi

Pattern: *kimi*
Initial state: Not in reasoning
Tokens: Unicode markers

smg --reasoning-parser kimi

GLM-4.5

Pattern: *glm45*, *glm47*
Initial state: Not in reasoning
Tokens: <think> / </think>

smg --reasoning-parser glm45

Complete Parser Reference¶

Parser	Model Pattern	Initial State	Tokens
`deepseek_r1`	`deepseek-r1`	In reasoning	`</think>`
`qwen3`	`qwen3`	Not in reasoning	`<think>` / `</think>`
`qwen3_thinking`	`qwen-thinking`	In reasoning	`<think>` / `</think>`
`kimi`	`kimi`	Not in reasoning	Unicode markers
`glm45`	`glm45`, `glm47`	Not in reasoning	`<think>` / `</think>`
`step3`	`step3`	In reasoning	`<think>` / `</think>`
`minimax`	`minimax`, `mm-m2`	In reasoning	`<think>` appended

Output Format¶

When separate_reasoning: true is set in the request:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think step by step..."
    }
  }]
}

Tool Call Parsers¶

Tool call parsers extract function calls from model output and validate arguments against schemas.

Configuration¶

Option	`--tool-call-parser`
Default	Auto-detected from model name

Supported Parsers¶

DeepSeek

DeepSeek V3 tool format.

<tool_call>
get_weather(location="NYC")
</tool_call>

Qwen

Qwen model JSON tool calling format.

{"name": "get_weather", "arguments": {"location": "NYC"}}

Qwen XML

Qwen3-Coder / Qwen3.5+ XML format with parameter tags.

<tool_call><function=get_weather><parameter=location>NYC</parameter></function></tool_call>

Complete Parser Reference¶

Parser	Model Pattern	Format
`passthrough`	Default fallback	No parsing (returns text unchanged)
`json`	`gpt-`, `claude-`, `gemini-*`	Standard JSON function calls
`mistral`	`mistral-`, `mixtral-`	Mistral-specific format
`qwen`	`qwen`, `Qwen`	JSON tool calls
`qwen_xml`	`Qwen3-Coder`, `Qwen3.5`	XML with parameter tags
`pythonic`	`llama-4`, `deepseek-`	Python-style function syntax
`llama`	`llama-3.2*`	Python tag with JSON
`deepseek`	`deepseek-v3*`	XML with function syntax
`glm45_moe`	`glm-4.5`, `glm-4.6`	GLM 4.5/4.6 MoE format
`glm47_moe`	`glm-4.7*`	GLM 4.7 MoE format
`step3`	`step3`, `Step-3`	Step-3 model format
`kimik2`	`kimi-k2`, `Kimi-K2`	Kimi K2 model format
`minimax_m2`	`minimax`, `MiniMax`	MiniMax M2 model format

Tool Execution Flow¶

Parse: Extract tool calls from model output
Validate: Check arguments against tool schema
Execute: Run MCP tools or return to client
Inject: Add tool results back to conversation
Continue: Resume generation if needed

Configuration¶

Parser CLI Options¶

Option	Default	Description
`--reasoning-parser`	Auto	Reasoning parser type to use
`--tool-call-parser`	Auto	Tool call parser type to use
`--mcp-config-path`	None	Path to MCP server configuration file

MCP Integration¶

When MCP is configured, tool calls can be executed automatically:

smg \
  --mcp-config-path /path/to/mcp.json \
  --tool-call-parser llama

See the MCP Guide for detailed configuration.

Recommended Configurations¶

Tool Calling Model¶

Llama with MCP tool execution.

smg \
  --model-path meta-llama/Llama-3.2-70B-Instruct \
  --tool-call-parser llama \
  --mcp-config-path /config/mcp.json

Full Pipeline¶

Complete configuration with all features.

smg \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen \
  --mcp-config-path /config/mcp.json \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-enable-l1 \
  --worker-urls grpc://worker:50051

Monitoring¶

Pipeline Metrics¶

Metric	Description
`smg_router_stage_duration_seconds`	Time spent in each pipeline stage
`smg_mcp_tool_calls_total`	MCP tool invocations

Debug Logging¶

# Enable pipeline debug logging
RUST_LOG=smg::pipeline=debug smg ...

# Enable parser debug logging
RUST_LOG=smg::parsers=debug smg ...

Troubleshooting¶

Symptom	Cause	Solution
Reasoning not extracted	Wrong parser	Check model and parser match
Tool calls not parsed	Format mismatch	Verify tool parser selection
MCP tools timeout	Slow tool execution	Check MCP server configuration
Empty reasoning_content	Model not thinking	Enable `separate_reasoning: true` in request

What's Next?¶

MCP Integration¶

Configure Model Context Protocol servers for tool execution.

MCP →

Cache-Aware Routing¶

Maximize KV cache hits with prefix-based routing.

Cache-Aware Routing →