gRPC Pipeline¶
When workers communicate via gRPC, SMG becomes a complete OpenAI-compatible server with a sophisticated request processing pipeline for reasoning extraction, tool call parsing, and MCP execution.
Overview¶
Tokenization Caching¶
Two-level tokenization cache reduces CPU overhead by 60-90% for repeated content.
Reasoning Extraction¶
Extract chain-of-thought content from thinking models (DeepSeek-R1, Qwen3, etc.).
Tool Call Parsing¶
Parse function calls and execute MCP tools with automatic result injection.
Pipeline Architecture¶
HTTP Mode¶
Gateway = Smart Proxy
SMG handles routing, load balancing, and failover. Workers run full OpenAI-compatible servers.
Responsibility Comparison¶
| Capability | gRPC Mode (Gateway) | HTTP Mode (Worker) |
|---|---|---|
| Chat template | Gateway | Worker |
| Tokenization | Gateway (cached) | Worker |
| Load balancing | Token-aware | Request count |
| Reasoning extraction | Gateway | Worker |
| Tool call parsing | Gateway | Worker |
| MCP execution | Gateway | N/A |
Reasoning Parsers¶
Reasoning parsers extract chain-of-thought content from model outputs. Essential for models that produce thinking tokens before their final response.
Configuration¶
| Option | --reasoning-parser |
|---|---|
| Default | Auto-detected from model name |
Supported Parsers¶
Qwen3
- Pattern:
*qwen3* - Initial state: Not in reasoning
- Tokens:
<think>/</think>
smg --reasoning-parser qwen3Kimi
- Pattern:
*kimi* - Initial state: Not in reasoning
- Tokens: Unicode markers
smg --reasoning-parser kimiGLM-4.5
- Pattern:
*glm45*,*glm47* - Initial state: Not in reasoning
- Tokens:
<think>/</think>
smg --reasoning-parser glm45Complete Parser Reference¶
| Parser | Model Pattern | Initial State | Tokens |
|---|---|---|---|
deepseek_r1 |
*deepseek-r1* |
In reasoning | </think> |
qwen3 |
*qwen3* |
Not in reasoning | <think> / </think> |
qwen3_thinking |
*qwen-thinking* |
In reasoning | <think> / </think> |
kimi |
*kimi* |
Not in reasoning | Unicode markers |
glm45 |
*glm45*, *glm47* |
Not in reasoning | <think> / </think> |
step3 |
*step3* |
In reasoning | <think> / </think> |
minimax |
*minimax*, *mm-m2* |
In reasoning | <think> appended |
Output Format¶
When separate_reasoning: true is set in the request:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "Let me think step by step..."
}
}]
}Tool Call Parsers¶
Tool call parsers extract function calls from model output and validate arguments against schemas.
Configuration¶
| Option | --tool-call-parser |
|---|---|
| Default | Auto-detected from model name |
Supported Parsers¶
DeepSeek
DeepSeek V3 tool format.
<tool_call>
get_weather(location="NYC")
</tool_call>Qwen
Qwen model JSON tool calling format.
{"name": "get_weather", "arguments": {"location": "NYC"}}Qwen XML
Qwen3-Coder / Qwen3.5+ XML format with parameter tags.
<tool_call><function=get_weather><parameter=location>NYC</parameter></function></tool_call>Complete Parser Reference¶
| Parser | Model Pattern | Format |
|---|---|---|
passthrough |
Default fallback | No parsing (returns text unchanged) |
json |
gpt-*, claude-*, gemini-* |
Standard JSON function calls |
mistral |
mistral-*, mixtral-* |
Mistral-specific format |
qwen |
qwen*, Qwen* |
JSON tool calls |
qwen_xml |
Qwen3-Coder*, Qwen3.5* |
XML with parameter tags |
pythonic |
llama-4*, deepseek-* |
Python-style function syntax |
llama |
llama-3.2* |
Python tag with JSON |
deepseek |
deepseek-v3* |
XML with function syntax |
glm45_moe |
glm-4.5*, glm-4.6* |
GLM 4.5/4.6 MoE format |
glm47_moe |
glm-4.7* |
GLM 4.7 MoE format |
step3 |
step3*, Step-3* |
Step-3 model format |
kimik2 |
kimi-k2*, Kimi-K2* |
Kimi K2 model format |
minimax_m2 |
minimax*, MiniMax* |
MiniMax M2 model format |
Tool Execution Flow¶
- Parse: Extract tool calls from model output
- Validate: Check arguments against tool schema
- Execute: Run MCP tools or return to client
- Inject: Add tool results back to conversation
- Continue: Resume generation if needed
Configuration¶
Parser CLI Options¶
| Option | Default | Description |
|---|---|---|
--reasoning-parser |
Auto | Reasoning parser type to use |
--tool-call-parser |
Auto | Tool call parser type to use |
--mcp-config-path |
None | Path to MCP server configuration file |
MCP Integration¶
When MCP is configured, tool calls can be executed automatically:
smg \
--mcp-config-path /path/to/mcp.json \
--tool-call-parser llamaSee the MCP Guide for detailed configuration.
Recommended Configurations¶
Tool Calling Model¶
Llama with MCP tool execution.
smg \
--model-path meta-llama/Llama-3.2-70B-Instruct \
--tool-call-parser llama \
--mcp-config-path /config/mcp.jsonFull Pipeline¶
Complete configuration with all features.
smg \
--model-path Qwen/Qwen2.5-72B-Instruct \
--reasoning-parser qwen3 \
--tool-call-parser qwen \
--mcp-config-path /config/mcp.json \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-enable-l1 \
--worker-urls grpc://worker:50051Monitoring¶
Pipeline Metrics¶
| Metric | Description |
|---|---|
smg_router_stage_duration_seconds |
Time spent in each pipeline stage |
smg_mcp_tool_calls_total |
MCP tool invocations |
Debug Logging¶
# Enable pipeline debug logging
RUST_LOG=smg::pipeline=debug smg ...
# Enable parser debug logging
RUST_LOG=smg::parsers=debug smg ...Troubleshooting¶
| Symptom | Cause | Solution |
|---|---|---|
| Reasoning not extracted | Wrong parser | Check model and parser match |
| Tool calls not parsed | Format mismatch | Verify tool parser selection |
| MCP tools timeout | Slow tool execution | Check MCP server configuration |
| Empty reasoning_content | Model not thinking | Enable separate_reasoning: true in request |
What's Next?¶
MCP Integration¶
Configure Model Context Protocol servers for tool execution.
Cache-Aware Routing¶
Maximize KV cache hits with prefix-based routing.