Tokenizer Caching¶
SMG implements a two-level tokenizer cache that dramatically reduces tokenization overhead for repeated content, achieving 60-90% cache hit rates in typical production workloads.
Overview¶
L1 Cache (Prefix Match)¶
Boundary-aligned prefix matching that tokenizes only the suffix on hit. Ideal for multi-turn conversations with growing context.
Memory Efficient¶
~2.2KB per L0 entry with configurable L1 memory bounds. Scale from 36MB (small) to 210MB (large) deployments.
Observable¶
Full Prometheus metrics for hit rates, memory usage, and cache sizing. Monitor and tune in real-time.
Why Cache Tokenization?¶
Tokenization—converting text to token IDs—happens on every request. While individual tokenization is fast (~1-5ms), it adds up at scale.
Multi-Turn Conversations¶
Growing context with shared prefix. L1 cache tokenizes only new messages.
RAG Applications¶
Common document snippets across queries. Both L0 and L1 provide benefits.
Batch Processing¶
Similar prompt templates with variable parts. High L0 hit rates.
Cache Architecture¶
L1 Cache (Prefix Match)¶
Router-level cache storing tokens at special token boundaries for prefix reuse.
- Tokenize only the suffix on hit
- Cross-request deduplication
- Memory-bounded (configurable)
- Automatic boundary detection
Best for: Multi-turn conversations, growing contexts, incremental content
Special Token Boundaries (L1)¶
L1 cache identifies boundaries at special tokens for efficient prefix matching:
| Model Family | Boundary Tokens | Example |
|---|---|---|
| ChatML (Qwen, Yi) | <|im_start|>, <|im_end|> |
Each message boundary |
| Llama 3 | <|begin_of_text|>, <|eot_id|>, <|start_header_id|> |
Text start, turn end |
| GPT | <|endoftext|> |
Document end |
Multi-Turn Conversation Example¶
Consider how caching helps a typical chat application:
Turn 2 (Warm)¶
System: You are a helpful assistant.
User: What is Python?
Assistant: Python is a programming language...
User: How do I install it?L0: Miss (text changed) L1: Hit! → Only tokenize new content (~0.5ms)
Result: Turn 2 tokenizes only ~20% of the content, saving ~2.5ms per request.
Configuration¶
Model & Tokenizer Paths¶
--model-path¶
HuggingFace model ID or local path to load the tokenizer from.
| Option | --model-path |
|---|---|
| Default | None |
Usage:
# HuggingFace model ID (downloads automatically)
smg --model-path meta-llama/Llama-3.1-8B-Instruct ...
# Local path to model directory
smg --model-path /models/llama-3.1-8b-instruct ...
# Local path to tokenizer.json file
smg --model-path /models/llama-3.1-8b-instruct/tokenizer.json ...When pointing to a local directory, SMG looks for either a HuggingFace
tokenizer.json or a tiktoken file (tiktoken.model or *.tiktoken). When
pulling from the HuggingFace Hub, SMG additionally falls back to
tokenizer_config.json and vocab.json in the downloaded snapshot if a
primary tokenizer file is not present.
--tokenizer-path¶
Explicit path to a tokenizer file. Overrides --model-path for tokenizer loading.
| Option | --tokenizer-path |
|---|---|
| Default | None |
When to use:
- When the tokenizer is stored separately from the model
- When using a custom tokenizer with a standard model
- When the model directory structure is non-standard
# Use model for metadata but separate tokenizer
smg \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-path /custom/tokenizers/llama3-tokenizer.json \
...Chat Templates¶
Chat templates convert structured messages (system, user, assistant roles) into the prompt format expected by specific models. SMG uses Jinja2 templates, the same format used by HuggingFace Transformers.
--chat-template¶
Path to a Jinja2 chat template file.
| Option | --chat-template |
|---|---|
| Default | Auto-discovered from model |
Template discovery priority:
- Explicit
--chat-templatepath (highest priority) chat_template.jsonin model directorychat_template.jinjain model directory- Any
.jinjafile in model directory chat_templatefield intokenizer_config.json
Template Variables¶
Chat templates use Jinja2 syntax with access to:
| Variable | Description |
|---|---|
messages |
Array of message objects with role and content |
add_generation_prompt |
Boolean to add assistant prompt prefix |
tools |
Optional array of tool definitions |
documents |
Optional array of document context |
Template Examples¶
Llama 3
<|begin_of_text|>{% for message in messages %}
<|start_header_id|>{{ message.role }}<|end_header_id|>
{{ message.content }}<|eot_id|>
{% endfor %}
{% if add_generation_prompt %}<|start_header_id|>assistant<|end_header_id|>
{% endif %}L0 Cache Configuration¶
The L0 cache stores complete tokenization results for exact string matches.
--tokenizer-cache-l0-max-entries¶
Maximum number of entries in the L0 cache.
| Option | --tokenizer-cache-l0-max-entries |
|---|---|
| Default | 10000 |
L1 Cache Configuration¶
The L1 cache stores tokenization results at special token boundaries.
--tokenizer-cache-l1-max-memory¶
Maximum memory for the L1 cache in bytes.
| Option | --tokenizer-cache-l1-max-memory |
|---|---|
| Default | 52428800 (50 MB) |
Memory Planning¶
L0 Cache Sizing¶
Each L0 entry uses approximately 2.2 KB:
| Entries | Memory | Recommended For |
|---|---|---|
| 1,000 | ~2.2 MB | Development, testing |
| 10,000 | ~22 MB | Standard production |
| 25,000 | ~55 MB | High-repetition workloads |
| 50,000 | ~110 MB | Large-scale deployments |
| 100,000 | ~220 MB | Enterprise with many prompt variants |
L1 Cache Sizing¶
L1 cache is bounded by total memory:
| Memory | Recommended For |
|---|---|
| 25 MB | Memory-constrained environments |
| 50 MB | Standard deployments (default) |
| 100 MB | Multi-turn conversation heavy |
| 200 MB | Long context applications |
Total Cache Budget¶
Medium Deployment¶
- L0: 25,000 entries (~55 MB)
- L1: 50 MB
- Total: ~105 MB
Large Deployment¶
- L0: 50,000 entries (~110 MB)
- L1: 100 MB
- Total: ~210 MB
Recommended Configurations¶
Multi-Turn Conversations¶
For chat applications with varying conversation lengths.
smg \
--model-path Qwen/Qwen2.5-7B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 20000 \
--tokenizer-cache-enable-l1 \
--tokenizer-cache-l1-max-memory 104857600Expected: L0 catches exact repeats, L1 accelerates prefix sharing
Memory-Constrained¶
For deployments with limited memory.
smg \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 5000Expected: Moderate benefit with minimal memory
No Caching¶
For stateless deployments or when memory is critical.
smg \
--model-path meta-llama/Llama-3.1-8B-Instruct
# Caching is disabled by defaultUse when: Diverse, unique requests dominate
Complete Example¶
Production configuration with tokenizer and caching:
smg \
--worker-urls http://worker1:8000 http://worker2:8000 \
--policy cache_aware \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--chat-template /templates/llama3.jinja \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 25000 \
--tokenizer-cache-enable-l1 \
--tokenizer-cache-l1-max-memory 104857600 \
--host 0.0.0.0 \
--port 8080Monitoring & Observability¶
The cache implementation tracks per-level hit/miss counters and L1 memory
usage internally (CacheStats and L1CacheStats in the tokenizer crate).
These statistics are not currently exported to the gateway's Prometheus
/metrics endpoint, so hit-rate monitoring must rely on application-level
logging or benchmark runs until dedicated metrics are wired up.
Sizing Signals to Watch¶
Without dedicated cache metrics, use these indirect signals when tuning
--tokenizer-cache-l0-max-entries and --tokenizer-cache-l1-max-memory:
- Rising tokenization latency at steady request rate suggests more unique
prompts than L0 can retain — increase
max-entries. - Multi-turn chat traffic with growing context benefits from larger L1 memory budgets; set L1 based on the estimate of ~1 KB per active conversation described in L1 Cache Sizing.
- Resident process memory approaching the sum of L0 (~2.2 KB per entry)
plus L1 (
max-memory) bounds indicates you are near the configured cache budget.
Integration with Other Caching Layers¶
Tokenizer caching is part of SMG's three-level caching strategy:
| Layer | What's Cached | Benefit |
|---|---|---|
| Tokenizer L0/L1 | Token IDs | Skip tokenization |
| Router radix tree | Prefix → worker mapping | Consistent routing decisions |
| Worker KV cache | Attention states | Skip prefill computation |
What's Next?¶
Metrics Reference¶
Complete list of cache-related metrics.
Load Balancing¶
Compare all available routing policies.