Priority Scheduling

When the gateway is at capacity, a single global queue makes every request compete on first-come-first-served terms. A latency-sensitive chat completion ends up waiting behind whatever batch job happened to arrive first. The priority scheduler replaces that flat queue with a priority-aware admission layer: clients declare a class with the x-smg-priority header, the gateway admits higher classes first, reserves capacity for them, and — when it has to — preempts a lower-class request that has not yet started streaming.

The priority scheduler is opt-in. When it is disabled (the default), the gateway keeps its legacy concurrency-limit admission path, so existing deployments see zero behavior change.


A busy restaurant

The mechanics are easier to hold in your head with an analogy:

  • Tables are slots. The dining room seats a fixed number of guests — the gateway's live backend capacity. A guest occupies a table for the length of their meal, the way a request holds a slot until its response finishes streaming.
  • Guest tiers are priority classes. Walk-ins, regulars, and VIPs get different treatment. SMG has four tiers: system, interactive, default, and bulk.
  • The host is the scheduler. The host decides who gets seated now, who waits by the door, and in what order the waiting line is called.
  • Reserved tables are reserved slots. A few tables are kept open for VIPs even when the room looks full, so a VIP rarely waits.
  • Bumping a guest whose food has not arrived is preemption. If a VIP walks in and every table is taken, the host may clear a table occupied by a walk-in who has not been served yet — nobody's dinner gets thrown away. A guest already eating is never bumped.
  • Calling a long-waiting walk-in out of turn is starvation promotion. If someone has waited by the door far too long, the host seats them next even though higher tiers are still in line, so they are not stuck forever.

The rest of this page is the real mechanics behind that picture.


Priority classes

Every admitted request is assigned one of four classes. They are strictly ordered — higher classes win contention, get reservations, and may preempt lower ones.

Class Rank Intended traffic Preempts lower classes?
system highest Internal control-plane callers Yes
interactive high Latency-sensitive (chat completions, autocomplete) Yes
default middle Unlabeled traffic — what a request gets with no header No
bulk lowest Background / batch jobs No

A request's class comes from the x-smg-priority header, then is clamped down to the maximum class the tenant is allowed to use. The header can only ever lower a request's class relative to the tenant cap — it can never promote a tenant above its ceiling. A free-tier tenant cannot escalate itself to system by setting a header. See the reference page for the exact header contract.


Slots and capacity

The scheduler admits against a pool of slots. Total slot capacity is the gateway's live backend capacity — it is read from the worker fleet, not a static number, and the scheduler reacts when it changes (workers scaling up or down). A request holds exactly one slot from the moment it is admitted until its response body finishes draining (or it is preempted or the client disconnects), at which point the slot returns to the pool.

When a slot frees, the scheduler's dispatcher wakes and tries to admit waiting requests, highest class first.

Reserved slots

Each class can reserve a share of capacity. A reservation is a floor, not a fixed partition:

  • A higher class's unused reservation is held back from lower classes. If interactive has reserved slots it is not currently using, bulk and default cannot consume them — they stay available so an arriving interactive request is admitted immediately instead of queueing.
  • Once a higher class actually uses a reserved slot, the hold collapses one-for-one; the slot is simply in flight.
  • A class's own reservation never counts against its own headroom.

This is why system and interactive ship with reservations by default while default and bulk reserve nothing — interactive traffic should almost never wait behind a batch burst. The sum of all reservations must fit under capacity; if it does not at startup, the scheduler refuses to build and the gateway falls back to legacy admission.


Per-class queues

When no slot is immediately available, a request does not fail right away — it joins a per-class FIFO queue. Each class has its own queue with its own depth limit and its own wait timeout:

  • If the queue is already at its configured depth, the request is rejected immediately (429).
  • If the request waits longer than the class's timeout, it is rejected (408).

A client that disconnects while queued is not currently detected — its place is held until that timeout fires, because the cancel signal isn't yet wired to client disconnect at this stage. (The 499 code exists for this case but isn't emitted today.)

Higher classes have shorter queues and shorter timeouts (fail fast, the latency matters); lower classes have deeper queues and longer timeouts (wait patiently, throughput matters). The dispatcher drains queues in strict priority ordersystem first, then interactive, default, and bulk, fully draining a higher tier before serving a lower one. A sustained higher-priority flood will hold off a lower tier; the starvation guard below is what keeps that from lasting forever.


Preemption

Reservations alone are not enough. If bulk fills its share and spills into unreserved capacity at the same moment an interactive request arrives, reservations leave a gap. Preemption closes it.

When a preempt-capable request (system or interactive by default) finds no free slot and no reservation to draw on, the scheduler looks for a victim: a strictly-lower-class request that is still in flight but has not yet emitted its first response byte. If it finds one, it cancels that request and claims the freed slot.

The TTFT boundary is the rule that makes this safe:

  • Only pre-first-byte requests are eligible. A request that has already streamed a byte to its client is never preempted — the user is already seeing output, and truncating it would be worse than making the higher-priority request wait. The handoff is decided by a single atomic compare-and-swap on the victim: whichever happens first wins, "first byte emitted" or "selected for preemption," and the loser backs off.
  • Victim selection minimizes wasted work. Among eligible victims the scheduler prefers the lowest class (cheapest to cancel) and, within that class, the most-recently-admitted request (the least upstream work thrown away).
  • At most one victim per admission. No cascading preemptions.

A preempted request receives 503 with Retry-After: 1 and an X-SMG-Preempted: true header, so clients and proxies can tell a preemption apart from an ordinary overload and retry promptly. If the victim's slot does not free within a short budget, the preemptor simply falls through to the queue — the cancel has already fired, so the slot frees shortly regardless.


TTFT protection

The flip side of preemption is a guarantee for the request being served: once you have produced a first byte, you are safe. The scheduler tracks time-to-first-byte per in-flight request. The response-body wrapper marks the first data frame, and from that instant the request is no longer a preemption candidate. This bounds the blast radius of a priority surge — a higher-priority spike can reclaim slots from work that has not visibly started, but it can never tear down responses already streaming to users.


Starvation promotion

Strict priority ordering risks starving the lowest classes: if higher classes stay busy, a bulk request could wait at the head of its queue indefinitely. To prevent that, each class has a starvation threshold. When the request at the head of a queue has waited longer than its class's threshold, the dispatcher promotes it out of normal priority order and admits it next — even letting it consume a slot that a higher class had reserved but is not using.

The dispatcher checks for starved waiters lowest priority first (the classes most at risk), before it does its normal high-to-low drain. The tradeoff is deliberate: avoiding indefinite starvation is worth occasionally lending out a reserved-but-idle slot.


Fallback to legacy admission

The priority scheduler is designed to fail safe. It only takes over when it is both enabled and able to start cleanly:

  • If the scheduler is disabled (the default), the gateway wires the legacy concurrency-limit middleware — no scheduler is even constructed.
  • If the scheduler is enabled but misconfigured — unparsable YAML, or reservations that sum to more than the live capacity — the gateway logs the failure at ERROR and falls back to legacy admission rather than aborting startup. A broken scheduler config must never take the data plane down.

In both cases the request path is the familiar token-bucket concurrency limiter described in Rate Limiting. The admission path is chosen once at startup.


Observability

The scheduler emits Prometheus metrics for admission outcomes, queue waits, preemptions, priority clamps, and per-class capacity pressure. These are the signals you watch to size reservations and queue depths, and to detect clients mis-setting the priority header. The full list is in the Metrics Reference.


What's Next?

Rate Limiting

The legacy concurrency-limit path the scheduler falls back to.

Rate Limiting →

Circuit Breakers

Isolate failing workers to prevent cascade failures.

Circuit Breakers →

Metrics Reference

Scheduler admission, queue, and preemption metrics.

Metrics Reference →