Protocol: spec_decoding.proto
The entire draft→verify loop is defined by the gRPC services and message schema in
specsplit/proto/spec_decoding.proto.
Where it lives
specsplit/proto/spec_decoding.proto
Why this protocol exists
It lets the Draft Worker and Target Worker exchange a compact representation of speculation candidates:
- the prompt/context is represented as
repeated int32token IDs - the speculative candidates are represented as a tree of
TokenNodes - the Target Worker can optionally reuse per-session KV cache state via
session_id
This keeps the network payload small and makes verification latency dominated by the Target model forward pass (rather than serialization cost).
Services
DraftService (Draft Worker)
GenerateDrafts (DraftRequest) returns (DraftResponse)Ping (PingRequest) returns (PingResponse)(health/readiness)
TargetService (Target Worker)
VerifyDrafts (VerifyRequest) returns (VerifyResponse)- supports per-session KV reuse when
session_idis provided EndSession (EndSessionRequest) returns (EndSessionResponse)- releases GPU KV cache for a session
Ping (PingRequest) returns (PingResponse)(health/readiness)
Key messages
TokenNode
Single node in the speculative tree:
token_id: vocabulary index of the candidate tokenlog_prob: log-probability assigned by the draft modelchildren: child candidate nodes (branching)top_k_token_ids/top_k_probs: optional Top-K distribution data used for full-vocabulary residual computations.
DraftRequest / DraftResponse
DraftRequestprompt_token_ids: current prompt/context token IDsmax_draft_len: tree depth (K / gamma)num_beams: branching factor per leveltemperature:0means greedy;>0enables samplingreset_cache: clear draft KV cache before generating (used after misses)session_id: session identifier for thread-safe KV reuseDraftResponsedraft_tree: generated tree candidates (forest at root-level)telemetry: server-side timing metadata
VerifyRequest / VerifyResponse
VerifyRequestdraft_tree: draft candidates to verifysession_id: KV cache reuse key (empty means stateless verification)temperature:0for greedy verification;>0for stochastic verificationexpected_prefix_length: orchestrator’s expected accepted prefix lengthVerifyResponseaccepted_token_ids: longest accepted prefix from the treecorrection_token_id+has_correction: correction token when draft rejection occurscache_hit: whether session KV cache was reusedtelemetry: server-side timing metadata
Tests / CI hooks
- The CI workflow job
proto-checkcompilesspec_decoding.protoand verifies the generated stub files exist. - Unit tests validate key building blocks around token-tree transformations and
verification math (see
tests/unit/).