SpecSplit — Project Guide
What Is SpecSplit?
SpecSplit is a research framework for disaggregated speculative decoding — an approach to accelerating large language model (LLM) inference by splitting the workload across two networked GPUs:
| Worker | Role | Hardware |
|---|---|---|
| Draft Worker ("The Hare") | Generates speculative token trees with a small, fast LLM | Cheap GPU |
| Target Worker ("The Tortoise") | Verifies drafts via tree-attention with a large, accurate LLM | Expensive GPU |
An Orchestrator coordinates the draft→verify ping-pong loop and presents a single user-facing interface.
Key insight: The draft model is cheap to run, so speculating several tokens ahead is fast. The target model verifies the entire tree in one batched forward pass, amortizing its high per-token cost.
Repository Layout
specsplit/
├── proto/ # gRPC protobuf definitions
│ └── spec_decoding.proto
├── core/ # Shared utilities
│ ├── config.py # Pydantic settings (env-overridable)
│ ├── serialization.py # Tensor ↔ list conversion
│ ├── telemetry.py # High-precision timing + JSON spans
│ └── verification.py # Greedy tree verification math
├── workers/
│ ├── draft/ # Draft Worker microservice
│ │ ├── engine.py # Autoregressive generation + KV cache
│ │ └── service.py # gRPC server/client bindings
│ ├── target/ # Target Worker microservice
│ │ ├── engine.py # Session-based KV-cached verification
│ │ ├── service.py # gRPC server bindings
│ │ ├── tree_attn.py # Custom tree attention masking
│ │ └── kv_cache.py # Pre-allocated static KV cache
│ └── orchestrator/ # Pipeline coordinator
│ ├── client.py # User-facing entry point
│ └── pipeline.py # Async overlapped draft→verify loop
tests/
├── unit/ # Fast, no-model tests
└── integration/ # Tests requiring model downloads
benchmarks/
├── runner.py # Benchmark harness (writes CSVs)
└── run.sh # Convenience wrapper around runner.py
docs/ # You are here
Installation
# Clone
git clone https://github.com/<your-org>/SpecSplit.git
cd SpecSplit
# Create a virtual environment (Python 3.10+)
uv venv && source .venv/bin/activate
# Install in editable mode with dev dependencies
uv pip install -e ".[dev]"
# Generate gRPC stubs from proto
make proto
Environment Variables
All configuration is driven by Pydantic settings with a SPECSPLIT_ prefix.
Override any default via the environment:
| Variable | Default | Description |
|---|---|---|
SPECSPLIT_DRAFT_MODEL_NAME |
gpt2 |
HuggingFace draft model ID |
SPECSPLIT_DRAFT_DEVICE |
cuda:0 |
Draft worker torch device |
SPECSPLIT_DRAFT_MAX_DRAFT_TOKENS |
5 |
Gamma — speculation depth |
SPECSPLIT_TARGET_MODEL_NAME |
meta-llama/Llama-2-7b-hf |
Target model ID |
SPECSPLIT_TARGET_DEVICE |
cuda:0 |
Target worker torch device |
SPECSPLIT_TARGET_MAX_SESSIONS |
16 |
Max concurrent KV cache sessions |
SPECSPLIT_ORCH_DRAFT_ADDRESS |
localhost:50051 |
gRPC address of draft worker |
SPECSPLIT_ORCH_TARGET_ADDRESS |
localhost:50052 |
gRPC address of target worker |
SPECSPLIT_ORCH_MAX_ROUNDS |
20 |
Max draft→verify rounds per prompt |
SPECSPLIT_ORCH_MAX_OUTPUT_TOKENS |
1024 |
Max total tokens to generate per prompt |
SPECSPLIT_ORCH_MAX_DRAFT_TOKENS |
5 |
Draft tree depth (K / gamma) forwarded to Draft Worker |
SPECSPLIT_ORCH_DRAFT_TEMPERATURE |
0.25 |
Draft sampling temperature (0 = greedy) |
SPECSPLIT_ORCH_VERIFY_TEMPERATURE |
0.0 |
Verification sampling temperature (0 = greedy) |
SPECSPLIT_ORCH_USE_TARGET_KV_CACHE |
true |
Enable target KV cache (stateful verification) |
SPECSPLIT_ORCH_TOKENIZER_MODEL |
gpt2 |
HF model name used for tokenizer |
Quick Start
1. Start the Target Worker
SPECSPLIT_TARGET_MODEL_NAME=meta-llama/Llama-3.1-70B \
SPECSPLIT_TARGET_DEVICE=cuda:0 \
python -m specsplit.workers.target.service
2. Start the Draft Worker
SPECSPLIT_DRAFT_MODEL_NAME=meta-llama/Llama-3.1-8B \
SPECSPLIT_DRAFT_DEVICE=cuda:1 \
python -m specsplit.workers.draft.service
3. Run the Orchestrator
python -m specsplit.workers.orchestrator.client \
--prompt "What is the Capital of France?" \
--max-rounds 20 \
--max-output-tokens 256 \
--max-draft-tokens 3 \
--draft-temperature 0.15 \
--verify-temperature 0.15 \
--use-target-cache
Remote Workers (optional)
For running Draft/Target workers behind a public endpoint (e.g., via ngrok),
use scripts/manage_remote_worker.sh.
- Create env files from:
scripts/remote_worker.draft.env.examplescripts/remote_worker.target.env.example- Start and manage the workers:
scripts/manage_remote_worker.sh start ~/specsplit-target.env
scripts/manage_remote_worker.sh status ~/specsplit-target.env
scripts/manage_remote_worker.sh logs ~/specsplit-target.env
scripts/manage_remote_worker.sh stop ~/specsplit-target.env
scripts/manage_remote_worker.sh update ~/specsplit-target.env
See scripts/start_documentation.md for a worked example (including the
Orchestrator CLI invocation).
Development Commands
| Command | Description |
|---|---|
make install |
Install package in editable mode |
make proto |
Regenerate gRPC Python stubs |
make test |
Run unit tests only |
make test-all |
Run full test suite (unit + integration) |
make lint |
Lint with ruff |
make typecheck |
Static analysis with mypy |
make format |
Auto-format with ruff |
make clean |
Remove build artifacts and caches |
Contributing
- Create a feature branch from
main. - Write tests for any new functionality.
- Ensure
make lint test typecheckpasses. - Open a PR with a clear description of the change.