Running Experiments
This guide explains how to use the SpecSplit benchmarking harness to run reproducible experiments, sweep hyper-parameters, and collect per-request telemetry.
Prerequisites
- Both workers running — see Project Guide for startup.
- A prompt dataset — JSONL file, one JSON object per line.
- Python environment —
uv pip install -e ".[dev]"completed.
Dataset Format
The benchmark script accepts JSONL. Each line needs at minimum a "prompt" field:
{"prompt": "Explain quantum computing to a 5-year-old."}
{"prompt": "Write a Python function that merges two sorted lists.", "id": "code-01"}
{"prompt": "What is the difference between TCP and UDP?"}
ShareGPT format is also supported — the script auto-extracts the first human
turn from "conversations":
{"conversations": [{"from": "human", "value": "What is RLHF?"}]}
Tip: For quick smoke tests, create a 5-line JSONL. For publication-grade results, use a 500+ prompt slice of ShareGPT or LMSYS-Chat-1M.
Basic Run
python benchmarks/runner.py \
--prompts data/prompts.jsonl \
--results-dir benchmarks/results/baseline \
--draft-addr localhost:50051 \
--target-addr localhost:50052 \
--tokenizer gpt2 \
--gamma 5
This runs every prompt with the default Gamma (K=5) and writes per-request metrics to CSV.
Gamma Sweep
Gamma (K) is the draft tree depth — the number of tokens the draft model
speculates per round. It directly controls the throughput/acceptance-rate
trade-off. This corresponds to DraftWorkerConfig.max_draft_tokens and
interacts with the tree attention mask (tree_attn.py) and the static KV
cache rollback depth (kv_cache.py).
Sweep multiple values in a single invocation:
python benchmarks/runner.py \
--prompts data/prompts.jsonl \
--results-dir benchmarks/results/gamma_sweep \
--draft-addr localhost:50051 \
--target-addr localhost:50052 \
--tokenizer gpt2 \
--gamma 1 3 5 8 12
The runner runs the full dataset once per gamma value and writes:
- benchmarks/results/<run>/summary.csv (per-gamma aggregation)
- benchmarks/results/<run>/per_round.csv (per-request, per-round details)
Metrics Reference
Each row in the output CSV contains:
| Column | Unit | Description |
|---|---|---|
request_id |
— | Unique identifier (from dataset "id" or auto-generated) |
gamma |
int | Draft depth (K) for this run |
prompt_length |
tokens | Estimated prompt token count |
generated_tokens |
tokens | Number of output tokens produced |
ttft_ms |
ms | Time-to-First-Token — latency from request start to the first token |
tpot_ms |
ms | Time-Per-Output-Token — average inter-token latency |
average_acceptance_rate |
0–1 | Mean fraction of draft tokens accepted per round |
total_network_idle_ms |
ms | Cumulative gRPC round-trip overhead |
total_latency_ms |
ms | End-to-end wall-clock time for the request |
num_rounds |
int | Number of draft→verify iterations |
Overriding Configuration
# Limit generation length
python benchmarks/runner.py \
--prompts data/prompts.jsonl \
--results-dir benchmarks/results/custom \
--max-output-tokens 128 \
--max-rounds 10
# Point at custom worker addresses
SPECSPLIT_ORCH_DRAFT_ADDRESS=gpu1:50051 \
SPECSPLIT_ORCH_TARGET_ADDRESS=gpu2:50052 \
python benchmarks/runner.py \
--prompts data/prompts.jsonl \
--results-dir benchmarks/results/custom
Analyzing Results
Use benchmarks/analyze_results.py to generate plots/tables from the CSVs
emitted by benchmarks/runner.py:
python benchmarks/analyze_results.py --results-dir benchmarks/results/custom
Key Plots to Produce
- TPOT vs Gamma — shows where throughput gains saturate.
- Acceptance Rate vs Gamma — reveals the quality ceiling of the draft model.
- TTFT vs Gamma — shows first-token latency cost of deeper speculation.
- Network Idle Fraction —
total_network_idle_ms / total_latency_mshighlights whether the bottleneck is compute or network.
Reproducibility Checklist
- [ ] Pin model versions in your experiment log (e.g.
Qwen/Qwen2.5-0.5B) - [ ] Record GPU type and driver version (
nvidia-smi) - [ ] Use the same dataset JSONL across all runs
- [ ] Set
PYTHONHASHSEED=0andCUBLAS_WORKSPACE_CONFIG=:4096:8for determinism - [ ] Log the exact command used in each CSV filename or beside it
Advanced: Async Overlapped Pipeline
The pipeline.py module implements an async overlapped execution mode
where draft round N+1 is speculatively started while round N is being verified:
from specsplit.workers.orchestrator.pipeline import run_speculative_loop_async
result = await run_speculative_loop_async(
draft_stub, target_stub, prompt_ids, config
)
print(result.speculation_hits, result.speculation_misses)
The benchmark script does not yet use this mode, but it can be integrated
for latency-focused experiments. Track speculation_hits / total_rounds
to quantify the pipeline overlap benefit.