Toward Accelerated LLM Inference: Porting and Evaluating Diffusion-Based Speculative Decoding on TPU

Autoregressive decoding is inherently sequential: generating n tokens requires n target-model forward passes. We port DFlash—a diffusion-based speculative decoding method that drafts a whole 16-token block in a single parallel forward pass—from GPU/PyTorch to TPU/JAX within the vLLM TPU inference stack (tpu-inference). Benchmarked on Qwen3-4B across 9 datasets (math, code, chat) on TPU V5P.

Code TPU Inference Fork View Results

Key Outcomes

2.31×

vLLM pipeline speedup (DFlash vs baseline)

3.01×

Standalone TPU V5P speedup (DFlash vs baseline, 9 benchmarks)

τ = 5.42

Avg. accepted tokens per draft (standalone TPU V5P, 9 benchmarks)

Team Members

Aaron Feng

Zhongyan Luo

Son Nguyen

Andy Huang

Advisors

Hao Zhang

Mentor

Yiming Zhao

Problem and Motivation

The Decoding Bottleneck

LLM inference consists of two stages with very different computational profiles. Prefilling processes the entire input prompt in a single forward pass—all tokens are known in advance, so the computation is fully parallelizable across the prompt length, similar to training. Decoding, however, is inherently sequential: each new token depends on the one generated before it. Latency grows linearly with output length, making decoding the dominant bottleneck for long-form tasks like chain-of-thought reasoning (500–2000 tokens) and code generation (200–1000 tokens).

Speculative Decoding

Since full parallelization is fundamentally incompatible with autoregressive decoding, acceleration techniques must relax the sequential dependency in controlled ways. Speculative decoding does this by having a lightweight draft model propose a sequence of candidate tokens, then having the full target model verify them all in a single batched forward pass. If the draft is good, multiple tokens are accepted per verification step, reducing the number of expensive target-model calls. In the worst case (first token rejected), it degenerates to standard decoding with negligible overhead.

Our Contribution

We port DFlash—a diffusion-based speculative decoding method—from GPU/PyTorch to TPU/JAX inside the tpu-inference runtime. We evaluate performance in both standalone loops (isolating model compute) and the full vLLM serving pipeline (including scheduling, KV cache management, and rejection sampling). We compare against Eagle3 (an autoregressive drafter) and verify that output quality is preserved.

TPUs are uniquely suited to this approach: their matrix-unit (MXU) architecture favors large, dense, data-parallel operations—exactly the pattern DFlash uses when predicting a 16-token block in one pass. Critically, TPU verification cost is flat from K=16 to K=128 (0.97×), while GPU verification scales to 2.3× at longer contexts. This means DFlash + TPU is the only combination where both draft and verification costs remain constant as block size grows, opening the door to much wider draft blocks than are practical on GPU.

Background

Speculative Decoding

Speculative decoding introduces parallelism into the decoding stage by using a fast, cheap draft model to propose a sequence of n candidate tokens, then having the full target model verify all candidates in a single batched forward pass. At each position, the draft token is checked against the target model’s distribution using a rejection-sampling rule. Verification proceeds from position 1 onward: the first rejected token causes all remaining drafts to be discarded, and the target model samples from its own distribution at that position.

In the worst case (first draft token rejected), speculative decoding degenerates to standard decoding with negligible additional cost. In typical cases, multiple tokens are accepted per target-model call, reducing average latency while preserving the exact output distribution. The key metric is τ (average acceptance length)—how many draft tokens are accepted per verification step. Higher τ means more tokens generated per expensive target-model forward pass.

Draft Model Approaches

Draft models span a wide range of speed–quality tradeoffs:

N-gram drafting: non-neural and extremely fast, but provides a poor approximation of the target distribution, yielding low acceptance rates and limited speedup.
Eagle3: a small one-layer transformer that reuses target-model context (hidden states and last-token embeddings). Produces much better drafts but remains autoregressive—each draft token depends on the previous one, requiring O(k) sequential forward passes for k draft tokens. This sequential proposal cost grows linearly with block size.
Large diffusion models: explored as drafters, but their memory footprint outweighs latency benefits. Small diffusion models tend to produce drafts that poorly align with the target distribution.

DFlash Architecture

DFlash replaces sequential drafting with a diffusion-style block drafter that predicts an entire fixed-size token block (16 tokens) in a single forward pass using non-causal attention. The draft model is not a full language model—it has no embedding layer or LM head of its own. It consists of a small transformer stack (4 decoder layers) with a custom attention pattern, and it reuses the target model’s embedding and LM head for both input and output.

Input: The block of positions to be predicted is represented as token IDs (including mask placeholders for unknown future positions), passed through the target model’s embedding layer. Additionally, the draft receives target hidden states from a subset of the target model’s layers (layers [1, 9, 17, 25, 33] for Qwen3-4B), which are concatenated and projected via an FC layer + RMSNorm. This projected vector conditions the draft on what the target “thinks” at the current position.

Attention: Each DFlash decoder layer uses a custom attention mechanism where queries come from the block, while keys and values come from both the target context and the block. Crucially, attention is non-causal within the block: all positions attend to each other bidirectionally. At K=16, each position sees 15 neighbors; at K=128, each position would see 127 neighbors—providing fundamentally richer conditioning than autoregressive drafters, which can only see past positions.

Output: The block of hidden states produced by the draft is fed into the target model’s LM head to obtain logits. The draft model never has its own vocabulary projection—it always uses the target’s, ensuring draft proposals and target verification share the same vocabulary space. This design reduces draft cost from O(k) to O(1) while remaining expressive enough to achieve high acceptance rates. Trained on just 289K samples, DFlash outperforms Eagle3 (trained on 1.4M samples) in inference acceleration.

Methods: GPU → TPU Migration

Key Engineering Steps

Dual KV cache architecture: The GPU reference uses DynamicCache with KV concatenation. On TPU, paged KV (vLLM PagedAttention) serves the target model, while the draft model uses a separate static JAX KV cache with dynamic_update_slice, matching the GPU architecture’s per-layer static caches. The KV cache manager was extended to allocate draft_layer.{i} specs for all draft layers instead of hardcoding a single layer (as was done for Eagle3).
Non-causal attention kernel: DFlash’s draft attention is explicitly is_causal=False—all block positions attend to each other bidirectionally. This was the main parity risk: the existing TPU decode path was optimized around causal ragged paged attention. We route draft layers through TPU flash_attention with causal=False, while keeping the target model on ragged paged causal attention. The reference uses token-axis K/V concatenation (not additive fusion), which we match exactly.
Sequence-length inflation fix: We discovered that attn_metadata.seq_lens included unverified draft tokens (~15 phantom tokens per step), corrupting the proposer’s context buffer, KV cache positions, and RoPE embeddings. Each decode step inflated the sequence length by the full draft block size rather than the accepted count, causing positional encoding drift and stale context. Fixing this single bug by using num_tokens_no_spec (the actual accepted count) nearly doubled performance: τ jumped from 2.49 to 4.48, speedup from 1.30× to 2.31×.
Target hidden-state extraction: An auxiliary capture path was added to the Qwen3 target model. Hidden states from layers [1, 9, 17, 25, 33] are concatenated along the feature dimension, projected via an FC layer + RMSNorm, and passed to the draft model as contextual conditioning. The layer selection follows the DFlash checkpoint configuration and is deterministic.
Method registration & dispatch: DFlash was integrated using the same pattern as Eagle3: a new "dflash" method branch in tpu_runner.py, a DFlashProposer class under spec_decode/jax/, and dispatch routing in speculative_decoding_manager.py. Compilation prewarm support and precompile helpers were extended for the dflash path.

The implementation preserves the full DFlash contract: extract target hidden-state features, run a lightweight 4-layer block drafter with non-causal attention, reuse the target’s embedding layer and LM head for logits, then verify the full 16-token draft block in one target forward pass. It adds zero new vLLM dependencies—DFlash runs entirely within the tpu-inference runtime. Rejection and acceptance are handled centrally by the existing rejection sampler; the proposer only returns draft token IDs for active requests.

Results

We benchmark Qwen3-4B (target) + DFlash-b16 (draft) across 9 datasets spanning math, code, and chat tasks on TPU V5P (4 chips). The standalone loop achieves an overall 3.01× speedup with τ = 5.42, reaching 3.72× on math benchmarks (τ = 6.71). Math tasks see the highest acceptance because reasoning chains are more predictable for the drafter; chat tasks (1.96×) show lower τ due to higher entropy. On math benchmarks where GPU comparison data is available, TPU achieves 94.9% of GPU paper τ, and exceeds GPU on Math500 (τ = 8.80 vs 7.84).

In the full vLLM serving pipeline (with scheduling, batch management, and rejection sampling), DFlash achieves 2.31× speedup at τ = 4.48. The gap between standalone and pipeline τ (6.67 vs 4.48) comes from vLLM orchestration overhead, not model compute. Output mismatches (bf16 floating-point divergence in batch-16 verify vs single-token baseline) do not indicate correctness loss.

Additional Findings

vLLM Pipeline TPS

End-to-end tokens/sec throughput in the full vLLM serving pipeline comparing baseline, DFlash, and Eagle3 speculative methods.

This measures real serving throughput, not just raw decoding speed. The vLLM pipeline adds request scheduling, paged KV cache management, batch formation, and the rejection-sampling verification loop on top of the model forward pass.

DFlash achieves 2.31× speedup over baseline by drafting all 16 tokens in a single diffusion forward pass. Eagle3 (autoregressive drafter) is included as a reference—its sequential O(k) drafting makes it slower per proposal despite competitive acceptance rates.

The gap between standalone speedup (3.01×) and pipeline speedup (2.31×) quantifies vLLM orchestration overhead. Profiling shows verification alone is 59% of step time, with the two LM-head matmuls at ~30%. Reducing these bottlenecks (see Future Work) could close the standalone–pipeline gap significantly.

Baseline TPS is already high on TPU V4 (~160 tok/s), meaning the relative speedup understates the absolute throughput gains compared to GPU baselines.

Standalone TPU vs GPU DFlash

Side-by-side DFlash throughput (tok/s) on TPU V5P vs GPU A100 paper numbers for the four math benchmarks with published GPU data.

Only math-reasoning benchmarks (gsm8k, math500, aime24, aime25) are shown because these are the only datasets for which the original DFlash paper reports GPU A100 throughput. Code and chat benchmarks lack published GPU baselines and are excluded.

Despite a full framework migration (PyTorch → JAX) and entirely different hardware (NVIDIA A100 → Google TPU V5P), TPU throughput consistently matches or exceeds GPU. This confirms that DFlash’s block-parallel drafting pattern—a single dense matrix multiply predicting 16 tokens—maps naturally to TPU’s systolic MXU architecture.

The TPU advantage is partly structural: TPU’s paged attention kernel (RPA v3) absorbs attention compute within the weight-loading window, so verification cost stays flat as block size grows. On GPU, attention cost scales to 2.3× at K=128/L=1024, eroding the speedup DFlash provides.

If an “AVERAGE” bar is shown, it averages only these four math benchmarks—not the full 9-dataset suite.

Acceptance Volume (τ × drafts/sec)

Effective throughput decomposed into two factors: bar height = τ (accepted tokens/draft), bar width = drafts/sec, bar area = total accepted tokens/sec.

Speculative decoding throughput depends on two independent factors: (1) how many tokens the target model accepts per draft block (τ), and (2) how many draft blocks the system proposes per second (drafts/sec). This chart visualizes both simultaneously—the area of each bar is their product, representing real accepted tokens per second.

Math benchmarks have the tallest bars (τ up to 8.80 on Math500) because structured reasoning chains are highly predictable for the diffusion drafter. Chat benchmarks have shorter bars (lower τ) but can partially compensate with wider bars (higher drafts/sec) because their shorter average contexts reduce per-step compute.

This decomposition reveals an important insight: a dataset could have mediocre τ but still achieve high throughput if the draft-verify cycle is fast enough. Conversely, high τ alone doesn’t guarantee throughput if each cycle is slow (e.g., very long contexts). Optimizing real serving performance requires attending to both factors.

The drafter’s non-causal block attention is key to high τ—at K=16, each draft position sees 15 bidirectional neighbors, providing context that autoregressive drafters (which only see past positions) cannot access.

Acceptance Rate by Position

Acceptance probability at each of the 16 draft positions, showing how draft quality decays further from the last verified token.

Each DFlash draft block proposes 16 tokens simultaneously. Position 1 is immediately after the last verified token and is easiest to predict; acceptance probability drops at later positions because each subsequent token must match the target model’s distribution conditioned on increasingly uncertain prior tokens.

Curve shape reveals task difficulty. Math benchmarks (gsm8k, math500) maintain >70% acceptance through position 8–10 because structured reasoning chains follow predictable patterns. Chat benchmarks (chatbot_arena, wild_bench) drop below 50% by position 4–5 due to higher entropy—open-ended responses have many plausible continuations.

Flat, high curves indicate strong drafter-target alignment and mean the full K=16 block is well-utilized. Steep drop-offs suggest diminishing returns beyond a certain block size—reducing K could save compute without sacrificing many accepted tokens for those tasks.

Why DFlash excels here: the non-causal attention allows position 16 to “see” positions 1–15 bidirectionally during drafting, so later positions benefit from earlier ones. An autoregressive drafter at position 16 would only see positions 1–15 causally, missing the reverse conditioning that helps DFlash maintain higher acceptance deeper into the block.

The area under each curve is proportional to τ for that dataset. Math500’s curve has the largest area (τ=8.80), while wild_bench has the smallest (τ≈2.8).

DFlash Speedup by Dataset (TPU V5P)

DFlash speedup over autoregressive baseline on all 9 V5P benchmarks, colored by task category (math, code, chat).

Each bar shows the ratio of DFlash TPS to baseline TPS for one dataset on TPU V5P (4 chips). Color indicates task category: math (avg 3.72×), code (avg 2.77×), chat (avg 1.96×). The overall average across all 9 datasets is 3.01×.

Why math leads: reasoning chains follow structured, predictable patterns (step-by-step proofs, equation manipulations) that DFlash’s diffusion drafter can anticipate. Math500 achieves the peak speedup of 4.93× (τ=8.80), actually exceeding the GPU paper’s τ=7.84 on the same dataset—the only benchmark where TPU surpasses GPU draft quality.

Why chat lags: open-ended conversations have higher next-token entropy (many plausible continuations), causing more draft rejections. However, even the worst-case chat benchmarks still achieve ~2× speedup—speculative decoding is never slower than baseline.

Code sits in between: code has structured syntax (loops, function signatures) that the drafter predicts well, but variable names and logic branches introduce more uncertainty than pure math.

Category labels along the x-axis group datasets to make these patterns immediately visible.

Performance by Task Category

Category-level aggregation: average speedup, τ, and DFlash TPS for math, code, and chat task groups.

Aggregating by category removes per-dataset noise and reveals the systematic relationship between task structure and DFlash performance:

Math (4 datasets): 3.72× speedup, τ=6.71, ~585 DFlash TPS
Code (2 datasets): 2.77× speedup, τ=5.09
Chat (3 datasets): 1.96× speedup, τ=3.37

The progression math > code > chat reflects decreasing predictability of the output distribution. DFlash’s non-causal block attention thrives when upcoming tokens can be inferred from bidirectional context (e.g., completing a proof step or closing a code block). Open-ended conversation has maximal next-token entropy, limiting the drafter’s ability to propose accepted sequences.

Each sub-panel shows a different metric to provide a complete picture: speedup alone can be misleading if the baseline is already fast; τ measures intrinsic draft quality independent of hardware speed; DFlash TPS shows absolute throughput for capacity planning.

TPU V5P vs GPU A100 — τ Parity

Head-to-head τ and speedup comparison: our TPU V5P results vs published GPU A100 numbers on matching math benchmarks.

This chart directly compares our TPU V5P τ values against the GPU A100 numbers reported in the original DFlash paper (Table 2) for the four math benchmarks: gsm8k, math500, aime24, aime25.

Draft quality parity: TPU achieves 94.9% of GPU paper τ on average. Math500 actually exceeds GPU (8.80 vs 7.84 = 112%), making it the only benchmark where our TPU port outperforms the original implementation. The remaining ~5% gap likely comes from two sources:

Numerical precision: TPU uses bf16 (8-bit exponent, 7-bit mantissa) vs GPU’s fp16/fp32 mix, causing slightly different softmax probabilities at argmax decision boundaries.
Attention kernel differences: TPU Pallas attention vs GPU FlashAttention-2 use different tiling strategies, producing small floating-point accumulation differences.

Speedup ratios appear lower on TPU even with comparable τ because the TPU autoregressive baseline is already fast (6.3ms TPOT). Speculative decoding’s relative gain is measured against this fast baseline, so the same absolute speedup in tok/s translates to a smaller ratio. This is actually a positive: it means the TPU baseline is strong, and speculative decoding provides acceleration on top of an already-optimized system.

Output Quality — Exact Token Match

Fraction of output samples producing byte-identical token sequences between DFlash and autoregressive baseline across all 9 datasets.

Match rates range from 0% (swe-bench) to 62.5% (gsm8k). This wide range is expected and does not indicate correctness loss.

Why mismatches occur: during speculative decoding, the target model verifies 16 positions in a single batched forward pass. The bf16 floating-point accumulation order differs from the single-token baseline (different matrix tiling, different reduction order), producing slightly different logit values. When two candidate tokens have nearly equal probability, this can flip the argmax—producing a valid but different token. The output distribution is statistically preserved; only individual samples may diverge.

Semantic correctness is preserved: on math benchmarks with boxed final answers (gsm8k, math500, aime24, aime25), we verify that the final boxed answer matches between DFlash and baseline. Token-level divergence in intermediate reasoning steps does not affect the final answer—the model simply takes a different but equally valid reasoning path.

Why swe-bench is 0%: code generation is extremely sensitive to exact token choices (variable names, whitespace). A single different token early in generation causes a cascade of different tokens thereafter. The 0% exact match doesn’t mean the code is wrong—it means the bf16 divergence propagated through the entire output.

gsm8k is highest (62.5%) because its outputs are shorter and more constrained (numerical answers), giving fewer opportunities for divergence to occur.

1 / 8

Inference Demo

Pick a dataset prompt and compare decoded outputs and throughput across baseline decoding, DFlash (GPU/TPU), and Eagle3.

Scroll down to load replay samples…

Conclusion

We demonstrate that diffusion-based speculative decoding transfers effectively to TPU. The DFlash port achieves 94% of GPU draft quality (τ=6.67 standalone, 94.9% of GPU paper τ on math benchmarks) and delivers meaningful acceleration in both standalone (3.01×) and serving-pipeline (2.31×) settings. Output quality is preserved: token mismatches arise from bf16 floating-point divergence in batch verification, not correctness errors, and final answers match on math benchmarks.

Key Findings

Step Profiling — GSM8K on TPU V4

Time breakdown of a single speculative decoding step: where the compute actually goes.

Under sync-barrier measurement, core compute (draft forward + verify forward) accounts for only 17.2% of total step time. The remaining 82.8% appears as overhead, but this is misleading: JAX’s lazy evaluation already pipelines host-device operations well.

The real bottleneck is vLLM orchestration: the rejection-sampling loop, request scheduling, and KV cache management. Verification alone consumes 59% of step time, with the two LM-head matmuls (draft logits + verify logits) at ~30%.

This profile directly motivates two Future Work items: (1) optimizing the vLLM scheduling path to close the standalone–pipeline τ gap, and (2) approximate or fused LM-head approaches to reduce the 30% matmul overhead.

Speculative Methods Compared

DFlash standalone vs DFlash vLLM pipeline vs Eagle3: speedup comparison on math benchmarks (TPU V4).

Three speculative decoding configurations are compared on math benchmarks using TPU V4. DFlash standalone runs the draft-verify loop without vLLM overhead; DFlash pipeline runs inside the full vLLM serving stack; Eagle3 is an autoregressive drafter baseline.

The standalone–pipeline gap (τ 6.67 vs 4.48) quantifies vLLM orchestration cost. Eagle3’s lower speedup comes from its O(k) sequential drafting: 16 draft tokens require 16 serial forward passes vs DFlash’s single pass.

Despite Eagle3’s competitive acceptance rates (it was trained on 1.4M samples vs DFlash’s 289K), its sequential proposal bottleneck limits throughput. This demonstrates that proposal speed, not just draft quality, determines end-to-end speculative decoding performance.

Pipeline overhead dominates the standalone–pipeline gap: profiling shows that 82.8% of step time appears as overhead under sync-barrier measurement, but JAX’s lazy evaluation already pipelines operations well. The real bottleneck is vLLM orchestration (scheduling, rejection sampling loop), not host-device transfers. Verification alone accounts for 59% of step time; the two LM head matmuls account for ~30%.
TPU verification cost is flat: verification cost ratio at K=128/K=16 is 0.97× on TPU at all context lengths tested (L=64–1024), compared to ~2.3× on GPU at L=1024. The entire hardware contrast lives in attention handling—FFN is memory-bandwidth-bound and flat on both hardware (GPU FFN: 1.09×, TPU FFN: 0.95×). TPU’s paged attention kernel (RPA v3) and systolic pipeline absorb attention compute within the weight-loading window.
Diffusion + TPU intersection: only the combination of diffusion drafting (O(1) draft cost regardless of block size) and TPU (flat verification) makes larger block sizes (K>16) viable. An autoregressive drafter at K=128 requires 8× sequential passes on either hardware. GPU DFlash draft cost is 1.22× at K=128 (near-flat), but GPU verification attention scales to 2.51× at L=1024. Neither alone is sufficient; only the bottom-right cell of the 2×2 matrix (diffusion + TPU) keeps both sides flat.
TPU advantage grows with context length: GPU verification penalty at K=128 grows from ~1.08× (L=64) to ~2.3× (L=1024). Speculative decoding benefits the most on long-generation tasks (chain-of-thought, code generation)—exactly where the TPU advantage is strongest.
Iterative refinement does not help: replacing mask tokens with predicted tokens for refinement passes degrades τ from 6.18 to ~2.5, because the model was trained on mask-token inputs and sees predicted tokens as out-of-distribution.

Hardware Generations

DFlash Speedup: TPU V4 vs V5P

Speedup comparison across TPU generations: V4 vs V5P on matching benchmarks.

V5P’s autoregressive baseline is 1.69× faster than V4, meaning V5P starts from a higher throughput floor. DFlash speedup ratios appear slightly lower on V5P because the same absolute tok/s improvement yields a smaller ratio against a faster baseline.

However, absolute DFlash TPS on V5P is higher across all benchmarks. For capacity planning, absolute throughput matters more than the ratio—V5P DFlash serves more tokens per second than V4 DFlash on every dataset.

The τ values are similar across generations (same draft model checkpoint, same attention pattern), confirming that draft quality is a property of the model, not the hardware. The throughput difference comes from V5P’s faster MXU and HBM bandwidth.

Cost Efficiency — TPU vs GPU

Cost per million tokens at GCP on-demand pricing: V5P $2.10/hr, V4 $3.22/hr, GPU A100 ~$5.07/hr.

Each bar shows the dollar cost to generate one million tokens on that hardware, calculated as: (price_per_hour / tokens_per_second) × (1,000,000 / 3600). Lower bars mean more cost-effective inference.

TPU V5P with DFlash is the most cost-efficient option across all benchmarks. Although V5P’s on-demand price ($2.10/hr per chip) is lower than V4 ($3.22/hr) and GPU A100 ($5.07/hr), the cost advantage is amplified by DFlash’s throughput boost: higher tok/s at lower $/hr compounds into dramatically lower $/Mtok.

The dashed line shows the average V5P cost across benchmarks. Math benchmarks achieve the lowest cost because their high τ translates to the highest DFlash throughput. Chat benchmarks are more expensive per token but still cheaper than GPU baseline on any task.

Note: prices are GCP on-demand and don’t reflect committed-use discounts, spot pricing, or reserved capacity, which would further favor TPU (Google offers deeper discounts on its own hardware).

Future Work

Wider block sizes (K=64, K=128): TPU’s flat verification cost enables training DFlash drafters at larger block sizes. Our measurements show both draft and verify cost remain flat through K=128 on TPU (draft: 0.95×, verify: 0.97×). At K=128, each draft position sees 127 bidirectional neighbors versus only 15 at K=16, providing fundamentally richer conditioning—an advantage exclusive to diffusion-style parallel drafters. No published work has trained a target-conditioned block-diffusion drafter at K≥64; this fills a key gap in the literature.
vLLM pipeline optimization: the τ gap between standalone (6.67) and pipeline (4.48) stems from vLLM orchestration overhead (scheduling, rejection sampling loop), not model compute. Instrumenting and optimizing speculative_decoding_manager.py’s scheduling path could recover a significant portion of this gap.
LM head optimization: the two LM head matmuls (draft logits + verify logits) account for ~30% of step time. Approximate or fused approaches such as top-k projection could reduce this cost substantially.
Context-position τ correlation: preliminary analysis suggests acceptance rate may improve as generation progresses (more KV context → better drafter conditioning). If confirmed, this would be a compounding advantage: longer conversations yield both better τ and no additional verification cost on TPU.
K-ceiling characterization: the flat region has been measured through K=256 (1.02×). Extending measurements to K=512 and K=1024 would characterize where the memory-bandwidth-bound regime transitions to compute-bound, establishing the full design space for wider drafters.
Upstream contribution: the implementation requires zero vLLM changes and is ready for a tpu-inference PR. A small 10-line vLLM change would enable vllm serve --speculative-method dflash.