What is speculative decoding in large language model inference optimization?

Updated May 17, 2026

Short answer

Speculative decoding speeds up LLM inference by using a smaller model to propose tokens verified by a larger model.

Deep explanation

Speculative decoding uses a fast draft model to generate candidate tokens and a larger target model to validate them in parallel. Instead of generating tokens sequentially with the large model, the system accepts or rejects chunks of tokens, reducing inference latency significantly. This approach improves throughput while preserving output quality. It requires careful alignment between draft and target models to avoid high rejection rates.

Unlock with a Pro subscription to view this section.

View pricing