How does speculative decoding improve ChatGPT inference latency without sacrificing output quality?

Updated May 15, 2026

Short answer

Speculative decoding uses a smaller model to draft tokens which are verified by a larger model, reducing latency.

Deep explanation

Speculative decoding is a technique where a fast, small model generates candidate token sequences, and a larger model verifies or corrects them. This reduces the number of expensive forward passes required by the large model.

If the small model’s predictions are correct, multiple tokens are accepted in one step. If not, the large model corrects the sequence. This improves throughput while maintaining high-quality outputs.

It is widely used in large-scale LLM systems to reduce inference latency significantly.

Unlock with a Pro subscription to view this section.

View pricing