How does cross-request KV-cache sharing improve throughput in ChatGPT systems?

Updated May 15, 2026

Short answer

Cross-request KV-cache sharing reuses computation for identical or similar prompt prefixes across multiple requests.

Deep explanation

In large-scale ChatGPT deployments, many users send similar prompts (e.g., system instructions or repeated prefixes). Cross-request KV-cache sharing allows reuse of attention computations for identical prefixes, reducing redundant GPU work.

This requires careful indexing of KV states and strict consistency guarantees to ensure correct outputs. It works best with shared system prompts or templated inputs.

This optimization significantly increases throughput but requires complex cache invalidation and security isolation strategies.

Unlock with a Pro subscription to view this section.

View pricing