How does cross-request KV-cache sharing improve throughput in ChatGPT systems?
Updated May 15, 2026
Short answer
Cross-request KV-cache sharing reuses computation for identical or similar prompt prefixes across multiple requests.
Deep explanation
In large-scale ChatGPT deployments, many users send similar prompts (e.g., system instructions or repeated prefixes). Cross-request KV-cache sharing allows reuse of attention computations for identical prefixes, reducing redundant GPU work.
This requires careful indexing of KV states and strict consistency guarantees to ensure correct outputs. It works best with shared system prompts or templated inputs.
This optimization significantly increases throughput but requires complex cache invalidation and security isolation strategies.
Unlock with a Pro subscription to view this section.
View pricingReal-world example
No real-world example available yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProCommon mistakes
No common mistakes listed yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProFollow-up questions
No follow-up questions available yet.
Unlock with a Pro subscription to view this section.
Upgrade to Pro