One post tagged with "Ceph"

KV Caching with vLLM, LMCache, and Ceph

December 10, 2025 · 17 min read

Chief Architect, Data and AI, Ceph at IBM

Distinguished Engineer, Intel

Inference accounts for 90% of the machine learning costs for deployed AI systems, and it is no surprise that inference optimization is a burgeoning topic in the research community. IDC estimates that global enterprises will invest $307 billion on AI solutions in 2025, and that number is expected to grow aggressively year-over-year.

Understanding the workload

Unlike training, inference for autoregressive language models only involves the forward pass, which itself is broken up into two distinct phases: prefill and decode. Each phase has a unique workload profile – prefill tends to be computation bound, consuming every ounce of floating-point arithmetic capability the system can garner, followed by decode, which is principally limited by memory bandwidth.

The computational complexity of both prefill and decode phases grows quadratically with each additional token. Prefill is easily parallelized across GPUs - all prompt tokens are known up front when a request arrives at the model API. The decode phase brings in the transformer multi-headed attention mechanism and must compute the attention states across all previous tokens - including any prompt(s) and generated responses. This complicates the deployment of inference services where context lengths are growing rapidly to accommodate larger code bases, longer documents, and retrieval augmented generation. KV caching is where the computed key and value weights that correspond with token sequences in a prompt are saved for later, and then retrieved when they are used in a subsequent prompt to avoid the cost of computation (GPU hours) and to reduce the time between when the prompt was submitted as a request and the first response token (time-to-first-token, or TTFT).

Understanding the workload​

Understanding the workload