As giant language fashions scale to longer context home windows and serve extra concurrent customers, the key-value (KV) cache has emerged as a major reminiscence bottleneck in manufacturing inference methods. For a 30-billion-parameter mannequin with a batch measurement of 128 and an enter size of 1,024 tokens, the ensuing KV cache can occupy as much as 180 GB of reminiscence. For reference, a 7-billion-parameter mannequin’s parameters eat 14 GB of GPU reminiscence, whereas the KV cache for a similar mannequin can require round 72 GB.
Compressing the KV cache reduces reminiscence strain, will increase batch sizes, and immediately improves throughput with out retraining the bottom mannequin. Over the previous two years, a number of distinct compression methods have emerged from analysis. This text breaks down the ten most essential ones with emphasis on how every works and the place it matches in a sensible inference pipeline.
Token Eviction with H2O (Heavy Hitter Oracle)
H2O, revealed at NeurIPS 2023, is without doubt one of the foundational token eviction strategies. Its core commentary is {that a} small portion of tokens contributes nearly all of consideration rating mass throughout technology and are referred to as Heavy Hitters (H2). H2O dynamically retains a steadiness of latest tokens and H2 tokens, preserving a set KV cache measurement throughout Transformer layers. The choice course of is pushed by cumulative consideration scores averaged throughout all queries and tokens.
Consideration weight distribution follows a power-law which suggests evicting low-scoring tokens incurs minimal accuracy loss in observe. H2O is a decoding-phase methodology and doesn’t cut back prefill computation, which stays a limitation for long-context prompts. With 20% heavy hitters, H2O improves throughput over Hugging Face Speed up by as much as 29× on OPT-6.7B and OPT-30B.
StreamingLLM (Consideration Sink Retention)
StreamingLLM is designed for eventualities the place LLMs should deal with very lengthy or infinite enter streams. Its technique is to all the time keep the KV states of the primary few tokens which function consideration sinks, and mix them with a sliding window of the latest tokens as much as the out there reminiscence finances.
The perception is that preliminary tokens, no matter their semantic content material, operate as structural anchors that obtain disproportionately excessive consideration all through technology. Dropping them causes important accuracy degradation, whereas preserving them alongside a recency window stabilizes outputs. StreamingLLM is quick and hardware-friendly however doesn’t use significance scoring, which suggests it could possibly discard semantically vital middle-context tokens. It’s best fitted to streaming dialogue purposes the place latest context dominates.
SnapKV (Commentary Window Compression)
SnapKV addresses the prefill stage particularly, concentrating on long-prompt eventualities. It makes use of a small commentary window on the finish of the immediate to foretell token significance. The eye scores from queries on this commentary window are aggregated to vote for essential positions — the heavy hitters — within the prefix.
Not like H2O, SnapKV employs a pooling layer over the commentary window’s consideration scores to pick clustered essential KV positions per consideration head, moderately than utilizing a flat cumulative significance rating throughout the complete sequence. This head-specific choice makes SnapKV extra correct than H2O on the similar cache finances. SnapKV has turn into a extensively used baseline for prefill-phase compression and is immediately similar to H2O on benchmarks resembling LongBench.
PyramidKV / PyramidInfer (Layer-Smart Pyramidal Allocation)
A key limitation of H2O and SnapKV is that they apply a uniform compression finances throughout all Transformer layers. PyramidKV addresses this by allocating totally different cache sizes per layer based mostly on consideration sample construction. The complementary system, PyramidInfer, extends this to the prefill part itself.
PyramidInfer finds that the variety of essential keys and values that affect future technology decreases layer by layer, and extracts them by measuring consistency in consideration weights throughout latest tokens. By computing fewer keys and values in deeper layers throughout prefill moderately than pruning a pre-computed cache, PyramidInfer reduces reminiscence earlier within the pipeline. Experimental outcomes present PyramidInfer improves throughput by 2.2× in comparison with Hugging Face Speed up, with over 54% GPU reminiscence discount within the KV cache.
The instinct aligns with how data funnels by means of Transformer depth: early layers want richer context, whereas deeper layers converge on a smaller set of salient tokens. Assigning compression budgets proportionally to every layer’s precise data density is extra environment friendly than making use of a flat finances uniformly.
KV Cache Quantization — KIVI
KIVI, revealed at ICML 2024, is a plug-and-play 2-bit KV cache quantization algorithm that requires no fine-tuning. It quantizes the important thing cache per-channel and the worth cache per-token.
The uneven scheme is motivated by noticed distributional variations: keys exhibit bigger channel-wise outliers, whereas values are higher represented per-token. With this hardware-friendly design, KIVI permits fashions together with Llama-2, Falcon, and Mistral to take care of comparable technology high quality whereas decreasing mixed peak reminiscence, mannequin weights and KV cache, by 2.6×. This allows as much as 4× bigger batch sizes and will increase throughput by 2.35× to three.47× on actual inference workloads. The two.6× determine covers each mannequin weights and KV cache collectively: at 2-bit precision the KV cache discount is extra aggressive, and it’s this discount that drives the batch measurement scaling.
KVQuant (Calibrated Combined-Precision Quantization)
Whereas KIVI applies a set uneven scheme, KVQuant takes a calibrated, multi-component method to low-bit KV cache quantization. It combines per-channel key quantization, pre-RoPE key quantization (which avoids quantizing keys after positional embeddings have distorted the distribution), sensitivity-weighted non-uniform quantization that defines quantization ranges from calibration information moderately than fastened grids, and a dense-and-sparse decomposition that handles excessive outlier values individually from the majority distribution.
This mixture permits KVQuant to push quantization to very low bit widths together with sub-4-bit with higher accuracy than fixed-precision schemes, concentrating on deployments that must help extraordinarily lengthy contexts (the paper evaluates as much as 10 million context size). For manufacturing methods with secure workloads, the calibration value is amortized throughout inference runs.
TurboQuant (Close to-Optimum On-line KV Cache Quantization)
TurboQuant is Google Analysis’s newest contribution to this house, accepted at ICLR 2026. It targets a identified weak spot in all prior quantization strategies: MSE-optimal scalar quantizers introduce systematic bias in inside product estimation, which compounds throughout consideration computations. TurboQuant addresses this by means of a two-stage pipeline.
The primary stage, PolarQuant (AISTATS 2026), applies a random orthogonal rotation to every key and worth vector earlier than quantization. This rotation redistributes variance uniformly throughout all coordinates with out altering the mathematical content material so that every coordinate might be quantized precisely with a easy analytically computed scalar quantizer. No coaching or calibration is required. The second stage applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) correction to the quantization residual, which produces an unbiased inside product estimator. Collectively, the 2 levels obtain at the very least 6× reminiscence discount and as much as 8× sooner consideration computation on NVIDIA H100 GPUs at 3-bit precision, working inside an element of roughly 2.7 of the information-theoretic restrict. As a result of TurboQuant makes use of random matrices moderately than realized ones, it applies to any mannequin at inference time with no offline preparation.
Multi-Question Consideration (MQA) and Grouped-Question Consideration (GQA)
MQA and GQA are architectural modifications that cut back the KV cache by design moderately than compressing an present one. In MQA, all question heads share a single key and worth head, dramatically decreasing cache measurement. GQA teams a number of question heads to share a smaller set of key-value heads, providing a center floor between full multi-head consideration and MQA. Each require both coaching from scratch or fine-tuning; with out correct coaching, making use of them to pre-trained fashions usually ends in degraded efficiency.
GQA has since turn into the de facto normal in trendy open-weight LLMs. In Llama 2, solely the 70B mannequin used GQA — the 7B and 13B variants used normal multi-head consideration. Llama 3 prolonged GQA throughout each the 8B and 70B sizes. Mistral utilized GQA from its preliminary 7B launch in September 2023. For practitioners choosing or deploying new mannequin households, GQA is now a baseline expectation moderately than an non-obligatory optimization.
Multi-Head Latent Consideration (MLA) — DeepSeek
MLA is DeepSeek’s architectural resolution to KV cache reminiscence, first launched in DeepSeek-V2 (Could 2024) and carried ahead in DeepSeek-V3 and DeepSeek-R1. It’s an consideration mechanism geared up with low-rank key-value joint compression. Somewhat than storing full-dimensional key and worth tensors per token, MLA initiatives them right into a compressed latent vector throughout inference, storing the latent illustration as a substitute.
The outcomes are probably the most dramatic of any approach on this checklist. In comparison with DeepSeek’s prior 67B dense mannequin, DeepSeek-V2 with MLA reduces the KV cache by 93.3% whereas reaching superior efficiency in comparison with normal multi-head consideration. This isn’t a marginal enchancment — it basically adjustments the reminiscence economics of serving giant fashions, enabling considerably longer context home windows and bigger batch sizes on the identical {hardware}. Analysis has additionally proven that MLA constantly provides greater expressive energy than GQA below the identical KV cache finances, offering a theoretical foundation for the empirical good points. Amongst architectural approaches, MLA is presently probably the most validated at scale in open-weight fashions.
Low-Rank KV Cache Compression (Palu / LoRC)
Low-rank compression targets the hidden dimension of KV tensors moderately than the sequence size or bit width. Palu is a post-training KV cache compression framework that reduces cache measurement by means of low-rank projection of key and worth weight matrices. It proposes a medium-grained, group-head low-rank decomposition that balances accuracy and reconstruction overhead, and makes use of an environment friendly rank search algorithm based mostly on Fisher data to mechanically assign bigger ranks to extra delicate weight matrices and smaller ranks to much less vital ones.
Associated strategies on this household embrace LoRC, SVDq, CSKV, and ReCalKV, all of which exploit the commentary that key and worth matrices throughout consideration heads exhibit important low-rank construction, notably for longer contexts. Low-rank strategies are orthogonal to each quantization and token eviction and might be stacked with both for compounded compression. This household stays comparatively underexplored in comparison with eviction-based strategies, making it an energetic space of analysis.
Key Takeaways:
- KV cache development is proportional to each sequence size and batch measurement, making compression important for high-throughput serving.
- Token eviction (H2O, StreamingLLM, SnapKV) is training-free and hardware-compatible however discards tokens completely; SnapKV selects clustered essential KV positions per head by way of pooled consideration scores, not flat cumulative scores.
- Quantization (KIVI, KVQuant, TurboQuant) reduces reminiscence with out eradicating tokens. KIVI achieves 2.6× mixed peak reminiscence discount (mannequin weights + KV cache) at 2-bit precision; TurboQuant achieves 6× reminiscence discount at 3-bit precision with no calibration, working close to the information-theoretic restrict.
- Low-rank strategies (Palu, LoRC, MLA) goal hidden dimension redundancy and stay underexplored relative to token eviction.
- Architectural options (GQA, MLA) have to be included at coaching time. In Llama 2, solely the 70B mannequin used GQA; Llama 3 prolonged it throughout all sizes. MLA achieves a 93.3% KV cache discount in DeepSeek-V2.
- The 2026 analysis frontier is shifting towards latent-space compaction (Attention Matching, 50× compaction) and reasoning-aware compression (TriAttention, 10.7× reminiscence discount on AIME25 at matched accuracy).
Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
Michal Sutter is an information science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.
