KV-Cache Quantization for LLM Inference

Research papers on quantizing key-value caches to reduce memory footprint and extend context length

45 papers3 topicsGenerated March 29, 2026

KV Cache Quantization Methods

15 papers

Core algorithms for quantizing key-value caches — vector quantization, non-uniform quantization, and mixed-precision approaches that reduce memory with minimal accuracy loss

Xiaodan Zhu et al.20201422 citesOpenAlex

Only eighteen months ago, I joined Leo, Horacio, Mónica, and Nuria on a tour of the delights that the Barcelona venue would be offering us in September 2020.Together with Chengqing, we made great plan...

Amir Gholami et al.2022980 citesOpenAlex

This chapter provides approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. Over the past decade, ...

Albert Gu, Tri Dao2023967 citesOpenAlex

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time a...

Stefania Sardellitti, Gesualdo Scutari, Sergio Barbarossa2015908 citesOpenAlex

Migrating computational intensive tasks from mobile devices to more resourceful cloud servers is a promising technique to increase the computational capacity of mobile devices while saving their batte...

M. G. Aartsen et al.2017795 citesOpenAlex

The IceCube Neutrino Observatory is a cubic-kilometer-scale high-energy neutrino detector built into the ice at the South Pole. Construction of IceCube, the largest neutrino detector built to date, wa...

Leach, Adam2021581 citesOpenAlex

Deep generative models are a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of ...

Noah Hollmann et al.2025450 citesOpenAlex

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science<sup>1,2</sup>. The fundamental ...

Saleh Ashkboos et al.2024400 citesSemantic Scholar

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way t...

Humza Naveed et al.2023351 citesOpenAlex

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contribution...

Wei Fang et al.2023322 citesOpenAlex

Spiking neural networks (SNNs) aim to realize brain-inspired intelligence on neuromorphic chips with high energy efficiency by introducing neural dynamics and spike properties. As the emerging spiking...

Md Maruf Hossain Shuvo et al.2022300 citesOpenAlex

Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted in breakthroughs in many areas. However, deploying these highly accurate models for data-driven, learned, autom...

Qiao Lan et al.2021224 citesOpenAlex

In the 1940s, Claude Shannon developed the information theory focusing on quantifying the maximum data rate that can be supported by a communication channel. Guided by this fundamental work, the main ...

DeepSeek-AI et al.2024206 citesOpenAlex

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepS...

Long-Context KV Compression

15 papers

Techniques enabling million-token context lengths by compressing or evicting KV entries — sliding-window hybrids, adaptive precision, and entropy coding

Tianqi Chen, Carlos Guestrin201645396 citesOpenAlex

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientis...

Humza Naveed et al.2023351 citesOpenAlex

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contribution...

Preeti Ranjan Panda, Nikil Dutt, Alexandru Nicolau2000224 citesOpenAlex

Efficient utilization of on-chip memory space is extremely important in modern embedded system applications based on processor cores. In addition to a data cache that interfaces with slower off-chip m...

DeepSeek-AI et al.2024206 citesOpenAlex

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepS...

Siying Dong et al.2021174 citesOpenAlex

This article is an eight-year retrospective on development priorities for RocksDB, a key-value store developed at Facebook that targets large-scale distributed systems and that is optimized for Solid ...

Soren Kejser Jensen, Torben Bach Pedersen, Christian Thomsen2017152 citesOpenAlex

The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to en...

Ebtesam Almazrouei et al.2023112 citesOpenAlex

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B,...

Shulin Zeng et al.2024102 citesOpenAlex

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techni...

Lei Kou et al.2022100 citesOpenAlex

In recent years, with the development of wind energy, the number and scale of wind farms have been developing rapidly. Since offshore wind farms have the advantages of stable wind speed, being clean, ...

DeepSeek-AI et al.202497 citesOpenAlex

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated fo...

Juncheng Yang, Yao Yue, K. V. Rashmi202192 citesOpenAlex

Modern web services use in-memory caching extensively to increase throughput and reduce latency. There have been several workload analyses of production systems that have fueled research in improving ...

Zeyu Han et al.202490 citesOpenAlex

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant com...

Abutalib Aghayev et al.201989 citesOpenAlex

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file syste...

Yixin Song et al.202477 citesOpenAlex

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the desig...

Semantic & Structured KV Pruning

15 papers

Methods that leverage semantic structure — chunk-level compression, attention-aware token eviction, and PCA-based decorrelation for aggressive yet accurate cache reduction

Ebtesam Almazrouei et al.2023112 citesOpenAlex

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B,...

Zeyu Han et al.202490 citesOpenAlex

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant com...

Zhisheng Ye et al.202373 citesOpenAlex

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU acceler...

Keivan Alizadeh et al.202473 citesOpenAlex

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. Proceedings of the 62nd Annual Meeting of the Association f...

Yang An et al.202443 citesOpenAlex

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language m...

Guangji Bai et al.202438 citesOpenAlex

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however,...

Mengwei Xu et al.202432 citesOpenAlex

Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from...

Yuanchun Li et al.202430 citesOpenAlex

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users effici...

Zhenyu Zhang et al.202327 citesOpenAlex

Hyde-IKV is a dynamic management system for the Key-Value (KV) cache in Small Language Models (SLMs). It tackles the memory bottleneck of long-context inference by intelligently prioritizing and compr...

Zhongwei Wan et al.202323 citesOpenAlex

Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substant...

Zixuan Zhou et al.202421 citesOpenAlex

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inferenc...

Chenggang Zhao et al.202520 citesOpenAlex

The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconn...

Xindi Wang et al.202418 citesOpenAlex

In Multimodal Sentiment Analysis (MSA), data noise arising from various sources can lead to uncertainty in Aleatoric Uncertainty (AU), significantly impacting model performance. Current efforts to add...

Xupeng Miao et al.202318 citesOpenAlex

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computati...