KV-Cache Quantization Research

KV Cache Quantization Methods

15 papers

Core algorithms for quantizing key-value caches — vector quantization, non-uniform quantization, and mixed-precision approaches that reduce memory with minimal accuracy loss

Proceedings of the 28th International Conference on Computational Linguistics

PDF

Xiaodan Zhu et al.20201422 citesOpenAlex

Only eighteen months ago, I joined Leo, Horacio, Mónica, and Nuria on a tour of the delights that the Barcelona venue would be offering us in September 2020.Together with Chengqing, we made great plan...

A Survey of Quantization Methods for Efficient Neural Network Inference

PDF

Amir Gholami et al.2022980 citesOpenAlex

This chapter provides approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. Over the past decade, ...

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

PDF

Albert Gu, Tri Dao2023967 citesOpenAlex

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time a...

Joint Optimization of Radio and Computational Resources for Multicell Mobile-Edge Computing

PDF

Stefania Sardellitti, Gesualdo Scutari, Sergio Barbarossa2015908 citesOpenAlex

Migrating computational intensive tasks from mobile devices to more resourceful cloud servers is a promising technique to increase the computational capacity of mobile devices while saving their batte...

The IceCube Neutrino Observatory: instrumentation and online systems

PDF

M. G. Aartsen et al.2017795 citesOpenAlex

The IceCube Neutrino Observatory is a cubic-kilometer-scale high-energy neutrino detector built into the ice at the South Pole. Construction of IceCube, the largest neutrino detector built to date, wa...

Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models

PDF

Leach, Adam2021581 citesOpenAlex

Deep generative models are a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of ...

Accurate predictions on small data with a tabular foundation model

PDF

Noah Hollmann et al.2025450 citesOpenAlex

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science<sup>1,2</sup>. The fundamental ...

The IceCube data acquisition system: Signal capture, digitization, and timestamping

PDF

Rasha Abbasi et al.2009419 citesOpenAlex

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos et al.2024400 citesSemantic Scholar

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way t...

A Comprehensive Overview of Large Language Models

PDF

Humza Naveed et al.2023351 citesOpenAlex

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contribution...

Findings of the Association for Computational Linguistics: EMNLP 2023

Graham Mcdougal et al.2023348 citesOpenAlex

SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence

PDF

Wei Fang et al.2023322 citesOpenAlex

Spiking neural networks (SNNs) aim to realize brain-inspired intelligence on neuromorphic chips with high energy efficiency by introducing neural dynamics and spike properties. As the emerging spiking...

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

PDF

Md Maruf Hossain Shuvo et al.2022300 citesOpenAlex

Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted in breakthroughs in many areas. However, deploying these highly accurate models for data-driven, learned, autom...

What is Semantic Communication? A View on Conveying Meaning in the Era of Machine Intelligence

PDF

Qiao Lan et al.2021224 citesOpenAlex

In the 1940s, Claude Shannon developed the information theory focusing on quantifying the maximum data rate that can be supported by a communication channel. Guided by this fundamental work, the main ...

DeepSeek-V3 Technical Report

PDF

DeepSeek-AI et al.2024206 citesOpenAlex

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepS...

Long-Context KV Compression

15 papers

Techniques enabling million-token context lengths by compressing or evicting KV entries — sliding-window hybrids, adaptive precision, and entropy coding

XGBoost

PDF

Tianqi Chen, Carlos Guestrin201645396 citesOpenAlex

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientis...

The IceCube data acquisition system: Signal capture, digitization, and timestamping

PDF

Rasha Abbasi et al.2009419 citesOpenAlex

A Comprehensive Overview of Large Language Models

PDF

Humza Naveed et al.2023351 citesOpenAlex

On-chip vs. off-chip memory

PDF

Preeti Ranjan Panda, Nikil Dutt, Alexandru Nicolau2000224 citesOpenAlex

Efficient utilization of on-chip memory space is extremely important in modern embedded system applications based on processor cores. In addition to a data cache that interfaces with slower off-chip m...

DeepSeek-V3 Technical Report

PDF

DeepSeek-AI et al.2024206 citesOpenAlex

RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications

PDF

Siying Dong et al.2021174 citesOpenAlex

This article is an eight-year retrospective on development priorities for RocksDB, a key-value store developed at Facebook that targets large-scale distributed systems and that is optimized for Solid ...

Time Series Management Systems: A Survey

PDF

Soren Kejser Jensen, Torben Bach Pedersen, Christian Thomsen2017152 citesOpenAlex

The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to en...

The Falcon Series of Open Language Models

PDF

Ebtesam Almazrouei et al.2023112 citesOpenAlex

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B,...

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

PDF

Shulin Zeng et al.2024102 citesOpenAlex

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techni...

Review on Monitoring, Operation and Maintenance of Smart Offshore Wind Farms

PDF

Lei Kou et al.2022100 citesOpenAlex

In recent years, with the development of wind energy, the number and scale of wind farms have been developing rapidly. Since offshore wind farms have the advantages of stable wind speed, being clean, ...

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

PDF

DeepSeek-AI et al.202497 citesOpenAlex

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated fo...

A Large-scale Analysis of Hundreds of In-memory Key-value Cache Clusters at Twitter

PDF

Juncheng Yang, Yao Yue, K. V. Rashmi202192 citesOpenAlex

Modern web services use in-memory caching extensively to increase throughput and reduce latency. There have been several workload analyses of production systems that have fueled research in improving ...

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

PDF

Zeyu Han et al.202490 citesOpenAlex

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant com...

File systems unfit as distributed storage backends

PDF

Abutalib Aghayev et al.201989 citesOpenAlex

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file syste...

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

PDF

Yixin Song et al.202477 citesOpenAlex

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the desig...

Semantic & Structured KV Pruning

15 papers

Methods that leverage semantic structure — chunk-level compression, attention-aware token eviction, and PCA-based decorrelation for aggressive yet accurate cache reduction

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

PDF

Hanrui Wang, Zhekai Zhang, Song Han2021318 citesCrossref

The Falcon Series of Open Language Models

PDF

Ebtesam Almazrouei et al.2023112 citesOpenAlex

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

PDF

Zeyu Han et al.202490 citesOpenAlex

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

PDF

Zhisheng Ye et al.202373 citesOpenAlex

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU acceler...

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

PDF

Keivan Alizadeh et al.202473 citesOpenAlex

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. Proceedings of the 62nd Annual Meeting of the Association f...

Qwen2 Technical Report

PDF

Yang An et al.202443 citesOpenAlex

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language m...

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models

PDF

Guangji Bai et al.202438 citesOpenAlex

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however,...

A Survey of Resource-efficient LLM and Multimodal Foundation Models

PDF

Mengwei Xu et al.202432 citesOpenAlex

Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from...

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

PDF

Yuanchun Li et al.202430 citesOpenAlex

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users effici...

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

PDF

Zhenyu Zhang et al.202327 citesOpenAlex

Hyde-IKV is a dynamic management system for the Key-Value (KV) cache in Small Language Models (SLMs). It tackles the memory bottleneck of long-context inference by intelligently prioritizing and compr...

Efficient Large Language Models: A Survey

PDF

Zhongwei Wan et al.202323 citesOpenAlex

Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substant...

A Survey on Efficient Inference for Large Language Models

PDF

Zixuan Zhou et al.202421 citesOpenAlex

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inferenc...

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

PDF

Chenggang Zhao et al.202520 citesOpenAlex

The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconn...

DFMU: Distribution-based Framework for Modeling Aleatoric Uncertainty in Multimodal Sentiment Analysis

PDF

Xindi Wang et al.202418 citesOpenAlex

In Multimodal Sentiment Analysis (MSA), data noise arising from various sources can lead to uncertainty in Aleatoric Uncertainty (AU), significantly impacting model performance. Current efforts to add...

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

PDF

Xupeng Miao et al.202318 citesOpenAlex

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computati...

KV-Cache Quantization for LLM Inference

KV Cache Quantization Methods

Long-Context KV Compression

Semantic & Structured KV Pruning