KV Cache Quantization

16 天

超越TurboQuant，面向长上下文推理的真2-bit KV Quantization算法问世

本文作者 Zhongzhu Zhou 是 TogetherAI 的 Senior Research Scientist，悉尼大学博士，研究方向为高效机器学习系统，方向覆盖模型训推算法与系统协同设计，LLM 压缩与量化。团队成员均来自 TogetherAI，悉尼大学以及伊利诺伊大学厄巴纳 — 香槟分校。 Together AI 于 2022 年 6 月创立，由苹果前高管 Vipul Ved Pra ...

腾讯网

KV Cache管理架构演进：从连续分配到统一混合内存架构

在生产环境部署过LLM的人都知道模型权重只是问题的一半，另一半是KV cache：存储注意力状态的运行时内存，让模型在生成token时不必从头开始重算。能不能管好这块内存决定了系统是一个卡顿的demo还是一个可用的推理服务。本文梳理KV cache管理经历的5个时代 ...

腾讯网

Google 发了个压缩算法，内存砍 6 倍，速度快 8 倍，精度零损失

Google Research 昨天发了篇博客，介绍了一个叫 TurboQuant 的压缩算法，将在下个月的 ICLR 2026 上正式发表。一句话概括：把大模型的 KV Cache 压缩到 3 bit，内存占用降 6 倍，推理速度快 8 倍，精度损失为零。零。不是「接近零」，不是「可忽略」，是在所有基准测试 ...

Morning Overview on MSN

Google unveiled TurboQuant, a method that cuts the memory bottleneck slowing large AI models

Companies running large language models face a persistent bottleneck: the memory consumed by key-value caches during ...

红板报 on MSN

KV Cache终于不用无脑全留了! 百度&复旦用「投资回报率」重新分配 ...

80% KV Cache压缩，性能损失仅0.52% ...

快科技

谷歌新论文把内存股价干崩了！KV cache压缩6倍

2026-03-26 23:31:06 出处：量子位作者：梦晨编辑：若风评论(0) 复制纠错两家存储芯片巨头股价大跌，没有财报暴雷，没有供应链断裂，只是谷歌展示了一篇即将在ICLR 2026正式亮相的论文。谷歌研究院推出TurboQuant压缩算法，把AI推理过程中最吃内存的KV cache压缩 ...

heise online

TurboQuant: Google aims to curb the memory hunger of large LLMs

Google's TurboQuant reduces the KV cache of large language models to 3 bits. Accuracy is said to remain, speed to multiply. Google Research has published new technical details about its compression ...

Hackaday

vector quantization

Large language models (LLMs) aren’t actually giant computer brains. Instead, they are massive vector spaces in which the probabilities of tokens occurring in a specific order is encoded. Billions of ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果