TurboQuant: KV Cache Compression¶
Status: Planned Source: src/bit_axon/quantization/turboquant.py (stub)
Abstract¶
TurboQuant is a planned KV cache compression technique for reducing the memory footprint of long-context inference in Bit-Axon. As context lengths grow to the target 64K tokens, the KV cache for the 8 sliding-window attention layers becomes the dominant memory consumer. TurboQuant aims to compress cached key and value tensors to lower precision with minimal quality loss, enabling the full 64K context to fit within the memory budget of a 16 GB MacBook Air.
Planned Feature
TurboQuant is referenced from ICLR 2026 submissions and is not yet implemented. The source file currently contains a stub. The details below describe the planned design.
Key Contributions (Planned)¶
- KV cache quantization — Compress cached \(\mathbf{K}\) and \(\mathbf{V}\) tensors from FP16 to 4-bit representations.
- Integration with SWA layers — Applied selectively to the 8 sliding-window attention layers (Zone 2) where KV caches are maintained.
- Memory target — Reduce total inference memory for 64K context from ~2,900 MB to under 2,500 MB.
Mathematical Foundations¶
KV Cache Memory Model¶
For sliding-window attention with window size \(W\), the KV cache per layer requires:
where the factor 2 accounts for separate \(\mathbf{K}\) and \(\mathbf{V}\) tensors. For Bit-Axon's Zone 2 layers:
- \(W = 4096\), \(d_{\text{model}} = 2560\), \(B = 1\) (single batch)
- 8 layers with KV caches
- FP16 (2 bytes per element):
Quantized KV Cache¶
TurboQuant targets 4-bit quantization of the KV cache. The compression ratio is:
The quantized KV cache memory:
Quantization Function¶
The planned quantization maps FP16 values to 4-bit indices:
where \(\mathcal{C}_{4\text{-bit}}\) is the set of representable values in the 4-bit format. Dequantization reconstructs an approximation:
Attention Quality Under Quantization¶
The attention computation with quantized KV:
The quality loss is bounded by the quantization error:
The specific quantization scheme (NF4, uniform, or learned) is to be determined during implementation.
Implementation Plan¶
Integration Points¶
| Component | Integration |
|---|---|
SlidingWindowAttention | Replace FP16 KV cache with quantized cache |
KVCache | Add quantize/dequantize methods |
turboquant.py | Core quantization primitives |
Planned API¶
# Planned (not yet implemented)
from bit_axon.quantization.turboquant import TurboQuant
quantizer = TurboQuant(bits=4)
# During inference:
# quantizer.compress(kv_cache) # Compress after each attention step
# quantizer.decompress(kv_cache) # Decompress for attention computation
Memory Budget Impact¶
| Configuration | KV Cache Memory | Total Inference Memory |
|---|---|---|
| FP16, 4K context | 335.5 MB | ~2,500 MB |
| FP16, 64K context | N/A (exceeds window) | ~2,900 MB |
| TurboQuant Q4, 64K context | ~83.9 MB | ~2,500 MB (target) |
References¶
- Dettmers, T., et al. (2024). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. (Related: NF4 quantization.)
- Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. (Related: KV cache management.)
- Liu, Z., et al. (2024). KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. arXiv:2402.02750. (Related: KV cache quantization.)
See also¶
- Memory Budget — Current memory analysis and TurboQuant impact projections
- Quantization Guide — NF4 weight quantization (implemented)
- SWA + MoE — Where KV cache is used (layers 9–16 only)
- API — Quantization — Current quantization Python API