TurboQuant: KV Cache Compression¶

Status: Planned Source: src/bit_axon/quantization/turboquant.py (stub)

Abstract¶

TurboQuant is a planned KV cache compression technique for reducing the memory footprint of long-context inference in Bit-Axon. As context lengths grow to the target 64K tokens, the KV cache for the 8 sliding-window attention layers becomes the dominant memory consumer. TurboQuant aims to compress cached key and value tensors to lower precision with minimal quality loss, enabling the full 64K context to fit within the memory budget of a 16 GB MacBook Air.

Planned Feature

TurboQuant is referenced from ICLR 2026 submissions and is not yet implemented. The source file currently contains a stub. The details below describe the planned design.

Key Contributions (Planned)¶

KV cache quantization — Compress cached \(\mathbf{K}\) and \(\mathbf{V}\) tensors from FP16 to 4-bit representations.
Integration with SWA layers — Applied selectively to the 8 sliding-window attention layers (Zone 2) where KV caches are maintained.
Memory target — Reduce total inference memory for 64K context from ~2,900 MB to under 2,500 MB.

Mathematical Foundations¶

KV Cache Memory Model¶

For sliding-window attention with window size \(W\), the KV cache per layer requires:

\[ M_{\text{KV}} = 2 \times B \times W \times d_{\text{model}} \times \text{sizeof}(\text{dtype}) \]

where the factor 2 accounts for separate \(\mathbf{K}\) and \(\mathbf{V}\) tensors. For Bit-Axon's Zone 2 layers:

\(W = 4096\), \(d_{\text{model}} = 2560\), \(B = 1\) (single batch)
8 layers with KV caches
FP16 (2 bytes per element):

\[ M_{\text{KV}}^{\text{FP16}} = 8 \times 2 \times 4096 \times 2560 \times 2 = 335.5 \text{ MB} \]

Quantized KV Cache¶

TurboQuant targets 4-bit quantization of the KV cache. The compression ratio is:

\[ r = \frac{\text{sizeof}(\text{FP16})}{\text{sizeof}(\text{Q4})} = \frac{16}{4} = 4\times \]

The quantized KV cache memory:

\[ M_{\text{KV}}^{\text{Q4}} = \frac{M_{\text{KV}}^{\text{FP16}}}{4} = 83.9 \text{ MB} \]

Quantization Function¶

The planned quantization maps FP16 values to 4-bit indices:

\[ \mathbf{K}_{\text{quantized}} = Q(\mathbf{K}) = \text{argmin}_{\mathbf{K}' \in \mathcal{C}_{4\text{-bit}}} \|\mathbf{K} - \mathbf{K}'\|_2 \]

where \(\mathcal{C}_{4\text{-bit}}\) is the set of representable values in the 4-bit format. Dequantization reconstructs an approximation:

\[ \hat{\mathbf{K}} = DQ(\mathbf{K}_{\text{quantized}}) \approx \mathbf{K} \]

Attention Quality Under Quantization¶

The attention computation with quantized KV:

\[ \text{Attn}(\mathbf{Q}, \hat{\mathbf{K}}, \hat{\mathbf{V}}) = \text{softmax}\left(\frac{\mathbf{Q}\hat{\mathbf{K}}^T}{\sqrt{d_h}}\right)\hat{\mathbf{V}} \]

The quality loss is bounded by the quantization error:

\[ \|\text{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) - \text{Attn}(\mathbf{Q}, \hat{\mathbf{K}}, \hat{\mathbf{V}})\| \leq f(\|\mathbf{K} - \hat{\mathbf{K}}\|, \|\mathbf{V} - \hat{\mathbf{V}}\|) \]

The specific quantization scheme (NF4, uniform, or learned) is to be determined during implementation.

Implementation Plan¶

Integration Points¶

Component	Integration
`SlidingWindowAttention`	Replace FP16 KV cache with quantized cache
`KVCache`	Add quantize/dequantize methods
`turboquant.py`	Core quantization primitives

Planned API¶

# Planned (not yet implemented)
from bit_axon.quantization.turboquant import TurboQuant

quantizer = TurboQuant(bits=4)
# During inference:
# quantizer.compress(kv_cache)  # Compress after each attention step
# quantizer.decompress(kv_cache)  # Decompress for attention computation

Memory Budget Impact¶

Configuration	KV Cache Memory	Total Inference Memory
FP16, 4K context	335.5 MB	~2,500 MB
FP16, 64K context	N/A (exceeds window)	~2,900 MB
TurboQuant Q4, 64K context	~83.9 MB	~2,500 MB (target)

References¶

Dettmers, T., et al. (2024). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. (Related: NF4 quantization.)
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. (Related: KV cache management.)
Liu, Z., et al. (2024). KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. arXiv:2402.02750. (Related: KV cache quantization.)