Skip to content

Quantization

bit_axon.quantization

Functions

quantize_nf4
quantize_nf4(weight: array, group_size: int = 64) -> tuple[array, array, array]

Quantize a weight matrix to NF4 format.

Parameters:

Name Type Description Default
weight array

Weight matrix to quantize.

required
group_size int

Number of elements per quantization group.

64

Returns:

Type Description
tuple[array, array, array]

Tuple of (packed_weights, scales, biases).

replace_linear_with_quantized
replace_linear_with_quantized(module: Module, group_size: int = 64, bits: int = 4)

Recursively replace nn.Linear layers with nn.QuantizedLinear.

Skips layers whose input dimension is smaller than group_size.

Parameters:

Name Type Description Default
module Module

Root module to traverse.

required
group_size int

Quantization group size.

64
bits int

Quantization bit width.

4

Returns:

Type Description

The modified module (mutated in-place).