QuantizationΒΆ
Shrink a 3.2B-parameter model from 6.4 GB down to 1.76 GB so it fits comfortably on a 16 GB MacBook. This guide covers NF4 quantization (implemented), the QLoRA training workflow, merge and re-quantize pipelines, and planned quantization methods.
Why Quantize?ΒΆ
A 3.2B-parameter model in full FP16 precision needs roughly 6.4 GB of memory just for weights. On a MacBook Air M4 with 16 GB of unified memory, that leaves barely enough room for the KV cache, activations, and the operating system.
Quantization reduces weight precision from 16-bit floating point down to 4-bit integers, cutting weight memory by roughly 4Γ with minimal accuracy loss.
The real constraint is RAM, not compute
Apple Silicon has plenty of compute bandwidth. The bottleneck is fitting the model into 16 GB of unified memory alongside macOS, context windows, and KV caches. Quantization is what makes it possible.
Memory SavingsΒΆ
| Configuration | Weight Memory | Inference Memory (4K ctx) | Inference Memory (64K ctx) |
|---|---|---|---|
| FP16 (unquantized) | ~6,400 MB | Does not fit | Does not fit |
| Q4 (NF4) | ~1,760 MB | ~2,500 MB | ~2,900 MB |
| QLoRA training (4-bit base) | ~1,760 MB | ~3,200 β 3,700 MB | β |
Q4 drops weight storage from 6.4 GB to 1.76 GB β a 3.6Γ reduction β and leaves room for 64K context windows with KV caches under 3 GB.
NF4 Quantization (Implemented)ΒΆ
Bit-Axon uses 4-bit NormalFloat (NF4) quantization, an affine quantization scheme optimized for normally-distributed neural network weights. It groups weights into blocks of group_size (default 64), computes a per-group scale and bias, and packs each weight into 4 bits.
Under the hood, Bit-Axon delegates to MLX primitives:
mx.quantize(weight, group_size, bits=4)β packs FP16 weights into 4-bit integers with per-group scales and biasesmx.dequantize(packed, scales, biases, group_size, bits)β unpacks back to FP16nn.QuantizedLinear.from_linear(linear, group_size, bits)β replaces annn.Linearlayer with a quantized version that runs 4-bit matmuls natively on Apple Silicon
CLIΒΆ
This loads the FP16 model from ./model, replaces every nn.Linear with nn.QuantizedLinear, and saves the quantized weights to ./model/q4.
| Flag | Default | Description |
|---|---|---|
--output / -o | <model>/q4 | Output directory |
--bits / -b | 4 | Quantization bit width |
--group-size / -g | 64 | Group size for affine quantization |
Python APIΒΆ
quantize_nf4ΒΆ
Pack a single weight tensor into 4-bit NormalFloat format:
import mlx.core as mx
from bit_axon.quantization import quantize_nf4, dequantize_nf4
# weight: mx.array of shape (output_dim, input_dim), dtype float16
packed, scales, biases = quantize_nf4(weight, group_size=64)
# packed: uint32 array (each element stores 8 Γ 4-bit weights)
# scales: float16 array of shape (output_dim, input_dim // group_size)
# biases: float16 array of shape (output_dim, input_dim // group_size)
# Unpack back to FP16
restored = dequantize_nf4(packed, scales, biases, group_size=64, bits=4)
replace_linear_with_quantizedΒΆ
Recursively walk a model and replace all nn.Linear layers with nn.QuantizedLinear:
from bit_axon import BitAxonModel, BitAxonConfig
from bit_axon.quantization import replace_linear_with_quantized
config = BitAxonConfig()
model = BitAxonModel(config)
# Replace every nn.Linear with nn.QuantizedLinear (in-place)
model = replace_linear_with_quantized(model, group_size=64, bits=4)
# model is now fully quantized β ready for inference
MoE support
replace_linear_with_quantized handles MoE expert lists correctly. It walks both dict-style children (named layers) and list-style children (expert arrays inside MixtureOfExperts), quantizing every expert's linear layers.
How It WorksΒΆ
FP16 Weight Matrix
ββββββββββββββββββββββββββββββββ
β wβ wβ wβ wβ wβ
wβ ... β shape: (out, in), float16
ββββββββββββββββββββββββββββββββ
β
βΌ split into groups of 64
ββββββββββββββββββββββββββββββββ
β Group 0: [wβ..wββ] β scaleβ, biasβ, 4-bit codes
β Group 1: [wββ
..wβββ] β scaleβ, biasβ, 4-bit codes
β ... β
ββββββββββββββββββββββββββββββββ
β
βΌ pack 8 codes per uint32
ββββββββββββββββββββββββββββββββ
β packed: uint32 array β 4Γ smaller than float16
β scales: float16 array β 1 scale per group of 64
β biases: float16 array β 1 bias per group of 64
ββββββββββββββββββββββββββββββββ
Each group of 64 weights gets its own affine mapping: w_quantized = (w - bias) / scale. The 4-bit codes are packed 8 per uint32 word. During inference, nn.QuantizedLinear unpacks on the fly and computes matmuls in the quantized domain β no FP16 intermediates for the weight matrix.
QLoRA: Quantization in the Training WorkflowΒΆ
QLoRA (Quantized Low-Rank Adaptation) freezes the base model in Q4 and trains only small LoRA or DoRA adapters on top. This gives you fine-tuning quality close to full FP16 training while keeping memory usage at ~3.2β3.7 GB.
βββββββββββββββββββββββββββββββββββββββββββββββ
β Base Model (frozen, Q4) β
β βββββββββββββββββββββββββββββββββββββββββ β
β β nn.QuantizedLinear (4-bit weights) β β
β ββββββββββββββββ¬βββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββ β
β β LoRA: A @ B (rank 8, float16) β β β trained
β β or DoRA: magnitude + direction β β β trained
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Training with QLoRAΒΆ
bit-axon train data.json \
--lora-rank 8 \
--quantize-bits 4 \
--quantize-group-size 64
Under the hood, the training pipeline does:
- Load model in FP16
- Quantize all
nn.Linearβnn.QuantizedLinear(Q4, group_size=64) - Apply LoRA/DoRA adapters on top of the frozen quantized layers
- Train only adapter parameters (
lora_a,lora_b, and optionally DoRA magnitude.m) - Save adapter weights only (a few MB)
Python APIΒΆ
from bit_axon import BitAxonModel, BitAxonConfig
from bit_axon.quantization import replace_linear_with_quantized
from bit_axon.training import apply_lora_to_model
# Step 1: Load model
config = BitAxonConfig()
model = BitAxonModel(config)
# Step 2: Quantize base to Q4
model = replace_linear_with_quantized(model, group_size=64, bits=4)
# Step 3: Wrap with LoRA adapters (rank 8)
model = apply_lora_to_model(
model,
rank=8,
alpha=16,
use_dora=False,
target_modules=["attention", "moe"],
)
# Step 4: Train only adapter parameters
# Only lora_a, lora_b (and .m for DoRA) are trainable.
# Base quantized weights are frozen.
Do not update base weights during QLoRA
The Trainer.get_trainable_params() filters strictly to adapter parameters. If you write a custom training loop, make sure you freeze the quantized base weights β otherwise you'll be computing gradients through quantized matmuls with degraded precision.
Merge and Re-QuantizeΒΆ
After training, you'll want to merge the LoRA/DoRA adapters back into the base model and re-quantize for efficient inference.
Q4 Base + LoRA Adapter
β
βΌ merge_adapters()
FP16 Base (LoRA fused in, dequantized)
β
βΌ quantize_model()
Q4 Merged Model (ready for deployment)
CLIΒΆ
bit-axon merge ./base-model \
--adapter ./adapter-checkpoint \
--output ./merged-model \
--bits 4 \
--group-size 64
By default, the merge command re-quantizes after fusing adapters. To keep the merged model in FP16 (e.g., for further processing), use --no-re-quantize:
bit-axon merge ./base-model \
--adapter ./adapter-checkpoint \
--output ./merged-model \
--no-re-quantize
Python APIΒΆ
from bit_axon.training import load_and_merge
# End-to-end: load base + adapter, merge, re-quantize, save
load_and_merge(
base_model_path="./base-model",
adapter_path="./adapter-checkpoint",
output_dir="./merged-model",
quantize_after_merge=True,
bits=4,
group_size=64,
lora_rank=8,
)
For finer control over each step:
from bit_axon.training import (
merge_adapters,
dequantize_model,
quantize_model,
save_merged_model,
)
# Step 1: Merge LoRA/DoRA adapters into the base
model = merge_adapters(model) # calls .fuse() on every LoRALinear/DoRALinear
# Step 2: Dequantize from Q4 to FP16
model = dequantize_model(model) # QuantizedLinear β nn.Linear (float16)
# Step 3: Re-quantize to Q4
model = quantize_model(model, bits=4, group_size=64)
# Step 4: Save
save_merged_model(model, output_dir="./merged-model", config=config, tokenizer=tokenizer)
Merge then quantize separately for evaluation
The full pipeline evaluates perplexity on the merged (unquantized) model before re-quantizing. This gives a clean quality metric without quantization noise. Only the final deployment model gets re-quantized.
Planned Quantization MethodsΒΆ
Bit-Axon has two additional quantization schemes in development. These are not yet implemented β the corresponding modules contain stubs.
Ternary Quantization (1.58-bit BitNet)ΒΆ
File: src/bit_axon/quantization/ternary.py (stub)
Ternary (1.58-bit) quantization represents each weight as one of three values: {-1, 0, +1}. This eliminates multiplications entirely from matmuls β replaced by sign flips and additions β and is the core idea behind BitNet b1.58.
| Precision | Bits per weight | Memory (3.2B) |
|---|---|---|
| FP16 | 16 | ~6,400 MB |
| NF4 | 4 | ~1,760 MB |
| Ternary | 1.58 | ~700 MB |
Status
The ternary module (quantization/ternary.py) is a stub with no implementation yet. It is planned for a future release.
TurboQuant KV Cache CompressionΒΆ
File: src/bit_axon/quantization/turboquant.py (stub)
TurboQuant compresses the KV cache β which grows linearly with sequence length β to reduce memory usage for long-context inference. The technique is based on ICLR 2026 research on learned KV cache quantization.
For 64K context windows, KV cache memory can dominate. TurboQuant aims to keep the total inference footprint under 3 GB even at maximum context length.
Status
The TurboQuant module (quantization/turboquant.py) is a stub with no implementation yet. It is planned for a future release.
Quick ReferenceΒΆ
# Quantize a model
bit-axon quantize ./model --bits 4 --group-size 64
# Train with QLoRA
bit-axon train data.json --lora-rank 8
# Merge adapters and re-quantize
bit-axon merge ./base-model --adapter ./adapter --output ./merged
# Run inference (auto-quantizes on load)
bit-axon run --model ./model --prompt "Hello, world!"
# Quantize
from bit_axon.quantization import quantize_nf4, replace_linear_with_quantized
packed, scales, biases = quantize_nf4(weight, group_size=64)
model = replace_linear_with_quantized(model, group_size=64, bits=4)
# Merge
from bit_axon.training import load_and_merge
load_and_merge("./base", "./adapter", "./output", quantize_after_merge=True)
See alsoΒΆ
- Training Guide β QLoRA training with quantized base weights
- Memory Budget β Detailed memory analysis and context length strategy
- TurboQuant Paper β Planned KV cache compression for long contexts
- CLI Reference β
quantizeandmergecommand options - API β Quantization β
quantize_nf4andreplace_linear_with_quantizedPython API