Skip to content

QuantizationΒΆ

Shrink a 3.2B-parameter model from 6.4 GB down to 1.76 GB so it fits comfortably on a 16 GB MacBook. This guide covers NF4 quantization (implemented), the QLoRA training workflow, merge and re-quantize pipelines, and planned quantization methods.


Why Quantize?ΒΆ

A 3.2B-parameter model in full FP16 precision needs roughly 6.4 GB of memory just for weights. On a MacBook Air M4 with 16 GB of unified memory, that leaves barely enough room for the KV cache, activations, and the operating system.

Quantization reduces weight precision from 16-bit floating point down to 4-bit integers, cutting weight memory by roughly 4Γ— with minimal accuracy loss.

The real constraint is RAM, not compute

Apple Silicon has plenty of compute bandwidth. The bottleneck is fitting the model into 16 GB of unified memory alongside macOS, context windows, and KV caches. Quantization is what makes it possible.

Memory SavingsΒΆ

Configuration Weight Memory Inference Memory (4K ctx) Inference Memory (64K ctx)
FP16 (unquantized) ~6,400 MB Does not fit Does not fit
Q4 (NF4) ~1,760 MB ~2,500 MB ~2,900 MB
QLoRA training (4-bit base) ~1,760 MB ~3,200 – 3,700 MB β€”

Q4 drops weight storage from 6.4 GB to 1.76 GB β€” a 3.6Γ— reduction β€” and leaves room for 64K context windows with KV caches under 3 GB.


NF4 Quantization (Implemented)ΒΆ

Bit-Axon uses 4-bit NormalFloat (NF4) quantization, an affine quantization scheme optimized for normally-distributed neural network weights. It groups weights into blocks of group_size (default 64), computes a per-group scale and bias, and packs each weight into 4 bits.

Under the hood, Bit-Axon delegates to MLX primitives:

  • mx.quantize(weight, group_size, bits=4) β€” packs FP16 weights into 4-bit integers with per-group scales and biases
  • mx.dequantize(packed, scales, biases, group_size, bits) β€” unpacks back to FP16
  • nn.QuantizedLinear.from_linear(linear, group_size, bits) β€” replaces an nn.Linear layer with a quantized version that runs 4-bit matmuls natively on Apple Silicon

CLIΒΆ

Quantize a model to 4-bit
bit-axon quantize ./model --bits 4 --group-size 64

This loads the FP16 model from ./model, replaces every nn.Linear with nn.QuantizedLinear, and saves the quantized weights to ./model/q4.

Full options
bit-axon quantize ./model \
  --output ./model-q4 \
  --bits 4 \
  --group-size 64
Flag Default Description
--output / -o <model>/q4 Output directory
--bits / -b 4 Quantization bit width
--group-size / -g 64 Group size for affine quantization

Python APIΒΆ

quantize_nf4ΒΆ

Pack a single weight tensor into 4-bit NormalFloat format:

import mlx.core as mx
from bit_axon.quantization import quantize_nf4, dequantize_nf4

# weight: mx.array of shape (output_dim, input_dim), dtype float16
packed, scales, biases = quantize_nf4(weight, group_size=64)

# packed: uint32 array (each element stores 8 Γ— 4-bit weights)
# scales: float16 array of shape (output_dim, input_dim // group_size)
# biases: float16 array of shape (output_dim, input_dim // group_size)

# Unpack back to FP16
restored = dequantize_nf4(packed, scales, biases, group_size=64, bits=4)

replace_linear_with_quantizedΒΆ

Recursively walk a model and replace all nn.Linear layers with nn.QuantizedLinear:

from bit_axon import BitAxonModel, BitAxonConfig
from bit_axon.quantization import replace_linear_with_quantized

config = BitAxonConfig()
model = BitAxonModel(config)

# Replace every nn.Linear with nn.QuantizedLinear (in-place)
model = replace_linear_with_quantized(model, group_size=64, bits=4)

# model is now fully quantized β€” ready for inference

MoE support

replace_linear_with_quantized handles MoE expert lists correctly. It walks both dict-style children (named layers) and list-style children (expert arrays inside MixtureOfExperts), quantizing every expert's linear layers.

How It WorksΒΆ

FP16 Weight Matrix
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ w₁  wβ‚‚  w₃  wβ‚„  wβ‚…  w₆ ... β”‚  shape: (out, in), float16
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό  split into groups of 64
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Group 0: [w₁..w₆₄]  β†’ scaleβ‚€, biasβ‚€, 4-bit codes
β”‚ Group 1: [w₆₅..wβ‚β‚‚β‚ˆ] β†’ scale₁, bias₁, 4-bit codes
β”‚ ...                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό  pack 8 codes per uint32
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ packed: uint32 array         β”‚  4Γ— smaller than float16
β”‚ scales: float16 array        β”‚  1 scale per group of 64
β”‚ biases: float16 array        β”‚  1 bias per group of 64
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each group of 64 weights gets its own affine mapping: w_quantized = (w - bias) / scale. The 4-bit codes are packed 8 per uint32 word. During inference, nn.QuantizedLinear unpacks on the fly and computes matmuls in the quantized domain β€” no FP16 intermediates for the weight matrix.


QLoRA: Quantization in the Training WorkflowΒΆ

QLoRA (Quantized Low-Rank Adaptation) freezes the base model in Q4 and trains only small LoRA or DoRA adapters on top. This gives you fine-tuning quality close to full FP16 training while keeping memory usage at ~3.2–3.7 GB.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Base Model (frozen, Q4)                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ nn.QuantizedLinear (4-bit weights)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                 β”‚                           β”‚
β”‚                 β–Ό                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ LoRA: A @ B (rank 8, float16)        β”‚  β”‚  ← trained
β”‚  β”‚ or DoRA: magnitude + direction        β”‚  β”‚  ← trained
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training with QLoRAΒΆ

Fine-tune with QLoRA (4-bit base + LoRA adapters)
bit-axon train data.json \
  --lora-rank 8 \
  --quantize-bits 4 \
  --quantize-group-size 64

Under the hood, the training pipeline does:

  1. Load model in FP16
  2. Quantize all nn.Linear β†’ nn.QuantizedLinear (Q4, group_size=64)
  3. Apply LoRA/DoRA adapters on top of the frozen quantized layers
  4. Train only adapter parameters (lora_a, lora_b, and optionally DoRA magnitude .m)
  5. Save adapter weights only (a few MB)

Python APIΒΆ

from bit_axon import BitAxonModel, BitAxonConfig
from bit_axon.quantization import replace_linear_with_quantized
from bit_axon.training import apply_lora_to_model

# Step 1: Load model
config = BitAxonConfig()
model = BitAxonModel(config)

# Step 2: Quantize base to Q4
model = replace_linear_with_quantized(model, group_size=64, bits=4)

# Step 3: Wrap with LoRA adapters (rank 8)
model = apply_lora_to_model(
    model,
    rank=8,
    alpha=16,
    use_dora=False,
    target_modules=["attention", "moe"],
)

# Step 4: Train only adapter parameters
# Only lora_a, lora_b (and .m for DoRA) are trainable.
# Base quantized weights are frozen.

Do not update base weights during QLoRA

The Trainer.get_trainable_params() filters strictly to adapter parameters. If you write a custom training loop, make sure you freeze the quantized base weights β€” otherwise you'll be computing gradients through quantized matmuls with degraded precision.


Merge and Re-QuantizeΒΆ

After training, you'll want to merge the LoRA/DoRA adapters back into the base model and re-quantize for efficient inference.

Q4 Base + LoRA Adapter
         β”‚
         β–Ό  merge_adapters()
FP16 Base (LoRA fused in, dequantized)
         β”‚
         β–Ό  quantize_model()
Q4 Merged Model (ready for deployment)

CLIΒΆ

Merge adapters and re-quantize
bit-axon merge ./base-model \
  --adapter ./adapter-checkpoint \
  --output ./merged-model \
  --bits 4 \
  --group-size 64

By default, the merge command re-quantizes after fusing adapters. To keep the merged model in FP16 (e.g., for further processing), use --no-re-quantize:

Merge without re-quantizing
bit-axon merge ./base-model \
  --adapter ./adapter-checkpoint \
  --output ./merged-model \
  --no-re-quantize

Python APIΒΆ

from bit_axon.training import load_and_merge

# End-to-end: load base + adapter, merge, re-quantize, save
load_and_merge(
    base_model_path="./base-model",
    adapter_path="./adapter-checkpoint",
    output_dir="./merged-model",
    quantize_after_merge=True,
    bits=4,
    group_size=64,
    lora_rank=8,
)

For finer control over each step:

from bit_axon.training import (
    merge_adapters,
    dequantize_model,
    quantize_model,
    save_merged_model,
)

# Step 1: Merge LoRA/DoRA adapters into the base
model = merge_adapters(model)  # calls .fuse() on every LoRALinear/DoRALinear

# Step 2: Dequantize from Q4 to FP16
model = dequantize_model(model)  # QuantizedLinear β†’ nn.Linear (float16)

# Step 3: Re-quantize to Q4
model = quantize_model(model, bits=4, group_size=64)

# Step 4: Save
save_merged_model(model, output_dir="./merged-model", config=config, tokenizer=tokenizer)

Merge then quantize separately for evaluation

The full pipeline evaluates perplexity on the merged (unquantized) model before re-quantizing. This gives a clean quality metric without quantization noise. Only the final deployment model gets re-quantized.


Planned Quantization MethodsΒΆ

Bit-Axon has two additional quantization schemes in development. These are not yet implemented β€” the corresponding modules contain stubs.

Ternary Quantization (1.58-bit BitNet)ΒΆ

File: src/bit_axon/quantization/ternary.py (stub)

Ternary (1.58-bit) quantization represents each weight as one of three values: {-1, 0, +1}. This eliminates multiplications entirely from matmuls β€” replaced by sign flips and additions β€” and is the core idea behind BitNet b1.58.

Precision Bits per weight Memory (3.2B)
FP16 16 ~6,400 MB
NF4 4 ~1,760 MB
Ternary 1.58 ~700 MB

Status

The ternary module (quantization/ternary.py) is a stub with no implementation yet. It is planned for a future release.

TurboQuant KV Cache CompressionΒΆ

File: src/bit_axon/quantization/turboquant.py (stub)

TurboQuant compresses the KV cache β€” which grows linearly with sequence length β€” to reduce memory usage for long-context inference. The technique is based on ICLR 2026 research on learned KV cache quantization.

For 64K context windows, KV cache memory can dominate. TurboQuant aims to keep the total inference footprint under 3 GB even at maximum context length.

Status

The TurboQuant module (quantization/turboquant.py) is a stub with no implementation yet. It is planned for a future release.


Quick ReferenceΒΆ

# Quantize a model
bit-axon quantize ./model --bits 4 --group-size 64

# Train with QLoRA
bit-axon train data.json --lora-rank 8

# Merge adapters and re-quantize
bit-axon merge ./base-model --adapter ./adapter --output ./merged

# Run inference (auto-quantizes on load)
bit-axon run --model ./model --prompt "Hello, world!"
# Quantize
from bit_axon.quantization import quantize_nf4, replace_linear_with_quantized
packed, scales, biases = quantize_nf4(weight, group_size=64)
model = replace_linear_with_quantized(model, group_size=64, bits=4)

# Merge
from bit_axon.training import load_and_merge
load_and_merge("./base", "./adapter", "./output", quantize_after_merge=True)

See alsoΒΆ