Skip to content

Memory BudgetΒΆ

Bit-Axon targets the MacBook Air M4 with 16 GB unified memory, of which roughly 8 GB is available for the model. This page breaks down every byte.

Available MemoryΒΆ

Item Size Notes
Physical RAM 16,384 MB MacBook Air M4 16 GB
macOS system usage ~2,500 MB macOS 15 + services
Other apps ~2,000 MB Browser, IDE, etc.
MLX overhead ~500 MB Framework runtime
Metal GPU safety margin ~1,500 MB Swap prevention buffer
Available for model ~8,000 MB This is the constraint

Maximizing Available Memory

Closing all other apps may yield ~10 GB, but ~8 GB is the baseline for stable operation. All design decisions assume this budget.

Weight Memory by PrecisionΒΆ

Precision Weight Size Fits?
FP16 ~6,400 MB ❌ No
Q8 (affine) ~3,200 MB ⚠️ Tight
Q4 (affine, group=64) ~1,760 MB βœ… Default
1.58-bit (Ternary) ~640 MB πŸ”¬ Experimental

FP16 weights alone consume 6.4 GB β€” nearly the entire budget before any activations or KV cache. Q4 quantization is the default strategy. MLX's nn.QuantizedLinear runs 4-bit matmul as a single fused Metal kernel, delivering 2–3Γ— speedup over FP16 while using 3.6Γ— less memory.

Parameter Count BreakdownΒΆ

Component Count FP16 Size Q4 Size
embed_tokens 65.5M 131 MB 36 MB
input_proj 5.2M 10.5 MB 2.9 MB
Layers 1–8: AxonSSMBlock (Γ—8) ~482M 964 MB 265 MB
Layers 9–16: AxonSWAMoEBlock (Γ—8) ~1,120M 2,240 MB 616 MB
Layers 17–24: AxonSSMMoEBlock (Γ—8) ~1,200M 2,400 MB 660 MB
output_proj 5.2M 10.5 MB 2.9 MB
lm_head (tied with embed) 0 0 MB 0 MB
Total ~3.2B ~6,400 MB ~1,760 MB

Weight Tying Savings

embed_tokens.weight and lm_head.weight share the same tensor, saving ~64 MB in FP16 (~18 MB in Q4).

SSM State Memory (Fixed, Layers 1–8 and 17–24)ΒΆ

The 16 SSM layers maintain a fixed-size hidden state regardless of context length:

  • SSM state: \((1, 7680, 16)\) per layer Γ— 16 layers = ~3.7 MB
  • Conv cache: \((1, 3, 7680)\) per layer Γ— 16 layers = ~0.7 MB
  • Total SSM state: ~4.4 MB β€” negligible

SWA KV Cache (Layers 9–16, 32 heads, head_dim=80)ΒΆ

Only the 8 SWA layers produce a KV cache. Formula:

\[\text{KV Cache} = 8 \text{ layers} \times 2_{(K+V)} \times 32 \text{ heads} \times 80_{d_k} \times \text{seq\_len} \times 2 \text{ bytes}\]
Context KV Cache (FP16) KV Cache (TurboQuant 3-bit)
1,024 80 MB ~14 MB
4,096 320 MB ~53 MB
16,384 1,280 MB ~213 MB
32,768 2,560 MB ~427 MB
65,536 5,120 MB ~853 MB

64K Context Without TurboQuant

At 65,536 tokens, the FP16 KV cache alone is ~5.1 GB β€” exceeding the entire available budget. TurboQuant is essential for 64K context.

Total Inference MemoryΒΆ

Precision 4K ctx 16K ctx 32K ctx 64K ctx
FP16 (no TQ) ~7,233 MB ~8,393 MB ~9,673 MB ~12,233 MB
Q4 (no TQ) ~2,593 MB ~3,353 MB ~4,633 MB ~7,193 MB
Q4 + TurboQuant 3-bit ~2,326 MB ~2,486 MB ~2,700 MB ~3,126 MB

Key conclusions:

  • βœ… Q4 alone fits within 8 GB up to ~16K context
  • βœ… Q4 + TurboQuant fits within 8 GB at full 64K context with ~5 GB to spare
  • ❌ FP16 inference exceeds 8 GB at any context length beyond 4K

QLoRA Training MemoryΒΆ

Full fine-tuning (FP16 weights + gradients + Adam optimizer) requires ~38.4 GB β€” physically impossible on 16 GB. QLoRA (Q4 frozen base + LoRA adapters) compresses training to:

Component Size Notes
Base weights Q4 (frozen) ~1,760 MB No gradients needed
LoRA params (rank=16, ~0.5%) ~32 MB Trainable
LoRA gradients (FP16) ~32 MB
LoRA Adam optimizer (\(m\), \(v\), FP32) ~256 MB
Activations (checkpointed, \(B=1\), \(T=2048\)) ~500–1,000 MB Gradient checkpointing enabled
SWA KV cache (\(T=2048\)) ~160 MB
MLX overhead ~500 MB
Total ~3,240–3,740 MB Fits within 8 GB βœ…

Thermal ManagementΒΆ

The fanless MacBook Air M4 dissipates all heat passively. Bit-Axon uses a thermal-aware training scheduler that monitors SoC temperature via macOS powermetrics:

CoolingSchedulerΒΆ

Temperature Action Rationale
< 75Β°C Normal speed No throttling needed
75–85Β°C Halve LoRA rank Same memory footprint, less compute
85–95Β°C 0.5s pause per step Forced cooling period
> 95Β°C Pause training Resume from checkpoint when cool

ThermalPolicyΒΆ

class ThermalPolicy:
    PAUSE_TEMP = 85    # Β°C β€” insert micro-pauses
    STOP_TEMP = 95     # Β°C β€” halt training entirely
    RESUME_TEMP = 75   # Β°C β€” resume normal operation
    CHECK_INTERVAL = 1 # seconds between temperature checks

MoE Thermal BenefitΒΆ

MoE sparsity directly reduces thermal load:

  • Only ~1.4B of 3.2B parameters are active per token (~44%)
  • 56% of the chip's compute units remain idle on each forward pass
  • This is roughly equivalent to running a 1.4B dense model in terms of heat generation
  • High likelihood of sustained training on a fanless MacBook Air

Context Length StrategyΒΆ

Scenario Context Recommended Config Total Memory
Chatbot 4K Q4 ~2.5 GB
Code generation 8K Q4 ~2.8 GB
Document summary 16K Q4 ~3.3 GB
PDF analysis 32K Q4 + TurboQuant recommended ~2.7 GB
Full 64K 64K Q4 + TurboQuant required ~3.1 GB

Weight Tying SavingsΒΆ

embed_tokens:  (32000, 2048) = 65.5M params
lm_head:       (32000, 2048) = 65.5M params  ← same tensor
                                ─────────────
Savings:                       65.5M params = ~131 MB (FP16) / ~36 MB (Q4)

By tying embed_tokens.weight = lm_head.weight, the model avoids storing a duplicate copy of the embedding matrix. This is particularly valuable at Q4 where every ~36 MB counts toward the 8 GB ceiling.


See alsoΒΆ