Memory BudgetΒΆ
Bit-Axon targets the MacBook Air M4 with 16 GB unified memory, of which roughly 8 GB is available for the model. This page breaks down every byte.
Available MemoryΒΆ
| Item | Size | Notes |
|---|---|---|
| Physical RAM | 16,384 MB | MacBook Air M4 16 GB |
| macOS system usage | ~2,500 MB | macOS 15 + services |
| Other apps | ~2,000 MB | Browser, IDE, etc. |
| MLX overhead | ~500 MB | Framework runtime |
| Metal GPU safety margin | ~1,500 MB | Swap prevention buffer |
| Available for model | ~8,000 MB | This is the constraint |
Maximizing Available Memory
Closing all other apps may yield ~10 GB, but ~8 GB is the baseline for stable operation. All design decisions assume this budget.
Weight Memory by PrecisionΒΆ
| Precision | Weight Size | Fits? |
|---|---|---|
| FP16 | ~6,400 MB | β No |
| Q8 (affine) | ~3,200 MB | β οΈ Tight |
| Q4 (affine, group=64) | ~1,760 MB | β Default |
| 1.58-bit (Ternary) | ~640 MB | π¬ Experimental |
FP16 weights alone consume 6.4 GB β nearly the entire budget before any activations or KV cache. Q4 quantization is the default strategy. MLX's nn.QuantizedLinear runs 4-bit matmul as a single fused Metal kernel, delivering 2β3Γ speedup over FP16 while using 3.6Γ less memory.
Parameter Count BreakdownΒΆ
| Component | Count | FP16 Size | Q4 Size |
|---|---|---|---|
embed_tokens | 65.5M | 131 MB | 36 MB |
input_proj | 5.2M | 10.5 MB | 2.9 MB |
| Layers 1β8: AxonSSMBlock (Γ8) | ~482M | 964 MB | 265 MB |
| Layers 9β16: AxonSWAMoEBlock (Γ8) | ~1,120M | 2,240 MB | 616 MB |
| Layers 17β24: AxonSSMMoEBlock (Γ8) | ~1,200M | 2,400 MB | 660 MB |
output_proj | 5.2M | 10.5 MB | 2.9 MB |
lm_head (tied with embed) | 0 | 0 MB | 0 MB |
| Total | ~3.2B | ~6,400 MB | ~1,760 MB |
Weight Tying Savings
embed_tokens.weight and lm_head.weight share the same tensor, saving ~64 MB in FP16 (~18 MB in Q4).
SSM State Memory (Fixed, Layers 1β8 and 17β24)ΒΆ
The 16 SSM layers maintain a fixed-size hidden state regardless of context length:
- SSM state: \((1, 7680, 16)\) per layer Γ 16 layers = ~3.7 MB
- Conv cache: \((1, 3, 7680)\) per layer Γ 16 layers = ~0.7 MB
- Total SSM state: ~4.4 MB β negligible
SWA KV Cache (Layers 9β16, 32 heads, head_dim=80)ΒΆ
Only the 8 SWA layers produce a KV cache. Formula:
| Context | KV Cache (FP16) | KV Cache (TurboQuant 3-bit) |
|---|---|---|
| 1,024 | 80 MB | ~14 MB |
| 4,096 | 320 MB | ~53 MB |
| 16,384 | 1,280 MB | ~213 MB |
| 32,768 | 2,560 MB | ~427 MB |
| 65,536 | 5,120 MB | ~853 MB |
64K Context Without TurboQuant
At 65,536 tokens, the FP16 KV cache alone is ~5.1 GB β exceeding the entire available budget. TurboQuant is essential for 64K context.
Total Inference MemoryΒΆ
| Precision | 4K ctx | 16K ctx | 32K ctx | 64K ctx |
|---|---|---|---|---|
| FP16 (no TQ) | ~7,233 MB | ~8,393 MB | ~9,673 MB | ~12,233 MB |
| Q4 (no TQ) | ~2,593 MB | ~3,353 MB | ~4,633 MB | ~7,193 MB |
| Q4 + TurboQuant 3-bit | ~2,326 MB | ~2,486 MB | ~2,700 MB | ~3,126 MB |
Key conclusions:
- β Q4 alone fits within 8 GB up to ~16K context
- β Q4 + TurboQuant fits within 8 GB at full 64K context with ~5 GB to spare
- β FP16 inference exceeds 8 GB at any context length beyond 4K
QLoRA Training MemoryΒΆ
Full fine-tuning (FP16 weights + gradients + Adam optimizer) requires ~38.4 GB β physically impossible on 16 GB. QLoRA (Q4 frozen base + LoRA adapters) compresses training to:
| Component | Size | Notes |
|---|---|---|
| Base weights Q4 (frozen) | ~1,760 MB | No gradients needed |
| LoRA params (rank=16, ~0.5%) | ~32 MB | Trainable |
| LoRA gradients (FP16) | ~32 MB | |
| LoRA Adam optimizer (\(m\), \(v\), FP32) | ~256 MB | |
| Activations (checkpointed, \(B=1\), \(T=2048\)) | ~500β1,000 MB | Gradient checkpointing enabled |
| SWA KV cache (\(T=2048\)) | ~160 MB | |
| MLX overhead | ~500 MB | |
| Total | ~3,240β3,740 MB | Fits within 8 GB β |
Thermal ManagementΒΆ
The fanless MacBook Air M4 dissipates all heat passively. Bit-Axon uses a thermal-aware training scheduler that monitors SoC temperature via macOS powermetrics:
CoolingSchedulerΒΆ
| Temperature | Action | Rationale |
|---|---|---|
| < 75Β°C | Normal speed | No throttling needed |
| 75β85Β°C | Halve LoRA rank | Same memory footprint, less compute |
| 85β95Β°C | 0.5s pause per step | Forced cooling period |
| > 95Β°C | Pause training | Resume from checkpoint when cool |
ThermalPolicyΒΆ
class ThermalPolicy:
PAUSE_TEMP = 85 # Β°C β insert micro-pauses
STOP_TEMP = 95 # Β°C β halt training entirely
RESUME_TEMP = 75 # Β°C β resume normal operation
CHECK_INTERVAL = 1 # seconds between temperature checks
MoE Thermal BenefitΒΆ
MoE sparsity directly reduces thermal load:
- Only ~1.4B of 3.2B parameters are active per token (~44%)
- 56% of the chip's compute units remain idle on each forward pass
- This is roughly equivalent to running a 1.4B dense model in terms of heat generation
- High likelihood of sustained training on a fanless MacBook Air
Context Length StrategyΒΆ
| Scenario | Context | Recommended Config | Total Memory |
|---|---|---|---|
| Chatbot | 4K | Q4 | ~2.5 GB |
| Code generation | 8K | Q4 | ~2.8 GB |
| Document summary | 16K | Q4 | ~3.3 GB |
| PDF analysis | 32K | Q4 + TurboQuant recommended | ~2.7 GB |
| Full 64K | 64K | Q4 + TurboQuant required | ~3.1 GB |
Weight Tying SavingsΒΆ
embed_tokens: (32000, 2048) = 65.5M params
lm_head: (32000, 2048) = 65.5M params β same tensor
βββββββββββββ
Savings: 65.5M params = ~131 MB (FP16) / ~36 MB (Q4)
By tying embed_tokens.weight = lm_head.weight, the model avoids storing a duplicate copy of the embedding matrix. This is particularly valuable at Q4 where every ~36 MB counts toward the 8 GB ceiling.
See alsoΒΆ
- β Architecture Overview
- Quantization Guide β NF4 quantization implementation and QLoRA training
- Training Guide β QLoRA memory usage during training
- TurboQuant Paper β Planned KV cache compression for 64K context
- Inference Guide β KV cache behavior during generation