A Three-Layer Sandwich Architecture for Running a 3.2B LLM on M4 MacBook
Introduction A MacBook Air M4 has 16GB of unified memory. Train a 3B model with PyTorch and the fan spins up within minutes; on the fanless model, thermal throttling kicks in. Bit-Axon is a 3.2B parameter hybrid language model that solves this constraint at the architecture level. The core idea is a three-layer sandwich structure: 24 layers divided into three segments, each using a different computation paradigm. L L L a a a y y y e e e r r r 1 1 9 7 - - - 8 1 2 : 6 4 : : █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ S S S P W M u A r + e + M A M o x o E o E n - S S M → → → C D O o e u n e t t p p e u x r t t e a s a s y b o n s n t o i h r n e p g s t i i ( s o O n ( ( n l ( ) i O n ( a e 1 t a ) t r e m n + e t m i s o o p r n a y ) r ) s e ) This isn’t just an intuitive division. Each segment addresses one of the three fundamental limitations of the Transformer architecture — quadratic complexity, memory explosion, and compute density. This post covers the mathematical foundations of each layer group, MLX framework optimizations, and thermal-aware training — the complete design for running an LLM on a MacBook. ...