<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Apple-Silicon on Devlog in the SKY</title>
    <link>https://skyoo2003.github.io/en/tags/apple-silicon/</link>
    <description>Recent content in Apple-Silicon on Devlog in the SKY</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Sun, 19 Apr 2026 00:00:00 +0900</lastBuildDate>
    <atom:link href="https://skyoo2003.github.io/en/tags/apple-silicon/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A Three-Layer Sandwich Architecture for Running a 3.2B LLM on M4 MacBook</title>
      <link>https://skyoo2003.github.io/en/posts/2026/04/19/three-layer-sandwich-llm/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0900</pubDate>
      <guid>https://skyoo2003.github.io/en/posts/2026/04/19/three-layer-sandwich-llm/</guid>
      <description>&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;A MacBook Air M4 has 16GB of unified memory. Train a 3B model with PyTorch and the fan spins up within minutes; on the fanless model, thermal throttling kicks in. &lt;a href=&#34;https://github.com/skyoo2003/bit-axon&#34;&gt;Bit-Axon&lt;/a&gt; is a 3.2B parameter hybrid language model that solves this constraint at the architecture level.&lt;/p&gt;
&lt;p&gt;The core idea is a &lt;strong&gt;three-layer sandwich structure&lt;/strong&gt;: 24 layers divided into three segments, each using a different computation paradigm.&lt;/p&gt;



&lt;div class=&#34;goat svg-container &#34;&gt;
  
    &lt;svg
      xmlns=&#34;http://www.w3.org/2000/svg&#34;
      font-family=&#34;Menlo,Lucida Console,monospace&#34;
      
        viewBox=&#34;0 0 736 57&#34;
      &gt;
      &lt;g transform=&#39;translate(8,16)&#39;&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;0&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;L&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;0&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;L&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;0&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;L&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;8&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;8&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;8&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;16&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;y&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;16&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;y&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;16&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;y&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;24&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;24&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;24&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;32&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;32&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;32&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;48&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;1&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;56&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;1&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;56&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;9&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;56&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;7&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;64&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;-&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;64&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;-&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;64&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;-&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;72&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;8&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;72&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;1&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;72&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;2&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;80&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;:&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;80&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;6&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;80&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;4&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;88&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;:&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;88&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;:&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;104&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;104&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;104&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;112&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;112&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;112&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;120&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;120&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;120&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;128&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;128&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;128&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;136&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;136&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;136&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;144&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;144&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;144&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;152&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;152&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;152&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;160&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;160&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;160&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;168&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;168&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;168&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;176&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;176&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;176&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;184&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;184&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;184&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;192&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;192&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;192&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;200&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;200&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;200&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;208&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;208&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;208&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;216&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;216&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;216&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;224&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;224&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;224&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;232&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;232&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;232&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;240&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;240&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;240&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;248&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;248&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;256&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;█&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;256&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;S&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;264&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;S&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;264&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;S&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;272&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;P&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;272&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;W&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;272&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;M&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;280&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;u&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;280&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;A&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;288&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;288&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;+&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;296&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;296&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;+&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;304&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;M&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;312&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;A&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;312&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;M&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;312&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;320&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;x&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;320&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;320&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;E&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;328&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;328&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;E&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;336&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;344&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;-&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;352&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;S&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;360&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;S&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;368&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;M&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;440&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;→&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;440&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;→&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;440&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;→&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;456&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;C&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;456&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;D&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;456&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;O&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;464&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;464&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;464&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;u&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;472&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;472&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;472&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;480&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;480&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;p&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;480&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;p&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;488&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;488&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;u&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;496&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;x&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;496&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;496&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;504&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;504&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;512&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;512&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;520&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;520&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;520&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;y&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;528&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;b&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;528&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;528&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;536&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;536&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;536&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;544&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;544&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;i&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;544&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;h&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;552&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;552&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;552&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;560&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;p&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;560&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;g&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;560&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;568&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;568&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;i&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;576&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;i&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;576&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;(&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;576&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;584&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;584&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;O&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;592&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;592&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;(&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;592&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;(&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;600&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;600&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;l&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;608&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;(&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;608&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;)&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;608&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;i&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;616&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;O&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;616&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;624&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;(&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;624&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;624&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;632&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;1&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;632&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;632&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;640&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;)&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;640&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;640&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;648&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;656&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;m&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;656&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;656&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;+&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;664&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;664&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;t&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;672&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;m&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;672&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;i&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;672&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;680&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;680&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;o&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;680&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;p&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;688&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;688&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;n&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;688&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;a&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;696&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;y&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;696&#39; y=&#39;20&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;)&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;696&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;r&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;704&#39; y=&#39;4&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;)&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;704&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;s&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;712&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;e&lt;/text&gt;
&lt;text text-anchor=&#39;middle&#39; x=&#39;720&#39; y=&#39;36&#39; fill=&#39;currentColor&#39; style=&#39;font-size:1em&#39;&gt;)&lt;/text&gt;
&lt;/g&gt;

    &lt;/svg&gt;
  
&lt;/div&gt;
&lt;p&gt;This isn&amp;rsquo;t just an intuitive division. Each segment addresses one of the three fundamental limitations of the Transformer architecture — &lt;strong&gt;quadratic complexity, memory explosion, and compute density&lt;/strong&gt;. This post covers the mathematical foundations of each layer group, MLX framework optimizations, and thermal-aware training — the complete design for running an LLM on a MacBook.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>A MacBook Air M4 has 16GB of unified memory. Train a 3B model with PyTorch and the fan spins up within minutes; on the fanless model, thermal throttling kicks in. <a href="https://github.com/skyoo2003/bit-axon">Bit-Axon</a> is a 3.2B parameter hybrid language model that solves this constraint at the architecture level.</p>
<p>The core idea is a <strong>three-layer sandwich structure</strong>: 24 layers divided into three segments, each using a different computation paradigm.</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 736 57"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>L</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>L</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>L</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='32' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>7</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='88' y='20' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='104' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='104' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='112' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='120' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='128' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='136' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='144' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='152' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='160' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='160' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='168' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='168' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='176' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='184' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='192' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='200' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='200' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='208' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='216' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='224' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='232' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='240' y='36' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'>█</text>
<text text-anchor='middle' x='256' y='36' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='264' y='36' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='272' y='20' fill='currentColor' style='font-size:1em'>W</text>
<text text-anchor='middle' x='272' y='36' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='280' y='20' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='288' y='36' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='296' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='296' y='20' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='304' y='36' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='312' y='4' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='312' y='20' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='312' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='320' y='4' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='320' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='320' y='36' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='328' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='328' y='20' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='336' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='344' y='4' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='352' y='4' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='360' y='4' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='368' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='440' y='4' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='440' y='20' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='440' y='36' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='456' y='4' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='456' y='20' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='456' y='36' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='464' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='464' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='464' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='472' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='472' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='472' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='480' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='480' y='20' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='480' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='488' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='488' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='496' y='4' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='496' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='496' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='504' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='504' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='512' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='512' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='520' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='520' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='520' y='36' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='528' y='4' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='528' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='528' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='536' y='4' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='536' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='536' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='544' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='544' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='544' y='36' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='552' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='552' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='552' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='560' y='4' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='560' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='560' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='568' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='568' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='576' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='576' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='576' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='584' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='584' y='20' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='592' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='592' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='592' y='36' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='600' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='600' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='608' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='608' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='608' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='616' y='4' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='616' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='624' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='624' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='624' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='632' y='4' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='632' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='632' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='640' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='640' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='640' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='648' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='656' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='656' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='656' y='36' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='664' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='664' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='672' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='672' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='672' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='680' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='680' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='680' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='688' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='688' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='688' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='696' y='4' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='696' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='696' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='704' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='704' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='712' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='720' y='36' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p>This isn&rsquo;t just an intuitive division. Each segment addresses one of the three fundamental limitations of the Transformer architecture — <strong>quadratic complexity, memory explosion, and compute density</strong>. This post covers the mathematical foundations of each layer group, MLX framework optimizations, and thermal-aware training — the complete design for running an LLM on a MacBook.</p>
<h2 id="why-mlx-over-pytorch">Why MLX Over PyTorch?</h2>
<p>The reason for choosing MLX on Apple Silicon is straightforward — it&rsquo;s the only framework that <strong>properly leverages unified memory</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>PyTorch (MPS)</th>
          <th>MLX</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Memory transfer</td>
          <td>GPU → CPU copy required</td>
          <td>Unified memory zero-copy</td>
      </tr>
      <tr>
          <td>Compilation</td>
          <td><code>torch.compile</code> (beta)</td>
          <td><code>@mx.compile</code> (stable)</td>
      </tr>
      <tr>
          <td>Apple Silicon optimization</td>
          <td>General-purpose backend</td>
          <td>Native optimization</td>
      </tr>
      <tr>
          <td>SwiftUI integration</td>
          <td>Not possible</td>
          <td>Native app support</td>
      </tr>
  </tbody>
</table>
<p>PyTorch&rsquo;s MPS backend supports the Apple Silicon GPU but still requires memory copies between CPU and GPU. On a MacBook Air with 16GB unified memory, this copy overhead is critical — every tensor transfer consumes memory bandwidth and increases inference latency.</p>
<p>MLX, on the other hand, is designed directly for Apple&rsquo;s unified memory architecture. Since the CPU and GPU share the same physical memory, no tensor movement is needed. The <code>@mx.compile</code> decorator compiles performance-critical kernels natively to the Apple Silicon GPU, delivering consistently faster performance than PyTorch&rsquo;s MPS backend.</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 280 153"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>┌</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='0' y='68' fill='currentColor' style='font-size:1em'>└</text>
<text text-anchor='middle' x='0' y='84' fill='currentColor' style='font-size:1em'>┌</text>
<text text-anchor='middle' x='0' y='100' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='0' y='116' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='0' y='132' fill='currentColor' style='font-size:1em'>└</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='16' y='52' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='16' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='16' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='16' y='116' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='16' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='24' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='24' y='100' fill='currentColor' style='font-size:1em'>G</text>
<text text-anchor='middle' x='24' y='116' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='32' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='32' y='52' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='32' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='32' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='32' y='100' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='32' y='116' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='32' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='40' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='40' y='36' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='40' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='40' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='40' y='100' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='40' y='116' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='40' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='48' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='48' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='48' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='48' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='48' y='116' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='48' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='56' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='56' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='56' y='116' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='56' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='64' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='64' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='64' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='72' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='72' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='72' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>┐</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='80' y='68' fill='currentColor' style='font-size:1em'>┘</text>
<text text-anchor='middle' x='80' y='84' fill='currentColor' style='font-size:1em'>┐</text>
<text text-anchor='middle' x='80' y='100' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='80' y='116' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='80' y='132' fill='currentColor' style='font-size:1em'>┘</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>←</text>
<text text-anchor='middle' x='96' y='100' fill='currentColor' style='font-size:1em'>←</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='112' y='100' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='120' y='100' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='128' y='100' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='136' y='100' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='152' y='100' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='176' y='100' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='184' y='20' fill='currentColor' style='font-size:1em'>┌</text>
<text text-anchor='middle' x='184' y='52' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='184' y='68' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='184' y='84' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='184' y='116' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='184' y='132' fill='currentColor' style='font-size:1em'>└</text>
<text text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'>L</text>
<text text-anchor='middle' x='192' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='192' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>X</text>
<text text-anchor='middle' x='200' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='200' y='36' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='200' y='68' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='200' y='84' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='200' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='208' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='208' y='68' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='208' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='208' y='116' fill='currentColor' style='font-size:1em'>G</text>
<text text-anchor='middle' x='208' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='216' y='36' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='216' y='68' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='216' y='84' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='216' y='116' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='216' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='224' y='68' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='224' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='224' y='116' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='224' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='232' y='68' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='232' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='232' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='240' y='68' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='240' y='84' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='240' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='248' y='68' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='248' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='256' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='256' y='36' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='256' y='100' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='256' y='132' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>┐</text>
<text text-anchor='middle' x='264' y='52' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='264' y='68' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='264' y='84' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='264' y='116' fill='currentColor' style='font-size:1em'>│</text>
<text text-anchor='middle' x='264' y='132' fill='currentColor' style='font-size:1em'>┘</text>
</g>

    </svg>
  
</div>
<p>This difference is dramatic with a 4-bit quantized 3.2B model. PyTorch must place model weights in CPU memory first, then copy to the GPU — effectively requiring double the memory at load time. MLX allocates once and is done.</p>
<h2 id="three-layer-architecture-design-philosophy">Three-Layer Architecture: Design Philosophy</h2>
<p>To understand the sandwich architecture, we first need to understand <strong>why this particular division</strong>.</p>
<p>The Transformer&rsquo;s core problem is attention&rsquo;s O(n²) complexity. When sequence length grows from 4K to 64K, attention compute increases by 256x. State Space Models (SSMs) solve this with O(n) complexity but can&rsquo;t model the complex dependencies that attention handles.</p>
<p>Bit-Axon&rsquo;s approach is to <strong>hierarchically combine the strengths of both paradigms</strong>:</p>
<ul>
<li><strong>Context absorption (SSM)</strong>: Linear complexity is essential when ingesting 64K tokens. Processing 64K tokens with attention is impossible in 16GB memory.</li>
<li><strong>Deep reasoning (SWA + MoE)</strong>: Semantic relationships, causal inference, and complex pattern matching require attention, but only over a local window — not the entire sequence.</li>
<li><strong>Output synthesis (SSM + MoE)</strong>: During final token generation, the reasoning is already complete and we&rsquo;re synthesizing representations. SSM&rsquo;s linear compute is sufficient. MoE selectively applies expert knowledge to boost quality.</li>
</ul>
<p>This design follows a principle of <strong>assigning minimum complexity to each layer group</strong>. Attention only where it&rsquo;s needed; SSM everywhere else.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>@staticmethod
</span></span><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">_get_layer_type</span>(layer_idx: <span style="color:#8be9fd;font-style:italic">int</span>, total_layers: <span style="color:#8be9fd;font-style:italic">int</span>) <span style="color:#ff79c6">-&gt;</span> <span style="color:#8be9fd;font-style:italic">str</span>:
</span></span><span style="display:flex;"><span>    third <span style="color:#ff79c6">=</span> total_layers <span style="color:#ff79c6">//</span> <span style="color:#bd93f9">3</span>  <span style="color:#6272a4"># 8 layers each</span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">if</span> layer_idx <span style="color:#ff79c6">&lt;</span> third:           <span style="color:#6272a4"># Layers 0-7: Pure SSM</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">return</span> <span style="color:#f1fa8c">&#34;ssm&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">elif</span> layer_idx <span style="color:#ff79c6">&lt;</span> <span style="color:#bd93f9">2</span> <span style="color:#ff79c6">*</span> third:     <span style="color:#6272a4"># Layers 8-15: SWA + MoE</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">return</span> <span style="color:#f1fa8c">&#34;swa_moe&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">else</span>:                           <span style="color:#6272a4"># Layers 16-23: SSM + MoE</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">return</span> <span style="color:#f1fa8c">&#34;ssm_moe&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="layers-1-8-pure-axon-ssm-context-absorption">Layers 1-8: Pure Axon-SSM (Context Absorption)</h2>
<p>The first 8 layers are pure Mamba-style <strong>State Space Models (SSM)</strong>. No attention means no KV cache, and memory per token is constant at O(1). This is why 64K context is possible.</p>
<h3 id="mathematical-foundations-of-ssm">Mathematical Foundations of SSM</h3>
<p>SSMs start from continuous-time state space models:</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 352 41"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>'</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='48' y='20' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='88' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='96' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='112' y='20' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='128' y='20' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='136' y='20' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='144' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='152' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='160' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='256' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='264' y='4' fill='currentColor' style='font-size:1em'>q</text>
<text text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='272' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='280' y='20' fill='currentColor' style='font-size:1em'>q</text>
<text text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='288' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='296' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='296' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='304' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='304' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='312' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='312' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='320' y='4' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='320' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='328' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='336' y='20' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p>Where <code>x(t)</code> is the input, <code>h(t)</code> is the state vector, <code>y(t)</code> is the output, and <code>A/B/C/D</code> are learnable parameter matrices. Discretizing the continuous model:</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 184 41"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='32' y='20' fill='currentColor' style='font-size:1em'>=</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>Ā</text>
<text text-anchor='middle' x='48' y='20' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>{</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='88' y='20' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>}</text>
<text text-anchor='middle' x='104' y='20' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='112' y='20' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'>+</text>
<text text-anchor='middle' x='120' y='20' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='128' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>̄</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'>t</text>
</g>

    </svg>
  
</div>
<p>Discretization uses the <strong>Zero-Order Hold (ZOH)</strong> method, with <code>dt</code> (step size) as a learnable parameter. The fact that <code>dt</code> can take different values per token is Mamba&rsquo;s core innovation — the state update speed adjusts based on input.</p>
<h3 id="axonssm-implementation-details">AxonSSM Implementation Details</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">AxonSSM</span>(nn<span style="color:#ff79c6">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __init__(self, config: BitAxonConfig):
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>in_proj <span style="color:#ff79c6">=</span> nn<span style="color:#ff79c6">.</span>Linear(D, <span style="color:#bd93f9">2</span> <span style="color:#ff79c6">*</span> E, bias<span style="color:#ff79c6">=</span><span style="color:#ff79c6">False</span>)          <span style="color:#6272a4"># Split input into x and z branches</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>conv1d <span style="color:#ff79c6">=</span> nn<span style="color:#ff79c6">.</span>Conv1d(E, E, kernel_size<span style="color:#ff79c6">=</span>d_conv, groups<span style="color:#ff79c6">=</span>E)  <span style="color:#6272a4"># Depthwise causal convolution</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>x_proj <span style="color:#ff79c6">=</span> nn<span style="color:#ff79c6">.</span>Linear(E, d_state <span style="color:#ff79c6">*</span> <span style="color:#bd93f9">2</span> <span style="color:#ff79c6">+</span> <span style="color:#bd93f9">1</span>, bias<span style="color:#ff79c6">=</span><span style="color:#ff79c6">False</span>)   <span style="color:#6272a4"># Project to B, C, dt params</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>dt_proj <span style="color:#ff79c6">=</span> nn<span style="color:#ff79c6">.</span>Linear(<span style="color:#bd93f9">1</span>, E, bias<span style="color:#ff79c6">=</span><span style="color:#ff79c6">True</span>)                <span style="color:#6272a4"># Per-channel step sizes</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>out_proj <span style="color:#ff79c6">=</span> nn<span style="color:#ff79c6">.</span>Linear(E, D, bias<span style="color:#ff79c6">=</span><span style="color:#ff79c6">False</span>)              <span style="color:#6272a4"># Output projection</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>A_log <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>log(mx<span style="color:#ff79c6">.</span>arange(<span style="color:#bd93f9">1</span>, d_state <span style="color:#ff79c6">+</span> <span style="color:#bd93f9">1</span>))            <span style="color:#6272a4"># Diagonal SSM state matrix</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>D <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>ones((E,))                                    <span style="color:#6272a4"># Skip connection parameter</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Key design decisions:</p>
<ul>
<li><strong><code>A_log</code> initialization</strong>: Initialized as <code>log(1), log(2), ..., log(d_state)</code> so that <code>A = -exp(A_log)</code> is a negative diagonal matrix. This guarantees the state decays exponentially over time, providing numerical stability.</li>
<li><strong>Causal convolution (<code>conv1d</code>)</strong>: A 1D convolution with kernel size 4 extracts local context first. This aligns with the intuition: &ldquo;look at the pattern of the last 4 tokens first, then reflect that in the SSM state.&rdquo;</li>
<li><strong>Gating</strong>: The <code>z</code> branch controls information flow with SiLU activation. In the form <code>y = SiLU(z) * SSM(x)</code>, the SSM output is selectively weighted.</li>
</ul>
<h3 id="parallel-scan-algorithm">Parallel Scan Algorithm</h3>
<p>The sequential recurrence <code>h_t = Āh_{t-1} + B̄x_t</code> is O(n) but inherently sequential, seemingly impossible to parallelize. Mamba&rsquo;s core innovation is parallelizing this via <strong>associative scan</strong>.</p>
<p>Bit-Axon implements this in chunks:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">_ssm_scan_parallel</span>(self, x, dt, B_in, C_in):
</span></span><span style="display:flex;"><span>    step <span style="color:#ff79c6">=</span> config<span style="color:#ff79c6">.</span>ssm_scan_step  <span style="color:#6272a4"># default 64</span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">for</span> j <span style="color:#ff79c6">in</span> <span style="color:#8be9fd;font-style:italic">range</span>(d_state):     <span style="color:#6272a4"># Each state dimension processed independently</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">for</span> i <span style="color:#ff79c6">in</span> <span style="color:#8be9fd;font-style:italic">range</span>(<span style="color:#bd93f9">0</span>, L, step):
</span></span><span style="display:flex;"><span>            S <span style="color:#ff79c6">=</span> <span style="color:#8be9fd;font-style:italic">min</span>(step, L <span style="color:#ff79c6">-</span> i)
</span></span><span style="display:flex;"><span>            dtA_chunk <span style="color:#ff79c6">=</span> dtA[:, i : i <span style="color:#ff79c6">+</span> S, :]
</span></span><span style="display:flex;"><span>            dtx_chunk <span style="color:#ff79c6">=</span> dtx[:, i : i <span style="color:#ff79c6">+</span> S, :]
</span></span><span style="display:flex;"><span>            B_chunk <span style="color:#ff79c6">=</span> B_in[:, i : i <span style="color:#ff79c6">+</span> S, j]
</span></span><span style="display:flex;"><span>            C_chunk <span style="color:#ff79c6">=</span> C_in[:, i : i <span style="color:#ff79c6">+</span> S, j]
</span></span></code></pre></td></tr></table>
</div>
</div><p>Chunk size 64 is optimized for Apple Silicon GPU&rsquo;s warp size and memory-compute balance. Too small and kernel launch overhead dominates; too large and memory usage increases.</p>
<h3 id="segment-sum-optimization">Segment Sum Optimization</h3>
<p>The core operation of parallel scan — segment sum (<code>segsum</code>) — is compiled natively for MLX:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">segsum</span>(x: mx<span style="color:#ff79c6">.</span>array) <span style="color:#ff79c6">-&gt;</span> mx<span style="color:#ff79c6">.</span>array:
</span></span><span style="display:flex;"><span>    <span style="color:#f1fa8c">&#34;&#34;&#34;Parallel segment sum for hardware-efficient computation&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    seq_len <span style="color:#ff79c6">=</span> x<span style="color:#ff79c6">.</span>shape[<span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>]
</span></span><span style="display:flex;"><span>    cs <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>cumsum(x, axis<span style="color:#ff79c6">=-</span><span style="color:#bd93f9">1</span>)
</span></span><span style="display:flex;"><span>    diff <span style="color:#ff79c6">=</span> cs[<span style="color:#ff79c6">...</span>, :, <span style="color:#ff79c6">None</span>] <span style="color:#ff79c6">-</span> cs[<span style="color:#ff79c6">...</span>, <span style="color:#ff79c6">None</span>, :]
</span></span><span style="display:flex;"><span>    mask <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>tril(mx<span style="color:#ff79c6">.</span>ones((seq_len, seq_len), dtype<span style="color:#ff79c6">=</span>diff<span style="color:#ff79c6">.</span>dtype), <span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">return</span> diff <span style="color:#ff79c6">*</span> mask
</span></span></code></pre></td></tr></table>
</div>
</div><p>This operation is compiled with <code>@mx.compile</code> and runs natively on the Apple Silicon GPU.</p>
<h2 id="layers-9-16-swa--moe-deep-reasoning">Layers 9-16: SWA + MoE (Deep Reasoning)</h2>
<p>The middle 8 layers combine <strong>Sliding Window Attention (SWA)</strong> with <strong>Mixture of Experts (MoE)</strong>. This segment handles the model&rsquo;s <strong>reasoning capability</strong>.</p>
<h3 id="sliding-window-attention">Sliding Window Attention</h3>
<p>Standard attention computes dot products for all token pairs, giving O(n²) complexity. SWA limits each token to attend to only the previous <code>window_size</code> tokens, reducing to O(n × window_size).</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">_make_sliding_window_mask</span>(self, seq_len: <span style="color:#8be9fd;font-style:italic">int</span>, kv_len: <span style="color:#8be9fd;font-style:italic">int</span>, q_offset: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">0</span>):
</span></span><span style="display:flex;"><span>    q_pos <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>arange(q_offset, q_offset <span style="color:#ff79c6">+</span> seq_len)
</span></span><span style="display:flex;"><span>    k_pos <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>arange(kv_len)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Causal constraint: tokens can&#39;t see the future</span>
</span></span><span style="display:flex;"><span>    causal_mask <span style="color:#ff79c6">=</span> k_pos[<span style="color:#ff79c6">None</span>, :] <span style="color:#ff79c6">&lt;=</span> (q_pos[:, <span style="color:#ff79c6">None</span>] <span style="color:#ff79c6">+</span> causal_offset)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Window constraint: limited attention range</span>
</span></span><span style="display:flex;"><span>    window_mask <span style="color:#ff79c6">=</span> (q_pos[:, <span style="color:#ff79c6">None</span>] <span style="color:#ff79c6">+</span> causal_offset) <span style="color:#ff79c6">-</span> k_pos[<span style="color:#ff79c6">None</span>, :] <span style="color:#ff79c6">&lt;</span> self<span style="color:#ff79c6">.</span>window_size
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Combined: -inf for positions outside causal+window</span>
</span></span><span style="display:flex;"><span>    mask <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>where(causal_mask <span style="color:#ff79c6">&amp;</span> window_mask, <span style="color:#bd93f9">0.0</span>, <span style="color:#ff79c6">-</span>mx<span style="color:#ff79c6">.</span>inf)
</span></span></code></pre></td></tr></table>
</div>
</div><p>The window size of 4096 is a deliberate choice. Most natural language dependencies resolve within 4K tokens — longer-range dependencies were already handled by the SSM layers (1-8). SWA thus performs <strong>local refinement on top of the context that SSM has absorbed</strong>.</p>
<h3 id="kv-cache-trimming">KV Cache Trimming</h3>
<p>The core memory optimization of SWA is KV cache trimming:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">KVCache</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __init__(self, window_size: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">|</span> <span style="color:#ff79c6">None</span> <span style="color:#ff79c6">=</span> <span style="color:#ff79c6">None</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>window_size <span style="color:#ff79c6">=</span> window_size
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> <span style="color:#50fa7b">update_and_fetch</span>(self, xk: mx<span style="color:#ff79c6">.</span>array, xv: mx<span style="color:#ff79c6">.</span>array) <span style="color:#ff79c6">-&gt;</span> <span style="color:#8be9fd;font-style:italic">tuple</span>[mx<span style="color:#ff79c6">.</span>array, mx<span style="color:#ff79c6">.</span>array]:
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>k <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>concatenate([self<span style="color:#ff79c6">.</span>k, xk], axis<span style="color:#ff79c6">=</span><span style="color:#bd93f9">2</span>)
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>v <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>concatenate([self<span style="color:#ff79c6">.</span>v, xv], axis<span style="color:#ff79c6">=</span><span style="color:#bd93f9">2</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">if</span> self<span style="color:#ff79c6">.</span>window_size <span style="color:#ff79c6">is</span> <span style="color:#ff79c6">not</span> <span style="color:#ff79c6">None</span>:
</span></span><span style="display:flex;"><span>            self<span style="color:#ff79c6">.</span>k <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>k[:, :, <span style="color:#ff79c6">-</span>self<span style="color:#ff79c6">.</span>window_size:]  <span style="color:#6272a4"># Trim to window</span>
</span></span><span style="display:flex;"><span>            self<span style="color:#ff79c6">.</span>v <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>v[:, :, <span style="color:#ff79c6">-</span>self<span style="color:#ff79c6">.</span>window_size:]
</span></span></code></pre></td></tr></table>
</div>
</div><p>This is why memory only grows O(window_size) even when processing 64K sequences. KV cache entries outside the window are discarded — SWA doesn&rsquo;t reference them, so there&rsquo;s no information loss.</p>
<h3 id="moe-implementation">MoE Implementation</h3>
<p>MoE dynamically selects experts per token to sparsify computation:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">SharedExpertMoE</span>(nn<span style="color:#ff79c6">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __call__(self, x: mx<span style="color:#ff79c6">.</span>array) <span style="color:#ff79c6">-&gt;</span> mx<span style="color:#ff79c6">.</span>array:
</span></span><span style="display:flex;"><span>        gates <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>gate(x)                      <span style="color:#6272a4"># (batch, seq_len, num_experts)</span>
</span></span><span style="display:flex;"><span>        gates <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>softmax(gates, axis<span style="color:#ff79c6">=-</span><span style="color:#bd93f9">1</span>)        <span style="color:#6272a4"># Softmax over experts</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Top-k expert selection</span>
</span></span><span style="display:flex;"><span>        inds <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>stop_gradient(mx<span style="color:#ff79c6">.</span>argpartition(<span style="color:#ff79c6">-</span>gates, kth<span style="color:#ff79c6">=</span>k<span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>, axis<span style="color:#ff79c6">=-</span><span style="color:#bd93f9">1</span>)[<span style="color:#ff79c6">...</span>, :k])
</span></span><span style="display:flex;"><span>        scores <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>take_along_axis(gates, inds, axis<span style="color:#ff79c6">=-</span><span style="color:#bd93f9">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Expert processing</span>
</span></span><span style="display:flex;"><span>        y <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>switch_mlp(x, inds)
</span></span><span style="display:flex;"><span>        y <span style="color:#ff79c6">=</span> (y <span style="color:#ff79c6">*</span> scores[<span style="color:#ff79c6">...</span>, <span style="color:#ff79c6">None</span>])<span style="color:#ff79c6">.</span>sum(axis<span style="color:#ff79c6">=-</span><span style="color:#bd93f9">2</span>)  <span style="color:#6272a4"># Weighted combination</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The <strong><code>gather_mm</code> optimization</strong> is the core. Expert routing is mathematically &ldquo;multiply input by the selected expert&rsquo;s weight matrix,&rdquo; but implementing this naively requires loading all expert weights into memory.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">SwitchLinear</span>(nn<span style="color:#ff79c6">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __call__(self, x: mx<span style="color:#ff79c6">.</span>array, indices: mx<span style="color:#ff79c6">.</span>array) <span style="color:#ff79c6">-&gt;</span> mx<span style="color:#ff79c6">.</span>array:
</span></span><span style="display:flex;"><span>        B, L, K <span style="color:#ff79c6">=</span> indices<span style="color:#ff79c6">.</span>shape
</span></span><span style="display:flex;"><span>        flat_idx <span style="color:#ff79c6">=</span> indices<span style="color:#ff79c6">.</span>reshape(<span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>)
</span></span><span style="display:flex;"><span>        x_flat <span style="color:#ff79c6">=</span> x<span style="color:#ff79c6">.</span>reshape(<span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>, <span style="color:#bd93f9">1</span>, D)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        w_t <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>weight<span style="color:#ff79c6">.</span>swapaxes(<span style="color:#ff79c6">-</span><span style="color:#bd93f9">1</span>, <span style="color:#ff79c6">-</span><span style="color:#bd93f9">2</span>)
</span></span><span style="display:flex;"><span>        out <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>gather_mm(x_flat, w_t, rhs_indices<span style="color:#ff79c6">=</span>flat_idx,
</span></span><span style="display:flex;"><span>                          sorted_indices<span style="color:#ff79c6">=</span>sorted_indices)
</span></span></code></pre></td></tr></table>
</div>
</div><p><code>mx.gather_mm</code> is a native MLX operation that <strong>collects only the relevant rows from the weight matrix based on indices, then multiplies</strong>. No need to iterate over all expert weights — only the rows belonging to each token&rsquo;s assigned expert are computed. Using sorted indices (<code>sorted_indices</code>) makes memory access patterns contiguous, maximizing cache efficiency.</p>
<p>A <strong>Shared Expert</strong> is an additional MLP applied to all tokens:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6272a4"># Shared expert gating</span>
</span></span><span style="display:flex;"><span>shared_out <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>shared_expert(x)
</span></span><span style="display:flex;"><span>gate <span style="color:#ff79c6">=</span> sigmoid(shared_expert_gate(x))
</span></span><span style="display:flex;"><span>output <span style="color:#ff79c6">=</span> gated_expert_output <span style="color:#ff79c6">+</span> gate <span style="color:#ff79c6">*</span> shared_out
</span></span></code></pre></td></tr></table>
</div>
</div><p>The shared expert exists to guarantee <strong>common knowledge that Top-2 routing might miss</strong>. Things like &ldquo;basic grammar of natural language&rdquo; or &ldquo;general world knowledge&rdquo; should always be applied, not left to expert routing.</p>
<h3 id="parameter-activation-efficiency">Parameter Activation Efficiency</h3>
<p>With Top-2 out of 8 experts, only 25% of MoE FFN parameters participate in computation per token. Including the shared expert, activated parameters per token are approximately 1.4B — just 44% of the total 3.2B.</p>
<h2 id="layers-17-24-ssm--moe-output-synthesis">Layers 17-24: SSM + MoE (Output Synthesis)</h2>
<p>The final 8 layers drop attention entirely, using only SSM + MoE. Linear recurrence combined with sparse experts enables <strong>fast output generation</strong>.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">AxonSSMMoEBlock</span>(nn<span style="color:#ff79c6">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __call__(self, x, cache<span style="color:#ff79c6">=</span><span style="color:#ff79c6">None</span>):
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># SSM with residual</span>
</span></span><span style="display:flex;"><span>        residual <span style="color:#ff79c6">=</span> x
</span></span><span style="display:flex;"><span>        x <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>input_norm(x)
</span></span><span style="display:flex;"><span>        ssm_out, ssm_cache <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>ssm(x, cache<span style="color:#ff79c6">=</span>cache)
</span></span><span style="display:flex;"><span>        x <span style="color:#ff79c6">=</span> residual <span style="color:#ff79c6">+</span> ssm_out
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># MoE with residual</span>
</span></span><span style="display:flex;"><span>        residual <span style="color:#ff79c6">=</span> x
</span></span><span style="display:flex;"><span>        x <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>post_ssm_norm(x)
</span></span><span style="display:flex;"><span>        x <span style="color:#ff79c6">=</span> residual <span style="color:#ff79c6">+</span> self<span style="color:#ff79c6">.</span>moe(x)
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">return</span> x, ssm_cache
</span></span></code></pre></td></tr></table>
</div>
</div><p>Why no attention in the final segment? In autoregressive generation, what matters most is the <strong>representation of the last token</strong>. At this point, the SWA layers (9-16) have already completed reasoning, and layers 17-24 are synthesizing the reasoning result into a final token distribution. Synthesis doesn&rsquo;t need attention&rsquo;s global context — SSM&rsquo;s linear compute and MoE&rsquo;s expert knowledge are sufficient.</p>
<h2 id="detailed-memory-budget-analysis">Detailed Memory Budget Analysis</h2>
<p>Running a model on a MacBook Air M4 (16GB unified memory) requires precise memory management. macOS allocates about 6-8GB to the system, leaving roughly 8GB for the model.</p>
<h3 id="weight-memory">Weight Memory</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Parameters</th>
          <th>Memory (FP16)</th>
          <th>Memory (4-bit)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full model</td>
          <td>3.2B</td>
          <td>~6,400 MB</td>
          <td>~1,600 MB</td>
      </tr>
      <tr>
          <td>SSM layers (8)</td>
          <td>~0.8B</td>
          <td>~1,600 MB</td>
          <td>~400 MB</td>
      </tr>
      <tr>
          <td>SWA+MoE layers (8)</td>
          <td>~1.6B</td>
          <td>~3,200 MB</td>
          <td>~800 MB</td>
      </tr>
      <tr>
          <td>SSM+MoE layers (8)</td>
          <td>~0.8B</td>
          <td>~1,600 MB</td>
          <td>~400 MB</td>
      </tr>
  </tbody>
</table>
<h3 id="inference-memory-kv-cache--activations">Inference Memory (KV Cache + Activations)</h3>
<table>
  <thead>
      <tr>
          <th>Sequence Length</th>
          <th>KV Cache (SWA 8 layers)</th>
          <th>Activation Memory</th>
          <th>Total Inference Memory</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>4K</td>
          <td>~200 MB</td>
          <td>~400 MB</td>
          <td>~600 MB</td>
      </tr>
      <tr>
          <td>16K</td>
          <td>~200 MB</td>
          <td>~600 MB</td>
          <td>~800 MB</td>
      </tr>
      <tr>
          <td>64K</td>
          <td>~200 MB</td>
          <td>~1,200 MB</td>
          <td>~1,400 MB</td>
      </tr>
  </tbody>
</table>
<p>The KV cache doesn&rsquo;t grow with sequence length because of <code>window_size</code> trimming. Only 4096 entries of KV cache are maintained regardless of whether the sequence is 4K or 64K.</p>
<h3 id="4-bit-nf4-quantization">4-bit NF4 Quantization</h3>
<p>Quantization is the core technique that reduces model size by 4x. NF4 (NormalFloat 4) is a 4-bit data format optimized for normal distributions, with less information loss than generic int4 quantization.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">QuantizedSwitchLinear</span>(nn<span style="color:#ff79c6">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __init__(self, input_dims, output_dims, num_experts,
</span></span><span style="display:flex;"><span>                 group_size<span style="color:#ff79c6">=</span><span style="color:#bd93f9">64</span>, bits<span style="color:#ff79c6">=</span><span style="color:#bd93f9">4</span>):
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Per-group quantization: scale factor and bias per 64 elements</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>weight, self<span style="color:#ff79c6">.</span>scales, self<span style="color:#ff79c6">.</span>biases_quant <span style="color:#ff79c6">=</span> \
</span></span><span style="display:flex;"><span>            mx<span style="color:#ff79c6">.</span>quantize(weight, group_size<span style="color:#ff79c6">=</span>group_size, bits<span style="color:#ff79c6">=</span>bits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __call__(self, x: mx<span style="color:#ff79c6">.</span>array, indices: mx<span style="color:#ff79c6">.</span>array):
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Single fused operation: quantized weights + gather</span>
</span></span><span style="display:flex;"><span>        out <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>gather_qmm(
</span></span><span style="display:flex;"><span>            x_flat, self<span style="color:#ff79c6">.</span>weight, self<span style="color:#ff79c6">.</span>scales, self<span style="color:#ff79c6">.</span>biases_quant,
</span></span><span style="display:flex;"><span>            rhs_indices<span style="color:#ff79c6">=</span>flat_idx, group_size<span style="color:#ff79c6">=</span>self<span style="color:#ff79c6">.</span>group_size, bits<span style="color:#ff79c6">=</span>self<span style="color:#ff79c6">.</span>bits
</span></span><span style="display:flex;"><span>        )
</span></span></code></pre></td></tr></table>
</div>
</div><p><code>mx.gather_qmm</code> combines dequantization and gather into a single fused operation. Quantized weights are used directly without a separate decoding step, saving memory bandwidth.</p>
<h3 id="final-memory-layout">Final Memory Layout</h3>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 408 105"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>├</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>├</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>├</text>
<text text-anchor='middle' x='0' y='68' fill='currentColor' style='font-size:1em'>├</text>
<text text-anchor='middle' x='0' y='84' fill='currentColor' style='font-size:1em'>└</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='52' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='68' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='8' y='84' fill='currentColor' style='font-size:1em'>─</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>K</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='24' y='68' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='24' y='84' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='32' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='32' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>V</text>
<text text-anchor='middle' x='32' y='52' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='32' y='68' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='32' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='40' y='20' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='40' y='84' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='48' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='48' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='48' y='68' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='48' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='56' y='68' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='56' y='84' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='64' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='68' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='64' y='84' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='72' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='72' y='68' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='72' y='84' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='80' y='68' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='80' y='84' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='88' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='88' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='88' y='68' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='88' y='84' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='96' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='96' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='96' y='68' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='84' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='104' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='104' y='36' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='104' y='52' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='104' y='68' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='112' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='112' y='84' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='120' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='120' y='52' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='120' y='68' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='120' y='84' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='128' y='52' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='128' y='68' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='128' y='84' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='136' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='136' y='52' fill='currentColor' style='font-size:1em'>K</text>
<text text-anchor='middle' x='136' y='68' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='136' y='84' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='144' y='20' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='144' y='68' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='144' y='84' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='152' y='20' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='152' y='52' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='152' y='68' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='152' y='84' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='160' y='20' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='160' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='160' y='68' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='168' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='168' y='36' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='168' y='52' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='168' y='84' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='176' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='176' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='176' y='52' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='176' y='68' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='176' y='84' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='184' y='20' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='184' y='52' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='184' y='68' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='192' y='20' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='192' y='84' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='200' y='52' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='200' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='208' y='20' fill='currentColor' style='font-size:1em'>~</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='208' y='52' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='208' y='84' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='216' y='36' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='216' y='52' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='216' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='224' y='52' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='224' y='84' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='232' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='240' y='52' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='240' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='248' y='52' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='248' y='84' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='256' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='264' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='272' y='20' fill='currentColor' style='font-size:1em'>B</text>
<text text-anchor='middle' x='280' y='84' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='288' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='296' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='312' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='320' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='328' y='84' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='336' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='344' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='360' y='84' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='368' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='376' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='384' y='84' fill='currentColor' style='font-size:1em'>k</text>
<text text-anchor='middle' x='392' y='84' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p>With 4-bit quantization alone, 4K context uses about 2.2GB and 64K context uses about 3.2GB — running comfortably on a 16GB MacBook.</p>
<h2 id="thermal-aware-training">Thermal-Aware Training</h2>
<p>Bit-Axon&rsquo;s most practical innovation is its <strong>thermal-aware training pipeline</strong>. A three-tier thermal policy enables sustained training on a fanless MacBook Air.</p>
<h3 id="thermal-policy-implementation">Thermal Policy Implementation</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>@dataclass
</span></span><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">ThermalPolicy</span>:
</span></span><span style="display:flex;"><span>    max_speed_temp: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">75.0</span>    <span style="color:#6272a4"># Below this: full-speed training</span>
</span></span><span style="display:flex;"><span>    pause_temp: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">85.0</span>        <span style="color:#6272a4"># Above this: pause training</span>
</span></span><span style="display:flex;"><span>    stop_temp: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">95.0</span>         <span style="color:#6272a4"># Above this: stop training</span>
</span></span><span style="display:flex;"><span>    pause_duration: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">0.5</span>      <span style="color:#6272a4"># Cool-down pause duration (seconds)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">CoolingScheduler</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __init__(self, monitor, policy: ThermalPolicy <span style="color:#ff79c6">=</span> <span style="color:#ff79c6">None</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>_monitor <span style="color:#ff79c6">=</span> monitor
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>_policy <span style="color:#ff79c6">=</span> policy <span style="color:#ff79c6">or</span> ThermalPolicy()
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>_total_pause_time: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">0.0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> <span style="color:#50fa7b">check_before_step</span>(self, step: <span style="color:#8be9fd;font-style:italic">int</span>) <span style="color:#ff79c6">-&gt;</span> <span style="color:#ff79c6">None</span>:
</span></span><span style="display:flex;"><span>        temp <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>_monitor<span style="color:#ff79c6">.</span>temperature
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">if</span> temp <span style="color:#ff79c6">&gt;=</span> self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>stop_temp:  <span style="color:#6272a4"># 95°C threshold</span>
</span></span><span style="display:flex;"><span>            <span style="color:#ff79c6">raise</span> ThermalShutdownError(
</span></span><span style="display:flex;"><span>                <span style="color:#f1fa8c">f</span><span style="color:#f1fa8c">&#34;SoC temperature </span><span style="color:#f1fa8c">{</span>temp<span style="color:#f1fa8c">:</span><span style="color:#f1fa8c">.1f</span><span style="color:#f1fa8c">}</span><span style="color:#f1fa8c">C exceeds stop threshold&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">while</span> temp <span style="color:#ff79c6">&gt;=</span> self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>pause_temp:  <span style="color:#6272a4"># 85°C threshold</span>
</span></span><span style="display:flex;"><span>            time<span style="color:#ff79c6">.</span>sleep(self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>pause_duration)  <span style="color:#6272a4"># Wait 0.5s</span>
</span></span><span style="display:flex;"><span>            self<span style="color:#ff79c6">.</span>_total_pause_time <span style="color:#ff79c6">+=</span> self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>pause_duration
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> <span style="color:#50fa7b">should_reduce_batch</span>(self) <span style="color:#ff79c6">-&gt;</span> <span style="color:#8be9fd;font-style:italic">bool</span>:
</span></span><span style="display:flex;"><span>        temp <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>_monitor<span style="color:#ff79c6">.</span>temperature
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">return</span> self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>max_speed_temp <span style="color:#ff79c6">&lt;=</span> temp <span style="color:#ff79c6">&lt;</span> self<span style="color:#ff79c6">.</span>_policy<span style="color:#ff79c6">.</span>pause_temp
</span></span></code></pre></td></tr></table>
</div>
</div>


<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 576 73"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='8' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='16' y='52' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>&lt;</text>
<text text-anchor='middle' x='40' y='20' fill='currentColor' style='font-size:1em'>7</text>
<text text-anchor='middle' x='40' y='36' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>≥</text>
<text text-anchor='middle' x='48' y='20' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>7</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='64' y='52' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>°</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='72' y='52' fill='currentColor' style='font-size:1em'>°</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>°</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>°</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='88' y='20' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='104' y='20' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='104' y='36' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='104' y='52' fill='currentColor' style='font-size:1em'>→</text>
<text text-anchor='middle' x='120' y='4' fill='currentColor' style='font-size:1em'>N</text>
<text text-anchor='middle' x='120' y='20' fill='currentColor' style='font-size:1em'>W</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='120' y='52' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='128' y='4' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='128' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='128' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='136' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='136' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='136' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='144' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='144' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='144' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='152' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='152' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='152' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='160' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='160' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='160' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='160' y='52' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='168' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='168' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='168' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='168' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='176' y='20' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='176' y='52' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='184' y='4' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='184' y='52' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='192' y='4' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='192' y='20' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='200' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='200' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='200' y='36' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='200' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='208' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='208' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='208' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='216' y='4' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='216' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='224' y='4' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='224' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='224' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='232' y='4' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='232' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='232' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='240' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='240' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='240' y='52' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='248' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='248' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='248' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='256' y='4' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='256' y='20' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='256' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='256' y='52' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='272' y='4' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='272' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='272' y='52' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='280' y='4' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='280' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='280' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='280' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='288' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='288' y='52' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='296' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='296' y='20' fill='currentColor' style='font-size:1em'>z</text>
<text text-anchor='middle' x='296' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='296' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='304' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='304' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='304' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='304' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='312' y='4' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='312' y='36' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='312' y='52' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='320' y='4' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='320' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='320' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='328' y='4' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='328' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='328' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='328' y='52' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='336' y='20' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='336' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='336' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='344' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='344' y='36' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='344' y='52' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='352' y='20' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='352' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='360' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='360' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='360' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='368' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='368' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='368' y='52' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='376' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='376' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='376' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='384' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='384' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='384' y='52' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='392' y='36' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='392' y='52' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='400' y='20' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='400' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='400' y='52' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='408' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='408' y='52' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='416' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='416' y='36' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='416' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='424' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='424' y='36' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='424' y='52' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='432' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='432' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='432' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='440' y='20' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='440' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='440' y='52' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='448' y='20' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='448' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='456' y='20' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='456' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='456' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='464' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='464' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='464' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='472' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='472' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='472' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='480' y='20' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='480' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='480' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='488' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='488' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='496' y='20' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='496' y='52' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='504' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='512' y='20' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='520' y='20' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='528' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='536' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='544' y='20' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='552' y='20' fill='currentColor' style='font-size:1em'>h</text>
<text text-anchor='middle' x='560' y='20' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<h3 id="temperature-monitoring">Temperature Monitoring</h3>
<p>Real-time temperature is read from Apple Silicon&rsquo;s SoC via macOS <code>powermetrics</code>. This system call also provides fan speed, power consumption, and thermal throttling status. On fanless models, thermal throttling kicks in around 100°C, so stopping training at 95°C allows safe response before throttling is reached.</p>
<h3 id="dynamic-batch-size-adjustment">Dynamic Batch Size Adjustment</h3>
<p>When <code>should_reduce_batch()</code> returns <code>True</code>, the training loop halves the batch size. Reduced batch size decreases GPU compute, which reduces heat generation. When temperature drops below 75°C, the original batch size is restored.</p>
<p>This mechanism provides automatic balance between training speed and thermal safety. No manual intervention needed — the system maintains optimal training speed on its own.</p>
<h2 id="sequence-packing-and-training-efficiency">Sequence Packing and Training Efficiency</h2>
<p><strong>Sequence packing</strong> maximizes GPU utilization:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">SequencePacker</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> __init__(self, max_seq_len: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">2048</span>, eos_token_id: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">151645</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>max_seq_len <span style="color:#ff79c6">=</span> max_seq_len
</span></span><span style="display:flex;"><span>        self<span style="color:#ff79c6">.</span>eos_token_id <span style="color:#ff79c6">=</span> eos_token_id
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">def</span> <span style="color:#50fa7b">add_example</span>(self, token_ids: <span style="color:#8be9fd;font-style:italic">list</span>[<span style="color:#8be9fd;font-style:italic">int</span>], loss_mask: <span style="color:#8be9fd;font-style:italic">list</span>[<span style="color:#8be9fd;font-style:italic">int</span>]):
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Insert EOS separator if buffer is not empty</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">if</span> self<span style="color:#ff79c6">.</span>_buffer_ids:
</span></span><span style="display:flex;"><span>            self<span style="color:#ff79c6">.</span>_buffer_ids<span style="color:#ff79c6">.</span>append(self<span style="color:#ff79c6">.</span>eos_token_id)
</span></span><span style="display:flex;"><span>            self<span style="color:#ff79c6">.</span>_buffer_mask<span style="color:#ff79c6">.</span>append(<span style="color:#bd93f9">0</span>)  <span style="color:#6272a4"># Don&#39;t compute loss on separators</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#6272a4"># Yield complete batches when buffer is full</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ff79c6">while</span> <span style="color:#8be9fd;font-style:italic">len</span>(self<span style="color:#ff79c6">.</span>_buffer_ids) <span style="color:#ff79c6">&gt;=</span> self<span style="color:#ff79c6">.</span>max_seq_len:
</span></span><span style="display:flex;"><span>            <span style="color:#ff79c6">yield</span> PackedBatch(
</span></span><span style="display:flex;"><span>                token_ids<span style="color:#ff79c6">=</span>buffer[:self<span style="color:#ff79c6">.</span>max_seq_len],
</span></span><span style="display:flex;"><span>                loss_mask<span style="color:#ff79c6">=</span>mask[:self<span style="color:#ff79c6">.</span>max_seq_len]
</span></span><span style="display:flex;"><span>            )
</span></span></code></pre></td></tr></table>
</div>
</div><p>Sequence packing combines multiple training examples into a single sequence, maximizing GPU memory utilization. For example, four 512-token examples can be packed into one 2048-token sequence without padding. An EOS token serves as the separator between examples, with <code>loss_mask=0</code> ensuring no loss is computed on separator tokens.</p>
<h2 id="orpo-training-sft-and-preference-alignment-in-one-pass">ORPO Training: SFT and Preference Alignment in One Pass</h2>
<p>Bit-Axon supports <strong>ORPO (Odds Ratio Preference Optimization)</strong>. ORPO&rsquo;s key advantage is that no separate reference model is needed — supervised fine-tuning and preference alignment happen simultaneously in a single model.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">orpo_loss</span>(chosen_logps, rejected_logps, beta<span style="color:#ff79c6">=</span><span style="color:#bd93f9">0.1</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Odds ratio computation</span>
</span></span><span style="display:flex;"><span>    log_odds <span style="color:#ff79c6">=</span> (chosen_logps <span style="color:#ff79c6">-</span> rejected_logps) <span style="color:#ff79c6">-</span> \
</span></span><span style="display:flex;"><span>               (log1mexp(chosen_logps) <span style="color:#ff79c6">-</span> log1mexp(rejected_logps))
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Sigmoid penalty</span>
</span></span><span style="display:flex;"><span>    loss <span style="color:#ff79c6">=</span> <span style="color:#ff79c6">-</span>mx<span style="color:#ff79c6">.</span>mean(nn<span style="color:#ff79c6">.</span>log_sigmoid(beta <span style="color:#ff79c6">*</span> log_odds))
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">return</span> loss
</span></span></code></pre></td></tr></table>
</div>
</div><p>ORPO&rsquo;s total loss consists of two components:</p>
<ol>
<li><strong>NLL loss</strong>: Cross-entropy loss on the chosen sequence (standard SFT)</li>
<li><strong>Odds ratio penalty</strong>: Penalizes the difference in log probabilities between chosen and rejected sequences</li>
</ol>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">compute_orpo_loss</span>(model, chosen_ids, chosen_labels,
</span></span><span style="display:flex;"><span>                      rejected_ids, rejected_labels, beta<span style="color:#ff79c6">=</span><span style="color:#bd93f9">0.1</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Forward passes (2x — no reference model needed)</span>
</span></span><span style="display:flex;"><span>    logits_chosen <span style="color:#ff79c6">=</span> model(chosen_ids)
</span></span><span style="display:flex;"><span>    logits_rejected <span style="color:#ff79c6">=</span> model(rejected_ids)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># NLL loss on chosen sequences</span>
</span></span><span style="display:flex;"><span>    nll_loss <span style="color:#ff79c6">=</span> cross_entropy_loss(logits_chosen, chosen_labels)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Preference comparison</span>
</span></span><span style="display:flex;"><span>    chosen_logps <span style="color:#ff79c6">=</span> get_logps(logits_chosen, chosen_labels)
</span></span><span style="display:flex;"><span>    rejected_logps <span style="color:#ff79c6">=</span> get_logps(logits_rejected, rejected_labels)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Combined objective</span>
</span></span><span style="display:flex;"><span>    orpo_penalty <span style="color:#ff79c6">=</span> orpo_loss(chosen_logps, rejected_logps, beta)
</span></span><span style="display:flex;"><span>    total_loss <span style="color:#ff79c6">=</span> nll_loss <span style="color:#ff79c6">+</span> orpo_penalty
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="numerical-stability">Numerical Stability</h3>
<p>The <code>log1mexp</code> function provides numerically stable computation of <code>log(1 - exp(x))</code>:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> <span style="color:#50fa7b">log1mexp</span>(x: mx<span style="color:#ff79c6">.</span>array) <span style="color:#ff79c6">-&gt;</span> mx<span style="color:#ff79c6">.</span>array:
</span></span><span style="display:flex;"><span>    threshold <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>array(<span style="color:#ff79c6">-</span>_LN2)  <span style="color:#6272a4"># -ln(2)</span>
</span></span><span style="display:flex;"><span>    use_branch1 <span style="color:#ff79c6">=</span> x <span style="color:#ff79c6">&lt;</span> threshold
</span></span><span style="display:flex;"><span>    x_branch1 <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>where(use_branch1, x, mx<span style="color:#ff79c6">.</span>zeros_like(x))
</span></span><span style="display:flex;"><span>    x_branch2 <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>where(<span style="color:#ff79c6">~</span>use_branch1, x, mx<span style="color:#ff79c6">.</span>zeros_like(x))
</span></span><span style="display:flex;"><span>    branch1 <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>log(<span style="color:#ff79c6">-</span>mx<span style="color:#ff79c6">.</span>expm1(x_branch1))         <span style="color:#6272a4"># For x &lt; -ln(2)</span>
</span></span><span style="display:flex;"><span>    branch2 <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>log1p(<span style="color:#ff79c6">-</span>mx<span style="color:#ff79c6">.</span>exp(x_branch2))          <span style="color:#6272a4"># For x &gt;= -ln(2)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#ff79c6">return</span> mx<span style="color:#ff79c6">.</span>where(use_branch1, branch1, branch2)
</span></span></code></pre></td></tr></table>
</div>
</div><p>When x approaches 0, <code>1 - exp(x)</code> converges to subnormal floating-point numbers, losing precision. Two branches avoid this issue entirely.</p>
<h3 id="qlora-and-dora">QLoRA and DoRA</h3>
<p>Training uses <strong>QLoRA</strong> (Quantized Low-Rank Adaptation): 4-bit quantized base weights are frozen and only low-rank adapters are trained.</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>@dataclass
</span></span><span style="display:flex;"><span><span style="color:#ff79c6">class</span> <span style="color:#50fa7b">TrainingConfig</span>:
</span></span><span style="display:flex;"><span>    quantize_bits: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">4</span>
</span></span><span style="display:flex;"><span>    quantize_group_size: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">64</span>
</span></span><span style="display:flex;"><span>    lora_rank: <span style="color:#8be9fd;font-style:italic">int</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">8</span>
</span></span><span style="display:flex;"><span>    lora_dropout: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">0.0</span>
</span></span><span style="display:flex;"><span>    lora_scale: <span style="color:#8be9fd;font-style:italic">float</span> <span style="color:#ff79c6">=</span> <span style="color:#bd93f9">20.0</span>
</span></span><span style="display:flex;"><span>    use_dora: <span style="color:#8be9fd;font-style:italic">bool</span> <span style="color:#ff79c6">=</span> <span style="color:#ff79c6">True</span>  <span style="color:#6272a4"># Weight-Decomposed LoRA</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><strong>DoRA (Weight-Decomposed Low-Rank Adaptation)</strong> is a LoRA variant that decomposes weights into magnitude and direction:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#ff79c6">def</span> __call__(self, x):
</span></span><span style="display:flex;"><span>    y <span style="color:#ff79c6">=</span> self<span style="color:#ff79c6">.</span>linear(x)
</span></span><span style="display:flex;"><span>    z <span style="color:#ff79c6">=</span> (self<span style="color:#ff79c6">.</span>dropout(x) <span style="color:#ff79c6">@</span> self<span style="color:#ff79c6">.</span>lora_a) <span style="color:#ff79c6">@</span> self<span style="color:#ff79c6">.</span>lora_b
</span></span><span style="display:flex;"><span>    out <span style="color:#ff79c6">=</span> y <span style="color:#ff79c6">+</span> (self<span style="color:#ff79c6">.</span>scale <span style="color:#ff79c6">*</span> z)<span style="color:#ff79c6">.</span>astype(x<span style="color:#ff79c6">.</span>dtype)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#6272a4"># Preserve original magnitude (DoRA core)</span>
</span></span><span style="display:flex;"><span>    denom <span style="color:#ff79c6">=</span> mx<span style="color:#ff79c6">.</span>sqrt(self<span style="color:#ff79c6">.</span>_dora_w_sq_norm <span style="color:#ff79c6">+</span> cross <span style="color:#ff79c6">+</span> d_sq)
</span></span><span style="display:flex;"><span>    out <span style="color:#ff79c6">=</span> (self<span style="color:#ff79c6">.</span>m <span style="color:#ff79c6">/</span> denom)<span style="color:#ff79c6">.</span>astype(x<span style="color:#ff79c6">.</span>dtype) <span style="color:#ff79c6">*</span> out
</span></span></code></pre></td></tr></table>
</div>
</div><p>DoRA outperforms standard LoRA because it <strong>prevents magnitude drift during training</strong>. Standard LoRA adds adapters to weights, which can shift the original weight magnitude. DoRA explicitly normalizes magnitude, improving training stability.</p>
<h2 id="model-configuration-summary">Model Configuration Summary</h2>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total parameters</td>
          <td>3.2B</td>
          <td>Including MoE</td>
      </tr>
      <tr>
          <td>Active parameters</td>
          <td>~1.4B</td>
          <td>With Top-2 routing</td>
      </tr>
      <tr>
          <td>vocab_size</td>
          <td>32,000</td>
          <td>BPE vocabulary size</td>
      </tr>
      <tr>
          <td>hidden_dim</td>
          <td>2,560</td>
          <td>Model hidden dimension</td>
      </tr>
      <tr>
          <td>num_layers</td>
          <td>24</td>
          <td>3 segments × 8 layers</td>
      </tr>
      <tr>
          <td>num_heads</td>
          <td>32</td>
          <td>Number of heads (head_dim=80)</td>
      </tr>
      <tr>
          <td>ssm_d_state</td>
          <td>16</td>
          <td>SSM state vector dimension</td>
      </tr>
      <tr>
          <td>ssm_d_conv</td>
          <td>4</td>
          <td>SSM 1D convolution kernel</td>
      </tr>
      <tr>
          <td>ssm_scan_step</td>
          <td>64</td>
          <td>Parallel scan chunk size</td>
      </tr>
      <tr>
          <td>swa_window_size</td>
          <td>4,096</td>
          <td>Sliding window size</td>
      </tr>
      <tr>
          <td>moe_num_experts</td>
          <td>8</td>
          <td>Number of experts</td>
      </tr>
      <tr>
          <td>moe_top_k</td>
          <td>2</td>
          <td>Active experts per token</td>
      </tr>
      <tr>
          <td>moe_shared_expert</td>
          <td>true</td>
          <td>Shared expert enabled</td>
      </tr>
      <tr>
          <td>max_seq_len</td>
          <td>65,536</td>
          <td>Maximum sequence length</td>
      </tr>
      <tr>
          <td>Quantization</td>
          <td>4-bit NF4</td>
          <td>Group size 64</td>
      </tr>
  </tbody>
</table>
<h2 id="key-insights">Key Insights</h2>
<h3 id="1-solve-hardware-constraints-with-architecture">1. Solve Hardware Constraints with Architecture</h3>
<p>A fanless notebook&rsquo;s thermal limits can&rsquo;t be solved with software tuning alone. SSM&rsquo;s linear complexity reduces compute, MoE&rsquo;s sparse activation saves memory bandwidth, and the thermal scheduler dynamically adjusts training speed. All three are needed for sustained training on a fanless MacBook.</p>
<h3 id="2-match-your-framework-to-your-hardware">2. Match Your Framework to Your Hardware</h3>
<p>MLX&rsquo;s zero-copy unified memory is the decisive factor that makes model inference possible on a 16GB MacBook. PyTorch&rsquo;s GPU-CPU memory copy demands double the memory on the same hardware. Choosing the framework that matches your hardware is the first step in optimization.</p>
<h3 id="3-assign-minimum-complexity-to-each-layer-segment">3. Assign Minimum Complexity to Each Layer Segment</h3>
<p>Context absorption gets SSM (O(n)), reasoning gets SWA (O(n × w)), output gets SSM+MoE (linear + sparse). Attention exists in only 8 of 24 layers. Assigning only the minimum necessary computation to each segment manages total complexity more effectively than &ldquo;putting attention in every layer.&rdquo;</p>
<h3 id="4-reference-free-alignment-is-essential-for-edge-devices">4. Reference-Free Alignment is Essential for Edge Devices</h3>
<p>ORPO requires no reference model, making preference alignment possible in 16GB memory. PPO or DPO require loading a reference model, making them impossible to run on a MacBook due to memory constraints. Edge device constraints directly influence algorithm selection.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Bit-Axon is an experiment in running LLMs on edge devices. The three-layer sandwich architecture assigns computation suited to hardware constraints, MLX maximizes unified memory utilization, and thermal-aware training enables sustainable training within physical limits.</p>
<p>Combined, these make a 3.2B model practical on a fanless MacBook. 16GB unified memory, 4-bit quantization, Apple Silicon&rsquo;s efficient GPU — this hardware combination opens new possibilities for running LLMs on consumer devices.</p>
<p>Full source code at <a href="https://github.com/skyoo2003/bit-axon">github.com/skyoo2003/bit-axon</a>, model weights on <a href="https://huggingface.co/skyoo2003/bit-axon">HuggingFace</a>.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
