<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://amohan.dev/feed.xml" rel="self" type="application/atom+xml"/><link href="https://amohan.dev/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-01-26T00:44:36+00:00</updated><id>https://amohan.dev/feed.xml</id><title type="html">blank</title><subtitle>Immigrant Software Dev, have some opions on geopolitics </subtitle><entry><title type="html">Backporting FP8 to the RTX 3090 (No H100 Required)</title><link href="https://amohan.dev/blog/2026/fp8-as-storage-imma-ampere/" rel="alternate" type="text/html" title="Backporting FP8 to the RTX 3090 (No H100 Required)"/><published>2026-01-25T23:15:00+00:00</published><updated>2026-01-25T23:15:00+00:00</updated><id>https://amohan.dev/blog/2026/fp8-as-storage-imma-ampere</id><content type="html" xml:base="https://amohan.dev/blog/2026/fp8-as-storage-imma-ampere/"><![CDATA[<p>NVIDIA’s FP8 story is usually told like this: <em>“If you want to experiment with FP8 numerics, you need an H100 (or at least a very new GPU with FP8 support, like an RTX 4090).”</em></p> <p>I disagree.</p> <p>Call it: <strong>backporting FP8-style numerics experiments to the RTX 3090.</strong></p> <p>Not because Ampere magically does FP8 compute (it doesn’t), and not because this makes an RTX 3090 “faster” than Hopper (it won’t).</p> <p>But because a lot of FP8 research and engineering is really about:</p> <ul> <li><strong>how you store weights</strong> (bytes on the wire)</li> <li><strong>when and where you expand them</strong> (decode)</li> <li><strong>what scaling/quantization contract you enforce</strong></li> </ul> <p>You can explore a surprising amount of that on consumer Ampere, if you’re willing to treat FP8 as a <em>storage format</em> and map the math onto hardware that <em>is</em> available.</p> <p>Quick note: if you see an acronym you don’t recognize, jump to the <a href="#glossary">glossary</a>.</p> <p>Code: <a href="https://github.com/poad42/cuda-fp8-ampere">https://github.com/poad42/cuda-fp8-ampere</a></p> <h2 id="the-plan">The plan</h2> <p>Ampere (sm_86) has extremely capable <abbr title="Hardware matrix-multiply units inside NVIDIA GPUs">tensor cores</abbr>, but it doesn’t have native FP8 tensor-core <abbr title="Matrix Multiply-Accumulate (tensor core instruction path)">MMA</abbr>. What it <em>does</em> have is a very fast path for <strong>INT8 tensor cores</strong> (<abbr title="Integer Matrix Multiply-Accumulate (INT8 tensor core path)">IMMA</abbr> / <abbr title="Warp Matrix Multiply-Accumulate (CUDA API for tensor cores)">WMMA</abbr>).</p> <p>So the project becomes:</p> <blockquote> <p>Keep weights stored as <strong>1-byte FP8 bit patterns</strong> in VRAM, decode/scale/quantize on the fly, and use <strong>INT8 tensor cores</strong> for the matmul.</p> </blockquote> <p>That’s the whole framing: <strong>democratize FP8 research</strong> by making the storage + numerics experimentable on hardware people actually have.</p> <h2 id="fp8-as-storage-in-one-paragraph">FP8-as-storage, in one paragraph</h2> <p>I am not trying to do “FP8 compute.” I’m trying to store weights in a compact FP8 format and only expand them when needed.</p> <p>The VRAM part is simple: <strong>FP16/BF16 weights cost 2 bytes/weight</strong>, while <strong>FP8 weights cost 1 byte/weight</strong>. So for large weight matrices, storing FP8 can cut the resident weight footprint (and the bandwidth to stream it) by close to <strong>2×</strong>.</p> <p>In practice you also store <strong>scale factors</strong> (e.g. one FP16 scale per output channel), but that overhead is tiny compared to the full $N\times K$ weight matrix.</p> <p>Conceptually:</p> <ol> <li>Store weights as FP8 bytes (E4M3) — literally <code class="language-plaintext highlighter-rouge">uint8</code> bit patterns.</li> <li>Decode FP8 → FP16 on the fly using a 256-entry <abbr title="Lookup table (256-entry map from FP8 byte to FP16 value)">LUT</abbr>.</li> <li>Apply per-output-channel (per-column) scale.</li> <li>Quantize to INT8 so the tensor cores can consume it.</li> <li>Run <abbr title="Integer Matrix Multiply-Accumulate (INT8 tensor core path)">IMMA</abbr> (INT8×INT8→INT32 accumulate), then write FP16 output.</li> </ol> <p>That’s the whole “FP8 without FP8 MMA” idea.</p> <h2 id="whats-actually-new-here-and-what-isnt">What’s actually new here (and what isn’t)</h2> <p>Three honesty bullets up front:</p> <ul> <li><strong>This is not a claim that Ampere beats BF16/FP16 <abbr title="NVIDIA CUDA Basic Linear Algebra Subprograms (highly optimized GEMM library)">cuBLAS</abbr>.</strong> In fact, for pure compute, cuBLAS is usually hard to beat.</li> <li><strong>This is not full FP8 training.</strong> There’s no backward pass here.</li> <li><strong>This project focuses on FP8(E4M3) storage.</strong> Extending to E5M2 is conceptually similar (another decode path), but I didn’t build it into this writeup.</li> </ul> <p>So what <em>is</em> interesting?</p> <h3 id="bit-level-fp8-handling-lut-decode">Bit-level FP8 handling (LUT decode)</h3> <p>I store FP8 weights as raw <code class="language-plaintext highlighter-rouge">uint8</code> bit patterns and decode them with a 256-entry LUT. Since there are only 256 possible FP8 bytes, decode is conceptually:</p> <ul> <li><code class="language-plaintext highlighter-rouge">u8</code> → <code class="language-plaintext highlighter-rouge">fp16</code> via <code class="language-plaintext highlighter-rouge">LUT[u8]</code></li> </ul> <p>No <code class="language-plaintext highlighter-rouge">__byte_perm</code> tricks here — it’s mostly about making that decode cheap enough to hide behind the tensor-core pipe.</p> <h3 id="scaling--quantization-as-a-first-class-contract">Scaling + quantization as a first-class contract</h3> <p>The weights aren’t “just FP8.” They’re <strong>FP8 bits + per-output-channel scale</strong>. The kernel makes that explicit: decode → apply scale → saturating quantize to int8 → IMMA.</p> <h3 id="stochastic-rounding-sr-important-but-not-implemented-here">Stochastic rounding (SR): important, but not implemented here</h3> <p>If you’re interested in FP8 <em>training dynamics</em>, stochastic rounding matters a lot. This project doesn’t implement SR (no backward pass), but if I were pushing this toward “training-like” experiments on older GPUs, SR would be near the top of the list.</p> <h2 id="glossary">Glossary (quick definitions)</h2> <ul> <li><strong>FP8(E4M3)</strong>: an 8-bit float format. Great for storage, not great for high-accuracy math.</li> <li><strong>MMA</strong>: matrix multiply-accumulate (the tensor core instruction family).</li> <li><strong>IMMA / WMMA</strong>: NVIDIA’s tensor core path for int8 matrix multiply (instruction path / CUDA API).</li> <li><strong>cuBLAS / cuBLASLt</strong>: NVIDIA’s GPU linear algebra libraries (GEMM).</li> <li><strong>cp.async</strong>: an Ampere instruction to asynchronously copy from global memory to shared memory.</li> <li><strong>l2pin</strong>: using “persisting L2” cache hints to keep hot tensors resident longer.</li> <li><strong>Per-column scale</strong>: one scale factor per output channel; common in quantized inference.</li> <li><strong>LUT decode</strong>: since there are only 256 FP8 bit patterns, decode can be a table lookup.</li> </ul> <h2 id="the-pipeline-in-one-diagram">The pipeline, in one diagram</h2> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A (fp16/bf16)                B (uint8 fp8-e4m3 bits)         col_scales (u16 bits)
[M,K] row-major              [N,K] (represents KxN col-major)      [N]
      |                               |                               |
      |                               | (LUT in __constant__)        |
      |                               v                               |
      |                        fp8 -&gt; fp16 decode                     |
      |                               |                               |
      |                               +-----------(per-column)--------+
      |                                           scale
      |                               |
      |                               v
      |                        fp16 -&gt; int8 (sat)
      |                               |
      +--------------- int8 A --------+
                      (act quant)
                                      |
                                      v
                            WMMA/IMMA (int8) accumulate (int32)
                                      |
                                      v
                             D (fp16) written as [N,M]
                             (represents MxN col-major)
</code></pre></div></div> <p>If you’ve never written CUDA kernels: that diagram is basically the whole story.</p> <h2 id="baseline-pytorch-decode--matmul">Baseline: PyTorch decode + matmul</h2> <p>Before writing any custom kernel, I wanted a baseline that matches the real workflow:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># weights stored as FP8 bytes
</span><span class="n">B_u8</span> <span class="o">=</span> <span class="p">...</span>  <span class="c1"># [N,K] uint8
</span>
<span class="c1"># decode fp8 -&gt; fp16 every iteration
</span><span class="n">B_fp16</span> <span class="o">=</span> <span class="n">LUT</span><span class="p">[</span><span class="n">B_u8</span><span class="p">]</span> <span class="o">*</span> <span class="n">scales</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span>

<span class="c1"># compute in fp16 using standard matmul
</span><span class="n">out</span> <span class="o">=</span> <span class="n">A</span> <span class="o">@</span> <span class="n">B_fp16</span><span class="p">.</span><span class="n">T</span>
</code></pre></div></div> <p>That’s the easiest FP8-as-storage implementation: store FP8 bytes, decode on demand, then use <abbr title="NVIDIA CUDA Basic Linear Algebra Subprograms (highly optimized GEMM library)">cuBLAS</abbr>.</p> <p>Two additional baselines are useful:</p> <ul> <li><strong>Decode + matmul + downcast output to FP8</strong>: what the pipeline looks like if you want to store the output/activation in FP8.</li> <li><strong>Matmul-only with fp16 weights cached</strong>: not apples-to-apples (you’re no longer storing FP8), but it’s a useful upper bound.</li> </ul> <h3 id="baseline-numbers-pytorch">Baseline numbers (PyTorch)</h3> <p>Measured on RTX 3090 Ti (sm_86), CUDA-visible, shape $M=N=K=4096$.</p> <table> <thead> <tr> <th>Path</th> <th>What it includes</th> <th style="text-align: right">Time / iter</th> <th style="text-align: right">Effective TOPS</th> <th style="text-align: right">Peak alloc</th> </tr> </thead> <tbody> <tr> <td>Fused extension</td> <td>custom kernel (<code class="language-plaintext highlighter-rouge">fp8imma_ext.imma_fp8_v4_act</code>)</td> <td style="text-align: right">2.914 ms</td> <td style="text-align: right">47.17</td> <td style="text-align: right">120.1 MiB</td> </tr> <tr> <td>Naive Torch</td> <td>decode FP8→fp16 each iter + fp16 matmul</td> <td style="text-align: right">2.267 ms</td> <td style="text-align: right">60.63</td> <td style="text-align: right">248.1 MiB</td> </tr> <tr> <td>Naive Torch (end-to-end)</td> <td>decode + fp16 matmul + downcast output to FP8</td> <td style="text-align: right">2.322 ms</td> <td style="text-align: right">59.18</td> <td style="text-align: right">248.1 MiB</td> </tr> <tr> <td>Torch matmul only</td> <td>fp16 weights cached (no decode)</td> <td style="text-align: right">1.828 ms</td> <td style="text-align: right">75.17</td> <td style="text-align: right">120.1 MiB</td> </tr> </tbody> </table> <p>Notes (important, and easy to misread):</p> <ul> <li>The “matmul only” baseline assumes fp16 weights are already resident. That defeats the FP8 VRAM savings.</li> <li>“Peak alloc” here is per-call peak allocated bytes; it does not include already-resident fp16 cached weights.</li> </ul> <p>The naive decode+matmul being fast is not a paradox — <abbr title="NVIDIA CUDA Basic Linear Algebra Subprograms (highly optimized GEMM library)">cuBLAS</abbr> is extremely optimized, and the decode step is embarrassingly parallel. My main motivation for the fused kernel is controlling memory traffic and keeping the pipeline “weight storage = FP8 bytes” end-to-end.</p> <h2 id="fusing-it-into-one-kernel">Fusing it into one kernel</h2> <p>Once you accept that IMMA wants int8 fragments, the kernel is a pipeline problem:</p> <ul> <li><strong>Where does decode happen?</strong> (constant memory LUT vs texture vs global)</li> <li><strong>Where does scaling happen?</strong> (apply scale in fp16, or bake it into an int8 conversion)</li> <li><strong>Where does activation quant happen?</strong> (register path vs shared-memory staging)</li> <li><strong>How do you feed tensor cores continuously?</strong> (avoid stalls from decode/scale/quant)</li> </ul> <p>I ended up implementing variants as a way to test hypotheses.</p> <h3 id="variants-experiments">Variants (experiments)</h3> <ul> <li><strong>v2</strong>: baseline fused path (FP8→INT8 JIT + IMMA). Keep it simple and measure.</li> <li><strong>v2_i8lut</strong>: “what if I precompute a per-column FP8→INT8 table in shared memory?” (sounds clever; didn’t win).</li> <li><strong>v3_act_f16</strong>: fused activation quantization, register path.</li> <li><strong>v4_act_f16</strong>: <abbr title="cp.async: Ampere async copy from global memory to shared memory">cp.async</abbr> staging for activations + shared-memory quantization, then IMMA.</li> <li><strong>texscale</strong>: load per-column scales via TEX.</li> <li><strong>l2pin</strong>: <abbr title="Persisting-L2 cache hints to keep B/scales resident longer">persisting-L2</abbr> hints for B/scales.</li> </ul> <h3 id="kernel-benchmark-numbers">Kernel benchmark numbers</h3> <p>Measured via <code class="language-plaintext highlighter-rouge">./build/gpu_bench</code> on RTX 3090 Ti (sm_86), driver 590.48.01, CUDA 13.1.</p> <p>Shape: M=N=K=4096, <code>--warmup 10 --iters 50</code>.</p> <table> <thead> <tr> <th>Benchmark</th> <th style="text-align: right">Time / iter</th> <th style="text-align: right">Throughput</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v2</code></td> <td style="text-align: right">2.714 ms</td> <td style="text-align: right">50.63 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v2_l2pin</code></td> <td style="text-align: right">2.744 ms</td> <td style="text-align: right">50.09 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v4_act_f16</code></td> <td style="text-align: right">2.818 ms</td> <td style="text-align: right">48.77 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v4_act_f16_l2pin</code></td> <td style="text-align: right">2.851 ms</td> <td style="text-align: right">48.21 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v4_act_f16_texscale</code></td> <td style="text-align: right">2.824 ms</td> <td style="text-align: right">48.66 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v4_act_f16_texscale_l2pin</code></td> <td style="text-align: right">2.854 ms</td> <td style="text-align: right">48.16 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v2_i8lut</code></td> <td style="text-align: right">3.369 ms</td> <td style="text-align: right">40.79 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">imma_fp8_jit_v3_act_f16</code></td> <td style="text-align: right">5.606 ms</td> <td style="text-align: right">24.52 TOPS</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">int8gemm</code> (<abbr title="cuBLASLt: cuBLAS 'Lt' API for flexible/fast GEMMs">cuBLASLt</abbr> baseline)</td> <td style="text-align: right">0.018 ms</td> <td style="text-align: right">118.06 TOPS</td> </tr> </tbody> </table> <p>Notes:</p> <ul> <li><code class="language-plaintext highlighter-rouge">*_l2pin</code> can vary with driver/GPU state and other workloads.</li> <li>The <code class="language-plaintext highlighter-rouge">int8gemm</code> cuBLASLt number is <em>not</em> FP8-as-storage; it’s a ceiling for int8 TC GEMM on this machine.</li> </ul> <h2 id="why-do-this-at-all">Why do this at all?</h2> <p>After seeing the tables, the fair question is:</p> <blockquote> <p>If naive Torch is already fast, why bother?</p> </blockquote> <p>Because “fast” depends on what you’re measuring.</p> <p>On pure matmul throughput, a highly tuned fp16/bf16 GEMM can absolutely win. This project is about a different constraint: <strong>weight storage and weight movement</strong>.</p> <p>If your weights are truly stored in FP8 (1 byte/weight), then compared to fp16/bf16 (2 bytes/weight) you’re targeting <em>up to</em> <strong>2× less weight traffic</strong> and <strong>2× lower resident weight footprint</strong>. That’s a real lever for memory-bound inference workloads — even if you pay some extra compute to decode/scale/quantize.</p> <p>Practically, the “democratizing FP8 research” win is:</p> <ul> <li>you can keep the storage format honest (FP8 bytes in VRAM)</li> <li>you can experiment with scaling/quantization contracts</li> <li>you can measure the cost of decode/quant instead of hiding it in a pre-processing step</li> </ul> <p>So I view this as a tool for exploration: <strong>FP8-as-storage end-to-end</strong>, on hardware that doesn’t officially “support FP8.”</p> <h2 id="try-it-yourself">Try it yourself</h2> <p>Repo: <a href="https://github.com/poad42/cuda-fp8-ampere">https://github.com/poad42/cuda-fp8-ampere</a></p> <p>Build:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git submodule update <span class="nt">--init</span> <span class="nt">--recursive</span>
cmake <span class="nt">-S</span> <span class="nb">.</span> <span class="nt">-B</span> build <span class="nt">-DCMAKE_BUILD_TYPE</span><span class="o">=</span>Release
cmake <span class="nt">--build</span> build <span class="nt">-j</span>
</code></pre></div></div> <p>Run tests:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd </span>build
ctest <span class="nt">--output-on-failure</span>
</code></pre></div></div> <p>Run the kernel benches:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./build/gpu_bench <span class="nt">--bench</span> imma_fp8_jit_v2 <span class="nt">--M</span> 4096 <span class="nt">--N</span> 4096 <span class="nt">--K</span> 4096 <span class="nt">--warmup</span> 10 <span class="nt">--iters</span> 50
./build/gpu_bench <span class="nt">--bench</span> imma_fp8_jit_v4_act_f16 <span class="nt">--M</span> 4096 <span class="nt">--N</span> 4096 <span class="nt">--K</span> 4096 <span class="nt">--warmup</span> 10 <span class="nt">--iters</span> 50
./build/gpu_bench <span class="nt">--bench</span> imma_fp8_jit_v4_act_f16_texscale <span class="nt">--M</span> 4096 <span class="nt">--N</span> 4096 <span class="nt">--K</span> 4096 <span class="nt">--warmup</span> 10 <span class="nt">--iters</span> 50
</code></pre></div></div> <p>Run the Torch baselines (including end-to-end downcast):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span> .venv_torch_cuda312/bin/activate
python scripts/bench_torch_vs_fp8imma.py <span class="nt">--M</span> 4096 <span class="nt">--N</span> 4096 <span class="nt">--K</span> 4096 <span class="nt">--kChunk</span> 32 <span class="nt">--report_mem</span> <span class="nt">--downcast_out_fp8</span>
</code></pre></div></div> <h2 id="next-steps">Next steps</h2> <p>If I had another weekend:</p> <ul> <li>Add a tiny numerical correctness harness (reference decode + GEMM with tolerances).</li> <li>Report a more honest memory metric: <em>resident weights + peak workspace</em>, not just per-call peak alloc.</li> <li>Try more realistic shapes (transformer-ish M, larger N, varying K) instead of only 4096³.</li> </ul> <hr/> <p>If you want to dig into the code, the repo contains:</p> <ul> <li>a CUDA kernel library (C++ API + C ABI)</li> <li>a benchmark harness (<code class="language-plaintext highlighter-rouge">gpu_bench</code>)</li> <li>a minimal PyTorch extension</li> <li>smoke tests (CTest + torch compile/import test)</li> </ul>]]></content><author><name></name></author><category term="technical-deep-dive"/><category term="cuda"/><category term="gpu-optimization"/><category term="quantization"/><category term="tensor-cores"/><category term="fp8"/><summary type="html"><![CDATA[Storing FP8 weights as bytes on Ampere, decoding via a LUT, scaling, quantizing to INT8, and using IMMA tensor cores—so you can experiment with FP8-like numerics without Hopper.]]></summary></entry><entry><title type="html">Reverse-Engineering the RK3588 NPU: Hacking Memory Limits to Run Vision Transformers</title><link href="https://amohan.dev/blog/2025/shard-optimizing-vision-transformers-edge-npu/" rel="alternate" type="text/html" title="Reverse-Engineering the RK3588 NPU: Hacking Memory Limits to Run Vision Transformers"/><published>2025-12-12T02:59:00+00:00</published><updated>2025-12-12T02:59:00+00:00</updated><id>https://amohan.dev/blog/2025/shard-optimizing-vision-transformers-edge-npu</id><content type="html" xml:base="https://amohan.dev/blog/2025/shard-optimizing-vision-transformers-edge-npu/"><![CDATA[<h2 id="the-unsupported-hardware-problem">The “Unsupported” Hardware Problem</h2> <p>If you look at the spec sheet for the Rockchip RK3588 (the chip inside the Orange Pi 5), it looks like a beast. It promises <strong>6 TOPS</strong> of NPU performance. For $100, that’s a steal.</p> <p>But if you try to run modern AI on it—specifically the <strong>Vision Encoder</strong> from <strong>SmolVLM</strong>—that promise falls apart.</p> <p>The standard Computer Vision SDK (<code class="language-plaintext highlighter-rouge">rknn-toolkit2</code>) is optimized for older, predictable CNNs (like ResNet). When I fed it the <strong>SigLIP</strong> Vision Transformer used by SmolVLM, the driver choked. Even though the model is “smol,” the massive Attention matrices it generates triggered cryptic hex errors and refused to compile.</p> <p>This left me with one option: running the model on the CPU. The result? A single image inference took <strong>~30 seconds</strong>. The 6 TOPS accelerator sat idle while the CPU struggled.</p> <p>I didn’t accept that. I decided to reverse-engineer the NPU to find out exactly why it was failing, and how to force it to run at full speed.</p> <h2 id="context-why-do-it-the-hard-way-first-principles">Context: Why do it the hard way? (First Principles)</h2> <p><em>A quick note for those following the ecosystem:</em> You might see projects like <strong>QEngineering</strong> running the newer <strong>SmolVLM-v2</strong> on Rockchip’s <code class="language-plaintext highlighter-rouge">rknn-llm</code> SDK.</p> <p>That approach uses a specialized “black box” toolchain designed specifically for Transformers. Rockchip engineers have likely already implemented complex memory management inside that SDK to handle these models.</p> <p>My project targets the original <strong>SmolVLM-v1</strong>, but more importantly, I built it on the <strong>legacy <code class="language-plaintext highlighter-rouge">rknn-toolkit2</code> stack</strong>. <strong>Why hack the legacy stack?</strong> I wanted to take a “First Principles” approach. I didn’t want to use a black-box solver. I wanted to understand <strong>why</strong> the hardware was crashing on Attention layers and if I could find universal architectural patterns—like manual tiling and graph sharding—that could force <em>any</em> Transformer to run on <em>any</em> constrained edge accelerator.</p> <h2 id="the-detective-work-what-is-error-0xe010">The Detective Work: What is Error <code class="language-plaintext highlighter-rouge">0xe010</code>?</h2> <p>Rockchip doesn’t publish a public Instruction Set Architecture (ISA). When I tried to compile the Attention layers, the driver kept spitting out an undocumented error: <code class="language-plaintext highlighter-rouge">REGTASK Overflow (0xe010)</code>.</p> <p>I hypothesized this was a memory overflow. Even though the model parameters are small (~96M), the <strong>intermediate activation matrices</strong> for a 1024-token sequence are huge (~25MB).</p> <p>I wrote a script to generate synthetic ONNX graphs to probe the hardware limits:</p> <ul> <li><strong>8KB Tensor:</strong> Pass.</li> <li><strong>16KB Tensor:</strong> Pass.</li> <li><strong>32KB Tensor:</strong> Pass.</li> <li><strong>32.1KB Tensor:</strong> <strong>CRASH.</strong></li> </ul> <p><strong>Discovery:</strong> The NPU has a hardware-enforced <strong>32KB L1 SRAM Scratchpad</strong> for vector operations.</p> <p>The standard compiler was trying to shove a <strong>25MB</strong> Attention matrix into a <strong>32KB</strong> slot.</p> <h2 id="the-fix-nano-tiling--the-poison-pill">The Fix: Nano-Tiling &amp; The “Poison Pill”</h2> <p>To solve the 32KB limit, I wrote a <strong>“Nano-Tiling”</strong> algorithm in PyTorch. I manually sliced the massive 1024-token sequence into tiny <code class="language-plaintext highlighter-rouge">32x32</code> tiles that fit perfectly into the 32KB scratchpad.</p> <p>But here is where it got messy. The <code class="language-plaintext highlighter-rouge">rknn</code> compiler is “smart.” It looked at my tiled graph, decided it was inefficient, and fused the operators back together into a single giant block… which immediately crashed the hardware again.</p> <p>I had to trick the compiler. I needed a way to tell it: <em>“Do not merge these nodes.”</em></p> <p>I introduced a topological barrier I call the <strong>“Poison Pill.”</strong> I injected a dummy operation that looks mathematically significant to the dependency graph (preventing fusion) but is mathematically irrelevant to the model output.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The "Poison Pill"
# 1. Take a slice (forcing a strided access)
</span><span class="n">slice_x</span> <span class="o">=</span> <span class="n">x</span><span class="p">[...,</span> <span class="p">:</span><span class="mi">1</span><span class="p">]</span>

<span class="c1"># 2. Apply a non-linear op (breaks compiler fusion heuristics)
# 3. Scale it down to near-zero so it doesn't affect the math
</span><span class="n">poison</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">sigmoid</span><span class="p">(</span><span class="n">slice_x</span><span class="p">)</span> <span class="o">*</span> <span class="mf">1e-6</span> 

<span class="c1"># 4. Inject dependency
# The compiler sees 'out' depends on 'poison' and creates a barrier.
</span><span class="n">out</span> <span class="o">=</span> <span class="n">out</span> <span class="o">+</span> <span class="n">poison</span>
</code></pre></div></div> <p>By injecting this into the graph, I successfully forced the compiler to respect my tiling logic.</p> <h2 id="the-siglip-cliff-solving-accuracy-collapse">The “SigLIP Cliff”: Solving Accuracy Collapse</h2> <p>Getting it to run was step one. Getting it to be <em>right</em> was step two. When I first got the NPU running, the output was garbage. The cosine similarity compared to the original model was <strong>0.02</strong> (pure noise).</p> <p>The culprit was the architecture of <strong>SigLIP</strong>. Unlike standard models, SigLIP has massive activation “spikes” (values around <strong>300.0</strong>) sitting next to tiny visual signals (values around <strong>0.05</strong>).</p> <p>NPU quantization (INT8) works by mapping the range to -128/+127.</p> <ul> <li>If you zoom out to capture the <strong>300.0</strong>, the <strong>0.05</strong> rounds down to 0. <strong>Signal lost.</strong></li> <li>If you zoom in to capture the <strong>0.05</strong>, the <strong>300.0</strong> overflows to infinity. <strong>Math crash.</strong></li> </ul> <p>I implemented a <strong>“Sandwich” Domain Shift</strong>:</p> <ol> <li><strong>CPU Pre-Scale:</strong> Multiply the input by <code class="language-plaintext highlighter-rouge">0.1</code>. Now the max value is 30.0 (Safe for FP16).</li> <li><strong>NPU Execution:</strong> Run the heavy compute in this scaled-down “safe zone.”</li> <li><strong>CPU Post-Scale:</strong> Multiply the output by <code class="language-plaintext highlighter-rouge">10.0</code>.</li> </ol> <p>This simple trick restored the signal fidelity from 0.02 to <strong>0.999</strong> (effectively bit-exact).</p> <h2 id="the-architecture-custom-runtime-scheduler">The Architecture: Custom Runtime Scheduler</h2> <p>Finally, to bypass driver timeouts caused by the sheer number of tiles (thousands of tiny operations), I physically cut the model graph into <strong>26 separate binary files</strong> (shards).</p> <p>I wrote a custom <strong>User-Space Runtime</strong> in Python that acts as an orchestrator. It manually loads these shards onto the RK3588’s 3 separate NPU cores and fires them in a synchronized round-robin schedule (Core 0 -&gt; Core 1 -&gt; Core 2).</p> <h2 id="the-results">The Results</h2> <p>By ignoring the vendor’s “Unsupported” warnings and re-architecting the software to match the silicon’s physical reality, the results were drastic.</p> <table> <thead> <tr> <th style="text-align: left">Metric</th> <th style="text-align: left">CPU Baseline (PyTorch)</th> <th style="text-align: left">SHARD (My Method)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>Latency</strong></td> <td style="text-align: left">~30.0 seconds</td> <td style="text-align: left"><strong>&lt; 1.8 seconds</strong></td> </tr> <tr> <td style="text-align: left"><strong>Speedup</strong></td> <td style="text-align: left">1x</td> <td style="text-align: left"><strong>15x</strong></td> </tr> <tr> <td style="text-align: left"><strong>Accuracy</strong></td> <td style="text-align: left">Reference</td> <td style="text-align: left"><strong>0.999 (FP32 Match)</strong></td> </tr> </tbody> </table> <h2 id="conclusion">Conclusion</h2> <p>This project challenged the binary notion of “Supported Hardware.” The RK3588 didn’t support the SigLIP encoder out of the box on the standard SDK, but the silicon was always capable of it. It just needed an engineer to dig into the register overflow codes and manage the memory manually.</p> <p>If you want to see the full code, including the tiling logic and the runtime orchestrator, check out the repo below.</p> <p><a href="https://github.com/poad42/smolvlm_rk3588_full_npu_native"><strong>View Source on GitHub</strong></a></p>]]></content><author><name></name></author><category term="technical-deep-dive"/><category term="edge-ai"/><category term="npu"/><category term="optimization"/><category term="transformers"/><category term="hardware"/><category term="reverse-engineering"/><summary type="html"><![CDATA[Reverse-engineering the Rockchip RK3588 NPU to run SmolVLM 15x faster by discovering hardware limits, defeating compiler optimizations, and building a custom sharding runtime]]></summary></entry></feed>