Adhitya's Blog

Technical deep-dives and project insights

Backporting FP8 to the RTX 3090 (No H100 Required)

Storing FP8 weights as bytes on Ampere, decoding via a LUT, scaling, quantizing to INT8, and using IMMA tensor cores—so you can experiment with FP8-like numerics without Hopper.

12 min read · January 25, 2026

2026 · cuda gpu-optimization quantization tensor-cores fp8 · technical-deep-dive
Reverse-Engineering the RK3588 NPU: Hacking Memory Limits to Run Vision Transformers

Reverse-engineering the Rockchip RK3588 NPU to run SmolVLM 15x faster by discovering hardware limits, defeating compiler optimizations, and building a custom sharding runtime

6 min read · December 12, 2025

2025 · edge-ai npu optimization transformers hardware reverse-engineering · technical-deep-dive