Backporting FP8 to the RTX 3090 (No H100 Required)
Storing FP8 weights as bytes on Ampere, decoding via a LUT, scaling, quantizing to INT8, and using IMMA tensor cores—so you can experiment with FP8-like numerics without Hopper.
Reverse-Engineering the RK3588 NPU: Hacking Memory Limits to Run Vision Transformers
Reverse-engineering the Rockchip RK3588 NPU to run SmolVLM 15x faster by discovering hardware limits, defeating compiler optimizations, and building a custom sharding runtime