I've been optimizing ternary operations for BitNet 1.58b and found significant overhead in the current implementation. I wrote a dependency-free C kernel (sparse-ternary-fma) using 2-bit encoding and AVX-512 instructions. Benchmarks on Intel Xeon (N=4096): Throughput (Dense): 2.38x faster (8.21 GFLOPS vs 3.45 AVX2) Throughput (Sparse 80% zeros): 26.12x faster (23.25 GFLOPS vs 0.89 Scalar) Memory: 4x denser (2-bit vs 8-bit standard) This approach packs 4 trits per byte and leverages sparsity-aware FMA to skip zero-valued weights, which is critical for 1.58-bit quantization efficiency. PR is pending on the Microsoft BitNet repo. Code is open source here:https://github.com/microsoft/BitNet/pull/365 |