Transformer operations optimized by fusing computations into matrix multiplications

Original: After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transf

Source: x.com ↗

Writing ELI5 summary…