CODA Rewrites Transformer Blocks as GEMM-Epilogue Programs for Training Efficiency
Tags AI · Infrastructure
Authors from MIT (including Tri Dao and Yoon Kim) propose CODA, a GPU kernel abstraction that reparameterizes Transformer non-attention operations as GEMM-plus-epilogue programs. The approach covers nearly all non-attention computation in forward and backward passes, addressing the memory-bound bottleneck from normalization, activations, and residual updates. Both human- and LLM-authored CODA kernels achieve high performance.
Technical significance
CODA's approach of expressing non-attention Transformer operations as GEMM epilogues could unlock meaningful training speedups, particularly for long-context models where memory bandwidth is the binding constraint. The fact that LLM-authored kernels match human performance suggests that AI-assisted GPU kernel optimization is becoming practical.