A running set of technical posts on efficient transformer systems, scientific machine learning, and implementation details.
Scaling attention to 1M tokens on a single GPU
The story of FLARE: Fast Low-rank Attention Routing Engine Attention, the core mechanism of transformers, becomes prohibitively expensive at large sequence lengths. This post explains the ideas behind FLARE, an attention operator designed to retain global communication while scaling to million-token regimes on a single GPU. Rather than focusing on architectural novelty for its own sake, the goal is to understand attention as a communication operator and ask a simple question: can we keep the benefits of global message passing without paying the full quadratic cost? ...