Blog | Vedant Puri

A running set of technical posts on efficient transformer systems, scientific machine learning, and implementation details.

From Encoder to Decoder: Extending FLARE to Memory-Efficient Causal Attention

Running notes — last updated 2026-03-02. This is a living document, not a polished article; I update it frequently as my understanding develops. Motivation FLARE was originally developed as an encoder-style global mixing primitive: learned latent queries gather information from many tokens, then scatter it back. The decoder setting is harder because causality changes algorithmic dependencies, numerical stability constraints, and what efficiency means in training versus inference. This post summarizes a practical path to causal FLARE for long-context language modeling. See also the dissertation proposal talk for broader context. ...

Triple Attention in Triton: Building a Third-Order Memory in Linear Time

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Motivation Pairwise attention is powerful, but it compresses interaction structure into second-order forms. Most efficient attention methods try to approximate or factor the $N \times N$ attention matrix. Triple attention takes a different perspective: Instead of modeling pairwise token interactions, build a structured higher-order memory and let tokens read from it. ...

Higher-Order Attention in Linear Time: Multilinear Memories and Simplex Mixing

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Beyond pairwise attention Softmax attention is an extremely expressive token-mixing primitive, but it is expensive. For a sequence of length $N$, the core interaction matrix $Q \cdot K^\top$ is $N \times N$, which drives both runtime and memory. Linear transformers try to keep global context while avoiding the quadratic scaling. The catch is that the simplest linear attention formulations often lose a key ingredient that makes softmax attention work so well: token-specific routing. ...

Scaling attention to 1M tokens on a single GPU

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Attention, the core mechanism of transformers, becomes prohibitively expensive at large sequence lengths. This post explains the ideas behind FLARE, an attention operator designed to retain global communication while scaling to million-token regimes on a single GPU. Rather than focusing on architectural novelty for its own sake, the goal is to understand attention as a communication operator and ask a simple question: can we keep the benefits of global message passing without paying the full quadratic cost? ...

From POD to neural manifolds: a modern take on reduced order modeling

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. From POD to neural manifolds Reduced-order modeling (ROM) is ultimately about preserving dominant system behavior with far fewer degrees of freedom. Classical projection methods, convolutional autoencoder ROMs, and modern neural-field formulations all target this same objective, but they make different assumptions about (i) how we represent the state and (ii) how we evolve it in time. ...

Our job as computational engineers

Why math is interesting The foundation of my interest in math was laid in my freshman year when I was appointed course staff to a sophomore engineering course. As I taught students to model mechanical interactions using forces and moments, I learned not only to communicate my ideas to newcomers in engineering but also how to challenge my own preconceived notions. Explaining why a certain unseen force has to exist to maintain equilibrium is like talking in a different language, in the sense that certain things are obvious (evident enough not to require proving) to me but not to a student. I have to explain that concentrated point forces, reaction moments and all these unseen entities are fictitious objects made by engineers and scientists to model and approximate reality. ...