Vedant Puri

PhD Candidate, Carnegie Mellon University

Efficient Transformer Architectures | Scientific Machine Learning

I design transformer architectures with explicit attention to scaling and memory efficiency. My recent work, FLARE, enables million-token regimes on a single GPU. I implement new architectures directly in PyTorch and Triton. My background spans high-performance computing, numerical analysis, and computational fluid dynamics.

LinkedIn | GitHub | Google Scholar | vedantpuri@cmu.edu

Research Interests

Efficient attention architectures
Numerical methods for ML and for PDEs
Scientific machine learning

Featured Work

FLARE - Fast Low-rank Attention Routing Engine

Derived a flexible low-rank reformulation of self-attention via latent routing.
Reduced quadratic complexity of global communication in self-attention to linear complexity.
Demonstrated scaling to 1M tokens on a single H100 GPU, attaining over 200x speedup over vanilla self-attention.
Implemented attention modules in PyTorch and Triton with reproducible scaling experiments.
Evaluated across PDE surrogate modeling, NLP, and vision benchmarks.

FLARE architecture overview

FLARE for Language Modeling (Ongoing dissertation work)

Decoder-only architectures | 2025–Present

Extending FLARE to decoder-only next-token prediction with causal attention.
Enabling adaptive attention state size to control memory and compute during training and inference.
Implementing fused Triton kernels for causal training, prefill, and decode.

Hybrid equation-based + data-driven PDE modeling framework

Introduced smooth neural fields as nonlinear spatial ansatz functions in equation-based reduced-order modeling.
Retained physics-based Galerkin time evolution while learning expressive low-dimensional representations.
Attained 200x speedup over full-order simulations in transport-dominated regimes.

Project page | JCP paper |

SNF-ROM online stage architecture

Previous Work: Computational fluid dynamics on HPC systems

I previously worked on turbulence simulation and analysis workflows in high-performance computing settings, with emphasis on spectral element methods and large-scale post-processing. This background in numerical methods and PDE solvers informs how I design stable and efficient transformer architectures for scientific ML.

Velocity magnitude for flow past wall-mounted cube case at Reynolds Number 3900 with respect to cube height. Computation performed using spectral element code NEK5000 at Argonne Leadership Computing Facility.

Not Work

Not So Up-to-Date Photography Portfolio

For the past decade, I have used a Canon DSLR as an excuse to walk around and photograph people, geometry, and city texture.

Open portfolio page | Flickr

Hobbies

Sports: squash, golf, crossfit

Open Source

FLARE

Fast Low-rank Attention Routing Engine for scalable transformer attention.

mlutils.py

Lightweight PyTorch project template and utility toolkit for ML experiments.

Julia Open Source Tools

SciMLOperators.jl

Operator abstractions for SciML and PDE workflows

LinearSolve.jl

Linear solver interface for scientific machine learning

Below is a nonexhaustive list of Julia projects that I have contributed to.

KolmogorovArnold.jl

Julia implementation of Kolmogorov-Arnold Networks with custom gradients for faster training.

FastDiffusion.py

Experiment with trigonometric noise schedule in context of few step diffusion.

NekTools

FORTRAN 77 utilities for turbulence statistics and post-processing in NEK5000.

From Encoder to Decoder: Extending FLARE to Memory-Efficient Causal Attention

Running notes — last updated 2026-03-02. This is a living document, not a polished article; I update it frequently as my understanding develops. Motivation FLARE was originally developed as an encoder-style global mixing primitive: learned latent queries gather information from many tokens, then scatter it back. The decoder setting is harder because causality changes algorithmic dependencies, numerical stability constraints, and what efficiency means in training versus inference. This post summarizes a practical path to causal FLARE for long-context language modeling. See also the dissertation proposal talk for broader context. ...

Triple Attention in Triton: Building a Third-Order Memory in Linear Time

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Motivation Pairwise attention is powerful, but it compresses interaction structure into second-order forms. Most efficient attention methods try to approximate or factor the $N \times N$ attention matrix. Triple attention takes a different perspective: Instead of modeling pairwise token interactions, build a structured higher-order memory and let tokens read from it. ...

Higher-Order Attention in Linear Time: Multilinear Memories and Simplex Mixing

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Beyond pairwise attention Softmax attention is an extremely expressive token-mixing primitive, but it is expensive. For a sequence of length $N$, the core interaction matrix $Q \cdot K^\top$ is $N \times N$, which drives both runtime and memory. Linear transformers try to keep global context while avoiding the quadratic scaling. The catch is that the simplest linear attention formulations often lose a key ingredient that makes softmax attention work so well: token-specific routing. ...

Scaling attention to 1M tokens on a single GPU

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Attention, the core mechanism of transformers, becomes prohibitively expensive at large sequence lengths. This post explains the ideas behind FLARE, an attention operator designed to retain global communication while scaling to million-token regimes on a single GPU. Rather than focusing on architectural novelty for its own sake, the goal is to understand attention as a communication operator and ask a simple question: can we keep the benefits of global message passing without paying the full quadratic cost? ...

From POD to neural manifolds: a modern take on reduced order modeling

Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. From POD to neural manifolds Reduced-order modeling (ROM) is ultimately about preserving dominant system behavior with far fewer degrees of freedom. Classical projection methods, convolutional autoencoder ROMs, and modern neural-field formulations all target this same objective, but they make different assumptions about (i) how we represent the state and (ii) how we evolve it in time. ...

Our job as computational engineers

Why math is interesting The foundation of my interest in math was laid in my freshman year when I was appointed course staff to a sophomore engineering course. As I taught students to model mechanical interactions using forces and moments, I learned not only to communicate my ideas to newcomers in engineering but also how to challenge my own preconceived notions. Explaining why a certain unseen force has to exist to maintain equilibrium is like talking in a different language, in the sense that certain things are obvious (evident enough not to require proving) to me but not to a student. I have to explain that concentrated point forces, reaction moments and all these unseen entities are fictitious objects made by engineers and scientists to model and approximate reality. ...

Vedant Puri

Research Interests#

Featured Work#

FLARE - Fast Low-rank Attention Routing Engine#

FLARE for Language Modeling (Ongoing dissertation work)#

Hybrid equation-based + data-driven PDE modeling framework#

Previous Work: Computational fluid dynamics on HPC systems#

Not Work#

Not So Up-to-Date Photography Portfolio#

Hobbies#

Open Source#

FLARE #

mlutils.py #

Julia Open Source Tools#

SciMLOperators.jl #

LinearSolve.jl #

KolmogorovArnold.jl #

FastDiffusion.py #

NekTools #