The race to make massive language fashions quicker and cheaper to run has largely been fought at two ranges: the mannequin structure and the {hardware}. However there’s a third, typically underappreciated frontier — the GPU kernel. A kernel is the low-level computational routine that truly executes a mathematical operation on the GPU. Writing a superb one requires understanding not simply the maths, however the precise reminiscence structure, instruction scheduling, and {hardware} quirks of the chip you might be focusing on. Most ML professionals by no means write kernels immediately; they depend on libraries like FlashAttention or Triton to do it for them.
Meet FlashQLA: a QwenLM’s contribution to this layer. Launched beneath the MIT License and constructed on the TileLang compiler framework, it’s a high-performance linear consideration kernel library particularly optimized for the Gated Delta Community (GDN) consideration mechanism — the linear consideration structure that powers the Qwen3.5 and Qwen3.6 mannequin households.
What’s Linear Consideration and Why Does It Matter?
To grasp what FlashQLA solves, it helps to grasp what customary softmax consideration prices. In a traditional Transformer, the eye mechanism has O(n²) complexity — which means that doubling the sequence size quadruples the computation. That is the elemental bottleneck that makes processing lengthy paperwork, lengthy code information, or lengthy conversations costly.
Linear consideration replaces the softmax with a formulation that reduces this to O(n) complexity, making it scale rather more favorably with sequence size. The Gated Delta Community (GDN) is one such linear consideration mechanism, and it has been built-in into Qwen’s hybrid mannequin structure, the place GDN layers alternate with customary full consideration layers. This hybrid design makes an attempt to get the very best of each worlds: the expressiveness of full consideration the place it’s most wanted, and the effectivity of linear consideration in every single place else.
GDN makes use of what known as a ‘gated’ formulation — it applies an exponentially decaying gate to manage how a lot previous context is carried ahead. This gate is vital to how FlashQLA achieves its efficiency good points.
The Drawback with Current Kernels
Earlier than FlashQLA, the usual implementation for GDN operations got here from the Flash Linear Consideration (FLA) library, which makes use of Triton kernels — Triton being OpenAI’s Python-based GPU programming language. Whereas Triton makes kernel authoring extra accessible, it comes with trade-offs: the kernels it produces will not be all the time optimally scheduled for particular {hardware}, significantly on NVIDIA’s Hopper structure (the H100 and H200 GPU era).
The Hopper structure launched new options like warpgroup-level Tensor Core operations and asynchronous knowledge pipelines that Triton can’t all the time exploit to their full potential. That is the hole FlashQLA is designed to fill.
What FlashQLA Does In a different way
FlashQLA applies operator fusion and efficiency optimization to each the ahead go (used throughout inference and coaching) and the backward go (used throughout coaching for gradient computation) of GDN Chunked Prefill. The result’s a 2–3× speedup on ahead passes and a 2× speedup on backward passes in comparison with the FLA Triton kernel throughout a number of eventualities on NVIDIA Hopper GPUs.
Three technical improvements drive these good points:
1. Gate-driven computerized intra-card context parallelism: Context parallelism (CP) refers to splitting a protracted sequence throughout a number of processing models to allow them to work on totally different components concurrently. FlashQLA exploits the exponential decay property of the GDN gate to make this break up mathematically legitimate — as a result of the gate’s decay implies that tokens far aside in a sequence have diminishing affect on one another. This enables FlashQLA to mechanically allow intra-card CP beneath tensor parallelism (TP), long-sequence, and small-head-count settings, bettering GPU Streaming Multiprocessor (SM) utilization with out requiring handbook configuration.
2. {Hardware}-friendly algebraic reformulation: FlashQLA reformulates, to a sure extent, the mathematical computation of GDN Chunked Prefill’s ahead and backward flows to cut back overhead on three forms of GPU {hardware} models: Tensor Cores (which deal with matrix multiplications), CUDA Cores (which deal with scalar and vector operations), and the Particular Perform Unit (SFU, which handles operations like exponentials and sq. roots). Critically, that is executed with out sacrificing numerical precision — an essential assure when the reformulation is getting used for mannequin coaching.
3. TileLang fused warp-specialized kernels: Moderately than decomposing the computation into unbiased sequential kernels (too sluggish) or fusing every part right into a single monolithic kernel (too inflexible to optimize), FlashQLA takes a center path. It makes use of TileLang to construct a number of key fused kernels and manually implements warpgroup specialization — a method that assigns totally different warpgroups (teams of 128 threads on Hopper) to specialised roles, similar to one warpgroup shifting knowledge from international reminiscence to shared reminiscence whereas one other concurrently runs Tensor Core matrix multiplications. This overlap of knowledge motion, Tensor Core computation, and CUDA Core computation is what permits FlashQLA to method the theoretical peak throughput of the {hardware}.
Benchmarks
FlashQLA was benchmarked in opposition to two baselines: the FLA Triton kernel (model 0.5.0, Triton 3.5.1) and FlashInfer (model 0.6.9), utilizing TileLang 0.1.8, on NVIDIA H200 GPUs. The benchmarks used the pinnacle configurations from the Qwen3.5 and Qwen3.6 mannequin households, with head dimensions hv ∈ 64, 48, 32, 24, 16, 8, similar to tensor parallelism settings from TP1 by way of TP8.
The ahead (FWD) benchmarks measure single-kernel latency for various fashions and TP settings beneath various batch lengths. The backward (BWD) benchmarks look at the connection between whole token depend inside a batch and latency throughout a single replace step.
Key Takeaways
- FlashQLA is a high-performance linear consideration kernel library constructed by the Qwen group on TileLang, particularly optimized for the Gated Delta Community (GDN) Chunked Prefill ahead and backward passes.
- It achieves 2–3× ahead speedup and a couple of× backward speedup over the FLA Triton kernel throughout a number of eventualities on NVIDIA Hopper GPUs (SM90+), with effectivity good points most pronounced in pretraining and edge-side agentic inference.
- Three core improvements drive the efficiency good points: gate-driven computerized intra-card context parallelism, hardware-friendly algebraic reformulation that reduces Tensor Core, CUDA Core, and SFU overhead with out dropping numerical precision, and TileLang fused warp-specialized kernels that overlap knowledge motion, Tensor Core computation, and CUDA Core computation.
- GDN is a linear consideration mechanism with O(n) complexity, utilized in Qwen’s hybrid mannequin structure alongside customary full consideration layers — making environment friendly GDN kernels crucial for each coaching and long-context inference at scale.
- FlashQLA is open-source beneath the MIT License and requires SM90 or above, CUDA 12.8+, and PyTorch 2.8+, with a easy pip set up and each high-level and low-level Python APIs out there for integration.
Take a look at the GitHub Repo and Technical details. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
