The workforce behind Kimi.ai (Moonshot AI) simply made a big contribution to the open-source AI infrastructure house. The analysis workforce has made a big contribution to the open-source AI infrastructure house. They launched FlashKDA (Flash Kimi Delta Consideration), a high-performance CUTLASS-based kernel implementation of the Kimi Delta Consideration (KDA) mechanism. The FlashKDA library is out there on GitHub beneath an MIT license. It delivers prefill speedups of 1.72× to 2.22× over the flash-linear-attention baseline on NVIDIA H20 GPUs, and works as a drop-in backend for the favored flash-linear-attention library.
What Is Kimi Delta Consideration, and Why Does It Matter?
To know FlashKDA, it helps to first perceive the place it sits within the LLM consideration panorama.
Customary softmax consideration has quadratic complexity with respect to sequence size — which means that as you feed longer context right into a mannequin, compute prices develop extraordinarily quick. This has pushed a wave of analysis into linear consideration mechanisms, which approximate or exchange the softmax operation to attain linear scaling. Kimi Delta Consideration (KDA) is Moonshot AI’s contribution to this house: a linear consideration mechanism that refines the Gated DeltaNet with a finer-grained, channel-wise gating mechanism, enabling more practical use of restricted finite-state RNN reminiscence.
KDA is not only a analysis prototype. It’s the core consideration mechanism in Kimi Linear, Moonshot AI’s open-source hybrid mannequin with 48B whole parameters and 3B activated parameters. Kimi Linear makes use of a 3:1 KDA-to-MLA (Multi-Head Latent Consideration) ratio — three KDA layers for each one international consideration layer — which reduces KV cache utilization by as much as 75% throughout long-sequence technology whereas attaining as much as 6× increased decoding throughput at 1 million context size in comparison with full consideration. FlashKDA is the production-grade CUDA kernel that makes that structure quick throughout prefill.
Concretely, the KDA ahead cross takes in queries (q), keys (okay), values (v), a gate earlier than activation (g), and beta logits (beta), together with a scale issue, an output tensor (out), and gate parameters: A_log (log-gate parameter per head), dt_bias (gate bias), and lower_bound (gate decrease certain, starting from -5.0 to 0). The sigmoid activation on beta is utilized internally by the kernel. The mechanism additionally helps non-compulsory preliminary and remaining recurrent states — helpful for multi-turn inference the place you wish to carry state throughout requests.
The recurrent formulation means the mannequin can effectively course of lengthy sequences throughout technology. However environment friendly prefill of those architectures nonetheless requires extremely optimized GPU kernels — which is precisely what FlashKDA delivers.
Underneath the Hood: CUTLASS on Hopper
FlashKDA is constructed on CUTLASS, NVIDIA’s open-source library of CUDA C++ template abstractions for high-performance linear algebra and customized kernel growth. CUTLASS permits builders to write down kernels that take full benefit of NVIDIA’s Tensor Core structure, and it’s the identical basis utilized by libraries like FlashAttention-3.
The library targets SM90 and above — which means NVIDIA’s Hopper structure (H100, H20) and newer. The minimal necessities are CUDA 12.9 and PyTorch 2.4. The codebase is predominantly CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.
The core API is flash_kda.fwd, which takes the next inputs:
q,okay,v,g: all in bf16 with form[B, T, H, K]or[B, T, H, V](the placegis the gate earlier than activation)beta: bf16 beta logits in form[B, T, H](sigmoid utilized internally)scale: fp32 scalar scaling issueout: bf16 output tensor in form[B, T, H, V]A_log,dt_bias,lower_bound: fp32 gate parametersinitial_state,final_state: non-compulsory bf16 or fp32 recurrent statescu_seqlens: non-compulsory int64 cumulative sequence lengths for variable-length batching
One present constraint: the kernel requires Ok = V = 128 for head dimension.
The variable-length batching help by way of cu_seqlens is especially notable for manufacturing use. In actual inference serving, requests in a batch hardly ever share the identical sequence size. With the ability to pack a number of sequences of various lengths right into a single kernel name is a key requirement for high-throughput serving techniques.
Benchmark Outcomes: 1.72× to 2.22× on H20
The benchmark outcomes (as of April 20, 2026) evaluate flash_kda towards fla_chunk_kda (the present flash-linear-attention implementation) throughout a sequence size of T=8192, head dimension D=128, and two head rely configurations: H=96 and H=64. Every benchmark ran with 30 warmup iterations, 200 measurement iterations, and 5 repeats.
For H=96:
| Case | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| Mounted | 2.6219 | 4.5052 | 1.72× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
2.3420 | 4.5717 | 1.95× |
Varlen, seq_lens=1024 × 8 |
2.0100 | 4.4668 | 2.22× |
For H=64:
| Case | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| Mounted | 1.6199 | 2.9587 | 1.83× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
1.7027 | 3.0595 | 1.80× |
Varlen, seq_lens=1024 × 8 |
1.3930 | 3.0412 | 2.18× |
The height speedup of two.22× seems within the uniform variable-length case (seq_lens=1024 × 8, eight sequences of size 1024 summing to T=8192). The fixed-length case delivers the ground of the vary at 1.72×. Throughout each head configurations and all three sequence eventualities, FlashKDA constantly outperforms the flash-linear-attention baseline by a big margin.
Integration with flash-linear-attention
One of the sensible facets of FlashKDA is its integration story. As soon as put in, FlashKDA is auto-dispatched from flash-linear-attention’s chunk_kda — which suggests current codebases utilizing flash-linear-attention don’t want guide wiring to benefit from the quicker kernel. The combination is tracked in flash-linear-attention PR #852.
Set up is simple:
git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
cd flash-kda
git submodule replace --init --recursive
pip set up -v .
The correctness take a look at suite (exams/test_fwd.py) runs exact-match verification towards a PyTorch reference implementation and cross-validates towards flash-linear-attention. This provides AI devs a dependable baseline for auditing kernel habits earlier than deploying in manufacturing.
Key Takeaways
- FlashKDA is Moonshot AI’s open-source CUTLASS-based CUDA kernel for Kimi Delta Consideration (KDA), delivering 1.72×–2.22× prefill speedup over the
flash-linear-attentionbaseline on NVIDIA H20 GPUs. - KDA extends Gated DeltaNet with fine-grained, channel-wise gating — it’s the core consideration mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid mannequin that reduces KV cache utilization by as much as 75% and achieves as much as 6× increased decoding throughput at 1M context size.
- The kernel targets SM90+ {hardware} (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and at present helps a hard and fast head dimension of
Ok = V = 128. - Variable-length batching is natively supported by way of the
cu_seqlensparameter, permitting a number of sequences of various lengths to be packed right into a single kernel name — a important characteristic for high-throughput inference serving. - As soon as put in, FlashKDA is auto-dispatched from
flash-linear-attention‘schunk_kda, making it a drop-in efficiency improve for any current codebase already utilizing theflash-linear-attentionlibrary — no structure adjustments required.
Try the GitHub Repo. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
