The race to make large language models faster and cheaper to run has been fought largely at two levels: model architecture and hardware. But there is a third, often less appreciated limitation – the GPU kernel. The kernel is the low-level computational routine that actually performs mathematical operations on the GPU. Writing a good one requires understanding not only the mathematics, but also the exact memory layout, instruction scheduling, and hardware quirks of the chip you’re targeting. Most ML professionals never write kernels directly; To do this they rely on libraries like FlashAttention or Triton.
meet FlashQLA: QwenLM’s contribution to this layer. Released under the MIT license and built on the TileLang compiler framework, it is a high-performance linear attention kernel library specifically optimized for the gated delta network (GDN) attention mechanism – the linear attention architecture that powers the Qwen3.5 and Qwen3.6 model families.
What is linear focus and why does it matter?
To understand what FlashQLA solves, it helps to understand what the cost of standard softmax attention is. In a traditional Transformer, the attention mechanism has O(n²) complexity – meaning that doubling the sequence length quadruples the computation. This is the fundamental bottleneck that makes it expensive to process long documents, long code files or long conversations.
Linear attention replaces softmax with a formulation that reduces it to O(n) complexity, making it scale more favorably with sequence length. Gated Delta Network (GDN) is one such linear attention mechanism, and is integrated into QUEN’s hybrid model architecture, where GDN layers alternate with standard full attention layers. This hybrid design strives to get the best of both worlds: the expression of full focus where it’s needed most, and the efficiency of linear focus everywhere else.
GDN uses what is called a ‘gated’ formulation – it implements an exponentially decaying gate to control how much the previous context is advanced. This gate is the key to how FlashQLA achieves its performance benefits.
problem with existing kernels
Before FlashQLA, the standard implementation for GDN operations came from the Flash Linear Attention (FLA) library, which uses the Triton kernel – Triton is OpenAI’s Python-based GPU programming language. While Triton makes kernel authoring more accessible, it comes with trade-offs: the kernels it produces are not always optimized for specific hardware, especially on NVIDIA’s Hopper architecture (H100 and H200 GPU generations).
The Hopper architecture introduced new features such as warpgroup-level tensor core operations and asynchronous data pipelines that Triton cannot always utilize to its full potential. FlashQLA is designed to fill this gap.
What does FlashQLA do differently
FlashQLA applies operator fusion and performance optimization on both the forward pass (used during estimation and training) and backward pass (used during training for gradient computation) of the GDN chunked prefill. result is one 2-3× speedup on forward pass and a 2× speedup on backward pass Compared to FLA Triton kernel in several scenarios on NVIDIA Hopper GPU.
Three technological innovations drive these benefits:
1. Gate-driven automatic intra-card reference parallelism: Context parallelism (CP) refers to dividing a long sequence into multiple processing units so that they can work on different parts simultaneously. FlashQLA uses the exponential decay property of the GDN gate to make this partitioning mathematically valid – because the decay of the gate means that distant tokens in a sequence have less influence on each other. This allows FlashQLA to automatically enable intra-card CP under tensor parallelism (TP), long-sequence, and small-head-count settings, improving GPU streaming multiprocessor (SM) utilization without the need for manual configuration.
2. Hardware-Friendly Algebraic Correction: FlashQLA improves the mathematical calculation of the forward and backward flow of GDN chunked prefill to a certain extent to reduce the overhead on three types of GPU hardware units: the tensor core (which handles matrix multiplication), the CUDA core (which handles scalar and vector operations), and the Special Function Unit (SFU, which handles operations such as exponentiation and square root). Critically, this is done without sacrificing numerical precision – an important guarantee when the correction is being used for model training.
3. TileLang adds Warp-specialized kernels: Instead of decomposing the computation into independent sequential kernels (too slow) or fusing everything into a single monolithic kernel (too hard to optimize), FlashQLA takes a middle path. It uses TileLang to create multiple key fused kernels and manually implements warpgroup specialization – a technique that assigns different warpgroups (groups of 128 threads on a hopper) to particular roles, such that one warpgroup moves data from global memory to shared memory while another simultaneously runs tensor core matrix multiplication. This overlap of data movement, tensor core computation, and CUDA core computation allows FlashQLA to reach the theoretical peak throughput of the hardware.
Standard
FlashQLA was benchmarked against two baselines: FLA Triton kernel (version 0.5.0, Triton 3.5.1) and FlashInfer (version 0.6.9), using TileLang 0.1.8 on an NVIDIA H200 GPU. The benchmarks used head configurations from the Qwen3.5 and Qwen3.6 model families, with head dimensions hV ∈ 64, 48, 32, 24, 16, 8, corresponding to tensor parallelism settings from TP1 to TP8.
Forward (FWD) benchmarks measure single-kernel latency for different models and TP settings under different batch lengths. Backward (BWD) benchmarks examine the relationship between the total token count within a batch and the latency during a single update step.

key takeaways
- FlashQLA is a high-performance linear attention kernel library Built by the Quen team at TileLang, specifically optimized for Gated Delta Network (GDN) chunked prefill forward and backward passes.
- It achieves 2-3× forward speedup and 2× backward speedup On the FLA Triton kernel across multiple scenarios on the NVIDIA Hopper GPU (SM90+), the efficiency gains are most pronounced in pretraining and edge-side agentive inference.
- Three main innovations bring increased performance: gate-driven automatic intra-card reference parallelism, hardware-friendly algebraic correction that reduces Tensor core, CUDA core, and SFU overhead without losing numerical precision, and TileLang fused warp-specialized kernels that overlap data movement, Tensor core computation, and CUDA core computation.
- GDN is a linear attention system with O(n) complexityQuen’s hybrid model architecture is used with standard full attention layers – making efficient GDN kernels important for both training and long-context inference at scale.
- FlashQLA is open-source under the MIT license And it requires SM90 or above, CUDA 12.8+, and PyTorch 2.8+, with a simple pip install and both high-level and low-level Python APIs are available for integration.
check it out GitHub repo And technical details. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us