Flipping the Script: Why Short Convolutions Don't Need Linear Attention
A perspective from the CTO of Liquid AI on our LFM2 architecture
The ML community has been discussing lately about an intriguing architectural question: why do Linear Attention, Mamba, and Delta Net architectures seem to require short convolutions to function effectively? A recent blog post by Moonshot AI’s Su Jianlin went deep into this dependency with some hypothesis, and Liquid AI’s Harold Benoit nicely summarized the key insight in a recent tweet.
The Conventional Wisdom
The prevailing theory frames linear attention variants as performing a form of online learning (essentially extended Test-Time Training or eTTT) over key-value pairs. In this view, the state update follows:
Where the loss measures how effectively we can retrieve v_t from the state S_{t-1} given k_t.
The critical insight here is that when K=V, this framework collapses to a trivial solution: the identity function. More generally, the greater the overlap between keys and values, the less there is to learn in this TTT framework.
This explains why having K and V derived from the same input x through only channel-mixing layers (like QKV projections in multi-head attention) limits learnable content. Short convolutions with kernel size ≥2 applied to at least the keys solve this by ensuring that k_t contains information from past keys, enriching the learnable dynamics.
Turning the Question Upside Down
But here is where things get interesting. Instead of asking why linear attention needs short convolutions, we posed a different question entirely: Do short convolutions actually need linear attention at all?
Our LFM2 architecture answers with a decisive NO.
The LFM2 Architecture: Simplicity Meets Performance
LFM2 takes a radically different approach. Each block consists of:
A Grouped Query Attention (GQA) layer as global sequence mixer
Multiple double-gated short convolutions as local sequence mixers
The “usual suspects” of dense, residual, and normalization layers in the form of SwiGLU MLPs and RMSNorms, all applied element-wise, i.e., independent for each token.
But what’s absent? Linear attention, linear RNNs, SSMs, or any Mamba-style recurrences.
Our Design Philosophy
During LFM2’s development, we defined two clear objectives:
Strong LLM capability: We validated each architecture candidate against our internal benchmark suite comprising amongst many of knowledge capacity, low-resource multilingual language capabilities, mathematical, coding, and logical reasoning, and in-context recall evals. This made sure we are not sacrificing critical architectural capabilities for simplicity or efficiency.
Edge device efficiency: We profiled each candidate on real hardware. Specifically a Galaxy S24 Ultra with Qualcomm Snapdragon ARM CPU using an XNNPack-based inference stack. This made sure we have the real-world performance crown on edge devices instead of a purely theoretical “paper only” efficiency metric.
To ensure efficient architecture search, we started from a strong Llama3-style transformer baseline and evolved from there.
The Results Speak for Themselves
The LFM2 series demonstrates that combining softmax self-attention with gated local convolutions gives us
Strong benchmark performance across diverse tasks, proving that linear attention and linear RNNs are not necessary ingredients for success
Fast inference on edge devices, validating our architectural choices for practical deployment
More concretely, during our ablations and architecture search we observed that removing the short-convolutions entirely, i.e., no local mixers, results in a significant drop in some of our internal evaluation suite. This suggest some local mixing, either through some linear recurrence, short convolution, or sliding window attention, in addition to the global softmax attention, is essential for certain LLM capabilities. In the other direction, during our LFM2 architecture search, including additional linear attention layers or adding more GQA layers did not provide a meaningful advantage on the eval side but incurs slower inference speed, especially at growing context length.
A Deeper Pattern
Interestingly, this isn’t an isolated finding. Rom Parnichkun from our research team presented a paper at ICML 2025 showing that many linear structures, eg. linear attention and SSMs, fail catastrophically when their short convolution gates are removed, becoming unable to memorize information effectively. Even more so, it shows that many linear attention and SSMs struggle in general to fully realize their theoretical memory capacity with or without convolutions.
This suggests a broader principle: perhaps rather than short convolutions being a necessary supplement to make linear attention work, they might be doing most of the heavy lifting themselves.
Looking Forward
The success of LFM2 challenges conventional wisdom about what’s necessary for effective sequence modeling. By demonstrating that short convolutions can stand on their own without linear attention mechanisms, we are opening new directions for efficient, on-device AI systems.
For the research community, this raises fascinating questions: What other architectural assumptions might we be carrying unnecessarily? And how much simpler can we make our models while maintaining or even improving performance?
The journey with LFM2 has taught us that sometimes the best way forward is to flip the question entirely. Instead of asking how to fix complex mechanisms, perhaps we should ask whether we need them at all.
Want to dive deeper into LFM2 and architecture research? Check out our huggingface organization: https://huggingface.co/LiquidAI or consider joining our team: https://jobs.ashbyhq.com/liquid-ai

