Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    Liquid AI’s New LFM2-24B-A2B Hybrid Structure Blends Consideration with Convolutions to Resolve the Scaling Bottlenecks of Trendy LLMs

    Naveed AhmadBy Naveed Ahmad25/02/2026Updated:25/02/2026No Comments4 Mins Read
    blog banner23 62


    The generative AI race has lengthy been a sport of ‘greater is best.’ However because the business hits the boundaries of energy consumption and reminiscence bottlenecks, the dialog is shifting from uncooked parameter counts to architectural effectivity. Liquid AI staff is main this cost with the discharge of LFM2-24B-A2B, a 24-billion parameter mannequin that redefines what we must always anticipate from edge-capable AI.

    https://www.liquid.ai/weblog/lfm2-24b-a2b

    The ‘A2B’ Structure: A 1:3 Ratio for Effectivity

    The ‘A2B’ within the mannequin’s identify stands for Consideration-to-Base. In a standard Transformer, each layer makes use of Softmax Consideration, which scales quadratically (O(N2)) with sequence size. This results in huge KV (Key-Worth) caches that devour VRAM.

    Liquid AI staff bypasses this through the use of a hybrid construction. The ‘Base‘ layers are environment friendly gated brief convolution blocks, whereas the ‘Consideration‘ layers make the most of Grouped Question Consideration (GQA).

    Within the LFM2-24B-A2B configuration, the mannequin makes use of a 1:3 ratio:

    • Complete Layers: 40
    • Convolution Blocks: 30
    • Consideration Blocks: 10

    By interspersing a small variety of GQA blocks with a majority of gated convolution layers, the mannequin retains the high-resolution retrieval and reasoning of a Transformer whereas sustaining the quick prefill and low reminiscence footprint of a linear-complexity mannequin.

    Sparse MoE: 24B Intelligence on a 2B Finances

    A very powerful factor of LFM2-24B-A2B is its Combination of Specialists (MoE) design. Whereas the mannequin incorporates 24 billion parameters, it solely prompts 2.3 billion parameters per token.

    This can be a game-changer for deployment. As a result of the lively parameter path is so lean, the mannequin can match into 32GB of RAM. This implies it might run regionally on high-end client laptops, desktops with built-in GPUs (iGPUs), and devoted NPUs while not having a data-center-grade A100. It successfully offers the data density of a 24B mannequin with the inference velocity and vitality effectivity of a 2B mannequin.

    https://www.liquid.ai/weblog/lfm2-24b-a2b

    Benchmarks: Punching Up

    Liquid AI staff experiences that the LFM2 household follows a predictable, log-linear scaling conduct. Regardless of its smaller lively parameter rely, the 24B-A2B mannequin constantly outperforms bigger rivals.

    • Logic and Reasoning: In assessments like GSM8K and MATH-500, it rivals dense fashions twice its measurement.
    • Throughput: When benchmarked on a single NVIDIA H100 utilizing vLLM, it reached 26.8K complete tokens per second at 1,024 concurrent requests, considerably outpacing Snowflake’s gpt-oss-20b and Qwen3-30B-A3B.
    • Lengthy Context: The mannequin contains a 32k token context window, optimized for privacy-sensitive RAG (Retrieval-Augmented Technology) pipelines and native doc evaluation.

    Technical Cheat Sheet

    Property Specification
    Complete Parameters 24 Billion
    Lively Parameters 2.3 Billion
    Structure Hybrid (Gated Conv + GQA)
    Layers 40 (30 Base / 10 Consideration)
    Context Size 32,768 Tokens
    Coaching Information 17 Trillion Tokens
    License LFM Open License v1.0
    Native Assist llama.cpp, vLLM, SGLang, MLX

    Key Takeaways

    • Hybrid ‘A2B’ Structure: The mannequin makes use of a 1:3 ratio of Grouped Question Consideration (GQA) to Gated Quick Convolutions. By using linear-complexity ‘Base’ layers for 30 out of 40 layers, the mannequin achieves a lot quicker prefill and decode speeds with a considerably lowered reminiscence footprint in comparison with conventional all-attention Transformers.
    • Sparse MoE Effectivity: Regardless of having 24 billion complete parameters, the mannequin solely prompts 2.3 billion parameters per token. This ‘Sparse Combination of Specialists’ design permits it to ship the reasoning depth of a big mannequin whereas sustaining the inference latency and vitality effectivity of a 2B-parameter mannequin.
    • True Edge Functionality: Optimized through hardware-in-the-loop structure search, the mannequin is designed to slot in 32GB of RAM. This makes it totally deployable on consumer-grade {hardware}, together with laptops with built-in GPUs and NPUs, with out requiring costly data-center infrastructure.
    • State-of-the-Artwork Efficiency: LFM2-24B-A2B outperforms bigger rivals like Qwen3-30B-A3B and Snowflake gpt-oss-20b in throughput. Benchmarks present it hits roughly 26.8K tokens per second on a single H100, exhibiting near-linear scaling and excessive effectivity in long-context duties as much as its 32k token window.

    Try the Technical details and Model weights. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




    Source link

    Naveed Ahmad

    Related Posts

    Former L3Harris Trenchant boss jailed for promoting hacking instruments to Russian dealer

    25/02/2026

    Stripe is reportedly eyeing deal to purchase some or all of PayPal

    25/02/2026

    Uber engineers constructed an AI model of their boss

    25/02/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.