Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    Meet ‘Kani-TTS-2’: A 400M Param Open Supply Textual content-to-Speech Mannequin that Runs in 3GB VRAM with Voice Cloning Help

    Naveed AhmadBy Naveed Ahmad15/02/2026Updated:15/02/2026No Comments4 Mins Read
    blog banner23 28






    The panorama of generative audio is shifting towards effectivity. A brand new open-source contender, Kani-TTS-2, has been launched by the staff at nineninesix.ai. This mannequin marks a departure from heavy, compute-expensive TTS programs. As a substitute, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.

    Kani-TTS-2 presents a lean, high-performance various to closed-source APIs. It’s at the moment out there on Hugging Face in each English (EN) and Portuguese (PT) variations.

    The Structure: LFM2 and NanoCodec

    Kani-TTS-2 follows the ‘Audio-as-Language‘ philosophy. The mannequin doesn’t use conventional mel-spectrogram pipelines. As a substitute, it converts uncooked audio into discrete tokens utilizing a neural codec.

    The system depends on a two-stage course of:

    1. The Language Spine: The mannequin is constructed on LiquidAI’s LFM2 (350M) structure. This spine generates ‘audio intent’ by predicting the following audio tokens. As a result of LFM (Liquid Basis Fashions) are designed for effectivity, they supply a sooner various to plain transformers.
    2. The Neural Codec: It makes use of the NVIDIA NanoCodec to show these tokens into 22kHz waveforms.

    Through the use of this structure, the mannequin captures human-like prosody—the rhythm and intonation of speech—with out the ‘robotic’ artifacts present in older TTS programs.

    Effectivity: 10,000 Hours in 6 Hours

    The coaching metrics for Kani-TTS-2 are a masterclass in optimization. The English mannequin was skilled on 10,000 hours of high-quality speech information.

    Whereas that scale is spectacular, the pace of coaching is the actual story. The analysis staff skilled the mannequin in solely 6 hours utilizing a cluster of 8 NVIDIA H100 GPUs. This proves that huge datasets not require weeks of compute time when paired with environment friendly architectures like LFM2.

    Zero-Shot Voice Cloning and Efficiency

    The standout characteristic for builders is zero-shot voice cloning. In contrast to conventional fashions that require fine-tuning for brand new voices, Kani-TTS-2 makes use of speaker embeddings.

    • The way it works: You present a brief reference audio clip.
    • The outcome: The mannequin extracts the distinctive traits of that voice and applies them to the generated textual content immediately.

    From a deployment perspective, the mannequin is extremely accessible:

    • Parameter Depend: 400M (0.4B) parameters.
    • Pace: It incorporates a Actual-Time Issue (RTF) of 0.2. This implies it may generate 10 seconds of speech in roughly 2 seconds.
    • {Hardware}: It requires solely 3GB of VRAM, making it suitable with consumer-grade GPUs just like the RTX 3060 or 4050.
    • License: Launched underneath the Apache 2.0 license, permitting for industrial use.

    Key Takeaways

    • Environment friendly Structure: The mannequin makes use of a 400M parameter spine primarily based on LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ strategy treats speech as discrete tokens, permitting for sooner processing and extra human-like intonation in comparison with conventional architectures.
    • Fast Coaching at Scale: Kani-TTS-2-EN was skilled on 10,000 hours of high-quality speech information in simply 6 hours utilizing 8 NVIDIA H100 GPUs.
    • Prompt Zero-Shot Cloning: There isn’t any want for fine-tuning to copy a selected voice. By offering a brief reference audio clip, the mannequin makes use of speaker embeddings to immediately synthesize textual content within the goal speaker’s voice.
    • Excessive Efficiency on Edge {Hardware}: With a Actual-Time Issue (RTF) of 0.2, the mannequin can generate 10 seconds of audio in roughly 2 seconds. It requires solely 3GB of VRAM, making it totally practical on consumer-grade GPUs just like the RTX 3060.
    • Developer-Pleasant Licensing: Launched underneath the Apache 2.0 license, Kani-TTS-2 is prepared for industrial integration. It presents a local-first, low-latency various to costly closed-source TTS APIs.

    Try the Model Weight. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


    Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.






    Earlier articleGetting Began with OpenClaw and Connecting It with WhatsApp




    Source link

    Naveed Ahmad

    Related Posts

    OpenClaw creator Peter Steinberger joins OpenAI

    16/02/2026

    Anthropic and the Pentagon are reportedly arguing over Claude utilization

    16/02/2026

    Moonshot AI Launches Kimi Claw: Native OpenClaw on Kimi.com with 5,000 Neighborhood Expertise and 40GB Cloud Storage Now

    16/02/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.