Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    Supertone Releases Supertonic v3: On-System Textual content-to-Speech Mannequin with 31-Language Assist, Fewer Studying Failures, and Expression Tags

    Naveed AhmadBy Naveed Ahmad15/05/2026Updated:15/05/2026No Comments9 Mins Read
    blog11 20


    Supertone launched Supertonic 3, the third technology of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language assist, improved studying accuracy, fewer repeat and skip failures, and v2-compatible public ONNX property. It’s Lightning Quick, On-System, Multilingual and Correct TTS.

    What Modified from v2 to v3

    In contrast with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity throughout the shared-language set, and expands language protection from 5 to 31 languages. Model 2 supported English, Korean, Spanish, Portuguese, and French. Model 3 provides Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 whole ISO language codes. There may be additionally a particular na fallback for textual content whose language is unknown or exterior the supported set.

    The mannequin grows modestly to accommodate the added languages. At about 99M parameters throughout the general public ONNX property, Supertonic 3 is far smaller than 0.7B to 2B class open TTS methods. The smaller mannequin measurement is a sensible benefit for obtain measurement, startup time, and on-device inference. The replace additionally brings the overall disk footprint of the general public ONNX property to 404 MB. Moreover, Supertone not too long ago launched the Voice Builder, permitting builders to create customized, edge-native TTS fashions from their very own voice recordings.

    One new functionality in v3 that wasn’t current in v2 is expressive tag assist. Supertonic 3 helps easy expression tags comparable to , , and . These allow you to embed prosodic cues immediately into enter textual content with no separate preprocessing step or a separate mannequin for expressiveness. For engineers constructing voice interfaces or accessibility instruments, this implies you possibly can specify respiratory pauses or laughter inline in your textual content payload.

    Structure and Runtime

    The underlying structure carries over from prior variations: a speech autoencoder that encodes waveforms into steady latent representations, a flow-matching primarily based text-to-latent module that maps textual content to audio options, and a length predictor that controls pure timing. Circulate matching is a generative modeling approach that learns a vector area to remodel a easy distribution right into a goal distribution — it samples quicker than diffusion fashions at low step counts, which is why Supertonic can produce usable output in simply 2 inference steps. To additional refine output, v3 integrates Size-Conscious Rotary Place Embedding (LARoPE) for superior text-speech alignment and makes use of a Self-Purifying Circulate Matching approach throughout coaching to stay sturdy in opposition to noisy knowledge labels.

    On runtime effectivity, Supertonic 3 runs quick on CPU, even in contrast with bigger baselines measured on A100 GPU, and makes use of considerably much less reminiscence. It doesn’t require a GPU, which makes native, browser, and edge deployment a lot simpler.

    Studying Accuracy

    Throughout measured languages, Supertonic 3 stays inside a aggressive WER/CER vary in opposition to a lot bigger open TTS fashions comparable to VoxCPM2, whereas preserving a light-weight on-device deployment path. WER (Phrase Error Charge) and CER (Character Error Charge) are commonplace TTS readability metrics: you synthesize a passage, run ASR over the output, and evaluate the transcription to the unique textual content. CER is used for languages with out clear phrase boundaries; the others use WER. The system’s effectivity is finest demonstrated on excessive edge {hardware}; it achieves a mean RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Moreover, the ecosystem has expanded to incorporate Flutter (with macOS assist), .NET 9, and Go, whereas the online implementation leverages onnxruntime-web for pure client-side execution.

    Textual content Normalization

    A differentiating property carried ahead from v2 is built-in textual content normalization. Supertonic handles complicated floor types — monetary expressions like $5.2M, telephone numbers with space codes and extensions like (212) 555-0142 ext. 402, time and date codecs like 4:45 PM on Wed, Apr 3, 2024, and technical items like 2.3h and 30kph — with none preprocessing pipeline or phonetic annotations. The monetary expression “$5.2M” should learn as “5 level two million {dollars},” and “$450K” as “4 hundred fifty thousand {dollars}.” All 4 competing methods failed this. The technical unit “2.3h” should learn as “two level three hours” and “30kph” as “thirty kilometers per hour.” All 4 rivals additionally failed this class. The competing methods evaluated embrace ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

    https://github.com/supertone-inc/supertonic

    Getting Began

    The Python SDK set up is pip set up supertonic. On first run, the SDK downloads the mannequin property from Hugging Face mechanically. A minimal instance:

    from supertonic import TTS
    tts = TTS(auto_download=True)
    type = tts.get_voice_style(voice_name="M1")
    textual content = "A mild breeze moved by way of the open window whereas everybody listened to the story."
    wav, length = tts.synthesize(textual content, voice_style=type, lang="en")
    tts.save_audio(wav, "output.wav")
    print(f"Generated {length:.2f}s of audio")

    Marktechpost’s Visible Explainer

    Supertonic 3 — Developer Information

    1 / 7

    Overview

    Supertonic 3: On-System TTS,
    Now in 31 Languages

    Supertonic 3 is a light-weight, open-weight text-to-speech system by Supertone Inc. It runs totally through ONNX Runtime in your system — no cloud, no API name, no knowledge leaving your machine. v3 expands from 5 to 31 languages, provides expressive tags, reduces studying failures, and stays appropriate with the v2 ONNX interface.

    31
    Languages

    ~99M
    Parameters

    404 MB
    ONNX Belongings

    MIT
    Code License

    What’s New in v3

    4 Core Enhancements Over Supertonic 2

    Model 3 is a centered improve — identical inference contract, meaningfully higher output.

    • 🌐
      31 languages — Expanded from the 5-language v2 launch (en, ko, es, pt, fr). Now contains Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 extra ISO codes, plus a particular na fallback for unknown languages.
    • ✅
      Extra steady studying — Fewer repeat and skip failures, particularly on brief and lengthy utterances. This was a recognized limitation in v2 that v3 immediately addresses.
    • 🎭
      Expression tags — Helps , , and inline in textual content, with none separate preprocessing or exterior mannequin.
    • 🔊
      Increased speaker similarity — Improved similarity throughout the shared-language set in contrast with Supertonic 2. Voices are extra constant throughout languages.

    Set up

    Get Working in Underneath a Minute

    Set up the Python SDK through pip. On first run, mannequin property are downloaded mechanically from Hugging Face — no guide setup required.

    pip set up supertonic

    Fast Begin

    Primary Python Utilization

    The SDK auto-downloads mannequin property on first run. Specify a voice, move your textual content with a language code, and save the WAV output.

    from supertonic import TTS
    
    # Auto-downloads ONNX property on first run
    tts = TTS(auto_download=True)
    
    # Choose a preset voice (M1—M5 male, F1—F5 feminine)
    type = tts.get_voice_style(voice_name="M1")
    
    textual content = "A mild breeze moved by way of the open window."
    
    # synthesize() returns (wav_array, duration_in_seconds)
    wav, length = tts.synthesize(textual content, voice_style=type, lang="en")
    
    tts.save_audio(wav, "output.wav")
    print(f"Generated {length:.2f}s of audio")

    textual content = "I can not imagine it  that really labored!"
    wav, length = tts.synthesize(textual content, voice_style=type, lang="en")

    Languages

    31 Supported Languages + na Fallback

    All 31 languages share the identical mannequin structure and ONNX inference pipeline. Use the na code for textual content whose language is unknown or exterior the supported set.

    en English

    ko Korean

    ja Japanese

    ar Arabic

    bg Bulgarian

    cs Czech

    da Danish

    de German

    el Greek

    es Spanish

    et Estonian

    fi Finnish

    fr French

    hello Hindi

    hr Croatian

    hu Hungarian

    id Indonesian

    it Italian

    lt Lithuanian

    lv Latvian

    nl Dutch

    pl Polish

    pt Portuguese

    ro Romanian

    ru Russian

    sk Slovak

    sl Slovenian

    sv Swedish

    tr Turkish

    uk Ukrainian

    vi Vietnamese

    Textual content Normalization

    Handles Complicated Inputs With out Pre-Processing

    Supertonic 3 reads monetary expressions, dates, telephone numbers, and technical items appropriately out of the field — no G2P module or phonetic annotations required. Under: Supertonic vs. 4 main business/open-source methods.

    Class Enter Instance Supertonic 3 ElevenLabs / OpenAI / Gemini / Microsoft
    Monetary Expression $5.2M / $450K ✓ ✗ All 4 failed
    Time & Date 4:45 PM, Wed Apr 3 ✓ ✗ All 4 failed
    Cellphone Quantity (212) 555-0142 ext. 402 ✓ ✗ All 4 failed
    Technical Unit 2.3h at 30kph ✓ ✗ All 4 failed

    Deployment & Assets

    Runs In all places — 11 Platforms, No GPU Required

    The general public ONNX property run on CPU in fixed-voice mode with no GPU dependency. Browser assist is through WebGPU and WASM by way of onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

    🐍PythonONNX Runtime

    🟨Node.jsServer-side JS

    🌐BrowserWebGPU / WASM

    ☕JavaJVM

    ⚙️C++Excessive-perf

    🔷C#.NET

    🔵GoGo runtime

    🍎Swift / iOSNative

    🦀RustMethods

    💙FlutterCross-platform

    📄Code: MITLicense

    🤖Mannequin: OpenRAIL-MLicense

    Key Takeaways

    • Supertonic 3 expands language assist from 5 (v2) to 31 languages, rising from 66M to ~99M parameters with a complete ONNX asset measurement of 404 MB
    • New in v3: expressive tags (, , ), extra steady studying on brief and lengthy utterances, and improved speaker similarity vs. v2
    • v2-compatible public ONNX interface — present integrations improve with out altering inference code
    • Studying accuracy benchmarked in opposition to VoxCPM2; v3 stays inside a aggressive WER/CER vary whereas being considerably smaller
    • v3-specific RTF/throughput numbers haven’t been revealed; the 167× faster-than-real-time determine is a v2 benchmark and shouldn’t be assumed equivalent for v3
    • Native output of 16-bit WAV information making certain high-fidelity audio for engineering purposes

    Take a look at the GitHub Repo and Hugging Face Space. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us




    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    Meridian Ventures launched $35M fund to again MBA-deferred founders

    15/05/2026

    OpenAI is reportedly making ready authorized motion towards Apple; it would not be the primary companion to really feel burned

    15/05/2026

    Greatest AI Brokers for Software program Improvement Ranked: A Benchmark-Pushed Take a look at the Present Discipline

    15/05/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.