Supertone Releases Supertonic v3: On-System Textual content-to-Speech Mannequin with 31-Language Assist, Fewer Studying Failures, and Expression Tags

Supertone launched Supertonic 3, the third technology of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language assist, improved studying accuracy, fewer repeat and skip failures, and v2-compatible public ONNX property. It’s Lightning Quick, On-System, Multilingual and Correct TTS.

What Modified from v2 to v3

In contrast with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity throughout the shared-language set, and expands language protection from 5 to 31 languages. Model 2 supported English, Korean, Spanish, Portuguese, and French. Model 3 provides Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 whole ISO language codes. There may be additionally a particular na fallback for textual content whose language is unknown or exterior the supported set.

The mannequin grows modestly to accommodate the added languages. At about 99M parameters throughout the general public ONNX property, Supertonic 3 is far smaller than 0.7B to 2B class open TTS methods. The smaller mannequin measurement is a sensible benefit for obtain measurement, startup time, and on-device inference. The replace additionally brings the overall disk footprint of the general public ONNX property to 404 MB. Moreover, Supertone not too long ago launched the Voice Builder, permitting builders to create customized, edge-native TTS fashions from their very own voice recordings.

One new functionality in v3 that wasn’t current in v2 is expressive tag assist. Supertonic 3 helps easy expression tags comparable to , , and . These allow you to embed prosodic cues immediately into enter textual content with no separate preprocessing step or a separate mannequin for expressiveness. For engineers constructing voice interfaces or accessibility instruments, this implies you possibly can specify respiratory pauses or laughter inline in your textual content payload.

Structure and Runtime

The underlying structure carries over from prior variations: a speech autoencoder that encodes waveforms into steady latent representations, a flow-matching primarily based text-to-latent module that maps textual content to audio options, and a length predictor that controls pure timing. Circulate matching is a generative modeling approach that learns a vector area to remodel a easy distribution right into a goal distribution — it samples quicker than diffusion fashions at low step counts, which is why Supertonic can produce usable output in simply 2 inference steps. To additional refine output, v3 integrates Size-Conscious Rotary Place Embedding (LARoPE) for superior text-speech alignment and makes use of a Self-Purifying Circulate Matching approach throughout coaching to stay sturdy in opposition to noisy knowledge labels.

On runtime effectivity, Supertonic 3 runs quick on CPU, even in contrast with bigger baselines measured on A100 GPU, and makes use of considerably much less reminiscence. It doesn’t require a GPU, which makes native, browser, and edge deployment a lot simpler.

Studying Accuracy

Throughout measured languages, Supertonic 3 stays inside a aggressive WER/CER vary in opposition to a lot bigger open TTS fashions comparable to VoxCPM2, whereas preserving a light-weight on-device deployment path. WER (Phrase Error Charge) and CER (Character Error Charge) are commonplace TTS readability metrics: you synthesize a passage, run ASR over the output, and evaluate the transcription to the unique textual content. CER is used for languages with out clear phrase boundaries; the others use WER. The system’s effectivity is finest demonstrated on excessive edge {hardware}; it achieves a mean RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Moreover, the ecosystem has expanded to incorporate Flutter (with macOS assist), .NET 9, and Go, whereas the online implementation leverages onnxruntime-web for pure client-side execution.

Textual content Normalization

A differentiating property carried ahead from v2 is built-in textual content normalization. Supertonic handles complicated floor types — monetary expressions like $5.2M, telephone numbers with space codes and extensions like (212) 555-0142 ext. 402, time and date codecs like 4:45 PM on Wed, Apr 3, 2024, and technical items like 2.3h and 30kph — with none preprocessing pipeline or phonetic annotations. The monetary expression “$5.2M” should learn as “5 level two million {dollars},” and “$450K” as “4 hundred fifty thousand {dollars}.” All 4 competing methods failed this. The technical unit “2.3h” should learn as “two level three hours” and “30kph” as “thirty kilometers per hour.” All 4 rivals additionally failed this class. The competing methods evaluated embrace ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

https://github.com/supertone-inc/supertonic

Getting Began

The Python SDK set up is pip set up supertonic. On first run, the SDK downloads the mannequin property from Hugging Face mechanically. A minimal instance:

from supertonic import TTS
tts = TTS(auto_download=True)
type = tts.get_voice_style(voice_name="M1")
textual content = "A mild breeze moved by way of the open window whereas everybody listened to the story."
wav, length = tts.synthesize(textual content, voice_style=type, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {length:.2f}s of audio")

Marktechpost’s Visible Explainer

Overview

Supertonic 3: On-System TTS,
Now in 31 Languages

Supertonic 3 is a light-weight, open-weight text-to-speech system by Supertone Inc. It runs totally through ONNX Runtime in your system — no cloud, no API name, no knowledge leaving your machine. v3 expands from 5 to 31 languages, provides expressive tags, reduces studying failures, and stays appropriate with the v2 ONNX interface.

31
Languages

~99M
Parameters

404 MB
ONNX Belongings

MIT
Code License

What’s New in v3

4 Core Enhancements Over Supertonic 2

Model 3 is a centered improve — identical inference contract, meaningfully higher output.

🌐
31 languages — Expanded from the 5-language v2 launch (en, ko, es, pt, fr). Now contains Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 extra ISO codes, plus a particular na fallback for unknown languages.
✅
Extra steady studying — Fewer repeat and skip failures, particularly on brief and lengthy utterances. This was a recognized limitation in v2 that v3 immediately addresses.
🎭
Expression tags — Helps , , and inline in textual content, with none separate preprocessing or exterior mannequin.
🔊
Increased speaker similarity — Improved similarity throughout the shared-language set in contrast with Supertonic 2. Voices are extra constant throughout languages.

Set up

Get Working in Underneath a Minute

Set up the Python SDK through pip. On first run, mannequin property are downloaded mechanically from Hugging Face — no guide setup required.

pip set up supertonic

Fast Begin

Primary Python Utilization

The SDK auto-downloads mannequin property on first run. Specify a voice, move your textual content with a language code, and save the WAV output.

from supertonic import TTS

# Auto-downloads ONNX property on first run
tts = TTS(auto_download=True)

# Choose a preset voice (M1—M5 male, F1—F5 feminine)
type = tts.get_voice_style(voice_name="M1")

textual content = "A mild breeze moved by way of the open window."

# synthesize() returns (wav_array, duration_in_seconds)
wav, length = tts.synthesize(textual content, voice_style=type, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {length:.2f}s of audio")

textual content = "I can not imagine it  that really labored!"
wav, length = tts.synthesize(textual content, voice_style=type, lang="en")

Languages

31 Supported Languages + `na` Fallback

All 31 languages share the identical mannequin structure and ONNX inference pipeline. Use the na code for textual content whose language is unknown or exterior the supported set.

en English

ko Korean

ja Japanese

ar Arabic

bg Bulgarian

cs Czech

da Danish

de German

el Greek

es Spanish

et Estonian

fi Finnish

fr French

hello Hindi

hr Croatian

hu Hungarian

id Indonesian

it Italian

lt Lithuanian

lv Latvian

nl Dutch

pl Polish

pt Portuguese

ro Romanian

ru Russian

sk Slovak

sl Slovenian

sv Swedish

tr Turkish

uk Ukrainian

vi Vietnamese

Textual content Normalization

Handles Complicated Inputs With out Pre-Processing

Supertonic 3 reads monetary expressions, dates, telephone numbers, and technical items appropriately out of the field — no G2P module or phonetic annotations required. Under: Supertonic vs. 4 main business/open-source methods.

Class	Enter Instance	Supertonic 3	ElevenLabs / OpenAI / Gemini / Microsoft
Monetary Expression	$5.2M / $450K	✓	✗ All 4 failed
Time & Date	4:45 PM, Wed Apr 3	✓	✗ All 4 failed
Cellphone Quantity	(212) 555-0142 ext. 402	✓	✗ All 4 failed
Technical Unit	2.3h at 30kph	✓	✗ All 4 failed

Deployment & Assets

Runs In all places — 11 Platforms, No GPU Required

The general public ONNX property run on CPU in fixed-voice mode with no GPU dependency. Browser assist is through WebGPU and WASM by way of onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

🐍PythonONNX Runtime

🟨Node.jsServer-side JS

🌐BrowserWebGPU / WASM

☕JavaJVM

⚙️C++Excessive-perf

🔷C#.NET

🔵GoGo runtime

🍎Swift / iOSNative

🦀RustMethods

💙FlutterCross-platform

📄Code: MITLicense

🤖Mannequin: OpenRAIL-MLicense

Key Takeaways

Supertonic 3 expands language assist from 5 (v2) to 31 languages, rising from 66M to ~99M parameters with a complete ONNX asset measurement of 404 MB
New in v3: expressive tags (, , ), extra steady studying on brief and lengthy utterances, and improved speaker similarity vs. v2
v2-compatible public ONNX interface — present integrations improve with out altering inference code
Studying accuracy benchmarked in opposition to VoxCPM2; v3 stays inside a aggressive WER/CER vary whereas being considerably smaller
v3-specific RTF/throughput numbers haven’t been revealed; the 167× faster-than-real-time determine is a v2 benchmark and shouldn’t be assumed equivalent for v3
Native output of 16-bit WAV information making certain high-fidelity audio for engineering purposes

Take a look at the GitHub Repo and Hugging Face Space. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

Source link

Supertone Releases Supertonic v3: On-System Textual content-to-Speech Mannequin with 31-Language Assist, Fewer Studying Failures, and Expression Tags

Supertonic 3: On-System TTS,
Now in 31 Languages

4 Core Enhancements Over Supertonic 2

Get Working in Underneath a Minute

Primary Python Utilization

31 Supported Languages + `na` Fallback

Handles Complicated Inputs With out Pre-Processing

Runs In all places — 11 Platforms, No GPU Required

Meridian Ventures launched $35M fund to again MBA-deferred founders

OpenAI is reportedly making ready authorized motion towards Apple; it would not be the primary companion to really feel burned

Greatest AI Brokers for Software program Improvement Ranked: A Benchmark-Pushed Take a look at the Present Discipline

Supertone Releases Supertonic v3: On-System Textual content-to-Speech Mannequin with 31-Language Assist, Fewer Studying Failures, and Expression Tags

What Modified from v2 to v3

Structure and Runtime

Studying Accuracy

Textual content Normalization

Getting Began

Marktechpost’s Visible Explainer

Supertonic 3: On-System TTS,Now in 31 Languages

4 Core Enhancements Over Supertonic 2

Get Working in Underneath a Minute

Primary Python Utilization

31 Supported Languages + na Fallback

Handles Complicated Inputs With out Pre-Processing

Runs In all places — 11 Platforms, No GPU Required

Key Takeaways

Related Posts

Meridian Ventures launched $35M fund to again MBA-deferred founders

OpenAI is reportedly making ready authorized motion towards Apple; it would not be the primary companion to really feel burned

Greatest AI Brokers for Software program Improvement Ranked: A Benchmark-Pushed Take a look at the Present Discipline

Supertonic 3: On-System TTS,
Now in 31 Languages

31 Supported Languages + `na` Fallback