Supertone launched Supertonic 3, the third technology of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language assist, improved studying accuracy, fewer repeat and skip failures, and v2-compatible public ONNX property. It’s Lightning Quick, On-System, Multilingual and Correct TTS.
What Modified from v2 to v3
In contrast with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity throughout the shared-language set, and expands language protection from 5 to 31 languages. Model 2 supported English, Korean, Spanish, Portuguese, and French. Model 3 provides Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 whole ISO language codes. There may be additionally a particular na fallback for textual content whose language is unknown or exterior the supported set.
The mannequin grows modestly to accommodate the added languages. At about 99M parameters throughout the general public ONNX property, Supertonic 3 is far smaller than 0.7B to 2B class open TTS methods. The smaller mannequin measurement is a sensible benefit for obtain measurement, startup time, and on-device inference. The replace additionally brings the overall disk footprint of the general public ONNX property to 404 MB. Moreover, Supertone not too long ago launched the Voice Builder, permitting builders to create customized, edge-native TTS fashions from their very own voice recordings.
One new functionality in v3 that wasn’t current in v2 is expressive tag assist. Supertonic 3 helps easy expression tags comparable to , , and . These allow you to embed prosodic cues immediately into enter textual content with no separate preprocessing step or a separate mannequin for expressiveness. For engineers constructing voice interfaces or accessibility instruments, this implies you possibly can specify respiratory pauses or laughter inline in your textual content payload.
Structure and Runtime
The underlying structure carries over from prior variations: a speech autoencoder that encodes waveforms into steady latent representations, a flow-matching primarily based text-to-latent module that maps textual content to audio options, and a length predictor that controls pure timing. Circulate matching is a generative modeling approach that learns a vector area to remodel a easy distribution right into a goal distribution — it samples quicker than diffusion fashions at low step counts, which is why Supertonic can produce usable output in simply 2 inference steps. To additional refine output, v3 integrates Size-Conscious Rotary Place Embedding (LARoPE) for superior text-speech alignment and makes use of a Self-Purifying Circulate Matching approach throughout coaching to stay sturdy in opposition to noisy knowledge labels.
On runtime effectivity, Supertonic 3 runs quick on CPU, even in contrast with bigger baselines measured on A100 GPU, and makes use of considerably much less reminiscence. It doesn’t require a GPU, which makes native, browser, and edge deployment a lot simpler.
Studying Accuracy
Throughout measured languages, Supertonic 3 stays inside a aggressive WER/CER vary in opposition to a lot bigger open TTS fashions comparable to VoxCPM2, whereas preserving a light-weight on-device deployment path. WER (Phrase Error Charge) and CER (Character Error Charge) are commonplace TTS readability metrics: you synthesize a passage, run ASR over the output, and evaluate the transcription to the unique textual content. CER is used for languages with out clear phrase boundaries; the others use WER. The system’s effectivity is finest demonstrated on excessive edge {hardware}; it achieves a mean RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Moreover, the ecosystem has expanded to incorporate Flutter (with macOS assist), .NET 9, and Go, whereas the online implementation leverages onnxruntime-web for pure client-side execution.
Textual content Normalization
A differentiating property carried ahead from v2 is built-in textual content normalization. Supertonic handles complicated floor types — monetary expressions like $5.2M, telephone numbers with space codes and extensions like (212) 555-0142 ext. 402, time and date codecs like 4:45 PM on Wed, Apr 3, 2024, and technical items like 2.3h and 30kph — with none preprocessing pipeline or phonetic annotations. The monetary expression “$5.2M” should learn as “5 level two million {dollars},” and “$450K” as “4 hundred fifty thousand {dollars}.” All 4 competing methods failed this. The technical unit “2.3h” should learn as “two level three hours” and “30kph” as “thirty kilometers per hour.” All 4 rivals additionally failed this class. The competing methods evaluated embrace ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.
Getting Began
The Python SDK set up is pip set up supertonic. On first run, the SDK downloads the mannequin property from Hugging Face mechanically. A minimal instance:
from supertonic import TTS
tts = TTS(auto_download=True)
type = tts.get_voice_style(voice_name="M1")
textual content = "A mild breeze moved by way of the open window whereas everybody listened to the story."
wav, length = tts.synthesize(textual content, voice_style=type, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {length:.2f}s of audio")
Marktechpost’s Visible Explainer
Key Takeaways
- Supertonic 3 expands language assist from 5 (v2) to 31 languages, rising from 66M to ~99M parameters with a complete ONNX asset measurement of 404 MB
- New in v3: expressive tags (
,,), extra steady studying on brief and lengthy utterances, and improved speaker similarity vs. v2 - v2-compatible public ONNX interface — present integrations improve with out altering inference code
- Studying accuracy benchmarked in opposition to VoxCPM2; v3 stays inside a aggressive WER/CER vary whereas being considerably smaller
- v3-specific RTF/throughput numbers haven’t been revealed; the 167× faster-than-real-time determine is a v2 benchmark and shouldn’t be assumed equivalent for v3
- Native output of 16-bit WAV information making certain high-fidelity audio for engineering purposes
Take a look at the GitHub Repo and Hugging Face Space. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
