Fish Audio Releases Fish Audio S2: A New Era of Expressive Textual content-to-Speech (TTS) with Absurdly Controllable Emotion

The panorama of Textual content-to-Speech (TTS) is transferring away from modular pipelines towards built-in Giant Audio Fashions (LAMs). Fish Audio’s launch of S2-Professional, the flagship mannequin inside the Fish Speech ecosystem, represents a shift towards open architectures able to high-fidelity, multi-speaker synthesis with sub-150ms latency. The discharge offers a framework for zero-shot voice cloning and granular emotional management utilizing a Twin-Auto-Regressive (AR) method.

Structure: The Twin-AR Framework and RVQ

The elemental technical distinction in Fish Audio S2-Professional is its hierarchical Twin-AR structure. Conventional TTS fashions usually battle with the trade-off between sequence size and acoustic element. S2-Professional addresses this by bifurcating the era course of into two specialised levels: a ‘Sluggish AR’ mannequin and a ‘Quick AR’ mannequin.

The Sluggish AR Mannequin (4B Parameters): This element operates on the time-axis. It’s accountable for processing linguistic enter and producing semantic tokens. By using a bigger parameter depend (roughly 4 billion), the Sluggish AR mannequin captures long-range dependencies, prosody, and the structural nuances of speech.
The Quick AR Mannequin (400M Parameters): This element processes the acoustic dimension. It predicts the residual codebooks for every semantic token. This smaller, quicker mannequin ensures that the high-frequency particulars of the audio—timbre, breathiness, and texture—are generated with excessive effectivity.

This method depends on Residual Vector Quantization (RVQ). On this setup, uncooked audio is compressed into discrete tokens throughout a number of layers (codebooks). The primary layer captures the first acoustic options, whereas subsequent layers seize the ‘residuals’ or the remaining errors from the earlier layer. This enables the mannequin to reconstruct high-fidelity 44.1kHz audio whereas sustaining a manageable token depend for the Transformer structure.

Emotional Management by way of In-Context Studying and Inline Tags

Fish Audio S2-Professional achieves what the builders describe as ‘absurdly controllable emotion’ via two major mechanisms: zero-shot in-context studying and pure language inline management.

In-Context Studying (ICL):

Not like older generations of TTS that required specific fine-tuning to imitate a particular voice, S2-Professional makes use of the Transformer’s means to carry out in-context studying. By offering a reference audio clip—ideally between 10 and 30 seconds—the mannequin extracts the speaker’s identification and emotional state. The mannequin treats this reference as a prefix in its context window, permitting it to proceed the “sequence” in the identical voice and elegance.

Inline Management Tags:

The mannequin helps dynamic emotional transitions inside a single era move. As a result of the mannequin was educated on information containing descriptive linguistic markers, builders can insert pure language tags straight into the textual content immediate. For instance:

[whisper] I've a secret [laugh] that I can't let you know.

The mannequin interprets these tags as directions to switch the acoustic tokens in real-time, adjusting pitch, depth, and rhythm with out requiring a separate emotional embedding or exterior management vector.

Efficiency Benchmarks and SGLang Integration

Integrating TTS into real-time functions, the first constraint is ‘Time to First Audio’ (TTFA). Fish Audio S2-Professional is optimized for a sub-150ms latency, with benchmarks on NVIDIA H200 {hardware} reaching roughly 100ms.

A number of technical optimizations contribute to this efficiency:

SGLang and RadixAttention: S2-Professional is designed to work with SGLang, a high-performance serving framework. It makes use of RadixAttention, which permits for environment friendly Key-Worth (KV) cache administration. In a manufacturing atmosphere the place the identical “grasp” voice immediate (reference clip) is used repeatedly, RadixAttention caches the prefix’s KV states. This eliminates the necessity to re-compute the reference audio for each request, considerably lowering the prefill time.
Multi-Speaker Single-Move Era: The structure permits for a number of speaker identities to be current inside the similar context window. This allows the era of complicated dialogues or multi-character narrations in a single inference name, avoiding the latency overhead of switching fashions or reloading weights for various audio system.

Technical Implementation and Information Scaling

The Fish Speech repository offers a Python-based implementation using PyTorch. The mannequin was educated on a various dataset comprising over 300,000 hours of multi-lingual audio. This scale is what permits the mannequin’s sturdy efficiency throughout completely different languages and its means to deal with ‘non-verbal’ vocalizations like sighs or hesitations.

The coaching pipeline includes:

VQ-GAN Coaching: Coaching the quantizer to map audio right into a discrete latent area.
LLM Coaching: Coaching the Twin-AR transformers to foretell these latent tokens primarily based on textual content and acoustic prefixes.

The VQ-GAN utilized in S2-Professional is particularly tuned to attenuate artifacts through the decoding course of, making certain that even at excessive compression ratios, the reconstructed audio stays ‘clear’ (indistinguishable from the supply to the human ear).

Key Takeaways

Twin-AR Structure (Sluggish/Quick): Not like single-stage fashions, S2-Professional splits duties between a 4B parameter ‘Sluggish AR’ mannequin (for linguistic and prosodic construction) and a 400M parameter ‘Quick AR’ mannequin (for acoustic refinement), optimizing each element and velocity.
Sub-150ms Latency: Engineered for real-time conversational AI, the mannequin achieves a Time-to-First-Audio (TTFA) of ~100ms on high-end {hardware}, making it appropriate for stay brokers and interactive functions.
Hierarchical RVQ Encoding: Through the use of Residual Vector Quantization, the system compresses 44.1kHz audio into discrete tokens throughout a number of layers. This enables the mannequin to reconstruct complicated vocal textures—together with breaths and sighs—with out the computational bloat of uncooked waveforms.
Zero-Shot In-Context Studying: Builders can clone a voice and its emotional state by offering a 10–30 second reference clip. The mannequin treats this as a prefix, adopting the speaker’s timbre and prosody with out requiring further fine-tuning.
RadixAttention & SGLang Integration: Optimized for manufacturing, S2-Professional leverages RadixAttention to cache KV states of voice prompts. This enables for practically prompt era when utilizing the identical speaker repeatedly, drastically lowering prefill overhead.

Take a look at Model Card and Repo. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link

Fish Audio Releases Fish Audio S2: A New Era of Expressive Textual content-to-Speech (TTS) with Absurdly Controllable Emotion

An Finish-to-Finish Coding Information to NVIDIA KVPress for Lengthy-Context LLM Inference, KV Cache Compression, and Reminiscence-Environment friendly Technology

OpenAI Backs Invoice That Would Restrict Legal responsibility for AI-Enabled Mass Deaths or Monetary Disasters

EFF is the newest group to go away X

Fish Audio Releases Fish Audio S2: A New Era of Expressive Textual content-to-Speech (TTS) with Absurdly Controllable Emotion

Structure: The Twin-AR Framework and RVQ

Emotional Management by way of In-Context Studying and Inline Tags

Efficiency Benchmarks and SGLang Integration

Technical Implementation and Information Scaling

Key Takeaways

Related Posts

An Finish-to-Finish Coding Information to NVIDIA KVPress for Lengthy-Context LLM Inference, KV Cache Compression, and Reminiscence-Environment friendly Technology

OpenAI Backs Invoice That Would Restrict Legal responsibility for AI-Enabled Mass Deaths or Monetary Disasters

EFF is the newest group to go away X