Tencent AI Open Sources Covo-Audio: A 7B Speech Language Mannequin and Inference Pipeline for Actual-Time Audio Conversations and Reasoning

Tencent AI Lab has launched Covo-Audio, a 7B-parameter end-to-end Massive Audio Language Mannequin (LALM). The mannequin is designed to unify speech processing and language intelligence by immediately processing steady audio inputs and producing audio outputs inside a single structure.

System Structure

The Covo-Audio framework consists of 4 major parts designed for seamless cross-modal interplay:

Audio Encoder: The mannequin makes use of Whisper-large-v3 as its major encoder because of its robustness in opposition to background noise and various accents. This element operates at a body price of 50 Hz.
Audio Adapter: To bridge the encoder and the LLM, a specialised adapter employs three downsampling modules, integrating linear and convolution layers to cut back the body price from 50 Hz to six.25 Hz.
LLM Spine: The system is constructed upon Qwen2.5-7B-Base, which has been tailored to course of interleaved sequences of steady acoustic options and textual tokens.
Speech Tokenizer and Decoder: The tokenizer, based mostly on WavLM-large, makes use of a codebook dimension of 16,384 to provide discrete audio tokens at 25 Hz. The decoder employs a Stream-Matching (FM) based mostly framework and a BigVGAN vocoder to reconstruct high-fidelity 24K waveforms.

https://arxiv.org/pdf/2602.09823

A core contribution of this work is the Hierarchical Tri-modal Speech-Textual content Interleaving technique. Not like conventional strategies that function solely on the phrase or character degree, this framework aligns steady acoustic options $(a_c)$ , discrete speech tokens $(a_d)$ , and pure language textual content $(t)$ .

The mannequin makes use of two major patterns:

Sequential Interleaving $(a_c rightarrow t rightarrow a_d)$ : Steady options, textual content, and discrete tokens are organized in a progressive chain.
Parallel Integration $(a_c rightarrow t | a_d)$ : Steady options are aligned with a coupled text-discrete unit.

The hierarchical facet ensures structural coherence through the use of phrase-level interleaving for fine-grained alignment and sentence-level interleaving to protect world semantic integrity in long-form utterances^{. The coaching course of concerned a two-stage pre-training pipeline processing a complete of 2T tokens^.}

Intelligence-Speaker Decoupling

To mitigate the excessive value of setting up large-scale dialogue knowledge for particular audio system, the analysis staff proposed an Intelligence Speaker Decoupling technique. This method separates dialogue intelligence from voice rendering, permitting for versatile voice customization utilizing minimal text-to-speech (TTS) knowledge.

The tactic reformats high-quality TTS recordings into pseudo-conversations with masked textual content loss^{. By excluding the textual content response portion from the loss calculation, the mannequin preserves its reasoning skills whereas inheriting the naturalness of the TTS speaker^{. This allows customized interplay with out the necessity for intensive, speaker-specific dialogue datasets^.}}

Full-Duplex Voice Interplay

Covo-Audio advanced into Covo-Audio-Chat-FD, a variant able to simultaneous dual-stream communication^{^{^{^{. The audio encoder is reformatted right into a chunk-streaming method, and the consumer and mannequin streams are chunk-interleaved in a 1:4 ratio^{. Every chunk represents 0.16s of audio^.}}}}}

The system manages conversational states by way of particular architectural tokens:

THINK Token: Signifies a listening-only state whereas the mannequin waits to reply.
SHIFT Token: Signifies the transition to the mannequin’s talking flip.
BREAK Token: Detects interruption indicators (barge-ins), triggering the mannequin to terminate talking instantly and swap again to listening.

For multi-turn eventualities, the mannequin implements a recursive context-filling technique, the place steady audio options from consumer enter and generated tokens from earlier turns are prefixed as historic context^.

Audio Reasoning and Reinforcement Studying

To boost advanced reasoning, the mannequin incorporates Chain-of-Thought (CoT) reasoning and Group Relative Coverage Optimization (GRPO). The mannequin is optimized utilizing a verifiable composite reward operate:

$$R_{whole} = R_{accuracy} + R_{format} + R_{consistency} + R_{pondering}$$

This construction permits the mannequin to optimize for correctness $(R_{accuracy})$ , structured output adherence $(R_{format})$ , logical coherence $(R_{consistency})$ , and reasoning depth $(R_{pondering})$ .

Analysis and Efficiency

Covo-Audio (7B) exhibits aggressive or superior outcomes on a number of evaluated benchmarks, with strongest claims made for fashions of comparable scale and chosen speech/audio duties. ^{^{^{^{On the MMAU benchmark, it achieved a mean rating of 75.30%, the best amongst evaluated 7B-scale fashions^{^{^{^{. It notably excelled in music understanding with a rating of 76.05%^{. On the MMSU benchmark, Covo-Audio achieved a number one 66.64% common accuracy^.}}}}}}}}}

Relating to its conversational variants, Covo-Audio-Chat demonstrated sturdy efficiency on URO-Bench, notably in speech reasoning and spoken dialogue duties, outperforming fashions like Qwen3-Omni on the Chinese language monitor^{. For empathetic interplay on the VStyle benchmark, it achieved state-of-the-art leads to Mandarin for anger (4.89), unhappiness (4.93), and nervousness (5.00)^.}

The analysis staff notes an ‘early-response’ difficulty on the GaokaoEval full-duplex setting, the place unusually lengthy silent pauses between vocal fragments could cause untimely responses. This ‘early-response’ habits correlates with the mannequin’s pause-handling success metric and is recognized as a vital route for future optimization.

Key Takeaways

Unified Finish-to-Finish Structure: Covo-Audio is a 7B-parameter mannequin that natively processes steady audio inputs and generates high-fidelity audio outputs inside a single, unified structure. It eliminates the necessity for cascaded ASR-LLM-TTS pipelines, decreasing error propagation and data loss.
Hierarchical Tri-modal Interleaving: The mannequin employs a specialised technique to align steady acoustic options, discrete speech tokens, and pure language textual content. By interleaving these modalities at each phrase and sentence ranges, it preserves world semantic integrity whereas capturing fine-grained prosodic nuances.
Intelligence-Speaker Decoupling: Tencent analysis staff introduces a method to decouple dialogue intelligence from particular voice rendering. This enables for versatile voice customization utilizing light-weight Textual content-to-Speech (TTS) knowledge, considerably decreasing the price of creating customized conversational brokers.
Native Full-Duplex Interplay: The Covo-Audio-Chat-FD variant helps simultaneous listening and talking. It makes use of particular architectural tokens—THINK, SHIFT, and BREAK—to handle advanced real-time dynamics akin to clean turn-taking, backchanneling, and consumer barge-ins.
Superior Parameter Effectivity: Regardless of its compact 7B scale, Covo-Audio achieves state-of-the-art or extremely aggressive efficiency throughout core benchmarks, together with MMAU, MMSU, and URO-Bench. It continuously matches or exceeds the efficiency of a lot bigger techniques, akin to 32B-parameter fashions, in audio and speech understanding duties.

Take a look at the Paper, Model on HF and Repo. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.

Source link

Tencent AI Open Sources Covo-Audio: A 7B Speech Language Mannequin and Inference Pipeline for Actual-Time Audio Conversations and Reasoning

Apple overhauls its app developer platform with 100 new metrics, extra instruments

Slightly-known Croatian startup is coming for the robotaxi market with assist from Uber

Meta is slicing a number of hundred jobs

Tencent AI Open Sources Covo-Audio: A 7B Speech Language Mannequin and Inference Pipeline for Actual-Time Audio Conversations and Reasoning

System Structure

Hierarchical Tri-modal Interleaving

Intelligence-Speaker Decoupling

Full-Duplex Voice Interplay

Audio Reasoning and Reinforcement Studying

Analysis and Efficiency

Key Takeaways

Related Posts

Apple overhauls its app developer platform with 100 new metrics, extra instruments

Slightly-known Croatian startup is coming for the robotaxi market with assist from Uber

Meta is slicing a number of hundred jobs