OpenMOSS Releases MOSS-Audio: An Open-Supply Basis Mannequin for Speech, Sound, Music, and Time-Conscious Audio Reasoning

Understanding what’s taking place in an audio clip is a deceptively laborious downside. Transcribing spoken phrases is the straightforward half. A really succesful system additionally wants to acknowledge who’s talking, detect their emotional state, interpret background sounds, analyze musical content material, and reply time-grounded questions like ‘what did the speaker say on the 2-minute mark?’. Tackling all of that required stitching collectively a number of specialised techniques.

Tthe OpenMOSS workforce, MOSI.AI, and Shanghai Innovation Institute launched MOSS-Audio: an open-source audio understanding mannequin designed to unify all of these capabilities inside a single basis mannequin.

What MOSS-Audio Truly Does

MOSS-Audio helps speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and sophisticated reasoning over real-world audio. Its functionality set breaks down into a number of distinct areas. Speech & Content material Understanding precisely acknowledges and transcribes spoken content material, supporting each word-level and sentence-level timestamp alignment. Speaker, Emotion & Occasion Evaluation identifies speaker traits, analyzes emotional states primarily based on tone, timbre, and context, and detects key acoustic occasions throughout the audio. Scene & Sound Cue Extraction pulls significant indicators from background sounds, environmental noise, and non-speech indicators to deduce scene context and environment. Music Understanding analyzes musical type, emotional development, and instrumentation. Audio Query Answering & Summarization handles questions and summaries throughout speech, podcasts, conferences, and interviews. Lastly, Advanced Reasoning performs multi-hop reasoning over audio content material, powered by each chain-of-thought coaching and reinforcement studying.

In sensible phrases, a single MOSS-Audio mannequin can do all the above with out switching between completely different specialised techniques.

4 Mannequin Variants

The workforce launched 4 variants at launch: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Considering, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Considering. The naming conference is price understanding when you’re deciding which to make use of. The Instruct variants are optimized for direct instruction following, making them well-suited for manufacturing pipelines the place you need predictable, structured outputs. The Considering variants present stronger chain-of-thought reasoning capabilities, higher fitted to duties requiring multi-hop inference. The 4B fashions use Qwen3-4B because the LLM spine, and the 8B fashions use Qwen3-8B, leading to complete mannequin sizes of roughly 4.6B and eight.6B parameters respectively.

https://github.com/OpenMOSS/MOSS-Audio

The Structure: Three Elements Working Collectively

MOSS-Audio follows a modular design comprising three elements: an audio encoder, a modality adapter, and a big language mannequin. Uncooked audio is first encoded by the MOSS-Audio-Encoder into steady temporal representations at 12.5 Hz. These representations are then projected into the language mannequin’s embedding area by the adapter, and at last consumed by the LLM for auto-regressive textual content era.

The analysis workforce educated the encoder from scratch moderately than counting on off-the-shelf audio frontends. Their reasoning: a devoted encoder delivers extra sturdy speech representations, tighter temporal alignment, and higher extensibility throughout acoustic domains.

Two architectural improvements inside MOSS-Audio are price understanding intimately.

DeepStack Cross-Layer Function Injection: A standard weak spot in audio fashions is that relying solely on the encoder’s top-layer options tends to lose low-level acoustic data, issues like prosody, transient occasions, and native time-frequency construction. MOSS-Audio addresses this with a DeepStack-inspired cross-layer injection module between the encoder and the language mannequin: along with the encoder’s final-layer output, options from earlier and intermediate layers are chosen, independently projected, and injected into the language mannequin’s early layers. This preserves multi-granularity data starting from low-level acoustic particulars to high-level semantic abstractions, serving to the mannequin retain rhythm, timbre, transients, and background construction {that a} single high-level illustration can not totally seize.

Time-Conscious Illustration: Time is a important dimension in audio that textual content fashions aren’t naturally geared up to deal with. MOSS-Audio addresses this by a time-marker insertion technique throughout pretraining: express time tokens are inserted between audio body representations at mounted time intervals to point temporal positions. This lets the mannequin be taught ‘what occurred when’ inside a unified textual content era framework, naturally supporting timestamp ASR, occasion localization, time-based QA, and long-audio retrospection — with out requiring a separate localization head or post-processing pipeline.

Benchmark Efficiency

The numbers are sturdy. On common audio understanding, MOSS-Audio-8B-Considering achieves a mean accuracy of 71.08 throughout 4 benchmarks — 77.33 on MMAU, 64.92 on MMAU-Professional, 66.53 on MMAR, and 75.52 on MMSU, outperforming majority of open-source fashions. That features bigger fashions: Step-Audio-R1 at 33B scores 70.67, and Qwen3-Omni-30B-A3B-Instruct at 30B scores 67.91. For additional context, Kimi-Audio (7B) scores 61.14 and MiMo-Audio-7B scores 62.97 on the identical common. The 4B Considering variant scores 68.37, that means the smaller mannequin with chain-of-thought coaching beats all bigger open-source instruct-only opponents.

On speech captioning, evaluated with an LLM-as-a-Decide methodology throughout 13 fine-grained dimensions together with gender, age, accent, pitch, quantity, pace, texture, readability, fluency, emotion, tone, persona, and abstract, MOSS-Audio-Instruct variants lead throughout 11 out of 13 dimensions, with MOSS-Audio-8B-Instruct reaching the most effective general common rating of 3.7252.

On computerized speech recognition (ASR) spanning 12 analysis dimensions — together with well being situation, code-switching, dialect, singing, and non-speech situations — MOSS-Audio-8B-Instruct achieves the lowest general CER (Character Error Fee) of 11.30 throughout all examined fashions.

https://github.com/OpenMOSS/MOSS-Audio

Key Takeaways

Single Mannequin, Full Audio Stack: MOSS-Audio unifies speech transcription, speaker and emotion evaluation, environmental sound understanding, music evaluation, audio captioning, time-aware QA, and sophisticated reasoning into one open-source mannequin, eliminating the necessity to chain a number of specialised techniques collectively.
Two Architectural Improvements Drive Efficiency: DeepStack Cross-Layer Function Injection preserves multi-granularity acoustic data by injecting options from intermediate encoder layers instantly into the LLM’s early layers, whereas time-marker insertion throughout pretraining provides the mannequin express temporal consciousness for timestamp-grounded duties.
Greatest-in-Class Benchmark Outcomes at Environment friendly Scale: MOSS-Audio-8B-Considering achieves a mean accuracy of 71.08 on common audio understanding benchmarks, outperforming all open-source fashions together with 30B+ techniques, whereas the 4B Considering variant alone beats each bigger open-source instruct-only competitor.
Dominant Timestamp ASR Accuracy: MOSS-Audio-8B-Instruct scores 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming each Qwen3-Omni-30B-A3B-Instruct (833.66) and the closed-source Gemini-3.1-Professional (708.24) on the identical benchmark.

Try the Model Weights and Repo. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

Source link

OpenMOSS Releases MOSS-Audio: An Open-Supply Basis Mannequin for Speech, Sound, Music, and Time-Conscious Audio Reasoning

DeepMind’s David Silver simply raised $1.1B to construct an AI that learns with out human information

Letterboxd, the social platform for movie buffs, reportedly searching for new proprietor

Elon Musk Boosts New Yorker’s Sam Altman Exposé on X as Trial Begins

OpenMOSS Releases MOSS-Audio: An Open-Supply Basis Mannequin for Speech, Sound, Music, and Time-Conscious Audio Reasoning

What MOSS-Audio Truly Does

4 Mannequin Variants

The Structure: Three Elements Working Collectively

Benchmark Efficiency

Key Takeaways

Related Posts

DeepMind’s David Silver simply raised $1.1B to construct an AI that learns with out human information

Letterboxd, the social platform for movie buffs, reportedly searching for new proprietor

Elon Musk Boosts New Yorker’s Sam Altman Exposé on X as Trial Begins