Understanding audio has at all times been the multimodal frontier that lags behind imaginative and prescient. Whereas image-language fashions have quickly scaled towards real-world deployment, constructing open fashions that robustly cause over speech, environmental sounds, and music — particularly at size — has remained fairly arduous. NVIDIA and the College of Maryland researchers at the moment are taking a direct swing at that hole.
The analysis staff have launched Audio Flamingo Subsequent (AF-Subsequent), probably the most succesful mannequin within the Audio Flamingo sequence and a completely open Massive Audio-Language Mannequin (LALM) educated on internet-scale audio knowledge.
Audio Flamingo Subsequent (AF-Subsequent) is available in three specialised variants for various use instances. The discharge consists of AF-Subsequent-Instruct for basic query answering, AF-Subsequent-Assume for superior multi-step reasoning, and AF-Subsequent-Captioner for detailed audio captioning.
What’s a Massive Audio-Language Mannequin (LALM)?
A Massive Audio-Language Mannequin (LALM) pairs an audio encoder with a decoder-only language mannequin to allow query answering, captioning, transcription, and reasoning straight over audio inputs. Consider it because the audio equal of a vision-language mannequin like LLaVA or GPT-4V, however designed to deal with speech, environmental sounds, and music concurrently — inside a single unified mannequin.
The Structure: 4 Parts Working in a Pipeline
AF-Subsequent is constructed round 4 fundamental elements: First is the AF-Whisper audio encoder, a customized Whisper-based encoder additional pre-trained on a bigger and extra numerous corpus, together with multilingual speech and multi-talker ASR knowledge. Given an audio enter, the mannequin resamples it to 16 kHz mono and converts the waveform right into a 128-channel log mel-spectrogram utilizing a 25 ms window and 10 ms hop dimension. The spectrogram is processed in non-overlapping 30-second chunks via AF-Whisper, which outputs options at 50 Hz, after which a stride-2 pooling layer is utilized. The hidden dimension is 1280.
Second is the audio adaptor, a 2-layer MLP that maps AF-Whisper’s audio representations into the language mannequin’s embedding house. Third is the LLM spine: Qwen-2.5-7B, a decoder-only causal mannequin with 7B parameters, 36 transformer layers, and 16 consideration heads, with context size prolonged from 32k to 128k tokens via further long-context coaching.
A refined however vital architectural element is Rotary Time Embeddings (RoTE). Normal positional encodings in transformers index a token by its discrete sequence place i. RoTE replaces this: as a substitute of the usual RoPE rotation angle θ ← −i · 2π, RoTE makes use of θ ← −τi · 2π, the place τi is every token’s absolute timestamp. For audio tokens produced at a hard and fast 40 ms stride, discrete time positions are interpolated earlier than being fed into the RoTE module. This yields positional representations grounded in precise time moderately than sequence order — a core design selection enabling the mannequin’s temporal reasoning, notably for lengthy audio. Lastly, a streaming TTS module allows voice-to-voice interplay.
Temporal Audio Chain-of-Thought: The Key Reasoning Recipe
Chain-of-Thought (CoT) prompting has improved reasoning throughout textual content and imaginative and prescient fashions, however prior audio CoT work confirmed solely small positive factors as a result of coaching datasets had been restricted to brief clips with easy questions. AF-Subsequent addresses this with Temporal Audio Chain-of-Thought, the place the mannequin explicitly anchors every intermediate reasoning step to a timestamp within the audio earlier than producing a solution, encouraging trustworthy proof aggregation and decreasing hallucination over lengthy recordings.
To coach this functionality, the analysis staff created AF-Assume-Time, a dataset of query–reply–thinking-chain triplets curated from difficult audio sources together with trailers, film recaps, thriller tales, and long-form multi-party conversations. AF-Assume-Time consists of roughly 43K coaching samples, with a median of 446.3 phrases per pondering chain.
Coaching at Scale: 1 Million Hours, 4 Phases
The ultimate coaching dataset includes roughly 108 million samples and roughly 1 million hours of audio, drawn from each current publicly launched datasets and uncooked audio collected from the open web and subsequently labeled synthetically. New knowledge classes launched embody over 200K lengthy movies spanning 5 to half-hour for long-form captioning and QA, multi-talker speech understanding knowledge protecting speaker identification, interruption identification, and goal speaker ASR, roughly 1 million samples for multi-audio reasoning throughout a number of simultaneous audio inputs, and roughly 386K security and instruction-following samples.
Coaching follows a four-stage curriculum, every with distinct knowledge mixtures and context lengths. Pre-training has two sub-stages: Stage 1 trains solely the audio adaptor whereas maintaining each AF-Whisper and the LLM frozen (max audio 30 seconds, 8K token context); Stage 2 moreover fine-tunes the audio encoder whereas nonetheless maintaining the LLM frozen (max audio 1 minute, 8K token context). Mid-training additionally has two sub-stages: Stage 1 performs full fine-tuning of the complete mannequin, including AudioSkills-XL and newly curated knowledge (max audio 10 minutes, 24K token context); Stage 2 introduces long-audio captioning and QA, down-sampling the Stage 1 combination to half its unique mix weights whereas increasing context to 128K tokens and audio to half-hour. The mannequin ensuing from mid-training is particularly launched as AF-Subsequent-Captioner. Put up-training applies GRPO-based reinforcement studying specializing in multi-turn chat, security, instruction following, and chosen skill-specific datasets, producing AF-Subsequent-Instruct. Lastly, CoT-training begins from AF-Subsequent-Instruct, applies SFT on AF-Assume-Time, then GRPO utilizing the post-training knowledge combination, producing AF-Subsequent-Assume.
One notable contribution from the analysis staff is hybrid sequence parallelism, which makes 128K-context coaching possible on lengthy audio. With out it, audio token enlargement blows previous normal context home windows and the quadratic reminiscence price of self-attention turns into infeasible. The answer combines Ulysses consideration — which makes use of all-to-all collectives to distribute sequence and head dimensions inside nodes the place high-bandwidth interconnects can be found — with Ring consideration, which circulates key-value blocks throughout nodes through point-to-point transfers. Ulysses handles intra-node communication effectively; Ring scales throughout nodes.
Benchmark Outcomes: Sturdy Throughout the Board
On MMAU-v05.15.25, probably the most extensively used audio reasoning benchmark, AF-Subsequent-Instruct achieves a median accuracy of 74.20 vs. Audio Flamingo 3’s 72.42, with AF-Subsequent-Assume reaching 75.01 and AF-Subsequent-Captioner pushing to 75.76 — with positive factors throughout all three subcategories: sound (79.87), music (75.3), and speech (72.13). On the more difficult MMAU-Professional benchmark, AF-Subsequent-Assume (58.7) surpasses the closed-source Gemini-2.5-Professional (57.4).
Music understanding sees notably sturdy positive factors. On Medley-Solos-DB instrument recognition, AF-Subsequent reaches 92.13 vs. Audio Flamingo 2’s 85.80. On SongCaps music captioning, GPT5 protection and correctness scores bounce from 6.7 and 6.2 (AF3) to eight.8 and eight.9 respectively.
Lengthy-audio understanding is the place AF-Subsequent most clearly separates itself. On LongAudioBench, AF-Subsequent-Instruct achieves 73.9, outperforming each Audio Flamingo 3 (68.6) and the closed-source Gemini 2.5 Professional (60.4). On the speech-inclusive variant (+Speech), AF-Subsequent reaches 81.2 vs. Gemini 2.5 Professional’s 66.2. On ASR, AF-Subsequent-Instruct units new lows amongst LALMs with a Phrase Error Price of 1.54 on LibriSpeech test-clean and a couple of.76 on test-other. On VoiceBench, AF-Subsequent-Instruct achieves the best scores on AlpacaEval (4.43), CommonEval (3.96), and OpenBookQA (80.9), surpassing Audio Flamingo 3 by over 14 factors on OpenBookQA. On CoVoST2 speech translation, AF-Subsequent exhibits a very notable 12-point enchancment over Phi-4-mm on Arabic EN→X translation (21.9 vs. 9.9).
Key Takeaways
Listed below are 5 key takeaways:
- A Totally Open Audio-Language Mannequin at Web Scale: AF-Subsequent is taken into account the primary LALM to scale audio understanding to internet-scale knowledge — roughly 108 million samples and 1 million hours of audio.
- Temporal Audio Chain-of-Thought Solves Lengthy-Audio Reasoning: Reasonably than reasoning like prior CoT approaches, AF-Subsequent explicitly anchors every intermediate reasoning step to a timestamp within the audio earlier than producing a solution. This makes the mannequin considerably extra trustworthy and interpretable on lengthy recordings as much as half-hour — an issue prior fashions largely sidestepped.
- Three Specialised Variants for Completely different Use Instances: The discharge consists of AF-Subsequent-Instruct for basic query answering, AF-Subsequent-Assume for superior multi-step reasoning, and AF-Subsequent-Captioner for detailed audio captioning — permitting practitioners to pick the best mannequin primarily based on their job moderately than utilizing a one-size-fits-all checkpoint.
- Beats Closed Fashions on Lengthy Audio Regardless of Being Smaller On LongAudioBench, AF-Subsequent-Instruct scores 73.9 — outperforming the closed-source Gemini 2.5 Professional (60.4) and Audio Flamingo 3 (68.6). On the more difficult speech-inclusive variant, the hole widens additional, with AF-Subsequent reaching 81.2 vs. Gemini 2.5 Professional’s 66.2.
Take a look at the Paper, Project Page and Model Weights. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
