IBM Releases Two Granite Speech 4.1 2B Fashions: Autoregressive ASR with Translation and Non-Autoregressive Enhancing for Quick Inference

IBM launched two new open speech recognition fashions— Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — and so they make a compelling case for what a ~2B-parameter speech mannequin can do. Each can be found on Hugging Face underneath the Apache 2.0 license.

The pair targets a selected downside that enterprise AI groups know effectively: most production-grade computerized speech recognition (ASR) programs both demand huge compute or sacrifice accuracy to remain inside funds. IBM’s wager is that cautious structure choices can let you could have it each methods.

What These Fashions Truly Do

Granite Speech 4.1 2B is a compact and environment friendly speech-language mannequin designed for multilingual computerized speech recognition (ASR) and bidirectional computerized speech translation (AST) masking English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, focuses completely on ASR — particularly focusing on latency-sensitive deployments — and helps English, French, German, Spanish, and Portuguese, however not Japanese. That’s a significant distinction: groups that want Japanese transcription or any speech translation functionality ought to attain for the usual autoregressive mannequin.

IBM additionally quietly launched a 3rd variant alongside these two. Granite Speech 4.1 2B-Plus provides speaker-attributed ASR and word-level timestamps for purposes the place realizing who stated what — and precisely when — is a requirement.

Phrase Error Price (WER) is the first metric for measuring transcription high quality. Decrease is healthier. A WER of 5% means roughly 5 out of each 100 phrases are flawed. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B scores a imply WER of 5.33. Drilling into benchmark element — on LibriSpeech clear, the mannequin achieves a WER of 1.33, and a couple of.5 on LibriSpeech different.

The Structure, Defined

Each fashions share the identical three-component design at a excessive degree — a speech encoder, a modality adapter, and a language mannequin — although the decoding mechanism diverges considerably.

The first part is the speech encoder. The structure makes use of 16 conformer blocks skilled with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE models — utilizing body significance sampling to concentrate on informative elements of the audio. A Conformer is a neural community layer that mixes convolutional layers (good at capturing native acoustic patterns) with consideration mechanisms (good at capturing long-range dependencies). CTC is a coaching method that lets the mannequin study from audio-text pairs while not having precise frame-level alignment.

The second part is a speech-text modality adapter. A 2-layer window question transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the final conformer block, downsampling by an element of 5 utilizing 3 trainable queries per block and per layer — for a complete temporal downsampling issue of 10 — leading to a 10Hz acoustic embedding fee for the LLM. This adapter bridges the hole between steady acoustic options and discrete textual content tokens, compressing the audio illustration so the language mannequin can course of it effectively. Within the NAR mannequin, the Q-Former has 160M parameters and downsamples the concatenated hidden representations from 4 encoder layers (layers 4, 8, 12, and 16).

The third part is the language mannequin. Granite Speech 4.1 2B makes use of an intermediate checkpoint of granite-4.0-1b-base with 128k context size, fine-tuned on all coaching corpora. Within the NAR variant, this turns into a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal consideration masks eliminated to allow bidirectional context — tailored with LoRA at rank 128 utilized to each consideration and MLP layers.

The Autoregressive vs. Non-Autoregressive Tradeoff

That is the place the 2 fashions diverge most sharply, and it has direct penalties for manufacturing deployment.

In the usual Granite Speech 4.1 2B, textual content is generated autoregressively — one token at a time, every relying on each token earlier than it. This produces correct, steady transcripts with full help for AST, keyword-biased recognition, and punctuation, however is inherently sequential and slower at scale.

Granite Speech 4.1 2B-NAR takes a basically totally different strategy. Somewhat than decoding tokens one after the other, it edits a CTC speculation in a single ahead go utilizing a bidirectional LLM, attaining aggressive accuracy with sooner inference than autoregressive alternate options. That is the NLE (Non-autoregressive LLM-based Enhancing) structure. Concretely: the CTC encoder produces a tough preliminary transcript, that speculation is interleaved with insertion slots, after which a bidirectional LLM predicts edits — copy, insert, delete, or exchange — in any respect positions concurrently in a single go.

The NAR mannequin measured an RTFx of roughly 1820 on a single H100 GPU utilizing batched inference at batch dimension 128. RTFx (real-time issue multiplier) measures what number of instances sooner than actual time a mannequin can course of audio — an RTFx of 1820 means a one-hour audio file might be transcribed in underneath two seconds on that {hardware}. One sensible constraint engineers ought to be aware: the NAR mannequin requires flash_attention_2 for inference, since this backend helps sequence packing and respects the is_causal=False flag.

Coaching Information and Infrastructure

The 2 fashions have been skilled on totally different datasets. The usual mannequin was skilled on 174,000 hours of audio from public corpora for ASR and AST, in addition to artificial datasets tailor-made to help Japanese ASR, keyword-biased ASR, and speech translation. The NAR mannequin was skilled on roughly 130,000 hours of speech throughout 5 languages utilizing publicly accessible datasets together with CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.

The infrastructure hole between the 2 is equally telling. The usual mannequin’s coaching was accomplished in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR mannequin skilled in simply 3 days on 16 H100 GPUs (2 nodes) for five epochs — a a lot lighter coaching run, which displays the architectural simplicity of modifying over full autoregressive era.

Key Takeaways

Listed here are 5 quick key takeaways:

IBM launched two open ASR fashions — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — each ~2B parameters, and Apache 2.0 licensed.
The usual mannequin achieves a imply WER of 5.33 on the Open ASR Leaderboard, helps 6 languages for ASR (together with Japanese), bidirectional speech translation, key phrase biasing, and punctuation/truecasing — aggressive with fashions a number of instances its dimension.
The NAR mannequin trades capabilities for velocity — it drops Japanese, AST, and key phrase biasing, however delivers an RTFx of ~1820 on a single H100 GPU by modifying a CTC speculation in a single ahead go quite than producing tokens one after the other.
The structure has three core parts — a 16-layer Conformer encoder skilled with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding fee, and a fine-tuned granite-4.0-1b-base language mannequin.
A 3rd variant, Granite Speech 4.1 2B-Plus, additionally exists — extending the usual mannequin with speaker-attributed ASR and word-level timestamps for purposes the place speaker id and exact timing are required.

Try the Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

Source link

IBM Releases Two Granite Speech 4.1 2B Fashions: Autoregressive ASR with Translation and Non-Autoregressive Enhancing for Quick Inference

Reid Hoffman Thinks Docs Ought to Ask AI for a Second Opinion

On the stand, Elon Musk cannot escape his personal tweets

Sources: Anthropic may elevate a brand new $50B spherical at a valuation of $900B

IBM Releases Two Granite Speech 4.1 2B Fashions: Autoregressive ASR with Translation and Non-Autoregressive Enhancing for Quick Inference

What These Fashions Truly Do

The Structure, Defined

The Autoregressive vs. Non-Autoregressive Tradeoff

Coaching Information and Infrastructure

Key Takeaways

Related Posts

Reid Hoffman Thinks Docs Ought to Ask AI for a Second Opinion

On the stand, Elon Musk cannot escape his personal tweets

Sources: Anthropic may elevate a brand new $50B spherical at a valuation of $900B