**NVIDIA Unveils Low-Latency Speech Transcription Mannequin: Nemotron Speech ASR**
NVIDIA’s Nemotron Speech ASR is lastly right here, and it’s designed to revolutionize low-latency voice brokers and reside captioning. This streaming English transcription mannequin is constructed on high of a FastConformer encoder with an RNNT decoder and is optimized for each streaming and batch workloads on trendy NVIDIA GPUs.
**What’s Nemotron Speech ASR?**
Nemotron Speech ASR is a 600M parameter mannequin that makes use of a cache-conscious FastConformer encoder with 24 layers and an RNNT decoder. The encoder takes in 16 kHz mono audio and processes it in chunks of at the least 80 ms. This mannequin is specifically designed to make the most of trendy NVIDIA GPUs and offers a balanced method to latency and accuracy.
**How Does it Work?**
Conventional streaming ASR can result in compute waste and latency points due to overlapping home windows. Nemotron Speech ASR overcomes this by utilizing a cache of encoder states for all self-consideration and convolution layers. This strategy eliminates recomputation of overlapping context, resulting in:
* Non-overlapping physique processing, which scales linearly with audio measurement
* Predictable reminiscence development, as a result of cache dimension grows with sequence size moderately than concurrency-related duplication
* Steady latency underneath load, which is essential for flip taking and interruption in voice brokers
**Accuracy vs. Latency**
Nemotron Speech ASR is evaluated on a number of datasets, together with AMI, Earnings22, Gigaspeech, and LibriSpeech. The outcomes present that the mannequin achieves:
* About 7.84% WER at 0.16 s chunk dimension
* About 7.22% WER at 0.56 s chunk dimension
* About 7.16% WER at 1.12 s chunk dimension
This reveals the tradeoff between latency and accuracy. Larger chunks present extra phonetic context and barely decrease WER, however even the 0.16 s mode retains WER underneath 8% whereas remaining usable for actual time brokers.
**Throughput and Concurrency**
The cache-conscious design has a major influence on concurrency. On an NVIDIA H100 GPU, Nemotron Speech ASR helps about 560 concurrent streams at a 320 ms chunk dimension, roughly 3x the concurrency of a baseline streaming system on the identical latency goal.
**Coaching and Ecosystem Integration**
Nemotron Speech ASR is educated primarily on the English portion of NVIDIA’s Granary dataset, together with a big combination of public speech corpora, for a complete of about 285k hours of audio. Datasets embody YouTube Commons, YODAS2, Mosel, LibriLight, Fisher, Switchboard, WSJ, VCTK, VoxPopuli, and a number of Mozilla Widespread Voice releases.
**Key Takeaways**
1. Nemotron Speech ASR is a 0.6B parameter English streaming mannequin that makes use of a cache-conscious FastConformer encoder with an RNNT decoder and operates on 16 kHz mono audio with at the least 80 ms enter chunks.
2. The mannequin exposes 4 inference time chunk configurations, about 80 ms, 160 ms, 560 ms, and 1.12 s, which let engineers commerce latency for accuracy with out retraining whereas retaining WER round 7.2% to 7.8% on common ASR benchmarks.
3. Cache-conscious streaming removes overlapping window recomputation so every audio body is encoded as soon as, which yields about 3 instances increased concurrent streams on H100, greater than 5 instances on RTX A5000, and as much as 2 instances on DGX B200 in comparison with a buffered streaming baseline at comparable latency.
4. In an finish-to-end voice agent with Nemotron Speech ASR, Nemotron 3 Nano 30B, and Magpie TTS, measured median time to remaining transcription is about 24 ms, and server-side voice to voice latency on RTX 5090 is round 500 ms, which makes ASR a small fraction of the entire latency finances.
5. Nemotron Speech ASR is launched as a NeMo checkpoint underneath the NVIDIA Permissive Open Mannequin License with open weights and coaching particulars, so groups can self-host, advantageous tune, and profile the total stack for low-latency voice brokers and speech functions.
**Get Began with Nemotron Speech ASR**
The Nemotron Speech ASR mannequin weights are available on Hugging Face. Be a part of our neighborhood on Twitter and Reddit to remain updated on newest developments in AI and ML. Additionally, subscribe to our e-newsletter to get the newest insights and updates in AI, machine studying, and deep studying.
