**The Future of Speech-to-Text is Here: Introducing Voxtral Transcribe 2**
Hey there, fellow tech enthusiasts! I’m stoked to share the latest breakthrough in AI-powered speech-to-text technology: the launch of Voxtral Transcribe 2, a game-changing duo designed to tackle the complexities of multilingual manufacturing workloads. In this post, I’ll dive into the nitty-gritty details of this innovative technology, so buckle up!
The Voxtral Transcribe 2 family consists of two models: Voxtral Mini Transcribe V2 and Voxtral Realtime. Both are optimized for high-quality transcription, but with different use cases in mind.
**Voxtral Mini Transcribe V2: The Batch Beast**
If you’re looking for top-notch transcription quality and diarization across different domains and languages, Voxtral Mini Transcribe V2 is the way to go. This model is perfect for applications like subtitles, closed captions, and high-stakes projects that require precision. You can access this model as an efficient audio input model through the Mistral API, with a reasonable price tag of $0.003 per minute.
Some of the cool features of this model include:
* Speaker diarization: Identify speakers with exact start and end times, making it perfect for conferences, interviews, and multi-party calls.
* Context biasing: Feed the model up to 100 words or phrases to bias transcription towards specific names or domain phrases. This feature is optimized for English, but feel free to experiment with other languages.
* Phrase-level timestamps: Get per-word start and end timestamps for subtitles, alignment, and searchable audio workflows.
* Noise robustness: Accurate transcription even in noisy environments like factory floors or call centers.
* Longer audio support: Up to 3 hours of audio in a single request? Yeah, it can handle that!
**Voxtral Realtime: The Speed Demon**
Voxtral Realtime is a 4B-parameter multilingual powerhouse that’s all about speed. As one of the first open-weights models to hit accuracy comparable to offline systems with a delay of under 500 ms, this real-time model is a big deal.
Voxtral Realtime’s structure is all about balancing accuracy and latency, with a sliding-window and causal attention mechanism that makes it super efficient. The latency vs accuracy trade-off is explicit, folks – choose from options ranging from 80 ms to 2.4 s. At around 480 ms delay, Voxtral Realtime hits accuracy comparable to top-notch offline and real-time systems. At 2.4 s, it’s on par with Voxtral Mini Transcribe V2 on the FLEURS benchmark.
**Deployment Time**
Voxtral Mini Transcribe V2 is available through the Mistral audio transcription API at $0.003 per minute, while Voxtral Realtime is available through the Mistral API at $0.006 per minute. You can also get Voxtral Realtime as an open-weights model on Hugging Face under Apache 2.0 with official vLLM Realtime support.
The Mistral Studio audio playground lets you play around with up to 10 audio files, toggle diarization, select timestamp granularity, and configure context bias phrases.
**The Short Version**
In a nutshell, Voxtral Transcribe 2 consists of two models: Voxtral Mini Transcribe V2 for batch transcription and diarization, and Voxtral Realtime for speed and low-latency streaming transcription. Both support 13 languages and offer a range of features to suit your specific needs.
The real-time model, Voxtral Realtime, uses a 4B structure with sliding-window and causal attention, and supports configurable transcription delay from 80 ms to 2.4 s. The batch model, Voxtral Mini Transcribe V2, provides diarization, context biasing, word-level timestamps, noise robustness, and supports up to 3 hours of audio per request at $0.003 per minute.
Deployment options include a closed batch API and open real-time weights, priced at $0.006 per minute.
Ready to learn more about Voxtral Transcribe 2? Check out the technical details and model weights.
