Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Palms-On Coding Tutorial for Microsoft VibeVoice Overlaying Speaker-Conscious ASR, Actual-Time TTS, and Speech-to-Speech Pipelines

    Naveed AhmadBy Naveed Ahmad13/04/2026Updated:13/04/2026No Comments9 Mins Read
    blog 31


    On this tutorial, we discover Microsoft VibeVoice in Colab and construct a whole hands-on workflow for each speech recognition and real-time speech synthesis. We arrange the atmosphere from scratch, set up the required dependencies, confirm help for the most recent VibeVoice fashions, after which stroll via superior capabilities corresponding to speaker-aware transcription, context-guided ASR, batch audio processing, expressive text-to-speech technology, and an end-to-end speech-to-speech pipeline. As we work via the tutorial, we work together with sensible examples, take a look at completely different voice presets, generate long-form audio, launch a Gradio interface, and perceive tips on how to adapt the system for our personal recordsdata and experiments.

    !pip uninstall -y transformers -q
    !pip set up -q git+https://github.com/huggingface/transformers.git
    !pip set up -q torch torchaudio speed up soundfile librosa scipy numpy
    !pip set up -q huggingface_hub ipywidgets gradio einops
    !pip set up -q flash-attn --no-build-isolation 2>/dev/null || echo "flash-attn elective"
    !git clone -q --depth 1 https://github.com/microsoft/VibeVoice.git /content material/VibeVoice 2>/dev/null || echo "Already cloned"
    !pip set up -q -e /content material/VibeVoice
    
    
    print("="*70)
    print("IMPORTANT: If that is your first run, restart the runtime now!")
    print("Go to: Runtime -> Restart runtime, then run from CELL 2.")
    print("="*70)
    
    
    import torch
    import numpy as np
    import soundfile as sf
    import warnings
    import sys
    from IPython.show import Audio, show
    
    
    warnings.filterwarnings('ignore')
    sys.path.insert(0, '/content material/VibeVoice')
    
    
    import transformers
    print(f"Transformers model: {transformers.__version__}")
    
    
    attempt:
       from transformers import VibeVoiceAsrForConditionalGeneration
       print("VibeVoice ASR: Obtainable")
    besides ImportError:
       print("ERROR: VibeVoice not accessible. Please restart runtime and run Cell 1 once more.")
       increase
    
    
    SAMPLE_PODCAST = "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/primary/example_output/VibeVoice-1.5B_output.wav"
    SAMPLE_GERMAN = "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/primary/realtime_model/vibevoice_tts_german.wav"
    
    
    print("Setup full!")

    We put together the entire Google Colab atmosphere for VibeVoice by putting in and updating all of the required packages. We clone the official VibeVoice repository, configure the runtime, and confirm that the particular ASR help is out there within the put in Transformers model. We additionally import the core libraries and outline pattern audio sources, making our tutorial prepared for the later transcription and speech technology steps.

    from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
    
    
    print("Loading VibeVoice ASR mannequin (7B parameters)...")
    print("First run downloads ~14GB - please wait...")
    
    
    asr_processor = AutoProcessor.from_pretrained("microsoft/VibeVoice-ASR-HF")
    asr_model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
       "microsoft/VibeVoice-ASR-HF",
       device_map="auto",
       torch_dtype=torch.float16,
    )
    
    
    print(f"ASR Mannequin loaded on {asr_model.gadget}")
    
    
    def transcribe(audio_path, context=None, output_format="parsed"):
       inputs = asr_processor.apply_transcription_request(
           audio=audio_path,
           immediate=context,
       ).to(asr_model.gadget, asr_model.dtype)
      
       output_ids = asr_model.generate(**inputs)
       generated_ids = output_ids[:, inputs["input_ids"].form[1]:]
       end result = asr_processor.decode(generated_ids, return_format=output_format)[0]
      
       return end result
    
    
    print("="*70)
    print("ASR DEMO: Podcast Transcription with Speaker Diarization")
    print("="*70)
    
    
    print("nPlaying pattern audio:")
    show(Audio(SAMPLE_PODCAST))
    
    
    print("nTranscribing with speaker identification...")
    end result = transcribe(SAMPLE_PODCAST, output_format="parsed")
    
    
    print("nTRANSCRIPTION RESULTS:")
    print("-"*70)
    for phase in end result:
       speaker = phase['Speaker']
       begin = phase['Start']
       finish = phase['End']
       content material = phase['Content']
       print(f"n[Speaker {speaker}] {begin:.2f}s - {finish:.2f}s")
       print(f"  {content material}")
    
    
    print("n" + "="*70)
    print("ASR DEMO: Context-Conscious Transcription")
    print("="*70)
    
    
    print("nComparing transcription WITH and WITHOUT context hotwords:")
    print("-"*70)
    
    
    result_no_ctx = transcribe(SAMPLE_GERMAN, context=None, output_format="transcription_only")
    print(f"nWITHOUT context: {result_no_ctx}")
    
    
    result_with_ctx = transcribe(SAMPLE_GERMAN, context="About VibeVoice", output_format="transcription_only")
    print(f"WITH context:    {result_with_ctx}")
    
    
    print("nNotice how 'VibeVoice' is acknowledged accurately when context is offered!")

    We load the VibeVoice ASR mannequin and processor to transform speech into textual content. We outline a reusable transcription perform that permits inference with elective context and a number of output codecs. We then take a look at the mannequin on pattern audio to watch speaker diarization and examine the enhancements in recognition high quality from context-aware transcription.

    print("n" + "="*70)
    print("ASR DEMO: Batch Processing")
    print("="*70)
    
    
    audio_batch = [SAMPLE_GERMAN, SAMPLE_PODCAST]
    prompts_batch = ["About VibeVoice", None]
    
    
    inputs = asr_processor.apply_transcription_request(
       audio=audio_batch,
       immediate=prompts_batch
    ).to(asr_model.gadget, asr_model.dtype)
    
    
    output_ids = asr_model.generate(**inputs)
    generated_ids = output_ids[:, inputs["input_ids"].form[1]:]
    transcriptions = asr_processor.decode(generated_ids, return_format="transcription_only")
    
    
    print("nBatch transcription outcomes:")
    print("-"*70)
    for i, trans in enumerate(transcriptions):
       preview = trans[:150] + "..." if len(trans) > 150 else trans
       print(f"nAudio {i+1}: {preview}")
    
    
    from transformers import AutoModelForCausalLM
    from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast
    
    
    print("n" + "="*70)
    print("Loading VibeVoice Realtime TTS mannequin (0.5B parameters)...")
    print("="*70)
    
    
    tts_model = AutoModelForCausalLM.from_pretrained(
       "microsoft/VibeVoice-Realtime-0.5B",
       trust_remote_code=True,
       torch_dtype=torch.float16,
    ).to("cuda" if torch.cuda.is_available() else "cpu")
    
    
    tts_tokenizer = VibeVoiceTextTokenizerFast.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
    tts_model.set_ddpm_inference_steps(20)
    
    
    print(f"TTS Mannequin loaded on {subsequent(tts_model.parameters()).gadget}")
    
    
    VOICES = ["Carter", "Grace", "Emma", "Davis"]
    
    
    def synthesize(textual content, voice="Grace", cfg_scale=3.0, steps=20, save_path=None):
       tts_model.set_ddpm_inference_steps(steps)
       input_ids = tts_tokenizer(textual content, return_tensors="pt").input_ids.to(tts_model.gadget)
      
       output = tts_model.generate(
           inputs=input_ids,
           tokenizer=tts_tokenizer,
           cfg_scale=cfg_scale,
           return_speech=True,
           show_progress_bar=True,
           speaker_name=voice,
       )
      
       audio = output.audio.squeeze().cpu().numpy()
       sample_rate = 24000
      
       if save_path:
           sf.write(save_path, audio, sample_rate)
           print(f"Saved to: {save_path}")
      
       return audio, sample_rate

    We broaden the ASR workflow by processing a number of audio recordsdata collectively in batch mode. We then change to the text-to-speech aspect of the tutorial by loading the VibeVoice real-time TTS mannequin and its tokenizer. We additionally outline the speech synthesis helper perform and voice presets to generate pure audio from textual content within the subsequent levels.

    print("n" + "="*70)
    print("TTS DEMO: Primary Speech Synthesis")
    print("="*70)
    
    
    demo_texts = [
       ("Hello! Welcome to VibeVoice, Microsoft's open-source voice AI.", "Grace"),
       ("This model generates natural, expressive speech in real-time.", "Carter"),
       ("You can choose from multiple voice presets for different styles.", "Emma"),
    ]
    
    
    for textual content, voice in demo_texts:
       print(f"nText: {textual content}")
       print(f"Voice: {voice}")
       audio, sr = synthesize(textual content, voice=voice)
       print(f"Length: {len(audio)/sr:.2f} seconds")
       show(Audio(audio, charge=sr))
    
    
    print("n" + "="*70)
    print("TTS DEMO: Evaluate All Voice Presets")
    print("="*70)
    
    
    comparison_text = "VibeVoice produces remarkably pure and expressive speech synthesis."
    print(f"nSame textual content with completely different voices: "{comparison_text}"n")
    
    
    for voice in VOICES:
       print(f"Voice: {voice}")
       audio, sr = synthesize(comparison_text, voice=voice, steps=15)
       show(Audio(audio, charge=sr))
       print()
    
    
    print("n" + "="*70)
    print("TTS DEMO: Lengthy-form Speech Technology")
    print("="*70)
    
    
    long_text = """
    Welcome to at this time's expertise podcast! I am excited to share the most recent developments in synthetic intelligence and speech synthesis.
    
    
    Microsoft's VibeVoice represents a breakthrough in voice AI. Not like conventional text-to-speech methods, which battle with long-form content material, VibeVoice can generate coherent speech for prolonged durations.
    
    
    The important thing innovation is the ultra-low frame-rate tokenizers working at 7.5 hertz. This preserves audio high quality whereas dramatically bettering computational effectivity.
    
    
    The system makes use of a next-token diffusion framework that mixes a big language mannequin for context understanding with a diffusion head for high-fidelity audio technology. This permits pure prosody, applicable pauses, and expressive speech patterns.
    
    
    Whether or not you are constructing voice assistants, creating podcasts, or growing accessibility instruments, VibeVoice provides a strong basis on your tasks.
    
    
    Thanks for listening!
    """
    
    
    print("Producing long-form speech (this takes a second)...")
    audio, sr = synthesize(long_text.strip(), voice="Carter", cfg_scale=3.5, steps=25)
    print(f"nGenerated {len(audio)/sr:.2f} seconds of speech")
    show(Audio(audio, charge=sr))
    
    
    sf.write("/content material/longform_output.wav", audio, sr)
    print("Saved to: /content material/longform_output.wav")
    
    
    print("n" + "="*70)
    print("ADVANCED: Speech-to-Speech Pipeline")
    print("="*70)
    
    
    print("nStep 1: Transcribing enter audio...")
    transcription = transcribe(SAMPLE_GERMAN, context="About VibeVoice", output_format="transcription_only")
    print(f"Transcription: {transcription}")
    
    
    response_text = f"I understood you stated: {transcription} That is an interesting subject about AI expertise!"
    
    
    print(f"nStep 2: Producing speech response...")
    print(f"Response: {response_text}")
    
    
    audio, sr = synthesize(response_text, voice="Grace", cfg_scale=3.0, steps=20)
    
    
    print(f"nStep 3: Enjoying generated response ({len(audio)/sr:.2f}s)")
    show(Audio(audio, charge=sr))
    

    We use the TTS pipeline to generate speech from completely different instance texts and hearken to the outputs throughout a number of voices. We examine voice presets, create an extended podcast-style narration, and save the generated waveform as an output file. We additionally mix ASR and TTS right into a speech-to-speech workflow, the place we first transcribe audio after which generate a spoken response from the acknowledged textual content.

    import gradio as gr
    
    
    def tts_gradio(textual content, voice, cfg, steps):
       if not textual content.strip():
           return None
       audio, sr = synthesize(textual content, voice=voice, cfg_scale=cfg, steps=int(steps))
       return (sr, audio)
    
    
    demo = gr.Interface(
       fn=tts_gradio,
       inputs=[
           gr.Textbox(label="Text to Synthesize", lines=5,
                      value="Hello! This is VibeVoice real-time text-to-speech."),
           gr.Dropdown(choices=VOICES, value="Grace", label="Voice"),
           gr.Slider(1.0, 5.0, value=3.0, step=0.5, label="CFG Scale"),
           gr.Slider(5, 50, value=20, step=5, label="Inference Steps"),
       ],
       outputs=gr.Audio(label="Generated Speech"),
       title="VibeVoice Realtime TTS",
       description="Generate pure speech from textual content utilizing Microsoft's VibeVoice mannequin.",
    )
    
    
    print("nLaunching interactive TTS interface...")
    demo.launch(share=True, quiet=True)
    
    
    from google.colab import recordsdata
    import os
    
    
    print("n" + "="*70)
    print("UPLOAD YOUR OWN AUDIO")
    print("="*70)
    
    
    print("nUpload an audio file (wav, mp3, flac, and so on.):")
    uploaded = recordsdata.add()
    
    
    if uploaded:
       for filename, information in uploaded.objects():
           filepath = f"/content material/{filename}"
           with open(filepath, 'wb') as f:
               f.write(information)
          
           print(f"nProcessing: {filename}")
           show(Audio(filepath))
          
           end result = transcribe(filepath, output_format="parsed")
          
           print("nTranscription:")
           print("-"*50)
           if isinstance(end result, checklist):
               for seg in end result:
                   print(f"[{seg.get('Start',0):.2f}s-{seg.get('End',0):.2f}s] Speaker {seg.get('Speaker',0)}: {seg.get('Content material','')}")
           else:
               print(end result)
    else:
       print("No file uploaded - skipping this step")
    
    
    print("n" + "="*70)
    print("MEMORY OPTIMIZATION TIPS")
    print("="*70)
    
    
    print("""
    1. REDUCE ASR CHUNK SIZE (if out of reminiscence with lengthy audio):
      output_ids = asr_model.generate(**inputs, acoustic_tokenizer_chunk_size=64000)
    
    
    2. USE BFLOAT16 DTYPE:
      mannequin = VibeVoiceAsrForConditionalGeneration.from_pretrained(
          model_id, torch_dtype=torch.bfloat16, device_map="auto")
    
    
    3. REDUCE TTS INFERENCE STEPS (quicker however decrease high quality):
      tts_model.set_ddpm_inference_steps(10)
    
    
    4. CLEAR GPU CACHE:
      import gc
      torch.cuda.empty_cache()
      gc.accumulate()
    
    
    5. GRADIENT CHECKPOINTING FOR TRAINING:
      mannequin.gradient_checkpointing_enable()
    """)
    
    
    print("n" + "="*70)
    print("DOWNLOAD GENERATED FILES")
    print("="*70)
    
    
    output_files = ["/content/longform_output.wav"]
    
    
    for filepath in output_files:
       if os.path.exists(filepath):
           print(f"Downloading: {os.path.basename(filepath)}")
           recordsdata.obtain(filepath)
       else:
           print(f"File not discovered: {filepath}")
    
    
    print("n" + "="*70)
    print("TUTORIAL COMPLETE!")
    print("="*70)
    
    
    print("""
    WHAT YOU LEARNED:
    
    
    VIBEVOICE ASR (Speech-to-Textual content):
     - 60-minute single-pass transcription
     - Speaker diarization (who stated what, when)
     - Context-aware hotword recognition
     - 50+ language help
     - Batch processing
    
    
    VIBEVOICE REALTIME TTS (Textual content-to-Speech):
     - Actual-time streaming (~300ms latency)
     - A number of voice presets
     - Lengthy-form technology (~10 minutes)
     - Configurable high quality/velocity
    
    
    RESOURCES:
     GitHub:     https://github.com/microsoft/VibeVoice
     ASR Mannequin:  https://huggingface.co/microsoft/VibeVoice-ASR-HF
     TTS Mannequin:  https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B
     ASR Paper:  https://arxiv.org/pdf/2601.18184
     TTS Paper:  https://openreview.internet/pdf?id=FihSkzyxdv
    
    
    RESPONSIBLE USE:
     - That is for analysis/improvement solely
     - At all times disclose AI-generated content material
     - Don't use for impersonation or fraud
     - Observe relevant legal guidelines and laws
    """)

    We constructed an interactive Gradio interface that lets us kind textual content and generate speech in a extra user-friendly approach. We additionally add our personal audio recordsdata for transcription, evaluate the outputs, and assess reminiscence optimization solutions to enhance execution in Colab. Additionally, we obtain the generated recordsdata and summarize the entire set of capabilities that we explored all through the tutorial.

    In conclusion, we gained a robust sensible understanding of tips on how to run and experiment with Microsoft VibeVoice on Colab for each ASR and real-time TTS duties. We discovered tips on how to transcribe audio with speaker data and hotword context, and likewise tips on how to synthesize pure speech, examine voices, create longer audio outputs, and join transcription with technology in a unified workflow. By way of these experiments, we noticed how VibeVoice can function a strong open-source basis for voice assistants, transcription instruments, accessibility methods, interactive demos, and broader speech AI functions, whereas additionally studying the optimization and deployment issues wanted for smoother real-world use.


    Try the Full Codes here. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us




    Source link

    Naveed Ahmad

    Related Posts

    Apple reportedly testing 4 designs for upcoming sensible glasses

    13/04/2026

    Meta AI and KAUST Researchers Suggest Neural Computer systems That Fold Computation, Reminiscence, and I/O Into One Realized Mannequin

    13/04/2026

    Trump officers could also be encouraging banks to check Anthropic’s Mythos mannequin

    13/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.