Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Implementation on kvcached for Elastic KV Cache Reminiscence, Bursty LLM Serving, and Multi-Mannequin GPU Sharing

    Naveed AhmadBy Naveed Ahmad26/04/2026Updated:26/04/2026No Comments2 Mins Read
    blog 1 18


    import numpy as np
    import matplotlib.pyplot as plt
    
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
    
    
    tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base)
    axes[0].plot(tk, mk, label="with kvcached", linewidth=2, coloration="#1f77b4")
    axes[0].plot(tb, mb, label="baseline (static)", linewidth=2,
                linestyle="--", coloration="#d62728")
    axes[0].axhline(idle_kvc,  coloration="#1f77b4", alpha=.3, linestyle=":")
    axes[0].axhline(idle_base, coloration="#d62728", alpha=.3, linestyle=":")
    axes[0].set_xlabel("time (s)"); axes[0].set_ylabel("GPU reminiscence used (MB)")
    axes[0].set_title("VRAM underneath a bursty workloadn(dotted = idle-baseline VRAM)")
    axes[0].grid(alpha=.3); axes[0].legend()
    
    
    axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
    axes[1].set_ylabel("request latency (s)")
    axes[1].set_title(f"Latency throughout {len(lat_kvc)} requests")
    axes[1].grid(alpha=.3)
    
    
    plt.tight_layout()
    plt.savefig("/content material/kvcached_single_model.png", dpi=120, bbox_inches="tight")
    plt.present()
    
    
    print("n--- Single-model abstract --------------------------------------------")
    print(f"  Idle VRAM    kvcached: {idle_kvc:>6.0f} MB   "
         f"baseline: {idle_base:>6.0f} MB  "
         f"(financial savings: {idle_base - idle_kvc:>5.0f} MB)")
    print(f"  Peak VRAM    kvcached: {max(mk):>6.0f} MB   "
         f"baseline: {max(mb):>6.0f} MB")
    print(f"  Median lat.  kvcached: {np.median(lat_kvc):>6.2f} s   "
         f"baseline: {np.median(lat_base):>6.2f} s")
    print(f"  VRAM flex    kvcached: peak-idle = {max(mk)-min(mk):>5.0f} MB  "
         f"(baseline cannot launch -- static pool)")
    
    
    print("n=== Experiment 3: Two LLMs sharing one GPU (kvcached on each) ===")
    pA, lA = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path="/tmp/mA.log")
    strive:
       wait_ready(PORT_A)
       pB, lB = launch_vllm(MODEL_B, PORT_B, kvcached=True, log_path="/tmp/mB.log")
       strive:
           wait_ready(PORT_B)
           print(f"  Each fashions loaded. Idle VRAM: {vram_used_mb():.0f} MB")
    
    
           sampler = MemorySampler(); sampler.begin()
           for i in vary(4):
               port, mannequin = ((PORT_A, MODEL_A) if i % 2 == 0
                              else (PORT_B, MODEL_B))
               print(f"  spherical {i+1}: driving {mannequin}")
               bursty_workload(port, mannequin, n_bursts=1, burst_size=4, pause=0)
               time.sleep(5)
           sampler.cease()
           t, m = zip(*sampler.samples)
    
    
           plt.determine(figsize=(11, 4.2))
           plt.plot(t, m, coloration="#c2410c", linewidth=2)
           plt.xlabel("time (s)"); plt.ylabel("GPU reminiscence used (MB)")
           plt.title("Two LLMs on one T4 through kvcached — reminiscence flexes per lively mannequin")
           plt.grid(alpha=.3); plt.tight_layout()
           plt.savefig("/content material/kvcached_multillm.png", dpi=120,
                       bbox_inches="tight")
           plt.present()
       lastly:
           shutdown(pB, lB)
    lastly:
       shutdown(pA, lA)
    
    
    print("n=== Bonus: kvcached ships CLI instruments ===")
    print("  kvtop  — dwell per-instance KV reminiscence monitor (like nvtop for kvcached)")
    print("  kvctl  — set/restrict per-instance reminiscence budgets in shared reminiscence")
    for instrument in ("kvtop", "kvctl"):
       path = shutil.which(instrument)
       print(f"    {instrument}: {path or 'not on PATH'}")
    print("nAll plots saved to /content material/. Executed.")



    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and Extra

    26/04/2026

    Anthropic created a check market for agent-on-agent commerce

    26/04/2026

    Maine’s governor vetoes knowledge middle moratorium

    26/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.