Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Implementation on kvcached for Elastic KV Cache Reminiscence, Bursty LLM Serving, and Multi-Mannequin GPU Sharing

    Naveed AhmadBy Naveed Ahmad26/04/2026Updated:26/04/2026No Comments2 Mins Read
    blog 1 18


    import numpy as np
    import matplotlib.pyplot as plt
    
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
    
    
    tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base)
    axes[0].plot(tk, mk, label="with kvcached", linewidth=2, coloration="#1f77b4")
    axes[0].plot(tb, mb, label="baseline (static)", linewidth=2,
                linestyle="--", coloration="#d62728")
    axes[0].axhline(idle_kvc,  coloration="#1f77b4", alpha=.3, linestyle=":")
    axes[0].axhline(idle_base, coloration="#d62728", alpha=.3, linestyle=":")
    axes[0].set_xlabel("time (s)"); axes[0].set_ylabel("GPU reminiscence used (MB)")
    axes[0].set_title("VRAM underneath a bursty workloadn(dotted = idle-baseline VRAM)")
    axes[0].grid(alpha=.3); axes[0].legend()
    
    
    axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
    axes[1].set_ylabel("request latency (s)")
    axes[1].set_title(f"Latency throughout {len(lat_kvc)} requests")
    axes[1].grid(alpha=.3)
    
    
    plt.tight_layout()
    plt.savefig("/content material/kvcached_single_model.png", dpi=120, bbox_inches="tight")
    plt.present()
    
    
    print("n--- Single-model abstract --------------------------------------------")
    print(f"  Idle VRAM    kvcached: {idle_kvc:>6.0f} MB   "
         f"baseline: {idle_base:>6.0f} MB  "
         f"(financial savings: {idle_base - idle_kvc:>5.0f} MB)")
    print(f"  Peak VRAM    kvcached: {max(mk):>6.0f} MB   "
         f"baseline: {max(mb):>6.0f} MB")
    print(f"  Median lat.  kvcached: {np.median(lat_kvc):>6.2f} s   "
         f"baseline: {np.median(lat_base):>6.2f} s")
    print(f"  VRAM flex    kvcached: peak-idle = {max(mk)-min(mk):>5.0f} MB  "
         f"(baseline cannot launch -- static pool)")
    
    
    print("n=== Experiment 3: Two LLMs sharing one GPU (kvcached on each) ===")
    pA, lA = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path="/tmp/mA.log")
    strive:
       wait_ready(PORT_A)
       pB, lB = launch_vllm(MODEL_B, PORT_B, kvcached=True, log_path="/tmp/mB.log")
       strive:
           wait_ready(PORT_B)
           print(f"  Each fashions loaded. Idle VRAM: {vram_used_mb():.0f} MB")
    
    
           sampler = MemorySampler(); sampler.begin()
           for i in vary(4):
               port, mannequin = ((PORT_A, MODEL_A) if i % 2 == 0
                              else (PORT_B, MODEL_B))
               print(f"  spherical {i+1}: driving {mannequin}")
               bursty_workload(port, mannequin, n_bursts=1, burst_size=4, pause=0)
               time.sleep(5)
           sampler.cease()
           t, m = zip(*sampler.samples)
    
    
           plt.determine(figsize=(11, 4.2))
           plt.plot(t, m, coloration="#c2410c", linewidth=2)
           plt.xlabel("time (s)"); plt.ylabel("GPU reminiscence used (MB)")
           plt.title("Two LLMs on one T4 through kvcached — reminiscence flexes per lively mannequin")
           plt.grid(alpha=.3); plt.tight_layout()
           plt.savefig("/content material/kvcached_multillm.png", dpi=120,
                       bbox_inches="tight")
           plt.present()
       lastly:
           shutdown(pB, lB)
    lastly:
       shutdown(pA, lA)
    
    
    print("n=== Bonus: kvcached ships CLI instruments ===")
    print("  kvtop  — dwell per-instance KV reminiscence monitor (like nvtop for kvcached)")
    print("  kvctl  — set/restrict per-instance reminiscence budgets in shared reminiscence")
    for instrument in ("kvtop", "kvctl"):
       path = shutil.which(instrument)
       print(f"    {instrument}: {path or 'not on PATH'}")
    print("nAll plots saved to /content material/. Executed.")



    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    OpenAI says hackers stole some information after newest code safety concern

    14/05/2026

    Cerebras raises $5.5B, kicking off 2026’s IPO season with a bang

    14/05/2026

    Khosla Ventures is betting $10M on Ian Crosby, whose final startup, Bench, imploded

    14/05/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.