Meet ‘kvcached’: A Machine Studying Library to Allow Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs
Giant language mannequin serving usually wastes GPU reminiscence as a result of engines pre-reserve giant static KV cache areas per mannequin, even when requests are bursty or idle. Meet ‘kvcached‘, a library to allow virtualized, elastic KV cache for LLM serving on shared GPUs. kvcached has been developed by a analysis from Berkeley’s Sky Computing … Read more