Inference effectivity has quietly develop into one of the vital consequential bottlenecks in AI deployment. As agentic coding methods comparable to Claude Code, Codex, and Cursor scale from developer instruments to infrastructure powering software program growth at massive, the underlying inference engines serving these requests are beneath rising pressure. The LightSeek Basis researchers have launched TokenSpeed, an open-source LLM inference engine launched beneath the MIT license and designed particularly for the calls for of agentic workloads. The TokenSpeed engine is presently in preview standing.
Why Agentic Inference is a Totally different Drawback
To grasp what makes TokenSpeed’s design decisions significant, it helps to grasp what makes agentic inference exhausting. Coding brokers don’t behave like a typical chatbot flip. Contexts routinely exceed 50K tokens, and conversations typically span dozens of turns. This creates simultaneous stress on two metrics: per-GPU TPM (tokens per minute), which determines what number of customers a single GPU can serve, and per-user TPS (tokens per second), which determines whether or not a person person perceives the system as responsive. Most public benchmarks don’t totally seize this habits.
TokenSpeed has been designed to maximise each. The target is to maximise per-GPU TPM whereas sustaining a per-user TPS ground — sometimes 70 TPS, and typically 200 TPS or greater.
Structure: 5 Interlocking Subsystems
TokenSpeed’s structure is constructed round 5 design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a protected KV useful resource reuse restriction, a pluggable layered kernel system that helps heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.
The modeling layer makes use of an area SPMD (Single Program, A number of Information) method. SPMD is a parallel execution mannequin the place all processes run the identical program however on totally different subsets of knowledge — a typical sample in distributed deep studying. Relatively than requiring builders to manually implement the communication logic between processes, TokenSpeed allows builders to specify I/O placement annotations at module boundaries, and a light-weight static compiler then mechanically generates the required collective operations throughout mannequin development, eliminating the necessity to manually implement communication logic.
The scheduler makes a structural cut up between the management airplane and the execution airplane. The management airplane is applied in C++ as a finite-state machine that works with the kind system to implement protected useful resource administration — together with KV cache state switch and utilization — at compile time somewhat than at runtime. Request lifecycle, KV cache sources, and overlap timing are represented via express FSM transitions and possession semantics, so correctness is enforced by a verifiable management system somewhat than conference. By encoding these correctness constraints into the kind system somewhat than leaving them to runtime conference, errors in KV cache administration — one of the vital error-prone areas in LLM serving — are caught earlier. The execution airplane is applied in Python to keep up growth effectivity, enabling quicker characteristic iteration and decrease cognitive load for builders
The kernel layer treats GPU kernels as a first-class modular subsystem somewhat than baking them into the engine core. It gives a transportable public API, a centralized registry and choice mannequin, and an extensible plugin mechanism to help heterogeneous accelerators — which means it isn’t locked to NVIDIA {hardware}. The dev crew has additionally developed one of many quickest MLA (Multi-head Latent Consideration) kernels for agentic workloads on NVIDIA Blackwell. Within the decode kernel, q_seqlen and num_heads are grouped to completely make the most of Tensor Cores, as num_heads are small in a few of these use instances. The binary prefill kernel features a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.
Lastly, TokenSpeed integrates SMG — a PyTorch-native part — for a low-overhead CPU-side request entrypoint, decreasing the handoff value between CPU orchestration and GPU execution.
Benchmark Outcomes Towards TensorRT-LLM on NVIDIA B200
It’s price noting upfront that these benchmarks cowl single (non-disaggregated) deployment solely. PD disaggregation help continues to be present process cleanup and could also be coated in a devoted follow-up from the TokenSpeed crew.
Along with the EvalScope crew, TokenSpeed was evaluated in opposition to SWE-smith traces, which carefully mirror manufacturing coding-agent visitors, benchmarked in opposition to TensorRT-LLM — the present state-of-the-art on NVIDIA Blackwell. The take a look at mannequin was Kimi K2.5.
For coding brokers working above 70 TPS/Person, the most effective configuration is Consideration TP4 + MoE TP4, the place TokenSpeed dominates TensorRT-LLM throughout your entire Pareto frontier: roughly 9% quicker within the min-latency case (batch measurement 1), and roughly 11% greater throughput round 100 TPS/Person. TP4 right here refers to tensor parallelism throughout 4 GPUs, a method that shards mannequin weights throughout a number of units to scale back per-device reminiscence stress and latency.
On the MLA kernel, the features are extra pronounced on the decode stage. The decode kernel folds the query-sequence axis into the pinnacle axis to higher fill the BMM1 M tile, bettering Tensor Core utilization. The binary-version prefill kernel makes use of NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA throughout all 5 typical prefill workloads for coding brokers with lengthy prefix KV cache. Mixed with different optimizations, this practically halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with lengthy prefix KV cache.
Key Takeaways
- TokenSpeed is a brand new MIT-licensed, open-source LLM inference engine by LightSeek Basis, constructed particularly for agentic workloads. (Out there in preview mode)
- Its scheduler makes use of a C++ finite-state machine to implement KV cache security at compile time, whereas holding the execution airplane in Python for usability.
- On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/Person on Kimi K2.5.
- The TokenSpeed MLA kernel practically halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.
Try the Technical details and GitHub Repo. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us
