When your utility can name many alternative LLMs with very completely different costs and capabilities, who ought to resolve which one solutions every request? Salesforce AI analysis crew introduces ‘xRouter’, a tool-calling–based mostly routing system that targets this hole with a reinforcement studying based mostly router and learns when to reply regionally and when to name exterior fashions, whereas monitoring price at token degree.
What’s xRouter?
xRouter is a instrument calling based mostly orchestration system constructed on Qwen2.5-7B-Instruct because the router spine. The router is an instruction tuned mannequin with instrument calling capabilities that decides which downstream mannequin to invoke, immediate it, and whether or not to synthesize or choose a solution. The implementation makes use of DAPO, Distributional Benefit Coverage Optimization, contained in the Verl reinforcement studying framework, and exposes an OpenAI appropriate API.
The router operates over greater than 20 LLM instruments within the full system. These instruments span premium, normal, finances and specialised tiers, together with GPT-5, GPT-4.1, GPT-5-Mini, GPT-5-Nano, o3, Kimi K2, DeepSeek-R1, Qwen3-235B variants and GPT-OSS fashions. The offloading pool is a 12 mannequin subset that features GPT-5, GPT-5-Mini, GPT-5-Nano, GPT-4o, GPT-4.1, o3, o3-Professional, o4-Mini, GPT-OSS-120B, GPT-OSS-20B and two Gemini-2.5 variants.
Price Conscious Reward and Success Gating
Routing is framed as a reinforcement studying downside. For every episode, the reward combines a binary success sign and a value penalty. The analysis crew defines a reward that offers a hard and fast bonus when the ultimate reply is right, then subtracts a time period proportional to the entire normalized price of all mannequin calls. If the reply is unsuitable, the reward is zero no matter how low-cost it was.
As per the Model weights page, reward = high quality − λ × normalized_cost, the place λ is a value penalty coefficient. Episodes with failures successfully have zero high quality. This ‘success gated, price formed’ goal forces the router to first obtain correctness, then optimize price amongst profitable methods. In apply, coaching makes use of 3 price penalty settings, which produce the xRouter-7B-1, xRouter-7B-2 and xRouter-7B-3 variants.
Coaching Information and Sign Design
xRouter coaching information comes from Reasoning360, which incorporates math, code and basic reasoning duties with problem estimates derived from a powerful reference mannequin, Qwen3-32B. The analysis crew stratify samples into straightforward, medium and laborious bands, and add less complicated chit chat, retrieval and factual questions to show the router when it may reply instantly with out delegation. Every pattern contains descriptions and costs for fashions from completely different tiers. The system additionally refreshes the mannequin catalog and perturbs prices to keep away from overfitting to a static worth desk.
Failed trajectories, resembling unsuitable solutions from costly fashions or pointless calls when the router may have answered itself, nonetheless incur full price and obtain zero reward. This produces a clear studying sign, the place correctness gates reward and value shapes the routing coverage.
How the Router Behaves at Inference Time?
The router helps three execution modes. It may well reply instantly from the spine with out calling instruments. It may well name a number of downstream fashions, then synthesize a response utilizing its personal reasoning over their outputs. It may well additionally name downstream fashions and use a particular select_response instrument to choose one of many replies as the ultimate reply. These modes are applied by means of operate calls in an OpenAI model interface, which the orchestration engine executes by means of LiteLLM and SGLang.
Empirically, educated xRouter cases use a mixture of direct and synthesized responses. Off the shelf routers resembling GPT-4o, GPT-4.1, GPT-5, Qwen2.5-7B and Qwen3-8B have a tendency to reply instantly more often than not, even when instructed to dump when unsure. This is a vital behavioral distinction and explains a part of the effectivity achieve.
Quantitative Outcomes and Price Utility
On static routing baselines throughout Minerva, MATH-500, Olympiad Bench, AIME-24, AMC-23, Codeforces, Code-Contests and Human-EvalPlus, xRouter-7B variants persistently enhance accuracy in comparison with utilizing the identical base mannequin as an untrained router. xRouter-7B-2, for instance, reaches close to GPT-5 accuracy on Olympiad Bench whereas utilizing about one eighth of the GPT-5 analysis price.
Within the system degree comparability on LiveCodeBenchv5, GPQADiamond, AIME25, MT-Bench, IFEval and LiveBench, xRouter-7B-3 achieves the very best common accuracy on LiveCodeBenchv5 amongst all examined methods, and does this with reasonable price. Throughout duties resembling GPQA, xRouter variants attain round 80 to 90 % of GPT-5 accuracy whereas consuming lower than one fifth of the price. The analysis crew summarize that their price conscious reward can cut back inference price by as much as 80 % at comparable completion charges. The mannequin weights HF card stories as much as 60 % price discount for comparable high quality below different settings.
The analysis crew additionally defines ‘price utility’ as accuracy divided by price. Open supply single fashions with very low API costs usually attain increased price utility, however with decrease absolute accuracy. xRouter sits within the center, buying and selling some price utility for stronger process efficiency, which is often what manufacturing methods care about.
Key Takeaways
- xRouter is a instrument calling router constructed on Qwen2.5 7B Instruct that learns to pick amongst 20 plus exterior LLMs with a reinforcement studying coverage that’s explicitly price conscious.
- The router makes use of successful gated reward, duties solely get optimistic reward when the ultimate reply is right, and inside profitable trajectories it applies a value penalty time period λ occasions normalized price, which yields three xRouter 7B variants with completely different price accuracy commerce offs.
- Coaching on Reasoning360 with problem stratification and artificial straightforward queries teaches xRouter when to reply instantly and when to dump, whereas perturbing costs and mannequin swimming pools improves robustness to altering supplier catalogs.
- Throughout math, coding and reasoning benchmarks, xRouter 7B fashions obtain close to GPT 5 accuracy on laborious duties like Olympiad Bench and round 80 to 90 % of GPT 5 accuracy on GPQA, whereas slicing offloading price by as much as 60 to 80 % relying on the analysis setup.
Editorial Notes
xRouter is a sensible step towards price conscious orchestration for heterogeneous LLM fleets. It exhibits {that a} mid measurement router, educated with DAPO on Reasoning360 utilizing successful gated, price formed reward, can persistently strategy GPT 5 accuracy whereas decreasing offloading price by as much as 60 to 80 %.
Try the PAPER and Model Weight. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.
