Stanford Researchers Launched AgentFlow: In-the-Move Reinforcement Studying RL for Modular, Device-Utilizing AI Brokers

TL;DR: AgentFlow is a trainable agent framework with 4 modules—Planner, Executor, Verifier, Generator—coordinated by an express reminiscence and toolset. The planner is optimized within the loop with a brand new on-policy methodology, Move-GRPO, which broadcasts a trajectory-level consequence reward to each flip and applies token-level PPO-style updates with KL regularization and group-normalized benefits. On ten benchmarks, a 7B spine tuned with Move-GRPO studies +14.9% (search), +14.0% (agentic), +14.5% (math), and +4.1% (science) over robust baselines.

What’s AgentFlow?

AgentFlow formalizes multi-turn, tool-integrated reasoning as an Markov Determination Course of (MDP). At every flip, the Planner proposes a sub-goal and selects a device plus context; the Executor calls the device; the Verifier indicators whether or not to proceed; the Generator emits the ultimate reply on termination. A structured, evolving reminiscence data states, device calls, and verification indicators, constraining context development and making trajectories auditable. Solely the planner is skilled; different modules will be fastened engines.

The general public implementation showcases a modular toolkit (e.g., base_generator, python_coder, google_search, wikipedia_search, web_search) and ships quick-start scripts for inference, coaching, and benchmarking. The repository is MIT-licensed.

https://arxiv.org/pdf/2510.05592

Coaching methodology: Move-GRPO

Move-GRPO (Move-based Group Refined Coverage Optimization) converts long-horizon, sparse-reward optimization into tractable single-turn updates:

Remaining-outcome reward broadcast: a single, verifiable trajectory-level sign (LLM-as-judge correctness) is assigned to each flip, aligning native planning steps with world success.
Token-level clipped goal: importance-weighted ratios are computed per token, with PPO-style clipping and a KL penalty to a reference coverage to forestall drift.
Group-normalized benefits: variance discount throughout teams of on-policy rollouts stabilizes updates.

https://arxiv.org/pdf/2510.05592

Understanding the outcomes and benchmarks

Benchmarks. The analysis staff evaluates 4 job varieties: knowledge-intensive search (Bamboogle, 2Wiki, HotpotQA, Musique), agentic reasoning (GAIA textual break up), math (AIME-24, AMC-23, Recreation of 24), and science (GPQA, MedQA). GAIA is a tooling-oriented benchmark for normal assistants; the textual break up excludes multimodal necessities.

Foremost numbers (7B spine after Move-GRPO). Common good points over robust baselines: +14.9% (search), +14.0% (agentic), +14.5% (math), +4.1% (science). The analysis staff state their 7B system surpasses GPT-4o on the reported suite. The mission web page additionally studies coaching results corresponding to improved planning high quality, lowered tool-calling errors (as much as 28.4% on GAIA), and constructive tendencies with bigger flip budgets and mannequin scale.

Ablations. On-line Move-GRPO improves efficiency by +17.2% vs. a frozen-planner baseline, whereas offline supervised fine-tuning of the planner degrades efficiency by −19.0% on their composite metric.

https://arxiv.org/pdf/2510.05592

Key Takeaways

Modular agent, planner-only coaching. AgentFlow buildings an agent into Planner–Executor–Verifier–Generator with an express reminiscence; solely the Planner is skilled in-loop.
Move-GRPO converts long-horizon RL to single-turn updates. A trajectory-level consequence reward is broadcast to each flip; updates use token-level PPO-style clipping with KL regularization and group-normalized benefits.
The analysis team-reported good points on 10 benchmarks. With a 7B spine, AgentFlow studies common enhancements of +14.9% (search), +14.0% (agentic/GAIA textual), +14.5% (math), +4.1% (science) over robust baselines, and states surpassing GPT-4o on the identical suite.
Device-use reliability improves. The analysis staff report lowered tool-calling errors (e.g., on GAIA) and higher planning high quality underneath bigger flip budgets and mannequin scale.

AgentFlow formalizes tool-using brokers into 4 modules (planner, executor, verifier, generator) and trains solely the planner in-loop through Move-GRPO, which broadcasts a single trajectory-level reward to each flip with token-level PPO-style updates and KL management. Reported outcomes on ten benchmarks present common good points of +14.9% (search), +14.0% (agentic/GAIA textual break up), +14.5% (math), and +4.1% (science); the analysis staff moreover state the 7B system surpasses GPT-4o on this suite. Implementation, instruments, and quick-start scripts are MIT-licensed within the GitHub repo.

Try the Technical Paper, GitHub Page and Project Page. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link

What’s AgentFlow?

Coaching methodology: Move-GRPO

Understanding the outcomes and benchmarks

Key Takeaways

Leave a Comment Cancel reply