Pc-use brokers have been restricted to primitives. They click on, they kind, they scroll. Lengthy motion chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a basis mannequin that builds an hybrid motion area that lets an agent interleave low degree GUI actions with excessive degree programmatic instrument calls. The mannequin chooses the cheaper and extra dependable transfer at every step. The method improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena with out Home windows particular coaching.
What hybrid motion adjustments?
Hybrid motion treats instruments as firstclass actions. A instrument name encapsulates a multi step operation as a single perform with a transparent signature and a docstring. A click on or a key press nonetheless exists when no programmatic path is accessible. The agent learns to alternate between each modes. The aim is to cut back cascade errors and to chop step counts. The analysis workforce positions this as a bridge between GUI solely CUAs and gear centric agent frameworks.
Scaled instrument acquisition
UltraCUA builds its instrument library with an automatic pipeline. The system extracts keyboard shortcuts and instructions from software program documentation. The system integrates open supply implementations from agent toolkits. The system additionally makes use of coding brokers to synthesize new instruments. Every instrument is a callable interface that hides a protracted GUI sequence. The analysis workforce studies protection throughout 10 desktop domains with 881 instruments. The biggest buckets embrace VS Code with 135 instruments and LibreOffice Author with 123 instruments. Thunderbird and GIMP even have deep protection.
Verifiable artificial duties and trajectories
Coaching requires grounded supervision and steady rewards. UltraCUA makes use of a twin artificial engine. An evaluator first pipeline composes atomic verifiers for browsers, information, photos, and system state, then generates duties that fulfill these checks. An instruction first pipeline explores the OS and proposes context aligned duties that are then verified. The result’s 17,864 verifiable duties throughout 10 domains comparable to Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 duties. The LibreOffice suite sums to five,885 duties. Multi app duties attain 2,113.
A multi agent rollout produces profitable hybrid trajectories. The planner makes use of OpenAI o3 for resolution making. The grounder makes use of GTA1-7B for correct visible localization. The rollout yields about 26.8K profitable trajectories that present when to make use of a instrument and when to behave within the GUI. These trajectories are the core of the supervised part.
Coaching Strategy
Coaching has two phases. Stage 1 is supervised high quality tuning. The fashions practice for 3 epochs at a studying fee of 2e-5 on the profitable trajectories. Loss is utilized flip smart to keep away from over weighting early steps. Stage 2 is on-line reinforcement studying. The fashions practice for 150 steps at a studying fee of 1e-6 on verified duties which can be sampled by issue. The coverage optimization follows a GRPO variant with clip greater, and removes KL regularization and format rewards. The reward combines sparse activity consequence with a instrument use time period. Experiments use NVIDIA H100 GPUs. The context is saved close to 32K by controlling the variety of uncovered instruments.
Outcomes on OSWorld
UltraCUA improves success at each 7B and 32B scales. Below 15 step budgets, UltraCUA-32B reaches 41.0 % success. OpenCUA-32B reaches 29.7 %. Absolutely the achieve is 11.3 factors. UltraCUA-7B reaches 28.9 %. UI-TARS-1.5-7B reaches 23.4 %. Positive aspects persist beneath 50 step budgets. A per area breakdown reveals constant lifts throughout Chrome, Author, VS Code, and cross software duties. Common steps lower in opposition to baselines. These shifts point out higher motion choice fairly than solely extra makes an attempt.
Cross platform switch on WindowsAgentArena
UltraCUA trains solely on Ubuntu based mostly OSWorld information. The mannequin is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 % success. This exceeds UI-TARS-1.5-7B at 18.1 % and a Qwen2 baseline skilled with Home windows information at 13.5 %. The outcome means that hybrid motion methods discovered on one platform switch to different platforms. The paper highlights this as zero shot platform generalization.
Key Takeaways
- UltraCUA formalizes a hybrid motion area that lets a single agent alternate between GUI primitives and programmatic instrument calls, which reduces lengthy error susceptible motion chains.
- The analysis workforce scales a reusable instrument library by means of an automatic pipeline and pairs it with an artificial information engine, yielding 17,000 plus verifiable laptop use duties for coaching and analysis.
- Coaching follows a two stage recipe, supervised high quality tuning on profitable hybrid trajectories then on-line reinforcement studying on verifiable duties, which optimizes when to name instruments versus act within the GUI.
- On OSWorld, UltraCUA studies a median 22 % relative enchancment over base fashions and 11 % fewer steps, which signifies positive factors in reliability and effectivity.
- The 7B mannequin reaches 21.7 % success on WindowsAgentArena with out Home windows particular coaching, which reveals cross platform generalization of the hybrid motion coverage.
UltraCUA strikes laptop use brokers from brittle primitive motion chains to a hybrid motion coverage, integrating GUI primitives with programmatic instrument calls, which reduces error propagation and step counts. It scales instruments through an automatic pipeline and pairs them with an artificial information engine that yields 17,000 plus verifiable duties, enabling supervised high quality tuning and on-line reinforcement studying on grounded indicators. Reported outcomes embrace 22 % relative enchancment on OSWorld with 11 % fewer steps, and 21.7 % success on WindowsAgentArena with out Home windows particular coaching, which signifies cross platform switch of the coverage.
Try the Paper here. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.