Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Fashions Reaching 68.2% and 72.5% on SWE-bench Verified

Poolside AI launched the primary two fashions in its Laguna household: Laguna M.1 and Laguna XS.2. Alongside these, the corporate is releasing pool — a light-weight terminal-based coding agent and a twin Agent Shopper Protocol (ACP) client-server — the identical setting Poolside makes use of internally for agent RL coaching and analysis, now accessible as a analysis preview.

What are These Fashions, and Why Ought to You Care?

Each Laguna M.1 and Laguna XS.2 are Combination-of-Specialists (MoE) fashions. As a substitute of activating all parameters for each token, MoE fashions route every token by solely a subset of specialised sub-networks known as ‘consultants.’ This implies a big complete parameter depend and the potential headroom that comes with it whereas solely paying the compute price of a a lot smaller “activated” parameter depend at inference time.

Laguna M.1 is a 225B complete parameter MoE mannequin with 23B activated parameters, skilled from scratch on 30T tokens utilizing 6,144 interconnected NVIDIA Hopper GPUs. It accomplished pre-training on the finish of final yr and serves as the muse for your entire Laguna household. On benchmarks, it reaches 72.5% on SWE-bench Verified, 67.3% on SWE-bench Multilingual, 46.9% on SWE-bench Professional, and 40.7% on Terminal-Bench 2.0.

Laguna XS.2 is the second-generation MoE and Poolside’s first open-weight mannequin, constructed on all the things realized since coaching M.1. At 33B complete parameters with 3B activated per token, it’s designed for agentic coding and long-horizon work on a neighborhood machine — compact sufficient to run on a Mac with 36 GB of RAM through Ollama. It scores 68.2% on SWE-bench Verified, 62.4% on SWE-bench Multilingual, 44.5% on SWE-bench Professional, and 30.1% on Terminal-Bench 2.0. Poolside may also launch Laguna XS.2-base quickly for practitioners who need to fine-tune.

Structure: The Effectivity Choices in XS.2

XS.2 makes use of sigmoid gating with per-layer rotary scales, enabling a combined Sliding Window Consideration (SWA) and international consideration format in a 3:1 ratio throughout 40 complete layers — 30 SWA layers and 10 international consideration layers. Sliding Window Consideration limits every token’s consideration to a neighborhood window of 512 tokens slightly than the complete sequence, dramatically reducing KV cache reminiscence. The worldwide consideration layers at a 1-in-4 ratio protect long-range dependencies with out paying the complete price in all places. The mannequin additionally quantizes the KV cache to FP8, additional decreasing reminiscence per token.

Below the hood, XS.2 makes use of 256 consultants with 1 shared professional, helps a context window of 131,072 tokens, and options native reasoning help — interleaved considering between software calls with per-request management over enabling or disabling considering.

https://poolside.ai/weblog/laguna-a-deeper-dive

Coaching: Three Areas Poolside Pushed Laborious

Poolside staff trains all its fashions from scratch utilizing its personal information pipeline, its personal coaching codebase (Titan), and its personal agent RL infrastructure. Three areas noticed explicit funding for Laguna.

AutoMixer: Optimizing the Information Combine Mechanically. Information curation and the combo that goes into coaching is extraordinarily impactful on last mannequin efficiency. Fairly than counting on handbook heuristics, Poolside developed an automixing framework that trains a swarm of roughly 60 proxy fashions, every on a special information combine, and measures efficiency throughout key functionality teams — code, math, STEM, and customary sense. Surrogate regressors are then match to approximate how modifications in dataset proportions have an effect on downstream evaluations, giving a realized mapping from information combine to efficiency that may be straight optimized. The method is impressed by prior work together with Olmix, MDE, and RegMix, tailored to Poolside’s setting with richer information groupings.

On the information facet, each Laguna fashions have been skilled on greater than 30T tokens. Poolside’s diversity-preserving information curation method — which retains parts of mid- and lower-quality buckets alongside top-quality information to keep away from STEM bias — yields roughly 2× extra distinctive tokens in comparison with precision-focused pipelines, with the achieve persisting at longer coaching horizons. A separate deduplication evaluation additionally confirmed that international deduplication disproportionately removes high-quality information, informing how the staff tuned its pipeline. Artificial information contributes about 13% of the ultimate coaching combine in Laguna XS.2, with the Laguna collection utilizing roughly 4.4T+ artificial tokens in complete.

Muon Optimizer. Fairly than AdamW — the most typical optimizer in massive mannequin coaching — Poolside used a distributed implementation of the Muon optimizer by all coaching levels of each fashions. In preliminary pre-training ablations, the analysis staff achieved the identical coaching loss as an AdamW baseline in roughly 15% fewer steps, with massive absolute analysis uplifts on the ultimate mannequin, and achieved studying fee switch throughout mannequin scales. A further profit: Muon requires just one state per parameter slightly than two, decreasing reminiscence necessities for each coaching and checkpointing. Throughout pre-training of Laguna M.1, the overhead from the optimizer was lower than 1% of the coaching step time.

Poolside additionally runs periodic hash checks on mannequin weights throughout coaching replicas to catch silent information corruption (SDC) from faulty GPUs — particularly errors in arithmetic logic and pipeline registers, which in contrast to DRAM and SRAM are usually not lined by ECC safety.

Async On-Coverage Agent RL. That is arguably probably the most advanced piece of the Laguna coaching stack. Poolside constructed a completely asynchronous on-line RL system the place actor processes pull duties from a dataset, spin up sandboxed containers, and run the manufacturing agent binary towards every process utilizing the freshly deployed mannequin. The ensuing trajectories are scored, filtered, and written to Iceberg tables, whereas the coach repeatedly consumes these data and produces the following checkpoint — inference and coaching operating asynchronously in parallel, with throughput tuned to stability off-policy staleness.

Key Takeaways

Poolside releases its first open-weight mannequin: Laguna XS.2 is a 33B complete parameter MoE mannequin with solely 3B activated parameters per token, accessible underneath an Apache 2.0 license — compact sufficient to run domestically on a Mac with 36 GB of RAM through Ollama.
Robust benchmark efficiency at small scale: Laguna XS.2 scores 68.2% on SWE-bench Verified and 44.5% on SWE-bench Professional, whereas the bigger Laguna M.1 (225B complete, 23B activated) reaches 72.5% on SWE-bench Verified and 46.9% on SWE-bench Professional — each skilled from scratch on 30T tokens.
Muon optimizer beats AdamW by 15% in coaching effectivity: Poolside changed AdamW with a distributed implementation of the Muon optimizer, reaching the identical coaching loss in roughly 15% fewer steps, with decrease reminiscence necessities — just one state per parameter as an alternative of two.
AutoMixer replaces handbook information mixing with realized optimization: As a substitute of handcrafted information recipes, Poolside trains a swarm of ~60 proxy fashions on totally different information mixes and suits surrogate regressors to optimize dataset proportions — with artificial information making up ~13% of Laguna XS.2’s last coaching combine from a complete of 4.4T+ artificial tokens.
Absolutely asynchronous agent RL with GPUDirect RDMA weight switch: Poolside’s RL system runs inference and coaching in parallel, transferring a whole lot of gigabytes of BF16 weights between nodes in underneath 5 seconds through GPUDirect RDMA, utilizing a token-in, token-out actor design and the CISPO algorithm for off-policy coaching stability.

Take a look at the Model Weights and Technical details. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

Source link

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Fashions Reaching 68.2% and 72.5% on SWE-bench Verified

Meta FAIR Releases NeuralSet: A Python Package deal for Neuro-AI That Helps fMRI, M/EEG, Spikes, and HuggingFace Embeddings

Australia forces Massive Tech companies to pay for information or face a 2.25% tax

Google expands Pentagon’s entry to its AI after Anthropic’s refusal

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Fashions Reaching 68.2% and 72.5% on SWE-bench Verified

What are These Fashions, and Why Ought to You Care?

Structure: The Effectivity Choices in XS.2

Coaching: Three Areas Poolside Pushed Laborious

Key Takeaways

Related Posts

Meta FAIR Releases NeuralSet: A Python Package deal for Neuro-AI That Helps fMRI, M/EEG, Spikes, and HuggingFace Embeddings

Australia forces Massive Tech companies to pay for information or face a 2.25% tax

Google expands Pentagon’s entry to its AI after Anthropic’s refusal