MiniMax Simply Open Sourced MiniMax M2.7: A Self-Evolving Agent Mannequin that Scores 56.22% on SWE-Professional and 57.0% on Terminal Bench 2

MiniMax has formally open-sourced MiniMax M2.7, making the mannequin weights publicly accessible on Hugging Face. Initially introduced on March 18, 2026, MiniMax M2.7 is the MiniMax’s most succesful open-source mannequin so far — and its first mannequin to actively take part in its personal growth cycle, a significant shift in how giant language fashions are constructed and iterated.

What’s MiniMax M2.7?

MiniMax M2.7 is a part of MiniMax’s M2-series of Combination-of-Consultants (MoE) fashions. MoE is an architectural design the place solely a subset of the overall parameters are ‘activated’ throughout any inference go, which makes the mannequin considerably sooner and cheaper to serve in comparison with a dense mannequin of comparable output high quality.

MiniMax M2.7 is constructed round three core functionality areas: skilled software program engineering, skilled workplace work, and what MiniMax calls Agent Groups — native multi-agent collaboration. MiniMax M2.7 is able to constructing complicated agent harnesses and finishing extremely elaborate productiveness duties, leveraging capabilities comparable to Agent Groups, complicated Expertise, and dynamic device search.

SOTA Benchmark Efficiency: SWE-Professional and Terminal Bench 2

On SWE-Professional, which covers a number of programming languages, MiniMax M2.7 achieved a 56.22% accuracy charge, matching GPT-5.3-Codex. SWE-Professional duties span log evaluation, bug troubleshooting, code safety overview, and machine studying workflow debugging — a lot nearer to the messy actuality of manufacturing methods than normal algorithmic coding checks.

On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), each of which demand a excessive diploma of system-level comprehension, MiniMax M2.7 performs solidly. The mannequin excels not solely at code era however can even deeply perceive the operational logic and collaborative dynamics of software program methods.

On the repo-level code era benchmark VIBE-Professional, MiniMax M2.7 scored 55.6%, practically on par with Opus 4.6 — which means whether or not the requirement entails Net, Android, iOS, or simulation duties, they are often handed on to MiniMax M2.7 to finish. It additionally demonstrates a robust benefit on benchmarks nearer to real-world engineering eventualities: SWE Multilingual (76.5) and Multi SWE Bench (52.7).

Manufacturing Debugging: Below Three Minutes

When confronted with alerts in manufacturing, MiniMax M2.7 can correlate monitoring metrics with deployment timelines to carry out causal reasoning, conduct statistical evaluation on hint sampling and suggest exact hypotheses, proactively hook up with databases to confirm root causes, pinpoint lacking index migration information within the code repository, and use non-blocking index creation to cease the bleeding earlier than submitting a merge request. MiniMax staff reviews that on a number of events, this diminished restoration time for stay manufacturing system incidents to below three minutes. From observability evaluation and database experience to SRE-level decision-making, this positions MiniMax M2.7 as one thing past a code-generation mannequin.

The Self-Evolution Structure

To check the boundaries of autonomous enchancment, MiniMax M2.7 was tasked with optimizing a mannequin’s programming efficiency on an inner scaffold. It ran solely autonomously, executing an iterative loop of ‘analyze failure trajectories → plan adjustments → modify scaffold code → run evaluations → evaluate outcomes → resolve to maintain or revert adjustments’ for over 100 rounds. Throughout this course of, MiniMax M2.7 found efficient optimizations by itself: systematically trying to find the optimum mixture of sampling parameters comparable to temperature, frequency penalty, and presence penalty; designing extra particular workflow tips (comparable to routinely looking for a similar bug sample in different information after a repair); and including loop detection to the scaffold’s agent loop. This achieved a 30% efficiency enchancment on inner analysis units.

Inside MiniMax’s personal reinforcement studying staff workflows, M2.7 is now able to dealing with 30%–50% of the workflow end-to-end, with human researchers solely interacting for crucial selections and discussions.

MLE Bench Lite: Testing Autonomous ML Experimentation

MiniMax staff additionally examined MiniMax M2.7 on MLE Bench Lite, OpenAI’s open-sourced suite of twenty-two machine studying competitions runnable on a single A30 GPU, protecting nearly all levels of the ML workflow.

For this analysis, MiniMax staff designed a easy three-component harness: short-term reminiscence, self-feedback, and self-optimization. After every iteration spherical, the agent generates a short-term reminiscence markdown file, performs self-criticism on the present outcomes, and offers optimization instructions for the subsequent spherical. Three trials had been run, every with a 24-hour window for iterative evolution.

One of the best run achieved 9 gold medals, 5 silver medals, and 1 bronze medal. The typical medal charge throughout the three runs was 66.6%, a consequence second solely to Opus-4.6 (75.7%) and GPT-5.4 (71.2%), tying with Gemini-3.1 (66.6%).

Skilled Workplace Work and Finance

Past software program engineering, MiniMax M2.7 targets skilled workplace duties. Within the GDPval-AA analysis, which measures area experience and activity supply functionality throughout 45 fashions, MiniMax M2.7 achieved an ELO rating of 1495 — the best amongst open-source fashions, second solely to Opus 4.6, Sonnet 4.6, and GPT-5.4, and surpassing GPT-5.3.

On Toolathon, MiniMax M2.7 achieved an accuracy of 46.3%, reaching the worldwide prime tier. In MM Claw testing — an analysis MiniMax constructed primarily based on real-world utilization patterns from the OpenClaw private agent platform — MiniMax M2.7 maintained a 97% talent compliance charge throughout 40 complicated abilities (every exceeding 2,000 tokens) and achieved an general accuracy of 62.7%, approaching Sonnet 4.6.

In finance, MiniMax M2.7 can autonomously learn an organization’s annual reviews and earnings name transcripts, cross-reference a number of analysis reviews, independently design assumptions and construct a income forecast mannequin, and produce a PPT and Phrase analysis report primarily based on templates — understanding, making judgments, and producing output like a junior analyst.

Key Takeaways

MiniMax M2.7 is now formally open supply, with weights accessible on Hugging Face, making a frontier-grade agentic mannequin freely accessible for builders to deploy and construct on.
MiniMax M2.7 achieves SOTA efficiency on real-world software program engineering benchmarks, scoring 56.22% on SWE-Professional (matching GPT-5.3-Codex) and 57.0% on Terminal Bench 2 — checks that measure production-level reasoning, not simply code era.
MiniMax M2.7 is the primary mannequin to actively take part in its personal growth, working over 100 autonomous rounds of scaffold optimization and reaching a 30% efficiency enchancment — an early, concrete instance of AI-assisted AI growth in apply.
The mannequin is constructed for actual agentic deployments, sustaining 97% talent adherence throughout 40 complicated abilities (every exceeding 2,000 tokens), supporting native Agent Groups with secure function boundaries, and dealing with 30–50% of MiniMax’s inner RL staff workflows autonomously.
MiniMax M2.7 is the highest-ranked open-source mannequin on GDPval-AA with an ELO rating of 1495 throughout 45 fashions, demonstrating robust skilled work capabilities spanning workplace doc enhancing, monetary evaluation, and multi-round high-fidelity activity supply.

Take a look at the Technical details and Model Weight. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us

Source link

MiniMax Simply Open Sourced MiniMax M2.7: A Self-Evolving Agent Mannequin that Scores 56.22% on SWE-Professional and 57.0% on Terminal Bench 2

Walmart-owned Flipkart, Amazon are squeezing India’s fast commerce startups

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Imaginative and prescient-Language Mannequin with Bounding Field Prediction, Multilingual Assist, and Sub-250ms Edge Inference

Kalshi wins short-term pause in Arizona felony case

MiniMax Simply Open Sourced MiniMax M2.7: A Self-Evolving Agent Mannequin that Scores 56.22% on SWE-Professional and 57.0% on Terminal Bench 2

What’s MiniMax M2.7?

SOTA Benchmark Efficiency: SWE-Professional and Terminal Bench 2

Manufacturing Debugging: Below Three Minutes

The Self-Evolution Structure

MLE Bench Lite: Testing Autonomous ML Experimentation

Skilled Workplace Work and Finance

Key Takeaways

Related Posts

Walmart-owned Flipkart, Amazon are squeezing India’s fast commerce startups

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Imaginative and prescient-Language Mannequin with Bounding Field Prediction, Multilingual Assist, and Sub-250ms Edge Inference

Kalshi wins short-term pause in Arizona felony case