How can an AI system show complicated olympiad stage math issues in clear pure language whereas additionally checking that its personal reasoning is definitely appropriate? DeepSeek AI has launched DeepSeekMath-V2, an open weights giant language mannequin that’s optimized for pure language theorem proving with self verification. The mannequin is constructed on DeepSeek-V3.2-Exp-Base, runs as a 685B parameter combination of consultants, and is obtainable on Hugging Face beneath an Apache 2.0 license.
In evaluations, DeepSeekMath-V2 reaches gold stage scores on IMO 2025 and CMO 2024, and achieves 118 of 120 factors on Putnam 2024 when used with scaled check time compute.
Why Closing Reply Rewards usually are not Sufficient?
Most up-to-date math reasoning fashions use reinforcement studying that rewards solely the ultimate reply on benchmarks corresponding to AIME and HMMT. This strategy pushed fashions from weak baselines to close saturation on quick reply contests in about one 12 months. (Hugging Face)
Nonetheless, the DeepSeek analysis staff factors out two structural issues:
- An accurate numeric reply doesn’t assure appropriate reasoning. The mannequin might attain the best quantity by way of algebraic errors that cancel out.
- Many duties, corresponding to olympiad proofs and theorem proving, require a whole argument in pure language. These duties shouldn’t have a single ultimate numeric reply, so customary reply based mostly rewards don’t apply.
DeepSeekMath-V2 due to this fact optimizes proof high quality as a substitute of pure reply accuracy. The system evaluates whether or not a proof is full and logically sound, and makes use of that analysis as the primary studying sign.
Coaching a Verifier earlier than the Generator
The core design is verifier first. DeepSeek analysis staff trains an LLM based mostly verifier that may learn an issue and a candidate proof, then output each a pure language evaluation and a discrete high quality rating within the set {0, 0.5, 1}.
The preliminary reinforcement studying knowledge comes from Artwork of Drawback Fixing contests. The analysis staff crawl 17,503 proof model issues from olympiads, staff choice assessments, and put up 2010 issues that explicitly require proofs. These issues type the bottom set for chilly begin RL. Candidate proofs come from a DeepSeek-V3.2 reasoning mannequin that’s prompted to iteratively refine its personal options, which will increase element but additionally creates many imperfect proofs. Human consultants label these proofs utilizing the 0, 0.5, 1 rubric, based mostly on rigor and completeness.
The verifier is educated with Group Relative Coverage Optimization (GRPO). The reward has two elements:
- A format reward, which checks that the verifier output follows a hard and fast template, together with an evaluation part and a ultimate rating in a field.
- A rating reward, which penalizes absolutely the distinction between the anticipated rating and the professional rating.
This stage produces a verifier that may grade olympiad model proofs in a constant method.
Meta Verification to Management Hallucinated Critiques
A verifier can nonetheless recreation the reward. It might probably output the right ultimate rating whereas inventing pretend points within the evaluation. This is able to fulfill the numeric goal however make the reasons unreliable.
To handle this, the analysis staff introduce a meta verifier. The meta verifier reads the unique drawback, the proof, and the verifier evaluation, after which evaluates whether or not the evaluation is devoted. It scores elements corresponding to restatement of steps, identification of actual defects, and consistency between the narrative and the ultimate rating.
The meta verifier can also be educated with GRPO, with its personal format and rating rewards. Its output, a meta high quality rating, is then used as an additional reward time period for the bottom verifier. Analyses that hallucinate issues get low meta scores, even when the ultimate proof rating is appropriate. In experiments, this raises the common meta evaluated high quality of analyses from round 0.85 to 0.96 on a validation break up, whereas holding proof rating accuracy secure.
Self Verifying Proof Generator and Sequential Refinement
As soon as the verifier is powerful, DeepSeek analysis staff trains the proof generator. The generator takes an issue and outputs each an answer and a self evaluation that follows the identical rubric because the verifier.
The reward for the generator combines three indicators:
- The verifier rating on the generated proof.
- The settlement between the self reported rating and the verifier rating.
- The meta verification rating of the self evaluation.
Formally, the primary reward makes use of weights α = 0.76 for the proof rating and β = 0.24 for the self evaluation element, multiplied by a format time period that enforces the output construction. This pushes the generator to write down proofs that the verifier accepts, and to be trustworthy about remaining points. If it claims {that a} flawed proof is ideal, it loses reward by way of disagreement and low meta scores.
DeepSeek additionally exploits the 128K token context restrict of the bottom mannequin. For exhausting issues, the generator usually can’t restore all points in a single cross, as a result of the refined proof plus evaluation would exceed context. In that case, the system runs sequential refinement. It generates a proof and self evaluation, feeds them again as context, and asks the mannequin to supply a brand new proof that fixes the beforehand detected points. This loop can repeat a number of instances, topic to the context price range.
Scaling Verification and Auto Labeling
Because the generator improves, it produces tougher proofs, that are pricey to label by hand. To maintain coaching knowledge recent, the analysis staff introduces an computerized labeling pipeline based mostly on scaled verification.
For every candidate proof, the system samples a number of unbiased verifier analyses, then evaluates every evaluation utilizing the meta verifier. If a number of prime quality analyses converge on the identical critical points, the proof is labeled as incorrect. If no legitimate points survive meta checking, the proof is labeled as appropriate. Within the ultimate coaching iterations this pipeline replaces human labels, with spot checks confirming good settlement with consultants.
Competitors and Benchmark Outcomes
The analysis staff evaluated DeepSeekMath-V2 on a number of fronts:
On an inside set of 91 CNML stage issues protecting algebra, geometry, quantity concept, combinatorics, and inequalities, it exhibits that DeepSeekMath-V2 achieves the best imply proof rating amongst Gemini 2.5 Professional, GPT 5 Pondering Excessive, and DeepSeekMath-V2 in each class, as measured by their verifier.
On IMO Shortlist 2024, sequential refinement with self verification improves each cross at 1 and better of 32 high quality metrics as the utmost variety of refinement iterations will increase.
On IMO ProofBench, professional analysis the above determine exhibits that DeepSeekMath-V2 outperforms DeepMind DeepThink IMO Gold on the Fundamental subset and stays aggressive on the Superior subset, whereas clearly beating different giant fashions.
For full competitions, it reviews:
- IMO 2025: 5 of 6 issues solved, gold medal stage.
- CMO 2024: 4 issues absolutely solved plus partial credit score on 1 extra, gold medal stage.
- Putnam 2024: 11 of 12 issues solved utterly and the remaining drawback with minor errors, for 118 of 120 factors, above the most effective human rating of 90.
Key Takeaways
- DeepSeekMath V2 is a 685B parameter mannequin constructed on DeepSeek V3.2 Exp Base, designed for pure language theorem proving with self verification, and launched as open weights beneath the Apache 2.0 license.
- The primary innovation is a verifier first coaching pipeline with a GRPO educated verifier and meta verifier that rating proofs on rigor, not solely ultimate solutions, which immediately addresses the hole between appropriate solutions and proper reasoning.
- A proof generator is then educated in opposition to this verifier and meta verifier, utilizing rewards that mix proof high quality, settlement with self analysis, and evaluation faithfulness, plus sequential refinement beneath 128K context to iteratively restore proofs.
- With scaled check time compute and huge verification budgets, DeepSeekMath V2 reaches gold stage efficiency on IMO 2025 and CMO 2024 and scores 118 of 120 on Putnam 2024, surpassing the most effective human rating that 12 months.
Editorial Notes
DeepSeekMath-V2 is a vital step towards self verifiable mathematical reasoning, as a result of it immediately tackles the hole between appropriate ultimate solutions and proper reasoning, utilizing a verifier, meta verifier and proof generator educated with GRPO on olympiad model proofs and deployed at 685B scale to achieve gold stage efficiency on IMO 2025, CMO 2024 and a close to excellent 118 of 120 rating on Putnam 2024. General, this launch exhibits that self verifiable mathematical reasoning with open weights is now virtually achievable for competitors stage issues.
Take a look at the Full Paper, Model Weights on HF and Repo. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.