MathSmith

Overview

High-difficulty, high-quality mathematical data is scarce, constraining the upper bound of LLM reasoning. Existing synthesis pipelines mostly rewrite or mutate human-written problems, limiting diversity and true difficulty control. We introduce MathSmith, a from-scratch problem synthesis framework that samples concept–explanation pairs from PlanetMath and forges new problems guided by predefined difficulty strategies and reinforcement learning. Our contributions are: (1) a contamination-resistant pipeline that starts from randomly sampled concept–explanation pairs and constructs problems via stepwise rationales; (2) nine difficulty strategies that serve as soft constraints to induce deep, multi-step reasoning; (3) a reinforcement learning stage that jointly optimizes structural validity, reasoning complexity (via deep thinking CoT length), and answer consistency using GRPO; (4) a weakness-focused variant generator that traces every problem to its source concepts for targeted improvement; and (5) extensive experiments demonstrating strong gains on hard benchmarks, and favorable scaling with both data volume and model size.

MathSmith workflow diagram showing concept and explanation collection, SFT, and RL with GRPO.

The MathSmith workflow, comprising the phases of concept and explanation collection, supervised fine-tuning, and reinforcement learning.

Dataset Preview

Preview of MathSmith-HC-Problems Dataset, These problems are synthesized by MathSmith-HC-Problem-Synthesizer.

Preview of MathSmith-HC-Problems with LongCoT solution, These problems are synthesized by MathSmith-HC-Problem-Synthesizer and answer sampled by Qwen3-30B-A3B (Thinking Mode).

Pipeline

The MathSmith framework consists of four main stages:

Concept Collection. Randomly sample concept–explanation pairs from PlanetMath to ensure data independence.
Supervised Fine-tuning (SFT). Train the model on collected concept–explanation pairs to establish foundational understanding.
Reinforcement Learning (RL). Optimize the model using GRPO with rewards based on structural validity, reasoning complexity, and answer consistency.
Weakness-Focused Self-Improvement. Iteratively identify and address model weaknesses by generating targeted problem variants.

Weakness-focused improvement pipeline with concept tracing and targeted variant generation.

The MathSmith weakness-focused improvement pipeline. Problems are traced to concept explanations, which serve as a basis for generating targeted variants to strengthen model weaknesses.

Experiments

Benchmarks. Five datasets across two difficulty levels: Easy & Medium (GSM8K, MATH-500) and Hard (AIME2024, AIME2025, OlympiadBench).
Baselines. MetaMath, NuminaMath-COT, OpenMathInstruct-2, and PromptCOT. Each baseline uniformly samples 50K problems and regenerates solutions with the same teacher model to ensure fairness.
Evaluation settings. Both short-CoT and long-CoT are evaluated: short-CoT uses Qwen2.5-7B-Instruct and Qwen3-8B; long-CoT uses Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B. All methods are fully fine-tuned under identical configurations using LLaMA-Factory.
Metrics. Single-turn answer accuracy is adopted for all benchmarks.

Results

Overall performance. On Hard benchmarks, MathSmith-HC shows clear advantages under the long-CoT setting: Qwen3-8B achieves an average of 71.8% (+9.8% relative gain), while DeepSeek-R1-Distill-Qwen-7B reaches 51.0% (+15.6% relative gain).
Data scaling. Using OlympiadBench as the evaluation set, expanding the training data from 50K to 200K further widens the performance gap between MathSmith-HC and strong baselines.
Model scaling. Gains are limited on small models (1.7B/4B) but grow significantly with larger model sizes, indicating that MathSmith data enhances deep reasoning ability in larger models.
Problem difficulty. In “thinking” mode, the reasoning chains triggered by MathSmith-HC/Hard are notably longer, reflecting higher reasoning complexity.
Weakness-focused improvement. Generating variants targeting weak concepts outperforms random sampling, improving both the internal Practice Set and external benchmarks.

Baseline performance comparison table under equal data and training conditions.

Baseline performance under equal data and training conditions. MathSmith achieves consistently better generalization on challenging problems.

Performance vs training data volume (50K–200K) on OlympiadBench showing MathSmith-HC scaling advantage.

Performance on the Olympiad benchmark under varying training data volumes (50K–200K). MathSmith-HC scales more effectively than baseline methods.

Accuracy across Qwen3 model sizes with MathSmith-HC showing greater gains for larger models.

Accuracy on the Olympiad benchmark across Qwen3 model sizes series. MathSmith-HC yields greater gains with larger models.

Average reasoning trace token length under thinking mode across datasets, with MathSmith variants eliciting longer CoTs.

Average reasoning trace token length under thinking mode (Qwen3-30B-A3B) across various open-source math datasets. Problems synthesized by MathSmith variants elicit significantly longer reasoning, reflecting higher complexity.

Effect of weakness-focused problem generation vs. random sampling

Cited Our Work

@article{zhan2025mathsmith,
  title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
  author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
  journal={arXiv preprint arXiv:2508.05592},
  year={2025}
}