ICML 2026 Oral Presentation

Characterizing, Evaluating, and Optimizing Complex Reasoning

Large Reasoning Models should do more than reach the right answer. TRM makes reasoning trace quality measurable and usable for test-time selection and reinforcement learning.

  1. 1School of Artificial Intelligence, Shanghai Jiao Tong University
  2. 2Shanghai Artificial Intelligence Laboratory
  3. 3University of Science and Technology of China
  4. 4The Chinese University of Hong Kong
  5. 5Nanjing University
  6. 6Peking University

Corresponding authors: Yafu Li and Yu Cheng

Overview of the TRM framework: characterize reasoning quality, evaluate traces with DAG-based pairwise comparison, and optimize reasoning with a Thinking Reward Model.
TRM overview. Characterize reasoning quality with ME2, evaluate free-form traces through DAG-based pairwise comparison, and train a Thinking Reward Model for selection and policy optimization.

Reasoning Quality, Made Explicit

Large Reasoning Models now produce long traces with rich intermediate structure. A correct final answer can still hide redundant detours, fragile local steps, or reasoning paths that are hard to compare and optimize.

TRM organizes the problem around three questions: how to characterize reasoning quality, how to evaluate free-form traces, and how to optimize models with that signal. The resulting Thinking Reward Model scores reasoning trace quality; it complements final-answer verification rather than replacing it.

Characterizing

Define trace quality through macro/micro structure and efficiency/effectiveness objectives.

Evaluating

Turn free-form traces into DAGs so progression, branching, and merging are visible.

Optimizing

Train TRM on pairwise preferences and use it for selection and reinforcement learning.

Three Pieces of the Pipeline

TRM keeps the quality signal focused on the reasoning trace itself. It first names the quality dimensions, then obtains structured pairwise preferences, and finally trains a reward model for downstream use.

01

ME2 Principle

ME2 characterizes reasoning quality along two axes: macro versus micro granularity, and efficiency versus effectiveness.

02

DAG-based Pairwise Evaluation

Free-form traces are converted into DAGs, making progression, branching, and merging explicit before pairwise comparison.

03

Thinking Reward Model

TRM learns from preference pairs to score reasoning trace quality, providing a reward that is complementary to answer verifiers.

What Counts as Better Thinking?

ME2 gives a compact vocabulary for trace quality. It separates whether reasoning is globally organized from whether each local step is useful and valid.

The ME2 principle divides reasoning quality by macro versus micro granularity and efficiency versus effectiveness objectives.

Macro-Efficiency

Does the trace avoid unnecessary branches, repeated detours, and overlong global plans?

Macro-Effectiveness

Does the overall structure stay aligned with the problem and move toward a solution?

Micro-Efficiency

Are individual steps concise, non-redundant, and placed where they actually help?

Micro-Effectiveness

Are local calculations, claims, and transitions valid enough to support the trace?

How the DAG Is Built

A trace is first split into atomic reasoning steps. Edges are then inferred from semantic dependencies, using the generation order as a topological order so later steps can depend on earlier ones, but not the other way around.

A reasoning DAG abstraction showing progression, branching, and merging structures.
  1. Step partitioning Break the reasoning trace into smaller units that can serve as graph nodes.
  2. Dependency inference For each new node, select semantically relevant parents from earlier candidate nodes.
  3. Structure compression Merge simple linear chains into super-nodes so the graph keeps the important structure visible.
  4. Pairwise judgment Compare traces under the ME2 dimensions, then keep stable non-tie preferences.

From Correct Traces to TRM-Preference

TRM-Preference is built from verified-correct reasoning traces so that supervision focuses on trace quality rather than final-answer correctness. For each problem, multiple reasoning models generate candidate traces; only traces whose final answers are accepted by rule-based verifiers enter the preference pipeline.

The remaining traces are compared with DAG-based pairwise evaluation. The final dataset contains 103K training preference pairs and 1.5K validation pairs, decoupling reasoning quality from final-answer correctness.

01 Generate candidate reasoning traces from open-source reasoning models.
02 Filter by final-answer correctness with rule-based verifiers.
03 Convert traces to DAGs and compare them under ME2.
04 Train a Thinking Reward Model to score reasoning trace quality.

Compact Results

TRM reaches 88.6% validation accuracy on pairwise trace preference, above ReasonFlux-PRM-7B (62.5%) and Qwen2.5-Math-PRM-7B (46.3%). Downstream experiments use this trace-quality score for Best-of-N selection and as an auxiliary RL reward.

Validation. 88.6% preference accuracy on TRM-Preference validation.

Test-time selection. Best-of-N selection brings up to 19.3% improvement.

RL. TRM usually adds 2% to 4% over verifier reward, including 3.9% STEM and 3.8% Math gains on Llama-3.1-8B-Instruct.

Validation accuracy comparison showing TRM above ReasonFlux-PRM-7B and Qwen2.5-Math-PRM-7B.
Preference validation. TRM reaches 88.6% accuracy, compared with 62.5% for ReasonFlux-PRM-7B and 46.3% for Qwen2.5-Math-PRM-7B. Open full figure
Best-of-N selection results showing gains from TRM-guided trace-quality scoring.
Test-time selection. TRM-guided Best-of-N selection improves downstream accuracy, with gains up to 19.3%. Open full figure
Reinforcement learning results comparing verifier reward and TRM-assisted reward training.
Reinforcement learning. TRM commonly adds 2% to 4% over verifier reward, including 3.9% STEM and 3.8% Math gains on Llama-3.1-8B-Instruct. Open full figure
Reasoning quality analysis showing how trace-quality reward differs from answer verification.
Reasoning trace quality. TRM supplies a trace-quality reward that is distinct from and complementary to final-answer verification. Open full figure

Code, Model, and Data

The project releases TRM-Preference, TRM-8B weights, and code for scoring, reward-model training, and TRM-guided policy optimization.

Citation

@article{zhang2026characterizing,
  title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
  author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2602.08498},
  year={2026}
}