ICML 2026 Oral Presentation

Characterizing, Evaluating, and Optimizing Complex Reasoning

Large Reasoning Models should do more than reach the right answer. TRM makes reasoning trace quality measurable and usable for test-time selection and reinforcement learning.

Haoran Zhang^1,2
Yafu Li^2,4,†
Zhi Wang^5,2
Zhilin Wang^2,3
Shunkai Zhang^2,6
Xiaoye Qu²
Yu Cheng^4,2,†

¹School of Artificial Intelligence, Shanghai Jiao Tong University
²Shanghai Artificial Intelligence Laboratory
³University of Science and Technology of China
⁴The Chinese University of Hong Kong
⁵Nanjing University
⁶Peking University

^†Corresponding authors: Yafu Li and Yu Cheng

Paper arXiv Code Model Dataset

Overview

Reasoning Quality, Made Explicit

Large Reasoning Models now produce long traces with rich intermediate structure. A correct final answer can still hide redundant detours, fragile local steps, or reasoning paths that are hard to compare and optimize.

TRM organizes the problem around three questions: how to characterize reasoning quality, how to evaluate free-form traces, and how to optimize models with that signal. The resulting Thinking Reward Model scores reasoning trace quality; it complements final-answer verification rather than replacing it.

Characterizing

Define trace quality through macro/micro structure and efficiency/effectiveness objectives.

Evaluating

Turn free-form traces into DAGs so progression, branching, and merging are visible.

Optimizing

Train TRM on pairwise preferences and use it for selection and reinforcement learning.

Approach

Three Pieces of the Pipeline

TRM keeps the quality signal focused on the reasoning trace itself. It first names the quality dimensions, then obtains structured pairwise preferences, and finally trains a reward model for downstream use.

ME² Principle

ME² characterizes reasoning quality along two axes: macro versus micro granularity, and efficiency versus effectiveness.

DAG-based Pairwise Evaluation

Free-form traces are converted into DAGs, making progression, branching, and merging explicit before pairwise comparison.

Thinking Reward Model

TRM learns from preference pairs to score reasoning trace quality, providing a reward that is complementary to answer verifiers.

Characterization

What Counts as Better Thinking?

ME² gives a compact vocabulary for trace quality. It separates whether reasoning is globally organized from whether each local step is useful and valid.

The ME2 principle divides reasoning quality by macro versus micro granularity and efficiency versus effectiveness objectives.

Macro-Efficiency

Does the trace avoid unnecessary branches, repeated detours, and overlong global plans?

Macro-Effectiveness

Does the overall structure stay aligned with the problem and move toward a solution?

Micro-Efficiency

Are individual steps concise, non-redundant, and placed where they actually help?

Micro-Effectiveness

Are local calculations, claims, and transitions valid enough to support the trace?

Structured Evaluation

How the DAG Is Built

A trace is first split into atomic reasoning steps. Edges are then inferred from semantic dependencies, using the generation order as a topological order so later steps can depend on earlier ones, but not the other way around.

A reasoning DAG abstraction showing progression, branching, and merging structures.

Step partitioning Break the reasoning trace into smaller units that can serve as graph nodes.
Dependency inference For each new node, select semantically relevant parents from earlier candidate nodes.
Structure compression Merge simple linear chains into super-nodes so the graph keeps the important structure visible.
Pairwise judgment Compare traces under the ME² dimensions, then keep stable non-tie preferences.

Data Construction

From Correct Traces to TRM-Preference

TRM-Preference is built from verified-correct reasoning traces so that supervision focuses on trace quality rather than final-answer correctness. For each problem, multiple reasoning models generate candidate traces; only traces whose final answers are accepted by rule-based verifiers enter the preference pipeline.

The remaining traces are compared with DAG-based pairwise evaluation. The final dataset contains 103K training preference pairs and 1.5K validation pairs, decoupling reasoning quality from final-answer correctness.

01 Generate candidate reasoning traces from open-source reasoning models.

02 Filter by final-answer correctness with rule-based verifiers.

03 Convert traces to DAGs and compare them under ME².

04 Train a Thinking Reward Model to score reasoning trace quality.

Evidence

Compact Results

TRM reaches 88.6% validation accuracy on pairwise trace preference, above ReasonFlux-PRM-7B (62.5%) and Qwen2.5-Math-PRM-7B (46.3%). Downstream experiments use this trace-quality score for Best-of-N selection and as an auxiliary RL reward.

Validation. 88.6% preference accuracy on TRM-Preference validation.

Test-time selection. Best-of-N selection brings up to 19.3% improvement.

RL. TRM usually adds 2% to 4% over verifier reward, including 3.9% STEM and 3.8% Math gains on Llama-3.1-8B-Instruct.

Validation accuracy comparison showing TRM above ReasonFlux-PRM-7B and Qwen2.5-Math-PRM-7B. — **Preference validation.** TRM reaches 88.6% accuracy, compared with 62.5% for ReasonFlux-PRM-7B and 46.3% for Qwen2.5-Math-PRM-7B. Open full figure

Best-of-N selection results showing gains from TRM-guided trace-quality scoring. — **Test-time selection.** TRM-guided Best-of-N selection improves downstream accuracy, with gains up to 19.3%. Open full figure

Reinforcement learning results comparing verifier reward and TRM-assisted reward training. — **Reinforcement learning.** TRM commonly adds 2% to 4% over verifier reward, including 3.9% STEM and 3.8% Math gains on Llama-3.1-8B-Instruct. Open full figure

Reasoning quality analysis showing how trace-quality reward differs from answer verification. — **Reasoning trace quality.** TRM supplies a trace-quality reward that is distinct from and complementary to final-answer verification. Open full figure

Resources

Code, Model, and Data

The project releases TRM-Preference, TRM-8B weights, and code for scoring, reward-model training, and TRM-guided policy optimization.

GitHub Repository Code for training, scoring, and TRM-guided reinforcement learning. TRM-8B Model Released Thinking Reward Model weights for reasoning trace quality scoring. TRM-Preference Dataset 103K training preference pairs and 1.5K validation pairs. arXiv:2602.08498 Paper page for Characterizing, Evaluating, and Optimizing Complex Reasoning.

Reference

Citation

@article{zhang2026characterizing,
  title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
  author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2602.08498},
  year={2026}
}