Olympiad Combinatorics Benchmark

ComBench

A benchmark for rigorous proof reasoning and constructive realization in Olympiad-level combinatorics.

ComBench evaluates large language models on 100 human-annotated competition-level combinatorics problems. It separates proof quality from explicit witness construction through rubric-guided judging and deterministic verifier-gated scoring.

Shunkai Zhang*, Haoran Zhang*, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng

ComBench annotation pipeline

ComBench annotation and verification pipeline.

100Competition-level problems
50Analysis-centric records
50Construction-centric records
5Primary categories
15Competition sources
2000-2025Problem year range

Overview

ComBench is designed to expose whether a model can both prove a combinatorial claim and realize the requested object as a complete, checkable witness.

Distribution of ComBench categories

Distribution of ComBench categories across 100 problems.

01

Curated Olympiad problems

Records are drawn from major Olympiad-style competitions and grouped into five combinatorics categories.

02

Rubric-guided proof judging

Each item has problem-specific 0/1/6/7 grading guidelines for proof-side evaluation.

03

Executable construction checks

Construction-centric records require a witness payload checked by deterministic Python verifiers.

Design

The benchmark separates annotation, verification, and scoring so that proof reasoning and construction realization can be analyzed independently.

Stage 1

Specification and rubric construction

Human-expert intent, reference answers, and reference witnesses are converted into construction instructions and item-specific grading guidelines.

Stage 2

Verifier generation and audit

Each construction task receives deterministic verifier code and a semantic audit for target fidelity and constraint coverage.

Stage 3

Verifier-gated scoring

Proof scores remain separate from construction validity, then high proof scores are calibrated when the required witness fails verification.

Annotation pipeline diagram | Verifier-gated scoring diagram

Evaluation

ComBench reports average performance and Best@4 under rubric-guided and verifier-gated evaluation.

65.4Top overall Avg. score from GPT-5.5
75.3Top overall Best@4 score from GPT-5.5
83.7Top construction-centric Best@4 from Kimi-K2.6
Model Analysis Avg. Analysis Best@4 Construction Avg. Construction Best@4 Overall Avg. Overall Best@4
1GPT-5.562.472.968.477.765.475.3
2Gemini-3.1-Pro56.169.764.578.360.374.0
3Kimi-K2.643.560.663.483.753.572.1
4DeepSeek-V4-Pro37.856.652.667.745.262.1
5Qwen3.6-Max21.432.928.439.124.936.0
6SU-0120.930.328.841.124.835.7
7GLM-5.121.636.025.637.123.636.6
8Qwen3.6-35B17.926.622.732.020.329.3
9Nemotron-Cascade21.832.917.428.019.630.4
10Gemma-4-31B-IT16.124.317.530.916.827.6
Primary proof-error taxonomy

Proof-error taxonomy used to categorize below-full-credit proof samples.

Key finding

Proof quality and witness validity diverge

Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but achieves the strongest construction-centric Best@4, showing that constructive realization is not merely a by-product of stronger proof reasoning.

The dominant proof failure is Missing Core Mechanism at 41.2%, followed by Wrong Mathematical Target at 20.0%.

Resources

The public code release is available now. Paper, dataset, and full result artifacts will be linked here when public URLs are finalized.

arXiv

Preprint metadata and PDF link will be updated after release.

arXiv coming soon

GitHub

Evaluation pipeline, verifier runtime, data-building utilities, tests, and toy examples.

GitHub

Dataset HF

Dataset and result artifacts are reserved for the public benchmark release.

Dataset HF coming soon

Daily Paper

The Hugging Face Daily Paper entry will be linked after the arXiv page is available.

Daily Paper coming soon

Citation

@misc{combench2026,
  title        = {ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics},
  author       = {Zhang, Shunkai and Zhang, Haoran and Luo, Yun and Cheng, Qianjia and Lei, Haodi and Li, Yizhuo and Zhan, Runzhe and Wang, Zhilin and Xu, Bangjie and Su, Yucheng and Han, Xinmiao and Qu, Xiaoye and Liu, Dongrui and Lin, Zhouchen and Qiao, Yu and Ding, Ning and Li, Yafu and Cheng, Yu},
  year         = {2026},
  note         = {Preprint coming soon}
}