Research paper: “Inference-Time Scaling for Generalist Reward Modeling”

DeepSeek released a new paper introducing Self-Principled Critique Tuning (SPCT) for Generative Reward Modeling (GRM): “Inference-Time Scaling for Generalist Reward Modeling” The paper is under review at COLM 2025, as of April 2025.

Reward modeling (RM) is hard, especially in domains where there’s no clear positive or negative signal (e.g. coding, where you can execute the programs and compare to expected outputs).

In general domains, reward generation is more challenging, as the criteria for rewards are more diverse and complex, and there are often no explicit reference or ground truth. Generalist reward modeling is thus crucial for improving the performance of LLMs in broader applications.

Liu et al. argue that generalist RM requires:

Flexibility for different input types
Accurate reward generation in various domains
(For inference-time scaling) Generate higher-quality reward signals given more inference compute
(For inference-time scaling) Learn scalable behaviors for better performance-compute scaling

This post contains my lightly-edited notes about this paper.

Generalist Reward Modeling

Challenge 4 is vague in my opinion; I guess the argument is that inference-time scaling is not economic or perhaps not sufficiently scalable given their RM needs.

To overcome Challenge 1, they argue that “pointwise generative reward modeling (GRM) [can] unify the scoring of single, paired, and multiple responses within pure language representation”.

To overcome Challenge 2, they propose a new learning method: Self-Principled Critique Tuning (SPCT).

SPCT has two phases:

Rejective fine-tuning
Rule-based online RL

Rejective fine-tuning is used to cold start the reward model, and Rule-based online RL “reinforces genrealist reward generation by advancing the generated principles and critiques”.

What principles and critiques are we talking about here? While they find that humans can create effective principles i.e. rubrics that can be used for assessing the quality of a generation, they propose learning these principles from data.

Boosting reward quality with principles

In section 2.2, they conduct an experiment to demonstrate the value of principles. The benefits of principles:

Generalist RM requires to generate high-quality rewards beyond specific domains (Hendrycks et al., 2021; Jimenez et al., 2024), where the criteria for rewards are more diverse and complex, and there are often no explicit reference or ground truth. To this end, for general domains, we adopt principles to guide reward generation in place of artificial rules. Principles for LLMs are first introduced in Constitutional AI (Bai et al., 2022b; Sharma et al., 2025), which are handicraft criteria that guide the LLMs or curated classifiers to construct safe data pipelines.

With principles, the reward generation of GRMs changes to
\[\mathcal{R} = \boldsymbol{C} \sim r_{\theta}\left(x, \{y_i\}_{i=1}^n, \{p_i\}_{i=1}^m\right)\]
where \(\{p_i\}_{i=1}^m\) denotes the principles. We conduct a preliminary experiment to examine the influence of proper principles on reward quality, with the Chat Hard subset of Reward Bench (Lambert et al., 2024) and the IFEval subset of the PPE benchmark (Frick et al., 2025). We used GPT-4o-2024-08-06 to generate the principles and then pointwise rewards four times for each sample. And we filtered the principles whose according rewards are aligned with the ground truth. We tested different LLMs with principles generated by themselves and the filtered principles, and compared them with the default setting with no principle guidance.

…

We found that the self-generated principles barely improve performance, but the filtered principles could significantly boost the reward quality. This indicates that proper principles better guide reward generation under correctly summoned criteria.

I have no idea what “filtered the principles whose according rewards are aligned with the ground truth”. Presumably they selected some subset of the generated principles, those that used together give the best Reward Model as assessed by Reward Bench?

Note: this post uses MathJax to render TeX formulas.