May 28, 2024
9 mins

RLHF Workflow: From Reward Modeling to Online RLHF

Online Iterative RLHF builds upon reward modeling as human preference approximation with iteratively changing dataset and aligning the model better than previous techniques
Paper Link
Header image

Key Takeaways

This technical report delves into the intriguing world of online iterative Reinforcement Learning from Human Feedback (RLHF), a powerful technique used for aligning Large Language Models (LLMs) with human preferences. While offline RLHF methods rely on a fixed, pre-collected dataset, online iterative RLHF continuously gathers new data during the training process. This approach offers significant advantages, particularly in overcoming the limitations of offline methods and achieving better alignment with human expectations. The report provides a detailed "recipe" for implementing online iterative RLHF, aiming to empower the open-source community to leverage this powerful technique.

Key highlights include:

  • Understanding Online Iterative RLHF: The report clearly explains how online iterative RLHF works, emphasizing its ability to address the distribution shift issue faced by offline methods.
  • Human Feedback Approximation: Recognizing the cost of obtaining human feedback, the report introduces the use of a proxy preference model trained on diverse open-source datasets.
  • Practical Implementation Details: A detailed guide for implementing online iterative RLHF is provided, covering aspects like preference dataset selection, reward model training, exploration strategies, and benchmark evaluations.
  • Impact of Online Iterative RLHF: The results clearly demonstrate the superiority of online iterative RLHF over offline methods.


LLMs are incredibly effective at text generation, and Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for aligning these models with human values and preferences. Think of ChatGPT, Claude, and Gemini – all these revolutionary models employ RLHF to understand and adapt to our expectations. The more user interaction data a foundational model company has, the better they can align a model and more effective it can be.

The core idea behind RLHF is to incorporate human preferences into the training process. Imagine you're teaching a child to draw. You wouldn't just show them a bunch of drawings and expect them to learn. Instead, you'd provide feedback on their work, guiding them towards creating better drawings. RLHF works similarly: we "reward" the LLM for producing responses aligned with human preferences and "penalize" it for responses that deviate from those preferences.

Previous RLHF Approaches and Their Challenges

Traditional RLHF methods can be broadly categorized as either Deep RL-based approaches (using algorithms like Proximal Policy Optimization - PPO) or offline direct preference learning approaches (like Direct Preference Optimization - DPO).

DRL-based frameworks, like those used in ChatGPT and Claude, involve two stages. First, a reward model is trained to predict the "reward" associated with a given response. Second, a DRL algorithm like PPO is used to fine-tune the LLM to maximize this reward signal. While effective, these methods can be computationally demanding and challenging to tune, especially for resource-constrained open-source projects.

Offline direct preference learning algorithms like DPO directly learn from human preference datasets without explicitly constructing a reward function. These methods are generally easier to tune and require fewer computational resources. However, they rely on a fixed, offline preference dataset, which can lead to over-optimization and poor performance on out-of-distribution data.

Think of it like trying to learn to cook from a cookbook with a limited number of recipes. You might master those specific dishes, but struggle when faced with new ingredients or unfamiliar cooking techniques. Similarly, LLMs trained on a fixed dataset might excel in those specific areas covered by the data but falter when faced with prompts or situations outside that dataset.

The paper highlights this issue at the core:

Therefore, the distribution shift between policies is usually very large, and it is unlikely that we can learn the optimal policy solely from a pre-collected dataset.

Online Iterative RLHF

In contrast to offline methods, online iterative RLHF tackles the over-optimization challenge by continuously collecting new data during the training process. Imagine adding new recipes to your cookbook as you learn and explore new culinary techniques. This continuous learning allows you to adapt to new situations and improve your cooking skills over time.

Similarly, online iterative RLHF expands the training dataset by deploying intermediate models, gathering human feedback on their responses, and incorporating this new data into subsequent training iterations. This process helps mitigate the distribution shift issue, leading to better generalization and improved alignment.

Human Feedback Approximation

Ideally, we would gather feedback from actual humans for online iterative RLHF. However, this can be expensive and time-consuming. As an alternative, the report suggests using a proxy preference model trained on a diverse set of open-source preference datasets. This model can then provide feedback on the LLM's responses, approximating human judgment.


Before diving into the details of online iterative RLHF, let's establish a common understanding of the key concepts:

Reward Modeling as Human Feedback Approximation

Reward modeling plays a crucial role in RLHF by providing a mechanism for capturing human preferences. In essence, a reward model predicts the "reward" associated with a given LLM response. A higher reward indicates better alignment with human preferences.

Preference Datasets

Preference datasets are essential for training reward models. These datasets consist of prompts paired with multiple responses, where human annotators have provided preferences between those responses. For example, a dataset might contain a prompt like "Write a poem about nature," along with two different poems generated by the LLM. Human annotators would then indicate which poem they prefer.

Bradley Terry Reward model and Preference Model

One common approach to reward modeling is the Bradley-Terry (BT) model, a classic model from preference learning. The BT model assigns a scalar score to each response, and the probability of preferring one response over another is determined by the difference between their scores. In other words, the response with a higher score is more likely to be preferred.

Alternatively, a preference model can directly predict the probability of preferring one response over another without explicitly assigning scores. This model takes the prompt and two responses as input and outputs the probability of preferring the first response over the second.

The paper provides a clear visualization of these concepts in Figure 2.

Iterative Policy Optimization

Now, let's delve into the heart of online iterative RLHF - the iterative policy optimization process.

Supervised Fine-Tuning (SFT)

Before applying RLHF, the LLM is typically fine-tuned on a large dataset of instructions and responses. This step, known as Supervised Fine-tuning (SFT), helps the LLM develop a basic understanding of instructions and generate coherent responses.

Iterative Direct Preference Learning: Theoretical Insights and Algorithmic Principles

The core principle of online iterative RLHF is to continuously refine the LLM's policy (the way it generates responses) based on new data collected during training. This process involves two key components:

Iterative Finetuning

In each iteration, the LLM's policy is updated using a direct preference learning algorithm (like DPO) on the accumulated data, which includes both the initial offline dataset and the new data collected in previous iterations. This iterative fine-tuning allows the LLM to gradually align its responses with human preferences.

Iterative DPO and Practical Implementation Details

The report proposes an implementation of online iterative RLHF using DPO. To enhance exploration and prevent over-optimization on the existing data, the algorithm uses a non-symmetric structure with two agents:

  • Main agent: This agent focuses on exploiting the existing data, representing the best policy learned so far.
  • Enhancer: This agent explores new areas of the response space, generating responses that are different from the main agent's responses.

The enhancer plays a crucial role in mitigating the distribution shift issue. By exploring new areas of the response space, it helps gather data from regions where the existing reward model might be less accurate.

The exploration strategy employed combines temperature tuning with rejection sampling:

  • Temperature tuning: Adjusting the sampling temperature of the main agent introduces variability in its responses.
  • Rejection sampling: A "best-of-n" sampling approach is used, where multiple responses are generated for each prompt, and the best (and worst) response is selected based on the reward model's evaluation.

Figure 6 illustrates this process, highlighting how the historical dataset grows with each iteration, leading to a more comprehensive and diverse training set.

Evaluation of the Model

The effectiveness of online iterative RLHF is evaluated using a combination of standard benchmarks:


The following benchmarks are used to assess the model's performance in different aspects:

  • Conversation quality: AlpacaEval-2, MT-Bench, and Chat-Arena-Hard evaluate the model's ability to generate human-like and engaging responses in both single-turn and multi-turn conversations.
  • Academic tasks: GSM-8K, MMLU, HumanEval, TruthfulQA, ARC, and MBPP measure the model's capabilities in reasoning, problem-solving, and knowledge acquisition.

Main Results

The results unequivocally demonstrate the benefits of online iterative RLHF:

  • Conversation quality: The resulting model, SFR-Iterative-DPO-LLaMA-3-8B-R, significantly outperforms other open-source models with comparable sizes on conversation and instruction-following benchmarks. Notably, it even surpasses much larger models trained with offline DPO or PPO.
  • Academic tasks: The model's performance on academic benchmarks remains comparable to the SFT baseline, indicating that online iterative RLHF does not significantly degrade reasoning or factual accuracy.

The ablation study provides further insights into the impact of different design choices:

  • Length bias: The model trained with a length penalty in the reward function effectively mitigates the length bias (tendency to generate longer responses), leading to improved performance on certain benchmarks.
  • Reward model: Using a reward model trained on a diverse set of datasets results in better performance compared to using a reward model with a stronger bias towards longer responses.

Business Implications

The paper's findings have significant business implications, extending beyond the commonly known use cases of RLHF.

Enhanced Customer Service

Online Iterative RLHF can lead to more natural and helpful conversational AI agents for customer service applications. This could lead to increased customer satisfaction and reduced reliance on human agents for routine queries.

Personalized Learning and Education:

Iterative RLHF can power AI tutors and learning companions that adapt to individual student needs and learning styles, providing more effective personalized education experiences.

Creative Content Generation

Beyond writing, iterative RLHF can be used to generate other creative content, such as music, code, and even business strategies. This opens new avenues for businesses to leverage AI for creative tasks and idea generation.

Reduced Development Costs and Time

By leveraging open-source datasets and the "recipe" provided in the paper, businesses can significantly reduce the cost and time required to develop high-performing conversational AI systems.


The report makes a compelling case for the effectiveness of online iterative RLHF in aligning LLMs with human preferences. It provides a detailed and practical guide for implementing this technique, making it accessible to the open-source community. The results highlight the significant improvements achieved over offline methods, opening new possibilities for developing more human-aligned and powerful LLMs.

Why is Online Iterative RLHF more Effective?

The effectiveness of online iterative RLHF stems from its ability to address the key limitations of offline methods. Here's why it works so well:

Mitigating Distribution Shift

By continuously collecting new data from evolving policies, online iterative RLHF effectively tackles the distribution shift issue, preventing over-optimization on the initial dataset and improving generalization. It's like constantly adding new ingredients and recipes to your cooking repertoire, allowing you to handle a wider range of culinary challenges.

Refined Reward Model

The iterative process allows the reward model to continuously learn and improve, adapting to the LLM's evolving policies. This ensures the feedback provided remains relevant and effective throughout the training process.

Enhanced Exploration

By incorporating exploration strategies like temperature tuning and rejection sampling, online iterative RLHF encourages the LLM to venture into new areas of the response space. This helps discover novel and creative responses that might not be found within the confines of the initial dataset.

In essence, online iterative RLHF fosters a dynamic and adaptive learning process, enabling the LLM to continuously refine its responses and align them with the ever-evolving intricacies of human preferences.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.