March 20, 2024
4 mins

PERL: Parameter Efficient RLHF

PERL or Parameter Efficient Reinforcement Learning could be a groundbreaking technique to reduce memory and time consumption when it comes to aligning a model before releasing it to the world. This paper shows how using LoRA, you get close to the same benchmarks as standard RLHF techniques, and hence you get the same quality of output. Business implications are about costs, and can be done efficiently on premises on any open source model.
Paper Link
Header image

Key Takeaways

  • From Google Deepmind
  • RLHF has proven to be a strong method for LLM alignment. It's computationally expensive though.
  • PERL uses Low Rank Adaptation (LoRA) to perform Reward modelling and RLHF on instruction tuned models. Saves on computation and costs.
  • On 7 different benchmarks, PERL performs on par with traditional techniques while the training is faster and consumes less memory.

Another great paper from Google Deepmind with very practical applications, and another one where they have not shared the code. Something tells me that would be the new normal for the future.


Pretrained LLMs like GPT4, Mistral, Gemini need to be aligned with human preferences so that they can be tuned towards what is a good answer and what is a bad answer. This alignment helps with instruction tuning and helps finetune models for behaviors which do not have a typical loss function. Typically this is done via two techniques - Reward Modelling and Reinforcement Learning from Human Feedback.


RLHF or Reinforcement Learning from Human Feedback is used to align models so that they are aligned to human preferences. Conceptually, a model will answer about anything without any control. Some answers may be right, some incorrect, and some correct but not tuned to human preferences. RLHF helps align a model to understand how to answer.

In RLHF, we train a small reward model (RM) on all the human preference data, and use the said reward model to tune the parameters in an LLM.

From the paper

For a typical model training process, you have pretraining, Supervised Finetuning (SFT), Reward Modeling, and then RLHF. Reinforcement learning is done using the weights from post SFT stage. That is, a model is able to answer questions or follow instructions based on SFT examples. Then, we train a reward model and build the reinforcement learning pipeline.

Reward Modeling

To perform Reinforcement Learning (RL from here on), we need a reward score for each response generation. This is the target that RL algos would need to maximize. A popular approach is model based. From the paper:

A popular approach consists of deriving this reward from preferences pairs. In this preference paircomparison formulation, we learn a reward from preferences expressed over pairs of candidateresponses. Given an input x, we sample a pair of responses (y1, y2) ∼ π from one or more models.We collect preference labels over the candidates from humans.

We collect human preferences over pairs of candidate responses from the model. These preferences are used to train a separate "reward model" that can predict which response is preferred given the input.

Reinforcement Learning

The anchor model (obtained via SFT) is further fine-tuned using reinforcement learning, where the goal is to maximize the cumulative reward given by the reward model. Then, we use hyperparameters to control the learning rate and different Lora setups. The process is highly complex and resource draining. From the paper:

While RLHF has been shown tobe an effective method for alignment [Stiennon et al., 2020, Bai et al., 2022b], the complexity andcomputational cost of its process has hindered its adoption: the RL loop, for example, requires atleast twice the memory of standard fine-tuning, as it needs multiple instances of LLMs, such as areward model and an anchor model (for KL regularization) to train properly. Another challenge tothe wider adoption of RLHF lies in the difficulty to collect enough high quality training data to createan effective reward model.

Parameter Efficient Reinforcement Learning

The key idea behind PERL is to use something called "LoRA" adapters. These are small, trainable modules that can be attached to the pre-trained language model. Instead of fine-tuning the entire huge model, authors only train these tiny LoRA adapters while keeping the backbone language model frozen.

For the reward model training, researchers attach LoRA adapters to each attention projection matrix of the language model. Only train these adapters using the preference data, while the rest of the model remains fixed. Once trained, these adapters are easily combined with the original model to get our reward model. It's like putting on a little accessory to your language model.

The reinforcement learning process is similar. LoRA policy adapters are attached to the anchor model and only these little adapters are trained using reinforcement learning to maximize the reward from the reward model. The policy is also regularized to prevent it from drifting too far from the anchor.

So in essence, instead of updating all those billions of parameters during training, only these tiny LoRA modules are updated at any given time. This dramatically reduces the memory requirements and speeds up training, because there are way fewer trainable parameters to deal with.

It's like training a small student model that can then easily accessorize and enhance the capabilities of the larger pre-trained teacher model.


Our experiments across datasets show that we can LoRA train reward models to match the performance of fully tuned ones with less than 0.1% of the parameters being tuned. We also observe thatLoRA generally becomes more effective at matching full-tuning as we increase the LLM size. Acrossall our datasets, the reward models can be tuned with approximately 50% of the memory needed forfull-tuning. The training converges in a similar number of steps for LoRA and full tuning, and theLoRA models are roughly 50% faster to train. The HBM usage and training speedup vary a little witheach dataset as they depends on sequence lengths of the examples, which is different for all datasets.


Open source models is where people discovered LoRA which could help them finetune the models at a fraction of resources or costs. The models were still not aligned. With DPO, they could use rewards to train a model what a good answer look like. RLHF was difficult due to how resource hungry it was, despite being one of the most efficient methods for alignment.

With PERL, many open source developers and businesses can align the models the right way. The best way is indeed to give a model lots of examples marked with preferences. A reward model and RL loop under LoRA means that costs would dramatically come down and enterprises can do this more frequently to ensure that a model performs as they desire.

Moreover, with enterprises where AI is deployed (whether via open AI or open source models), they can take the responses generated by a human, get a preference from the users, and then use that to improve both current and future deployments to generate significantly more useful and directionally actionable outputs.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.