After pre-training on vast datasets and supervised fine-tuning with diverse instruction sets, Large Language Models (LLMs) have achieved remarkable capabilities in text generation. However, LLMs can generate seemingly reasonable sequences—-free from grammatical errors and redundant words—-they may still generate content that lacks truthfulness or accuracy. Are there any methods to mitigate these shortcomings? Researchers at OpenAI have framed these issues as the challenge of LLM alignment. Currently, one of the most prominent approaches to address these challenges is Reinforcement Learning from Human Feedback (RLHF). To implement RLHF, OpenAI has adopted the Proximal Policy Optimization (PPO) algorithm.

multi-turn instruction tuning

Most of instruction-following studies and benchmarks overlook the multi-turn instruction following capability of LLMs, which is actually a more common demand in real-world scenarios. So it would therefore never to be too much of an exggeration to say that multi-turn conversation ability is the most significant part of LLMs.

Parrot: enhancing multi-turn instruction following for LLMs

Multi-turn example, contextual information is need to be ultilized by LLMs. (Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

Multi-turn example, contextual information is need to be ultilized by LLMs. (Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

The most common interaction way between human and LLMs is multi-turn conversation. Parrot1 presents a solution aiming to enhancing multi-turn instruction following for LLMs.

  1. Dataset Collection: the authors proposes training a specialized Parrot-Ask to generate queries using the available real user-ChatGPT logs based on LLaMA, then employ Parrot-Ask to interact with an assistant LLM and thus collect 40K multi-turn instruction tuning data.
  2. Training Parrot-Ask Model: Training the mode is the inverse of standard instruction tuning. Compare to common supervised fine-tuning methods, the Parrot-Ask model is trained to predict query tokens instead of assistant output tokens. Concretely, the authors use LLaMA-13B-Chat and 90K ShareGPT data to train this model.
  3. CaPO dataset Collection: The authors sample 10K dataset which involve contextual information and adapt three strategies to generate negative responses, thus collect 10K Context-Aware Preference Optimazation (CaPO) dataset.
The process of Parrot.(a) First, train the Parrot-Ask model on real user-ChatGPT logs to learn how real users pose queries, and utilize it to iteratively interact with ChatGPT to collect multi-turn instructionresponse pairs. (b) Then construct negative responses for queries that rely heavily on context for answering with three strategies to simulate three types of error cases. Finally, use the collected data to train the Parrot-Chat model by (c) instruction tuning and (d) context-aware preference optimization.(Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

The process of Parrot.(a) First, train the Parrot-Ask model on real user-ChatGPT logs to learn how real users pose queries, and utilize it to iteratively interact with ChatGPT to collect multi-turn instructionresponse pairs. (b) Then construct negative responses for queries that rely heavily on context for answering with three strategies to simulate three types of error cases. Finally, use the collected data to train the Parrot-Chat model by (c) instruction tuning and (d) context-aware preference optimization.(Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

Reward Model

HelpSteer2

dataset: most of the prompts used in HelpSteer2 are sourced from ShareGPT, a platform where ChatGPT users voluntarily share their conversations.

method details: We initially have a reward model $P(a|x, y)$, which can predict the attributes $a$ conditioning on prompt $x$ and response $y$. So we can get this using Bayes’ rule: $$\begin{align}P(y|a, x) &= \frac{P(a|x, y)P(x, y)}{P(a, x)}\\&= \frac{P(a|x, y)P(x, y)}{\sum_{y}{P(a|x, y)P(x, y)}} \\&=\frac{P(a|x, y)P(y|x)}{\sum_y{P(a|x, y)P(y|x)}} \\ &\propto P(a|x, y)P(y|x)\end{align}$$ This means we can get a steermodel $P(y|a, x)$ from a reward model $P(a|x, y)$ and a language model $P(y|x)$. The steermodel $P(y|a, x)$ is our optimal model which not obtained. Then we assume we have an approximated SteerLM model $Q_{\theta}(y|a, x)$, hence what we need to do is to measure the distance between $P(a|x, y)$ and $Q_{\theta}(y|a, x)$ and try to minimize their distance. $$\min_{\theta}\mathbb{E}_{a,x \sim P(x)P(a)} D_{KL}(P(y|a, x)||Q_{\theta}(y|a, x))$$ given that the desired distribution of $P(y|a, x)$ still remains unknown, we extend above and get: $$\begin{align} &=\min_{\theta}\mathbb{E}_{a,x\sim P(x)P(a)}\mathbb{E}_{y\sim P(y|a,x)}(\log P(y|a, x)-\log Q_{\theta}(y|a, x)) \\&=-\min_{\theta}\mathbb{E}_{a,x\sim P(x)P(a), y\sim P(y|a,x)} \log Q_{\theta}(y|a, x) \\&= -\min_{\theta} \mathbb{E}_{a,x \sim P(x)P(a)} \sum_{y}P(a|y,x)P(y|x) \log Q_{\theta}(y|a, x) \end{align}$$ Actually, the data we sample to train the model don’t always belong to the distribution $P(a)P(x)P(a|y, x)P(y|x)$, for the sake of training efficiency, we sample responses from an original SteerLM model $Q^{’}(y|a, x)$. This leads to:

$$\begin{align} -\min_{\theta} \mathbb{E}_{a,x \sim P(x)P(a)} \sum_{y}P(a|y,x)P(y|x) \log Q_{\theta}(y|a, x) \\= -\min_{\theta}\mathbb{E}_{a,x \sim P(x)P(a), y \sim Q^{’}(y|x,a)}\frac{P(a|y, x)P(y|x)}{Q^{’}(y|x,a)} \log Q_{\theta}(y|a,x) \end{align}$$

At this point, we start to optimize this equation: $$\begin{align}\nabla_{\theta} \mathcal{L} = -\mathbb{E}_{a,x \sim P(x)P(a)} \mathbb{E}_{y \sim Q^{’}(y|x,a) }\frac{P(a|y,x)P(y|x)}{Q^{’}(y|x, a)} \nabla_{\theta} \log Q_{\theta} (y|x,a)\end{align}$$ Here, we estimate the expectation $\mathbb{E_{y \sim Q^{’}(y|x, a)}}$ leveraging $n$ samples $y_i \sim Q_{’}(y|x,a)$ , let $$\begin{align} w_{i} = \frac{P(a|y,x)P(y|x)}{Q^{’}(y|x,a)} \end{align}$$ Considering that $w_i$ might become a very big value, we should normalize $w_i$ to $w_{i}^{’}$ with $\sum_i w^{’}_i=1$: $$\begin{align} w^{’}_{i} = \frac{w_i}{\sum_{k=1} w_k}\end{align}$$ Gradient can be estimated as:$$\begin{align}\nabla_{\theta} \mathcal{L} \approx -\sum_{y\sim Q^{’}(y|x,a),i=1,…,n} w^{’}_i \nabla_{\theta} \log Q_{\theta}(y_i|a,x) \end{align}$$ To reduce variance, we subtract a baseline estimated by using $Q_{\theta}$ itself, and we have known: $$\begin{align} \mathbb{E}_{y \sim Q_{\theta}(y|x, a)} \nabla_{\theta} \log Q_{\theta}(y|a,x) = 0 \end{align}$$ We use another distribution to estimate that: $$\begin{align} \mathbb{E}_{y\sim Q_{\theta}(y|x,a)} \nabla_{\theta} \log Q_{\theta}(y|a,x) \approx \sum_{y_i \sim Q^{’}(y|x,a),i=1,…,n} b^{’}_i \nabla \log Q_{\theta}(y_i|a, x) \approx 0 \end{align}$$ where $b_i=\frac{b_i}{\sum_i b_i}$ with $b_i = \frac{Q_{\theta}(y_i|a,x)}{Q^{’}(y_i|a,x)}$. Then we get: $$\begin{align} \nabla_{\theta} \mathcal{L} \approx - \sum_{y_i \sim Q^{’}(y|x,a),i=1,…,n} (w^{’}_i - b^{’}_i)\nabla_{\theta} \log Q_{\theta}(y_i|a,x) \end{align}$$ From the above we have already known only $w_i$ is related to the optimal distribution $P(y|x, a)$, and only $b_i$ is associated with our target distribution $Q_\theta(y|a,x)$. During training stage, monitoring the KL distance between $w_i^{’}$ and $b_i^{’}$ is a reasonable option: $$\begin{align} D_{KL}(w^{’}_i || b^{’}_i) = \sum_{i} w^{’}_i \log \frac{w^{’}_i}{b^{’}_i} \end{align}$$

Human Preference Optimization

Make LLMs refuse to answer unknown questions

R-Tuning, introduced in (Zhang et al., 20232), aims to equip Large Language Models (LLMs) with the ability to decline answering unknown questions. It leverages the instruction tuning approach, following a two-step process:

  1. Uncertainty Identification: The model is first evaluated on the training data. By inferring the model on the training data once and comparing the prediction and label, the instruction tuning data is split into uncertain data and certain data.
  2. Refusal-Aware Data Construction: Uncertainty expressions are appended to the labels of the certain data points. This newly constructed “refusal-aware data” is then used to fine-tune the LLM, enabling it to recognize and decline unknown questions.
The workflow of constructing refusal-aware data. (Image source: R-Tuning: Teaching Large Language Models to Refuse Unknown Questions)

The workflow of constructing refusal-aware data. (Image source: R-Tuning: Teaching Large Language Models to Refuse Unknown Questions)

The purpose of R-Tuning is to alleviate hallucination of LLMs when facing unknown questions. However, it doesn’t take human preference responses into consideration.

Direct Preference Optimization

DPO (Direct Preference Optimization)3, which evolved from the pair-wise formulation of the reward model introduced in InstructGPT4, simplifies the RLHF (Reinforcement Learning from Human Feedback) process into a one-step optimization. The loss function has been reformulated as follows: $$\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}[\log\sigma(\beta\log\frac {\pi_{\theta}(y_w|x)} {\pi_{ref}(y_w|x)} - \beta\log\frac {\pi_{\theta}(y_l|x)} {\pi_{ref}(y_l|x)} )]$$ where $y_w$ represents the accepted or preferred data, while $y_l$ represents the rejected or less preferred data. This formulation clearly demonstrates that DPO optimizes the margin between desirable and undesirable changes, effectively enhancing the model’s ability to generate preferred outputs.

KTO

Sometimes, preference dataset with a one-pair format is hard to obtain. In such cases, we can use a set of preference data where each sample only has a label of ‘1’ for acceptance or a label of ‘-1’ for rejection. KTO5 is proposed for this scenario. Its mathematical formation is here: $$\mathcal{L}_{KTO}(\pi_\theta,\pi_{ref}) = \mathbb{E}_{x,y\sim{D}}[\lambda_y-v(x,y)]$$ where $$r_\theta(x, y) = \log\frac{\pi_{\theta}(y|x)} {\pi_{ref}(y|x)}$$ $$z_0 = \text{KL}(\pi_{\theta}(y^{\prime}|x) || \pi_{ref}(y^{\prime}|x))$$ $$v(x, y) = \begin{cases} \lambda_D\sigma(\beta(r_{\theta}(x,y) - z_0))\ \text{if}\ y \sim y_{desirable}|x\newline \lambda_U\sigma(\beta(z_0-r_{\theta}(x, y)))\ \text{if}\ y\sim\ y_{undesirable}|x\end{cases}$$ for stable training, the authors choose not to backpropagge through $z_0$.

SPPO

Self-Play Preference Optimization (SPPO)6 through approximating the Nash equilibrium benifits a theoretical convergence guarantee. This method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which would not be overcome by symmetric pairwise loss such as Direct Preference Optimazation (DPO) or Identity Preference Optimization (IPO).

In details, the researchers deploy PairRM7 as the preference model.

Figure 4. SPPO algorithm illustration with pseudo code. (Image source: Self-Play Preference Optimization for Language Model Alignment)

Figure 4. SPPO algorithm illustration with pseudo code. (Image source: Self-Play Preference Optimization for Language Model Alignment)