Rewrite-Retrieve-Read

This work introduces a new framework, Rewrite-Retrieve-Read1 instead of the previous retrieve-then-read for the retrieval-augmented LLMs from the perspective of the query rewriting. In this framework, a small language model is adopted as a trainable rewriter to cater to the downstream LLM.

Figure 1. Overview of proposed pipeline. (a) standard retrieve-then-read method. (b) LLM as a query rewriter. (c) pipeline with a trainable writer. (Image source: (Query Rewriting for Retrieval-Augmented Large Language Models))

Figure 1. Overview of proposed pipeline. (a) standard retrieve-then-read method. (b) LLM as a query rewriter. (c) pipeline with a trainable writer. (Image source: (Query Rewriting for Retrieval-Augmented Large Language Models))

From this figure, complex queries can be split into many sub-queries, which benefits retriever to recall precise contexts more efficiently. In practice, the authors use reinforcement learning to train the rewriter, which is undoubtedly a costly process.

EfficientRAG

Standard RAG struggle to handle complex questions like multi-hop queries. In this paper, the authors introduce EfficientRAG, which iteratively generates new queries.

RankRAG

LLMs are not good at reading too many chunked contexts (e.g., top-100) even with the long-context window. RankRAG aims to design an RAG instruction tuning pipeline that uses a single language model to achieve both high-recall context extraction and high-quality content generation. It is the most significant achievement that both context ranking and answer generation are considered in this framework.

Figure 1. The pipeline of RankRAG. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)

Figure 1. The pipeline of RankRAG. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)

  1. stage I: Supervised Fine-Tuning (SFT) The authors use 128K SFT examples (e.g., OpenAssistant, Dolly, SODA, ELI5, Self-Istruct, Unnatural Instructions) in total, and take the multi-turn conversion format, use the previous turns of conversation between user and assistant as the context, and only compute the loss at the last response from the assistant.
  2. stage II: Unified Instruction-Tuning for Ranking and Generation the stage II consists of these following parts:
    1. SFT data from Stage-I: need to maintain the capability of following instruction.
    2. Context-rich QA data: i) standard QA and reading comprehension dataset: DROP, NarrativeQA, Quoref, ROPES, NewsQA, TAT-QA. ii) conversational QA datasets: HumanAnnotatedConvQA, SyntheticConvQA.
    3. Retrieval-augmented QA data: SQuAD, WebQuestions. In these two datasets, not all the retrieved contexts contain the answer, thus they can be thought of as involving ‘hard-negative’ contexts.
    4. Context ranking data: MS MARCO passage ranking dataset.
    5. Retrieval-augmented ranking data:SQuAD, WebQuestions. For each example, combine a gold context with the other retrieved contexts using BM25. LLM is trained to explicitly identify all relevant contexts for the question. Finally, all the above data will be cast into a standardized QA form ($x$, $c$, $y$), where $x$ is question, $c$ is the corresponding context, and $y$ is the target output answer.
      Figure 2. The converting form of the standardlized QA form from question, context and answer. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)

      Figure 2. The converting form of the standardlized QA form from question, context and answer. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)

      RankRAG Inference: Retrieve-Rerank-Generate Pipeline The process of RankRAG can be described as follows: 1) the retriever $\mathcal{R}$ retrieves top-$N$ contexts from the knowledge base. 2) the RankRAG model calculates the relevant score between the quetion and retrieved $N$ contexts and only retains the top-$k$ contexts. 3) The top-$k$ contexts, along with the question, are integrated into a long prompt and fed into the RankRAG model to generate the final answer.

ChatQA2

ChatQA22(project)is designed to bridge the gap between open-source LLMs and proprietary models in long context understanding and retrieval-augmented generation capabilities. The current RAG pipeline has a few issues:

  1. The top-k chunk-wise retrieval introduces non-negligible fragmentation of context for generating accurate answers. For instance, many rival retrievers only support 512 tokens, which is too short in today’s view.
  2. Samll top-k usually leads to low recall, while much larger k can lead to worse generation as the current LLMs is not good at ultilizing too many chunks at the same time. To address these issues, the authors finetune LLMs to support longer context understanding capability, and leverage long-cotext retriever to recall relevant chunks.

Some interesting tips:

  1. seperate chunks with special characters (e.g. ), rather than the reserved begining and ending tokens and
  2. SFT data less than 32K is from LongAlpaca 12K, GPT-4 samples from Open Orca; SFT data longth between 32K and 128K is from synthetic datasets.