Rewrite-Retrieve-Read
This work introduces a new framework, Rewrite-Retrieve-Read1 instead of the previous retrieve-then-read for the retrieval-augmented LLMs from the perspective of the query rewriting. In this framework, a small language model is adopted as a trainable rewriter to cater to the downstream LLM.
Figure 1. Overview of proposed pipeline. (a) standard retrieve-then-read method. (b) LLM as a query rewriter. (c) pipeline with a trainable writer. (Image source: (Query Rewriting for Retrieval-Augmented Large Language Models))
EfficientRAG
Standard RAG struggle to handle complex questions like multi-hop queries. In this paper, the authors introduce EfficientRAG, which iteratively generates new queries.
RankRAG
LLMs are not good at reading too many chunked contexts (e.g., top-100) even with the long-context window. RankRAG aims to design an RAG instruction tuning pipeline that uses a single language model to achieve both high-recall context extraction and high-quality content generation. It is the most significant achievement that both context ranking and answer generation are considered in this framework.
Figure 1. The pipeline of RankRAG. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)
- stage I: Supervised Fine-Tuning (SFT) The authors use 128K SFT examples (e.g., OpenAssistant, Dolly, SODA, ELI5, Self-Istruct, Unnatural Instructions) in total, and take the multi-turn conversion format, use the previous turns of conversation between user and assistant as the context, and only compute the loss at the last response from the assistant.
- stage II: Unified Instruction-Tuning for Ranking and Generation
the stage II consists of these following parts:
- SFT data from Stage-I: need to maintain the capability of following instruction.
- Context-rich QA data: i) standard QA and reading comprehension dataset: DROP, NarrativeQA, Quoref, ROPES, NewsQA, TAT-QA. ii) conversational QA datasets: HumanAnnotatedConvQA, SyntheticConvQA.
- Retrieval-augmented QA data: SQuAD, WebQuestions. In these two datasets, not all the retrieved contexts contain the answer, thus they can be thought of as involving ‘hard-negative’ contexts.
- Context ranking data: MS MARCO passage ranking dataset.
- Retrieval-augmented ranking data:SQuAD, WebQuestions. For each example, combine a gold context with the other retrieved contexts using BM25. LLM is trained to explicitly identify all relevant contexts for the question.
Finally, all the above data will be cast into a standardized QA form ($x$, $c$, $y$), where $x$ is question, $c$ is the corresponding context, and $y$ is the target output answer.
Figure 2. The converting form of the standardlized QA form from question, context and answer. (Image source: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs)
ChatQA2
ChatQA22(project)is designed to bridge the gap between open-source LLMs and proprietary models in long context understanding and retrieval-augmented generation capabilities. The current RAG pipeline has a few issues:
- The top-k chunk-wise retrieval introduces non-negligible fragmentation of context for generating accurate answers. For instance, many rival retrievers only support 512 tokens, which is too short in today’s view.
- Samll top-k usually leads to low recall, while much larger k can lead to worse generation as the current LLMs is not good at ultilizing too many chunks at the same time. To address these issues, the authors finetune LLMs to support longer context understanding capability, and leverage long-cotext retriever to recall relevant chunks.
Some interesting tips:
- seperate chunks with special characters (e.g.
), rather than the reserved begining and ending tokensand - SFT data less than 32K is from LongAlpaca 12K, GPT-4 samples from Open Orca; SFT data longth between 32K and 128K is from synthetic datasets.
-
Ma et al., Query Rewriting for Retrieval-Augmented Large Language Models, 2023 ↩︎
-
Xu et al., ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities, 2024 ↩︎