April 5, 2024
3 mins

RAFT: Adapting Language Model to Domain Specific RAG

With RAFT, an LLM is finetuned to ignore the distracting documents and focus on relevant information for any given task. Along with CoT, this enables model to assign correct weightage to relevant information, improving the generation and downstream tasks output significantly.
Paper Link
Header image

Key Takeaways

  • RAFT is a novel training recipe that improves the ability of Large Language Models (LLMs) to answer questions in an "open-book" setting within a specific domain.
  • RAFT trains LLMs to identify and ignore irrelevant documents while focusing on relevant information to answer questions accurately.
  • This technique outperforms existing methods like supervised fine-tuning and RAG on various benchmarks, including PubMed, HotpotQA, and Gorilla API Bench.
  • RAFT demonstrates the potential of smaller, fine-tuned models to achieve comparable performance to larger, generic LLMs in domain-specific question-answering tasks.

Introduction

Large Language Models (LLMs) have revolutionized various natural language processing tasks, demonstrating remarkable capabilities in general knowledge reasoning. However, adapting these models to specialized domains, such as legal documents, medical records, or company-specific information, remains a challenge.

This paper focuses on improving LLMs' performance in domain-specific question answering (QA) tasks. Existing approaches, such as supervised fine-tuning and Retrieval-Augmented Generation (RAG), have limitations. Supervised fine-tuning often fails to leverage external knowledge sources, while RAG struggles to handle irrelevant information effectively.

Core Insight

RAFT addresses these limitations by introducing a novel training recipe that combines the strengths of supervised fine-tuning and RAG. The core idea is to train LLMs to differentiate between relevant and irrelevant documents while answering questions in an "open-book" setting. This enables the model to focus on pertinent information and generate accurate answers.

In simple terms

Imagine preparing for an open-book exam. You wouldn't simply memorize the entire textbook; instead, you'd learn to identify and focus on relevant sections. RAFT applies this principle to LLMs, training them to ignore "distractor" documents and extract key information from relevant ones. This improves their ability to answer questions accurately within a specific domain.

Methodology

RAFT involves the following steps:

  1. Data Preparation: The training data consists of question-answer pairs and associated documents. These documents are categorized as either "oracle" documents (containing the answer) or "distractor" documents (irrelevant to the answer).
  2. Training with Distractors: The LLM is trained to answer questions using the provided documents, including both oracle and distractor documents. This helps the model learn to identify and ignore irrelevant information.
  3. Chain-of-Thought Reasoning: RAFT encourages the model to generate answers in a chain-of-thought style, explaining its reasoning process and citing relevant passages from the documents. This improves transparency and reasoning ability.
  4. Evaluation: The trained model is evaluated on domain-specific QA benchmarks, where it is provided with questions and top-k retrieved documents. The model's ability to identify relevant information and generate accurate answers is assessed.

Evaluation

RAFT consistently outperforms existing methods on various benchmarks, demonstrating significant improvements in domain-specific question answering. The inclusion of distractor documents during training enhances the model's robustness and ability to handle irrelevant information effectively.

Results

Compared with the base Llama-2 instruction-tuned model, RAFT  with RAG does much better in terms of extracting information as well as being robust towards distractors. The gain can be as big as 35.25% on Hotpot QA and 76.35% on Torch Hub evaluation. Compared with DSF on the specific dataset, our model does better at relying on the provided context to solve the problem. RAFT  does much better on tasks like HotpotQA and HuggingFace datasets (30.87% on HotpotQA and 31.41% on HuggingFace).

Effect of Chain of Thought

Incorporating a reasoning chain guides the model to the right answer and also helps the model understand the task and improve the overall accuracy and enhances robustness.

A key finding here is:

incorporating a portion of the training data without the oracle document in the context (P = 80%) appears to enhance the model’s performance on RAG tasks.

That is, training your LLM without the correct corresponding context at times can be beneficial for the downstream task of answering questions related to the documents. This is counterintuitive, and I explain in the next section as to why this happens.

RAFT generalizes to top-k RAG results

The technique can maintain consistent performance even when the number of documents retrieved during testing varies because it is finetuned to ignore the irrelevant text (distracting documents). In real-world applications, the amount of information available can fluctuate. For example, a search engine might return a different number of results depending on the query.

Many RAG pipelines which use oracle documents (relevant documents) without distractor documents end up generating imperfect output because the models are not trained on which parts of the prompts to ignore.

RAFT generalizes because of the way it's trained:

  1. Training with Distractors: As mentioned earlier, RAFT includes distractor documents during training. This forces the model to learn how to differentiate between relevant and irrelevant information. This skill becomes crucial when dealing with varying numbers of documents at test time.
  2. Varying Distractor Numbers: RAFT experiments with different numbers of distractor documents during training. This exposes the model to different levels of "noise" and helps it develop robustness to varying amounts of context.

Infact, this generalizes to real time search query (Perplexity AI usecase) answering via an LLM too as the model understands which part of the prompt is not relevant and ignores that thus generating a more accurate answer.

Business Implications

Beyond basic question-answering applications, RAFT has significant potential to benefit businesses in various ways:

  • Improved Customer Service: RAFT can be used to develop intelligent chatbots that can access and process domain-specific information, such as product manuals or internal knowledge bases, to provide accurate and efficient customer support.
  • Enhanced Decision Making: By enabling LLMs to analyze and extract insights from domain-specific documents, RAFT can support data-driven decision making in areas like finance, healthcare, and marketing.
  • Streamlined Information Retrieval: RAFT can improve information retrieval systems by enabling LLMs to identify and prioritize relevant documents, reducing information overload and saving time.

Conclusion

RAFT presents a promising approach for training LLMs to excel in domain-specific question answering. Its ability to handle irrelevant information and adapt to different retrieval settings makes it a valuable tool for various business applications. As research in this area continues, we can expect to see even more powerful and efficient LLMs that can effectively leverage domain-specific knowledge to solve complex problems.

Why does this work?

RAFT gives better results than existing methods for domain-specific question answering because it trains LLMs to be more robust to irrelevant information. This is achieved by including distractor documents during the training process.

  • Existing methods: Traditional supervised fine-tuning typically trains LLMs on question-answer pairs without considering the context of external documents. This can lead to models that are good at memorizing answers but struggle to extract relevant information from documents when needed. On the other hand, existing RAG methods often rely on the retriever to provide only relevant documents, which doesn't prepare the model for real-world scenarios where irrelevant information might be present.
  • RAFT's approach: RAFT addresses this by explicitly training the LLM with both relevant and irrelevant documents. This forces the model to learn how to differentiate between the two and focus on the information that is actually helpful for answering the question. This makes the model more robust to the imperfections of real-world retrieval systems, where irrelevant documents might be included in the top-k results.

In essence, RAFT prepares the LLM for the "open-book" exam scenario by simulating the presence of distractor documents during training. This allows the model to develop the ability to sift through irrelevant information and focus on what's truly important for answering the question accurately.

Relevant Links

Blog via Microsoft

UC Berkeley's post on RAFT

Github Repo

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.