April 5, 2024
4 mins

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Gecko - from Google Deepmind - is a new embedding model architecture that utilizes two step LLM distillation process to create a high quality training dataset, and leads to a better model performance.
Paper Link
Header image

Key Takeaways:

  • Gecko is a compact and versatile text embedding model that achieves strong retrieval performance by distilling knowledge from large language models (LLMs).
  • Gecko utilizes a two-step distillation process:
    • Generating diverse, synthetic paired data using an LLM: This involves prompting an LLM to generate relevant tasks and queries for a large corpus of unlabeled passages.
    • Refining data quality by retrieving and relabeling passages: An initial embedding model retrieves candidate passages for each query, and the LLM reranks them to identify positive and hard negative passages.
  • Gecko outperforms existing text embedding models with similar size and dimensionality on the Massive Text Embedding Benchmark (MTEB).
  • Gecko demonstrates strong zero-shot performance, indicating its ability to generalize to new tasks and domains.
  • Gecko offers potential for various business applications, including improving search engines, building chatbots, and analyzing customer feedback.

Introduction

Text embedding models play a crucial role in various natural language processing (NLP) and generative AI tasks by representing text as dense vectors. These embeddings capture semantic relationships between words and sentences, enabling applications like document retrieval, sentence similarity, classification, and clustering.

While recent efforts have focused on developing general-purpose text embedding models, these models often require vast amounts of training data. Gecko addresses this challenge by leveraging the knowledge contained within large language models (LLMs) through a knowledge distillation approach. This allows Gecko to achieve strong performance with a compact model size and lower dimensionality compared to other models.

Approach

The model uses insights from knowledge distillation to create a two step LLM powered embedding model.

Figure: Overview of Gecko. Gecko is a versatile text embedding model trained on a variety of tasks including document retrieval, semantic similarity, and classification. To train Gecko, we utilize FRet where queries are generated from LLMs, and their positive and negative passages are mined by LLMs.

Step 1: LLM-based Diverse Query Generation

  1. Start with a large corpus of unlabeled passages.
  2. Use a few-shot prompted LLM to generate a relevant task and query for each passage. This ensures that the training data covers a diverse range of tasks and linguistic patterns.

Step 2: LLM-based Positive and Negative Mining

  1. Embed the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages.
  2. Use an LLM to rerank the retrieved passages based on their relevance to the query.
  3. Select the top-ranked passage as the positive target and a lower-ranked passage as the hard negative example. This step refines the data quality and helps the model learn nuanced distinctions.

The resulting dataset, FRet, is then combined with human-annotated data and used to fine-tune Gecko. This combination of LLM-generated and LLM-ranked data with human-annotated data allows Gecko to achieve strong performance on a variety of tasks.

This two-step distillation process is key to Gecko's success, as it allows the model to leverage the vast knowledge and understanding of LLMs to create a high-quality training dataset.

Context/Related Work

Gecko builds upon several existing concepts in NLP and machine learning:

Text Embedding Models: Existing models like SBERT, Universal Sentence Encoder, and Sentence T5 aim to provide general-purpose embeddings for various NLP tasks. However, they often struggle to generalize across different domains and tasks. Gecko addresses this limitation by leveraging LLM knowledge and a diverse training dataset.

Contrastive Learning: This technique involves training models to distinguish between similar and dissimilar examples. Gecko utilizes contrastive learning by employing LLM-ranked positive and hard negative passages during training.

Synthetic Data Generation: LLMs can be used to generate synthetic data for training NLP models. Gecko leverages this capability to create a diverse and task-agnostic training dataset, overcoming the limitations of manually labeled data.

Retrieval with Instructions: Recent research explores incorporating instructions into retrieval tasks to guide the model's behavior. Gecko adopts this concept by generating task descriptions along with queries, allowing the model to adapt to different retrieval objectives.

Training

Gecko is based on a 1.2B parameter pre-trained transformer language model that undergoes two additional training stages: pre-finetuning and fine-tuning.

Gecko's training process consists of three stages:

Pre-finetuning

Gecko starts with a pre-trained language model, which is further trained on a large corpus of text pairs through self-supervised tasks. This exposes the model to diverse textual data and improves its ability to capture semantic relationships. Training on text pairs  has been shown to improve performance for smaller-scale dual encoders for various downstream tasks including document retrieval and semantic similarity.

FRet Dataset Generation

FRet stands for Few-shot prompted Retrieval dataset.

This is the core of Gecko's knowledge distillation process. It involves two steps:

  • LLM-based Diverse Query Generation: An LLM is prompted to generate diverse tasks and corresponding queries for a large corpus of web passages. This ensures that the training data covers a wide range of tasks and linguistic patterns.
  • LLM-based Positive and Negative Mining: An initial embedding model retrieves candidate passages for each generated query. Then, the LLM reranks these passages to identify the most relevant positive passage and a hard negative passage. This step refines the data quality and helps the model learn nuanced distinctions.

Unified Fine-tuning Mixture

Gecko is fine-tuned on a mixture of FRet and other academic datasets formatted in a unified way, containing task descriptions, queries, positive passages, and negative passages. This allows the model to learn from both synthetic and human-annotated data, further improving its performance and versatility.

Results

Gecko demonstrates superior performance on the MTEB benchmark compared to other text embedding models with similar size and dimensionality. Notably, Gecko achieves strong results even when trained solely on the synthetic FRet dataset, highlighting its zero-shot generalization capabilities.

Gecko also shows promising results on multilingual retrieval tasks, even though the FRet dataset is only available in English. This suggests that the knowledge distilled from LLMs can be effectively transferred to other languages.

Analysis

Several factors contribute to Gecko's success:

LLM-based Relabeling

Using LLMs to identify better positive and negative passages significantly improves the quality of the training data, leading to better model performance. In many cases LLM generated positive and negative passages are of higher quality than original passages itself - despite being generated from original passage. Using those instead of original improves the quality.

Example: 

Seed Passage: Tagged: Batman, Robin, DC, DC Comics, Comics, …

Generated Task: Given a query, find a passage that allows you to check whether the query is true or not.

Generated Query: Batman is from DC comics

LLM-mined Positive: The Batman is an American superhero film based on the DC Comics character of the same name. Produced by DC Films and distributed by Warner Bros. Pictures, it is a reboot of the Batman film franchise.

LLM-mined Negative: "One of my employees wants to dress up in Batman attire," Gaskins said. "As long as he’s at work, I told him it was fine." New York Times News Service contributed to this report.

Diversity of FRet

The diversity of tasks and queries within FRet allows Gecko to learn general-purpose representations that can be applied to various Gen AI and NLP tasks. Interestingly, unified formatting  affects the quality of embeddings significantly, as it helps the model better separate different tasks.

Qualitative Analysis

LLM does generate diverse tasks and queries by conditioning on seed passages. LLMs are able to find a passage that provides a more direct and relevant answer to the generated query than the seed passage. Furthermore, LLM-ranked hard negatives make a challenging task of understanding nuanced differences.

The 2-step LLM distillation process effectively brings the LLM’s diverse domain knowledge and global ranking preferences into the text embedding model.

Business Applications

Gecko's versatility and performance open doors for various business applications:

  • Improved Search Engines: Gecko can enhance search engine results by better understanding user queries and retrieving the most relevant documents.
  • Chatbot Development: Gecko can be used to build more sophisticated chatbots that can understand user intent and provide accurate and relevant responses.
  • Customer Feedback Analysis: Gecko can analyze customer feedback to identify key themes and issues, helping businesses improve their products and services.
  • Content Recommendation: Gecko can be used to recommend relevant content to users based on their interests and past behavior.
  • Multilingual Applications: Gecko's multilingual capabilities can be leveraged to develop applications that work across different languages, such as cross-lingual information retrieval and machine translation.

Conclusion

Gecko presents a novel approach for training versatile text embedding models by distilling knowledge from LLMs. Its strong performance and zero-shot capabilities make it a promising tool for various NLP tasks and business applications. As LLM technology continues to advance, we can expect further improvements in Gecko's capabilities and its potential impact on the field of NLP.

Why does it work?

The top reason why Gecko works so well is its two-step LLM distillation process. This process allows Gecko to leverage the vast knowledge and understanding of LLMs to create a high-quality training dataset, which in turn leads to better model performance.

Here's how Gecko's approach differs from previous methods:

  • Traditional text embedding models are often trained on large, manually labeled datasets. However, creating these datasets is time-consuming, expensive, and can introduce biases and lack of diversity.
  • Gecko, on the other hand, uses LLMs to automatically generate a diverse and task-agnostic training dataset. This dataset, called FRet, contains a wide range of tasks and queries, ensuring that the model learns general-purpose representations.
  • Furthermore, Gecko uses LLMs to refine the quality of the training data by identifying the most relevant positive and hard negative passages for each query. This step is crucial because it helps the model learn nuanced distinctions and improves its overall performance.

In essence, Gecko's LLM distillation process allows it to learn from the knowledge and reasoning capabilities of LLMs, which ultimately leads to better text embedding models. This approach is more efficient and scalable than relying on manually labeled data, and it has the potential to revolutionize the way text embedding models are trained.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.