May 17, 2024
3 mins

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Infini-attention uses compressive memory cache and efficient retrieval to enable practically unbounded context for a transformer within a bounded memory footprint.
Paper Link
Header image

Key Takeaways

  • Longer Context for LLMs: This paper presents a new technique, called Infini-attention, which allows large language models (LLMs) to process much longer sequences of text than was previously possible. This is done by incorporating a "compressive memory" that stores information from previous parts of the text and allows the model to "remember" it when processing later parts.
  • Bounded Memory and Computation: Despite being able to handle longer contexts, Infini-attention does not require a huge increase in memory or computational resources. This is because the compressive memory has a fixed size, regardless of how long the input text is.
  • Improved Performance: Experiments show that Infini-attention leads to better performance on tasks involving long sequences of text, such as summarizing entire books or retrieving information from long documents.


Imagine trying to summarize a long book or article. As humans, we can easily refer back to earlier parts of the text to understand the context and key points. However, this has been a major challenge for LLMs, which typically have a limited "context window" and can only process a certain amount of text at a time. This paper tackles this problem head-on by introducing Infini-attention, a new way for LLMs to handle very long sequences of text effectively.


LLMs like GPT-4 have been impressive in their ability to generate human-quality text and perform various language-based tasks. However, they have a limitation: they can only process a certain amount of text at once due to the nature of their attention mechanism.

The attention mechanism in LLMs works by comparing each word in the input text to every other word, which becomes computationally expensive as the text gets longer. This limits the "context window" of the model, meaning it can only "remember" and use information from a recent portion of the text.


Infini-attention addresses this limitation by introducing a "compressive memory" alongside the traditional attention mechanism. This memory stores information from earlier parts of the text in a compressed format, allowing the model to access and utilize it when processing later parts.

Here's how it works:

Local Attention

Within each segment of text, the model uses the standard attention mechanism to focus on relevant parts.

Compressive Memory

Instead of creating a separate memory system, Infini-attention cleverly reuses the "query", "key", and "value" states that are already calculated as part of the standard attention mechanism. These states represent different aspects of the input text and their relationships. The memory is essentially an "associative matrix" where each "key" is linked with its corresponding "value". Think of it like a dictionary where you look up a word (key) to find its meaning (value).

When a new segment of text is processed, the key-value pairs from that segment are used to update the associative matrix. This is done incrementally, meaning the memory is constantly evolving and adapting to new information.

Memory Retrieval

Querying the Memory: When processing a new segment, the model uses the "query" state, which represents the current context, to retrieve relevant information from the memory.

Linear Attention Mechanism: This retrieval process is based on a "linear attention" mechanism. It essentially calculates the similarity between the query and each key in the memory and uses these similarities to retrieve the most relevant values. This is very similar to the cosine dot product used to calculate the similarity of the vector embeddings.

Combined Context

The model now has two sources of information: the local context from the current segment (obtained through the standard attention mechanism) and the global context from the entire text history (retrieved from the compressive memory). A "gating" mechanism is used to combine these two sources of information. This gating mechanism learns to balance the importance of local and global context depending on the specific task and input text.

The combined context, incorporating both local and global information, forms a rich representation of the text that the model can use for downstream tasks like summarization or question answering.

By efficiently storing and retrieving information, Infini-attention allows a model to have a much broader understanding of the text than traditional attention mechanisms, leading to improved performance on tasks involving long sequences.


The researchers tested Infini-attention on various tasks involving long sequences of text and compared it to existing methods:

Long-Context Language Modeling

Infini-attention outperformed other models, like Transformer-XL and Memorizing Transformer, in predicting the next word in a sequence, while using significantly less memory.

Passkey Retrieval

A 1B LLM with Infini-attention was able to accurately retrieve a "passkey" hidden within a million-word long text, even though it was only trained on 5,000-word long examples. This demonstrates the model's ability to generalize to much longer contexts than it was trained on.

Book Summarization

An 8B LLM with Infini-attention achieved state-of-the-art results on a book summarization task, demonstrating its ability to understand and summarize long and complex texts.

Business Impact

The ability to process longer sequences of text opens doors for several exciting applications:

  • Improved Summarization and Question Answering: LLMs could provide more accurate and comprehensive summaries of long documents or answer complex questions that require understanding the entire context.
  • Enhanced Chatbots and Conversational AI: Chatbots could engage in longer and more meaningful conversations with users, remembering and building upon previous interactions.
  • Advanced Content Creation: LLMs could generate longer and more coherent pieces of writing, such as scripts, poems, or even novels, with a consistent storyline and character development.
  • Efficient Information Retrieval: LLMs could become powerful search engines, capable of understanding and retrieving information from vast amounts of text data, such as legal documents, scientific papers, or historical archives.


Infini-attention presents a significant step forward in enabling LLMs to handle long and complex texts. It allows models to maintain a global understanding of the context while efficiently using computational resources. This has the potential to unlock a wide range of applications and further enhance the capabilities of LLMs in various domains.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.