Resources - Clio AI

Research Insights

Our insights into the latest research and publications providing context into their features, breakthroughs, and business applications.

All Research Insights

Deliberation in Latent Space via Differentiable Cache Augmentation

This paper introduces a novel approach to enhance LLMs by augmenting the key-value cache with latent embeddings generated by an offline coprocessor. The method is differentiable, efficient, and improves reasoning performance on a variety of tasks.

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Explore OREO, a novel offline RL algorithm for enhancing LLM multi-step reasoning. Learn how it outperforms DPO with soft Bellman optimization and fine-grained credit assignment.

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Explore the Mixture-of-Transformers (MoT) architecture, a novel approach for creating scalable multi-modal foundation models. MoT reduces pretraining costs while maintaining performance in text, image, and speech generation tasks.

Training Language Models to Self-Correct via Reinforcement Learning

Google Deepmind introduces SCoRe - a technique to teach LLMs to self correct themselves at inference time resulting in better outputs and mistakes corrections without needing human supervision.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling inference time compute helps models allocate more tokens and hence come up with better generation. A direct precursor to Open AI's o1 (Strawberry)

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

With Iterative Preference Learning and Monte Carlo Tree search, LLMs can reason well and generate high quality outputs to related tasks. Open AI strawberry is based on this same idea.

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2 research paper by Google Deepmind detailing Knowledge Distillation, Group Query Attention, and ton of safety trainings

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

MUSCLE - a paper from Apple research proposes strategies for LLM updates which does not alter the model behavior in a negative manner.

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

This paper from Deepmind creates a LOFT benchmark to test long context vs techniques like Retrieval, In Context Learning, and SQL.

Mixture of A Million Experts

This paper from Google Deepmind scales MoE to a million experts, each with a single layer, and then uses vector based retrieval to pick the top k experts for any given query at runtime.

Transformers meet Neural Algorithmic Reasoners

This paper from Google Deepmind combines Transformers with Neural Algorithmic Reaasoning resulting in an architecture where LLMs are good at reasoning tasks.

Mixture-of-Agents Enhances Large Language Model Capabilities

Together AI introduces a Mixture of Agents - a group of LLM experts which when mingled together can outperform GPT4 and other top LLMs

CRAG - Comprehensive RAG Benchmark

CRAG by META AI (FAIR) suggests a comprehensive evaluation framework for RAG systems - both straightforward and SOTA industry level systems.

Contextual Position Encoding: Learning to Count What’s Important

CoPE enables LLMs get better at counting tasks by contextualizing positional encoding differently that traditional token based approaches

LoRA Learns Less and Forgets Less

Analyzing LoRA to understand whether it can add new knowledge to an LLM.

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

MoRA builds upon the ideas of LoRA and PeFT to efficiently finetune an LLM with high rank updation using a square matrix instead of low rank matrices.

RLHF Workflow: From Reward Modeling to Online RLHF

Online Iterative RLHF builds upon reward modeling as human preference approximation with iteratively changing dataset and aligning the model better than previous techniques

Better & Faster Large Language Models via Multi-token Prediction

This Multi token prediction paper by Meta shows multi heads as memory efficient, better at performance, and faster at training compared to current next token predictors.

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Open ELM by Apple is a new approach which utilizes layer wise scaling resulting in more efficient LLMs.

AutoCodeRover: Autonomous Program Improvement

AutoCodeRover from NUS provides a novel framework that looks beyond code generation to genuine problem solving with the help of AI.

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

This paper from Open AI research talks about training an LLM to prioritize instructions in a hierarchical order, starting from system prompt, alignment, to user prompt, tool output, and so on.

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Megalodon by Meta AI is a new LLM architecture that tackles the problems in transformers and can support unlimited context length using a new attention technique.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Infini-attention uses compressive memory cache and efficient retrieval to enable practically unbounded context for a transformer within a bounded memory footprint.

ReFT: Representation Finetuning for Language Models

ReFT changes representations at different layers of an LLM by using a technique called intervention instead of changing weights/parameters using PeFT. Gives a better performance on common benchmarks and tasks.

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

MoD is a new Mixture of Depths implementation by Google Deepmind which dynamically allocates compute to input tokens.

RAFT: Adapting Language Model to Domain Specific RAG

With RAFT, an LLM is finetuned to ignore the distracting documents and focus on relevant information for any given task. Along with CoT, this enables model to assign correct weightage to relevant information, improving the generation and downstream tasks output significantly.

Gecko - Versatile Text Embeddings Distilled from Large Language Models

Gecko - from Google Deepmind - is a new embedding model architecture that utilizes two step LLM distillation process to create a high quality training dataset, and leads to a better model performance.

Jamba - A hybrid Transformer-Mamba Language Model

Jamba by AI21 Labs combines transformer layers with Mamba (SSM) layers and implements a MoE layers in middle to get a compute efficient model with high throughput.

Evolutionary Optimization of Model Merging Recipes

Evolutionary model merge uses evolutionary algorithms that automatically discover optimal ways to combine diverse open source models. This way the resultant model harnesses the capabilities of parent models without requiring extensive additional training data or compute. This makes foundational model development more accessible and efficient.

Dense X Retrieval: Proposition based Retrieval for RAG

Proposition based retrieval performs significantly better than existing techniques like paragraph based retrieval and sentence retrieval in case of RAG apps. This paper by Tencent investigates and quantifies how much better on Wikipedia articles.

PERL: Parameter Efficient RLHF

PERL or Parameter Efficient Reinforcement Learning could be a groundbreaking technique to reduce memory and time consumption when it comes to aligning a model before releasing it to the world. This paper shows how using LoRA, you get close to the same benchmarks as standard RLHF techniques, and hence you get the same quality of output. Business implications are about costs, and can be done efficiently on premises on any open source model.

CaLM - Composition of LLMs by augmentation

CaLM provides composition for LLMs similar to how libraries would in a programming language. It's a powerful method to enable combining skills of multiple LLMs depending on the use case.

Helpful Resources

Enterprise Deep-dives

DSPY: A Programming Model for Self-Improving Language Model Pipelines

Chain of Thought Prompting Demystified

Generative AI for Enterprises - Use Cases, Experimentation, Iterations, and Deployments

A custom model for your use case