May 17, 2024
4 mins

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Open ELM by Apple is a new approach which utilizes layer wise scaling resulting in more efficient LLMs.
Paper Link
Header image

Key Takeaways

  • OpenELM is a family of decoder-only transformer-based LLMs, focusing on transparency and reproducibility.
  • It utilizes a "layer-wise scaling" approach for efficient parameter allocation, leading to improved accuracy compared to other open-source LLMs.
  • OpenELM outperforms existing open LLMs like OLMo by a significant margin while requiring less pre-training data.
  • The release includes the complete training and evaluation framework, empowering open research and development.

Introduction

The field of natural language processing (NLP) is rapidly evolving, with LLMs like GPT-3 and Llama leading the charge. However, most of these models are closed-source, hindering transparency and reproducibility in research. OpenELM addresses this challenge by offering a state-of-the-art open-source LLM with a complete framework for training and evaluation.

Background

LLMs are typically built using isotropic transformer architectures, meaning each layer has the same configuration. However, this may not be the most efficient way to allocate parameters within the model. OpenELM introduces a novel approach called "layer-wise scaling," which allows for a more dynamic distribution of parameters across different layers.

Pretraining

Architecture

OpenELM adopts a decoder-only transformer architecture, similar to other LLMs like GPT-3. However, it incorporates several key innovations:

  • Layer-wise scaling: This technique adjusts the number of attention heads and the feed-forward network multiplier in each transformer layer. This means layers closer to the input have fewer parameters, while layers closer to the output have more. This efficient allocation leads to better accuracy with fewer parameters.
  • No learnable bias parameters: Unlike some LLMs, OpenELM does not use learnable bias parameters in any fully-connected layers.
  • Pre-normalization and positional embedding: The model uses RMSNorm for pre-normalization and RoPE for encoding positional information.
  • Advanced attention and activation functions: OpenELM replaces multi-head attention with grouped query attention and the standard feed-forward network with SwiGLU, leading to improved performance.

Data

OpenELM is trained on a massive dataset of publicly available text, including RefinedWeb, PILE, RedPajama, and Dolma. This dataset totals approximately 1.8 trillion tokens, ensuring the model's ability to learn complex language patterns.

Training Details

The training process involves several key hyperparameters and optimizations:

  • Training iterations: OpenELM variants are trained for 350,000 iterations using the CoreNet framework.
  • Optimizer and learning rate schedule: The model uses AdamW optimizer with a cosine learning rate schedule.
  • Weight decay and gradient clipping: Weight decay is set to 0.1, and gradient clipping is applied to prevent exploding gradients.
  • Distributed training: For larger models, techniques like FSDP and activation checkpointing are used for efficient training on multiple GPUs.

Evaluation

OpenELM's performance is evaluated using various tasks across different frameworks, including:

  • Standard zero-shot tasks: These tasks assess the model's common-sense reasoning abilities without any prior examples.
  • OpenLLM leaderboard tasks: This benchmark evaluates the model's performance on tasks like question answering, summarization, and natural language inference.
  • LLM360 leaderboard tasks: This framework assesses the model's capabilities across a wider range of tasks, including reasoning, knowledge understanding, and bias detection.

Experimental Results

Pre-training

OpenELM demonstrates superior performance compared to other open-source LLMs across different evaluation frameworks. For instance, a 1.1 billion parameter OpenELM model achieves significantly higher accuracy than the 1.2 billion parameter OLMo model while requiring only half the amount of pre-training data.

"OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2x fewer pre-training tokens."

Instruction Tuning

Instruction tuning further enhances OpenELM's capabilities by providing task-specific instructions. This leads to a consistent improvement of 1-2% in average accuracy across various tasks and frameworks.

PEFT

Parameter-efficient fine-tuning (PEFT) methods like LoRA and DoRA can be effectively applied to OpenELM, allowing for further performance improvements on specific tasks without the need to fine-tune the entire model.

Benchmarking

Benchmarking on consumer-grade hardware reveals that OpenELM, while demonstrating superior accuracy, is currently slower than OLMo in terms of token throughput. This is primarily due to the naive implementation of RMSNorm, which requires numerous small kernel launches compared to the more optimized LayerNorm used in OLMo.

"Our analysis reveals that a significant portion of OpenELM's processing time can be attributed to our naive implementation of RMSNorm."

However, the researchers acknowledge this bottleneck and plan to explore optimization strategies to improve OpenELM's inference efficiency in future work.

Business Implications

OpenELM's open-source nature and impressive performance have several potential business implications:

Democratization of NLP research

OpenELM empowers researchers and developers with access to state-of-the-art LLM technology, fostering innovation and accelerating progress in the field.

Cost-effective LLM development

The efficient architecture and training process of OpenELM can lead to reduced costs associated with LLM development and deployment.

Customization and specialization

OpenELM's open-source nature allows businesses to customize and fine-tune the model for specific use cases, leading to more tailored and effective solutions.

On Device LLMs

With this research, Apple clearly wants to run smaller LLMs on device to be able to support basic tasks like summarization and search.

Conclusion

OpenELM represents a significant step forward in the development of open-source LLMs. Its layer-wise scaling approach, comprehensive framework, and impressive performance make it a valuable resource for NLP researchers and developers. As the model continues to evolve and improve, we can expect OpenELM to play a crucial role in shaping the future of NLP technology.

Critical Analysis

While OpenELM offers numerous advantages, there are some areas for improvement:

Inference speed

As discussed earlier, the current implementation of RMSNorm hinders OpenELM's inference speed compared to other models. Optimizations in this area are crucial for real-world applications.

Model size

While OpenELM offers various model sizes, it might be beneficial to explore even smaller and more efficient models for resource-constrained environments.

Safety and bias mitigation

As with any LLM, addressing potential biases and ensuring responsible use are essential considerations. OpenELM's open-source nature facilitates this process, allowing for community-driven efforts to improve safety and fairness.

Github: https://github.com/apple/corenet/tree/main/projects/openelm

HuggingFace

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.