The Mamba in the Llama, OP-RAG in LLMs, and OLMoE

#22 | Latest AI Research Explained Simply

In this issue: x4 industry news, x3 AI research papers

Sep 2nd ~ Sep 8th

🗞️ Industry News in 1 Line

  1. ♥ 2.2k We’ve seen data centers at the bottom of the ocean, but a new company called Lumen Orbit is reaching for the skies. Training AI models takes a lot of power and Lumen Orbit is betting that it is possible to use abundant solar energy by putting the data centers in the space.

    Lumen Orbit Demo

    a visualization of Lumen Orbit’s datacenter

  2. ♥ 31k Ilya Sutskever, the ex-CTO of OpenAI started a new company called Safe Superintelligence Inc, and has raised a funding of $1B and is looking for smart people who want to solve one of the most challenging problems of our age, that is safe AGI.

  3. ♥ 8.4k Replit has announced the early access release of Replit Agent, an AI assistant who can do more than just writing code. People are loving it and have been comparing it to Cursor, which is another popular coding assistant.

  4. ♥ 3k HailuoAI, a Chinese AI company, has launched an impressive text-to-video model capable of creating incredibly realistic videos of people and landscapes.

    HailuoAI

    A man packing his luggage during the 19th Century

Your 24/7 Personal Genius, Free for the Taking

Ever wished for a brilliant assistant who works round the clock, never tires, and thinks at lightning speed?

🧠Supercharge Your Workflow with HubSpot's FREE AI Assistant Kit

This power-packed bundle includes:

  • AI Tool Overview: Your digital Swiss Army knife

  • Assistant Guide: Speak AI, get results

  • Task Delegator: Become an AI orchestrator

  • Efficiency Calculator: Measure your skyrocketing productivity

  • Improvement Framework: Stay ahead of the curve

Stop working harder. Start working smarter. Genius-level smarter.

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Wang et al. [Cornell University, University of Geneva, Together AI, Princeton University]

♥ 200   LLM Attention   bycloud’s pick  

The Architecture Pipeline of the Model

Introduction to linear RNNs

Transformer models have been doing wonders in language tasks, but they've got a couple of issues when it comes to practical use:

  1. They're pretty slow when generating really long sequences of text.

  2. They need a ton of memory to store all their "thoughts".

Now, there's this other type of model architecture called linear RNNs that can be just as good at language tasks, but they're way faster and more efficient when it comes to actually using them in the real world.

The researchers had a clever idea: What if we could take all the smarts from those big, pre-trained Transformer models and somehow transfer that knowledge into these speedy linear RNN models? 

In this paper, they used some smart tricks to reuse parts of the Transformer model and created a hybrid that keeps some of the original Transformer bits but runs much more efficiently.

How Do Linear RNNs Work?

Let's break down how attention in Transformers relates to linear RNNs like Mamba in a more friendly, easy-to-understand way. Here's the gist:

In Transformer models:

  1. Multi-head attention is a core component.

  2. Each attention head processes the input sequence in parallel.

  3. For each position in the sequence, three linear projections are computed:

    • Query (Q): represents the current position's "request" for information

    • Key (K): represents how relevant each position is to the current query

    • Value (V): contains the actual information to be aggregated

These projections are created by multiplying the input with learned weight matrices (WQ, WK, WV). The attention mechanism then uses these projections to compute weighted sums of values, where the weights are determined by the compatibility of queries and keys.

In linear RNNs (like Mamba):

  1. The processing is sequential rather than parallel.

  2. The model maintains a hidden state that is updated at each step.

  3. The current input and previous hidden state are used to compute the next hidden state and output.

In this paper, they create a mapping of the learned weights from the Transformer's attention mechanism to initialize the linear RNN. They propose a modified Mamba architecture that can directly use the weights from the Transformer's attention blocks. This initialization allows the linear RNN to leverage the patterns and relationships learned by the Transformer, but in a format that can be processed sequentially.

This approach effectively transfers the knowledge embedded in the Transformer's parallel attention mechanism into a sequential processing format, maintaining performance while improving inference efficiency.

Benchmark Results of Linear RNNs

The distilled hybrid models, which contain 50% or 25% of the original attention layers show comparable or even slightly superior performance to their teacher models on chat benchmarks like MT-Bench and AlpacaEval. These distilled models outperform other linear RNN models trained from scratch with significantly more data, such as the Falcon Mamba trained on over 5T tokens.

In general benchmarks, the hybrid models, especially Mamba2-Llama3 variants, consistently outperform pure Mamba models and other linear RNNs across a wide range of tasks, including reasoning, question-answering, and common-sense inference.

In Defense of RAG in the Era of Long-Context Language Models

Yu et al. [NVIDIA]

♥ 924   LLM RAG

Introduction to Order-Preserve Retrieval-Augmented Generation

Few years ago, LLMs had a small context window which limited their usefulness. RAG was initially an effective solution for tackling the limited context window as it allowed models to access a larger corpus of information by retrieving relevant chunks of text.

These days, LLMs can handle very long text sequences (millions of tokens) which reduces the need of storing and updating data in a vector database to perform RAG operations.

This paper will show that while long-context LLMs can handle vast amounts of text, they might struggle to prioritize relevant information within the entire context. This could lead to the model being overwhelmed and producing less accurate answers. To overcome this, the paper proposes a new approach called Order-Preserve Retrieval-Augmented Generation (OP-RAG):

  • Order Preservation: Instead of simply ranking retrieved chunks by relevance and feeding them to the LLM in descending order, OP-RAG maintains their original order within the document.

  • Improved Focus on Relevant Information: This strategy helps the LLM better understand the context of the retrieved information, reducing the chances of being distracted by irrelevant passages.

How does Order-Preserve Retrieval-Augmented Generation Work?

Imagine you have a very long book, and you want to find the answer to a specific question. Instead of reading the whole book, you could use a "smart search" tool (Table of contents in the book) that provides the most relevant passages. Here's how OP-RAG is similar to this searching technique:

  1. Divide and Conquer: The long book is divided into smaller sections, like chapters. Each chapter is a "chunk" of information.

  2. Find the Best Chapters: The "smart search" tool compares your question to each chapter, giving each chapter a "relevance score". The chapters most similar to your question get higher scores.

  3. Order Matters: Instead of just showing you the chapters with the highest scores, the tool remembers the order of the chapters in the book. So even if a chapter with a slightly lower score comes later in the book, it's still shown to you in its original position.

  4. Read the Relevant Chapters: You can then read the selected chapters in order, which helps you understand the context of the information and find the best answer to your question.

Why does order matter in RAG?

Imagine you're looking for information about a historical event. A chapter about the event's causes might have a lower score than a chapter about its consequences, but it's important to read the cause chapter first to understand the whole story. Order-preserving retrieval helps you follow the flow of information and get a more complete picture.

OP-RAG is like this "smart search" tool but for language models. It helps LLMs focus on the most relevant information in a long text and understand the context of that information, leading to more accurate and comprehensive answers. In the following image we can see that normal RAG orders the chunks according to their relevance but OP-RAG presents the chunks in the order they were present in the original source regardless of relevance.

Retrieved tokens in different RAG mechanisms.

Results and Evaluation of Order-Preserve Retrieval-Augmented Generation

The paper shows that OP-RAG outperforms both long-context LLMs without retrieval and the SELF-ROUTE approach in terms of answer quality and efficiency. OP-RAG got higher F1 scores on the EN.QA dataset while using considerably fewer tokens than its counterparts. Furthermore, it also performs well on the EN.MC dataset, achieving comparable or even higher accuracy with fewer tokens compared to long-context LLMs.

OLMoE: Open Mixture-of-Experts Language Models

Muennighoff et al. [Allen Institute for AI, Contextual AI, University of Washington, Princeton University]

♥ 889   LLM MOE

Introduction to OLMoE

Traditional dense LLMs, like those in the GPT or LLAMA family, typically require massive amounts of computational power and memory to train and run, which limits their accessibility and deployability. The researchers aim to solve this problem by developing and open-sourcing a Mixture of Experts (MoE) model called OLMOE-1B-7B.

MoE models are a type of neural network architecture that uses a gating mechanism to selectively activate only a subset of the model's parameters for each input, potentially offering better efficiency and performance. To do this, they have created a base MoE model (OLMOE-1B-7B) that achieves similar performance to much larger models (e.g., comparable MMLU scores to Llama2-13B) while being about 10 times less computationally expensive.

MoE vs Dense LM

Comparison of the architecture of dense LMs and MoE models like OLMOE

Inner-Workings of OLMoE

Here’s how the OLMOE-1B-7B Mixture of Experts (MoE) model works:

  1. Input Processing: When an input token enters the model, it goes through multiple layers. In each layer, there's a special MoE module.

  2. The Router: Within the MoE module, there's a component called the router, this router is like a traffic director for the input. It's a learned component that decides which experts should process the input.

  3. Expert Selection: The router looks at the input and assigns it to multiple experts (in this case, 8 out of 64 available experts per layer). Each expert is like a specialized mini-network within the larger model.

  4. Probability Assignment: The router doesn't just choose experts; it also assigns a probability to each chosen expert. This probability represents how relevant or important that expert is for processing the current input.

  5. Expert Processing: Each selected expert then processes the input independently. They're like specialized workers, each applying their unique skills to the task.

  6. Combining Expert Outputs: After the experts process the input, their outputs are combined. The combination isn't equal – it's weighted by the probabilities the router assigned. So, experts deemed more relevant by the router have a stronger influence on the final output for that layer.

  7. Layer Output: The combined result from the experts becomes the output of the MoE module for that layer. This output then moves on to the next layer in the model.

  8. Auxiliary Processes: During training, the model also uses additional components to ensure it works well:

    1. A load balancing mechanism to prevent overuse of certain experts

    2. A "z-loss" to keep the router's decision-making stable

This process repeats for each layer in the model, allowing the MoE to dynamically use different combinations of specialized sub-networks (experts) for different inputs. This approach aims to make the model more efficient and potentially more capable than a traditional dense model of similar size.

concept relations map for the immunology feature

Real-World Implications of OLMoE

OLMOE-1B-7B outperformed other models with similar active parameters (around 1B) across various tasks like MMLU, HellaSwag, ARC-Challenge, PIQA, and WinoGrande. After instruction tuning (SFT) and preference tuning (DPO), OLMOE-1B-7B-INSTRUCT shows over 10x gain on GSM8k (math problem-solving).

Moreover, OLMOE-1B-7B-INSTRUCT achieves an 84% score on AlpacaEval, surpassing much larger dense models like Llama2-13B-Chat. The model's ability to perform well with fewer active parameters means it could be deployed on less powerful hardware, making advanced AI capabilities more accessible to a wider range of users and organizations.

Reply

or to participate.