The AI Timeline
Posts
Mamba-2, LLM Hallucinations, and Diffusion on Syntax trees

Mamba-2, LLM Hallucinations, and Diffusion on Syntax trees

The AI Timeline #9

by cloud
June 11, 2024

In this issue: x3 industry news, x3 AI research papers

June 3rd ~ June 9th

🗞️ Industry News in 1 Line

♥ 1.1k Stability AI launched Stable Audio Open, an open-source model for generating up to 47 seconds of audio samples and sound effects from text prompts. It is particularly useful for sound designers, musicians, and creative communities as it allows customization and fine-tuning on user-provided audio data.
♥ 490 The new Qwen2 series which comprises of five model sizes (Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B) offers enhanced capabilities in coding, mathematics, and multilingual tasks, supports context lengths up to 128K tokens, and offers improved performance. Models are available under Apache 2.0 and Qianwen licenses for commercial use and development.
♥ 6.7k KWAI released a text-to-video model called KLING which can compete with OpenAI’s Sora. The model demos show diverse capabilities, including generating lifelike animations of car mirrors reflecting sunsets, a small figure exploring an art gallery, and an astronaut running smoothly on the moon’s surface.

Diffusion on Syntax Trees for Program Synthesis

Kapur et al. [University of California, Berkeley]

♥ 5.4k Diffusion

Introduction to Diffusion on Syntax Trees

Existing LLMs can be tweaked to produce code instead of English sentences, although this approach somewhat works for simpler code bases, there is a huge flaw in this approach – LLMs generate code token by token without the ability to observe the program’s runtime output. This makes it difficult to fix errors and make adjustments based on the program’s output, as there is no feedback loop.

This paper introduces a novel approach to program synthesis using neural diffusion models that operate on syntax trees, allowing the model to iteratively refine programs while ensuring syntactic validity. This method enables the model to observe the program’s output at each step, akin to a debugging process.

Understanding Diffusion on Syntax Trees

The paper suggests using denoising diffusion models, traditionally used in image generation, to work with syntax trees of programming languages. The core idea is to iteratively refine a program’s syntax tree by applying small, syntactically valid mutations, and then using a neural network to reverse these mutations, effectively ‘denoising’ the program back to its intended form.

The process begins by introducing ‘noise’ into a target program’s syntax tree, creating a series of increasingly mutated versions of the original program. These mutations are small to ensure that the resulting programs remain syntactically valid according to the rules of the programming language’s grammar.

A neural network is then trained to reverse these mutations, guided by the original image that the program is intended to render. This creates a feedback loop where the network can observe the effects of its mutations on the program’s output and adjust accordingly, much like debugging in traditional programming.

To facilitate this process, the authors employ a policy that defines how mutations are applied and reversed. They also introduce a value network that helps predict the ‘distance’ between the current mutated program and the target program, aiding in the search for the correct sequence of mutations to apply.

The architecture of the neural network is designed to work with the syntax trees and the specific grammar of the programming language being used. It includes a vision-language model for denoising, an image encoder for processing the visual output of the programs, and a transformer-based model for predicting the necessary edits to the syntax tree.

Results for Diffusion on Syntax Trees

This method demonstrated superior performance in inverse graphics tasks compared to baseline methods. The model successfully recovered programs from input sketches, suggesting its potential in interpreting and digitizing hand-drawn designs. In the following chart, you can see that the diffusion search (blue line) is consistently outperforming other methods in constructing 2D geometry.

Although this approach works well on 2D graphics, it may not work well for other domains without additional work. Moreover, the approach has limitations, such as the inability to handle variable bindings, loops, strings, and continuous parameters.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao and Gu [Princeton University, Carnegie Mellon University]

♥ 818 Mamba

Introduction to Mamba-2

Large language models are getting even larger while scaling quadratically, which means they require even more computation resources. For your reference, if a model needs to generate a sequence that is twice as long, then it will need four times as much resources.

State-space models (SSMs), on the other hand, offer linear scaling and have shown promising results, but their development has been somewhat isolated from the broader efforts to optimize Transformers. This paper introduces a novel framework called State Space Duality (SSD), which establishes theoretical connections between SSMs and attention mechanisms through structured semiseparable matrices.

This framework allows for the design of a new architecture, Mamba-2, which refines the selective SSM of its predecessor to achieve significantly faster performance (2-8×) while maintaining competitive language modeling capabilities.

Understanding State Space Models in Mamba-2 Architecture

Before we discuss Mamba-2, let’s revisit a few basic concepts:

State Space Models (SSMs) are a way to transform sequences of data. They are defined by a set of parameters that dictate how input sequences are converted into output sequences. You can think of it as a matrix multiplying a vector, where the matrix represents the transformation rules of the SSM, and the vector represents the input sequence.
Semi-separable matrices are a special kind of matrix with a unique structure. They are defined by the property that certain submatrices have a limited rank, which is a measure of complexity or information content. These matrices can be represented in a compact form, which is efficient both in terms of storage and computational resources needed for operations like multiplication.

This paper establishes that the way SSMs transform sequences is equivalent to multiplying by a semi-separable matrix. This means that the complex operations of an SSM can be understood and executed as operations on these structured matrices. This means that we can use existing algorithms for structured matrix multiplication to compute SSMs efficiently.

The Mamba-2 Architecture simplifies neural network design by removing certain linear projections and adding a normalization layer for stability. It uses a parallel approach to generate parameters at the start, enhancing efficiency and allowing for tensor parallelism.

The architecture takes advantage of multihead patterns for sequence transformations, where each ‘head’ independently processes part of the input. This modular approach is scalable and adaptable to different model sizes. Mamba-2 also incorporates kernel feature maps for flexibility and normalization terms for output consistency, making it a robust and efficient framework for machine learning tasks.

Performance Benchmarks of Mamba-2

The study compares various hybrid models combining SSD, MLP, and attention layers, all trained on the Pile dataset to 300B tokens at a 2.7B scale. The models are:

Transformer++: Interleaves 32 attention and 32 gated MLP layers.
Mamba-2: Consists of 64 SSD layers.
Mamba-2-MLP: Interleaves 32 SSD and 32 gated MLP layers.
Mamba-2-Attention: Combines 58 SSD layers with 6 strategically placed attention layers.
Mamba-2-MLP-Attention: Interleaves 28 SSD layers with 4 attention layers and 32 gated MLP layers.

Mamba-2-Attention outperforms both with a lower perplexity (lower the better) and higher zero-shot accuracy across benchmarks. Moreover, adding MLP layers to Mamba-2 decreases model quality but increases training and inference speed due to MLP’s hardware efficiency. SSD is 2 to 8 times faster than Mamba’s scan implementation and outpaces FlashAttention-2 at longer sequence lengths due to its efficient use of matrix multiplication units on GPUs.

However, Mamba-2 may not be as efficient as Transformer for short sequences due to the number of SSD layers required.

To Believe or Not to Believe Your LLM

Yadkori et al. [Google DeepMind]

♥ 668 LLM

Introduction

It is no mystery that LLMs lie, and they lie a lot! You must have seen Google’s AI recently had a bit of a ‘whoops’ moment with its AI Overviews feature – from generating overly ‘woke’ search summaries to spitting out answers that left users scratching their heads.

This misinformation arises due to two types of uncertainties: epistemic, which stems from a lack of knowledge about the ground truth, and aleatoric, which arises from inherent randomness in the problem, such as when multiple valid answers exist. This paper aims to reliably detect when only epistemic uncertainty is high, indicating that the model’s output may not be trustworthy.

The authors introduce an iterative prompting procedure that leverages the chain rule of probability to construct a joint distribution over multiple responses, allowing them to detect when a model’s output is influenced by its previous responses, thus revealing epistemic uncertainty.

Why Do Large Language Models Lie?

The paper discusses how repeating certain phrases or questions can influence the LM’s responses. It’s like if you keep telling a friend that your favorite ice cream flavor is strawberry, they might start to believe it, even if you actually prefer chocolate.

They found that if you repeat an incorrect answer many times, the LM might start to think it’s correct. But if the LM was very confident about the correct answer to begin with, it won’t be easily swayed by the repetition of the wrong one.

Model Architecture

The core model architecture used in this study is centered around mutual information (MI) estimation, with specific adaptations to accommodate practical constraints. This is a clever trick to estimate the correctness without having to look at every possible thing the LM could say, which would take forever. It’s like estimating how many jellybeans are in a jar without counting them all one by one.

Mutual Information Estimation:

Algorithm 1: This algorithm estimates the MI for a given joint distribution μ, typically set to Q, the language model's distribution. The estimation focuses on unique elements in the sample, ensuring duplicates do not skew the results.
Empirical Distribution: Due to the infinite potential support of Q, the method approximates Q using an empirical distribution derived from a finite sample, collecting representative elements in a set S.

Bias and Error Control:

Missing Mass (Uk): The missing mass quantifies the probability mass of elements not observed in the finite sample. This is crucial for bounding the error in MI estimation.
Theorem 4.6: This theorem provides a non-asymptotic bound on the estimation error, accounting for sample size, missing mass, and a bias term (γ). The bound ensures that the estimation error diminishes as the sample size (k) increases.

Epistemic Uncertainty Scoring Ik(γ,x): This score is used to measure the epistemic uncertainty for a given query x, derived from the MI estimate. It serves as a basis for designing abstention policies to handle hallucinations in the language model's responses.

Benchmarks and Results

This paper tested their new method on different sets of questions to see how well it could tell when to answer and when to abstain using three sets of questions:

TriviaQA: A set of general knowledge questions.
AmbigQA: Questions that might have more than one correct answer.
WordNet: A set of questions about different types of things, like “Name a type of fruit.”

They looked at how often the LM gave the right answer (precision) and how often it chose to answer (recall). They wanted the LM to answer correctly as often as possible, but also not to answer when it wasn’t sure. This method worked well! It was better than just using the most likely answer the LM came up with. It was especially good when questions had more than one correct answer, which is trickier for LMs.

In the following chart, we can see that while the first-order S.E. method has similar recall and error rates to those of the proposed M.E. method on low-entropy queries, its recall values are nearly zero for queries with higher entropy.

Reply

or to participate.