- The AI Timeline
- Posts
- Mamba-2, LLM Hallucinations, and Diffusion on Syntax trees
Mamba-2, LLM Hallucinations, and Diffusion on Syntax trees
The AI Timeline #9
In this issue: x3 industry news, x3 AI research papers
June 3rd ~ June 9th
šļø Industry News in 1 Line
ā„ 1.1k Stability AI launched Stable Audio Open, an open-source model for generating up to 47 seconds of audio samples and sound effects from text prompts. It is particularly useful for sound designers, musicians, and creative communities as it allows customization and fine-tuning on user-provided audio data.
ā„ 490 The new Qwen2 series which comprises of five model sizes (Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B) offers enhanced capabilities in coding, mathematics, and multilingual tasks, supports context lengths up to 128K tokens, and offers improved performance. Models are available under Apache 2.0 and Qianwen licenses for commercial use and development.
ā„ 6.7k KWAI released a text-to-video model called KLING which can compete with OpenAIās Sora. The model demos show diverse capabilities, including generating lifelike animations of car mirrors reflecting sunsets, a small figure exploring an art gallery, and an astronaut running smoothly on the moonās surface.
Diffusion on Syntax Trees for Program Synthesis
Kapur et al. [University of California, Berkeley]
ā„ 5.4k Diffusion
Introduction to Diffusion on Syntax Trees
Existing LLMs can be tweaked to produce code instead of English sentences, although this approach somewhat works for simpler code bases, there is a huge flaw in this approach ā LLMs generate code token by token without the ability to observe the programās runtime output. This makes it difficult to fix errors and make adjustments based on the programās output, as there is no feedback loop.
This paper introduces a novel approach to program synthesis using neural diffusion models that operate on syntax trees, allowing the model to iteratively refine programs while ensuring syntactic validity. This method enables the model to observe the programās output at each step, akin to a debugging process.
Understanding Diffusion on Syntax Trees
The paper suggests using denoising diffusion models, traditionally used in image generation, to work with syntax trees of programming languages. The core idea is to iteratively refine a programās syntax tree by applying small, syntactically valid mutations, and then using a neural network to reverse these mutations, effectively ādenoisingā the program back to its intended form.
The process begins by introducing ānoiseā into a target programās syntax tree, creating a series of increasingly mutated versions of the original program. These mutations are small to ensure that the resulting programs remain syntactically valid according to the rules of the programming languageās grammar.
A neural network is then trained to reverse these mutations, guided by the original image that the program is intended to render. This creates a feedback loop where the network can observe the effects of its mutations on the programās output and adjust accordingly, much like debugging in traditional programming.
To facilitate this process, the authors employ a policy that defines how mutations are applied and reversed. They also introduce a value network that helps predict the ādistanceā between the current mutated program and the target program, aiding in the search for the correct sequence of mutations to apply.
The architecture of the neural network is designed to work with the syntax trees and the specific grammar of the programming language being used. It includes a vision-language model for denoising, an image encoder for processing the visual output of the programs, and a transformer-based model for predicting the necessary edits to the syntax tree.
Results for Diffusion on Syntax Trees
This method demonstrated superior performance in inverse graphics tasks compared to baseline methods. The model successfully recovered programs from input sketches, suggesting its potential in interpreting and digitizing hand-drawn designs. In the following chart, you can see that the diffusion search (blue line) is consistently outperforming other methods in constructing 2D geometry.
Although this approach works well on 2D graphics, it may not work well for other domains without additional work. Moreover, the approach has limitations, such as the inability to handle variable bindings, loops, strings, and continuous parameters.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Dao and Gu [Princeton University, Carnegie Mellon University]
ā„ 818 Mamba
Introduction to Mamba-2
Large language models are getting even larger while scaling quadratically, which means they require even more computation resources. For your reference, if a model needs to generate a sequence that is twice as long, then it will need four times as much resources.
State-space models (SSMs), on the other hand, offer linear scaling and have shown promising results, but their development has been somewhat isolated from the broader efforts to optimize Transformers. This paper introduces a novel framework called State Space Duality (SSD), which establishes theoretical connections between SSMs and attention mechanisms through structured semiseparable matrices.
This framework allows for the design of a new architecture, Mamba-2, which refines the selective SSM of its predecessor to achieve significantly faster performance (2-8Ć) while maintaining competitive language modeling capabilities.
Understanding State Space Models in Mamba-2 Architecture
Before we discuss Mamba-2, letās revisit a few basic concepts:
State Space Models (SSMs) are a way to transform sequences of data. They are defined by a set of parameters that dictate how input sequences are converted into output sequences. You can think of it as a matrix multiplying a vector, where the matrix represents the transformation rules of the SSM, and the vector represents the input sequence.
Semi-separable matrices are a special kind of matrix with a unique structure. They are defined by the property that certain submatrices have a limited rank, which is a measure of complexity or information content. These matrices can be represented in a compact form, which is efficient both in terms of storage and computational resources needed for operations like multiplication.
This paper establishes that the way SSMs transform sequences is equivalent to multiplying by a semi-separable matrix. This means that the complex operations of an SSM can be understood and executed as operations on these structured matrices. This means that we can use existing algorithms for structured matrix multiplication to compute SSMs efficiently.
The Mamba-2 Architecture simplifies neural network design by removing certain linear projections and adding a normalization layer for stability. It uses a parallel approach to generate parameters at the start, enhancing efficiency and allowing for tensor parallelism.
The architecture takes advantage of multihead patterns for sequence transformations, where each āheadā independently processes part of the input. This modular approach is scalable and adaptable to different model sizes. Mamba-2 also incorporates kernel feature maps for flexibility and normalization terms for output consistency, making it a robust and efficient framework for machine learning tasks.
Performance Benchmarks of Mamba-2
The study compares various hybrid models combining SSD, MLP, and attention layers, all trained on the Pile dataset to 300B tokens at a 2.7B scale. The models are:
Transformer++: Interleaves 32 attention and 32 gated MLP layers.
Mamba-2: Consists of 64 SSD layers.
Mamba-2-MLP: Interleaves 32 SSD and 32 gated MLP layers.
Mamba-2-Attention: Combines 58 SSD layers with 6 strategically placed attention layers.
Mamba-2-MLP-Attention: Interleaves 28 SSD layers with 4 attention layers and 32 gated MLP layers.
Mamba-2-Attention outperforms both with a lower perplexity (lower the better) and higher zero-shot accuracy across benchmarks. Moreover, adding MLP layers to Mamba-2 decreases model quality but increases training and inference speed due to MLPās hardware efficiency. SSD is 2 to 8 times faster than Mambaās scan implementation and outpaces FlashAttention-2 at longer sequence lengths due to its efficient use of matrix multiplication units on GPUs.
However, Mamba-2 may not be as efficient as Transformer for short sequences due to the number of SSD layers required.
To Believe or Not to Believe Your LLM
Yadkori et al. [Google DeepMind]
ā„ 668 LLM
Introduction
It is no mystery that LLMs lie, and they lie a lot! You must have seen Googleās AI recently had a bit of a āwhoopsā moment with its AI Overviews feature ā from generating overly āwokeā search summaries to spitting out answers that left users scratching their heads.
This misinformation arises due to two types of uncertainties: epistemic, which stems from a lack of knowledge about the ground truth, and aleatoric, which arises from inherent randomness in the problem, such as when multiple valid answers exist. This paper aims to reliably detect when only epistemic uncertainty is high, indicating that the modelās output may not be trustworthy.
The authors introduce an iterative prompting procedure that leverages the chain rule of probability to construct a joint distribution over multiple responses, allowing them to detect when a modelās output is influenced by its previous responses, thus revealing epistemic uncertainty.
Why Do Large Language Models Lie?
The paper discusses how repeating certain phrases or questions can influence the LMās responses. Itās like if you keep telling a friend that your favorite ice cream flavor is strawberry, they might start to believe it, even if you actually prefer chocolate.
They found that if you repeat an incorrect answer many times, the LM might start to think itās correct. But if the LM was very confident about the correct answer to begin with, it wonāt be easily swayed by the repetition of the wrong one.
Model Architecture
The core model architecture used in this study is centered around mutual information (MI) estimation, with specific adaptations to accommodate practical constraints. This is a clever trick to estimate the correctness without having to look at every possible thing the LM could say, which would take forever. Itās like estimating how many jellybeans are in a jar without counting them all one by one.
Mutual Information Estimation:
Algorithm 1: This algorithm estimates the MI for a given joint distribution Ī¼, typically set to Q, the language model's distribution. The estimation focuses on unique elements in the sample, ensuring duplicates do not skew the results.
Empirical Distribution: Due to the infinite potential support of Q, the method approximates Q using an empirical distribution derived from a finite sample, collecting representative elements in a set S.
Bias and Error Control:
Missing Mass (Uk): The missing mass quantifies the probability mass of elements not observed in the finite sample. This is crucial for bounding the error in MI estimation.
Theorem 4.6: This theorem provides a non-asymptotic bound on the estimation error, accounting for sample size, missing mass, and a bias term (Ī³). The bound ensures that the estimation error diminishes as the sample size (k) increases.
Epistemic Uncertainty Scoring Ik(Ī³,x): This score is used to measure the epistemic uncertainty for a given query x, derived from the MI estimate. It serves as a basis for designing abstention policies to handle hallucinations in the language model's responses.
Benchmarks and Results
This paper tested their new method on different sets of questions to see how well it could tell when to answer and when to abstain using three sets of questions:
TriviaQA: A set of general knowledge questions.
AmbigQA: Questions that might have more than one correct answer.
WordNet: A set of questions about different types of things, like āName a type of fruit.ā
They looked at how often the LM gave the right answer (precision) and how often it chose to answer (recall). They wanted the LM to answer correctly as often as possible, but also not to answer when it wasnāt sure. This method worked well! It was better than just using the most likely answer the LM came up with. It was especially good when questions had more than one correct answer, which is trickier for LMs.
In the following chart, we can see that while the first-order S.E. method has similar recall and error rates to those of the proposed M.E. method on low-entropy queries, its recall values are nearly zero for queries with higher entropy.
Reply