- The AI Timeline
- Posts
- Differential Transformers, Intelligence at the Edge of Chaos, and LLMs Can't Truly Reason
Differential Transformers, Intelligence at the Edge of Chaos, and LLMs Can't Truly Reason
#27 | Latest AI Research Explained Simply
In this issue: 3x industry news, 3x AI research papers
Oct 7th ~ Oct 13th
🗞️ Industry News in 1 Line
♥ 2.2k Google has released Imagen 3, their latest text-to-image model which comes with higher degree of photorealism and better overall quality. It is now available to all Gemini users.
♥ 3.7k OpenAI aims to enhance AI models' ability to develop machine learning solutions and excel in Kaggle competitions with the release of MLE-bench, a new benchmark designed to evaluate how effectively AI agents perform in machine learning engineering.
♥ 8.4k The 7th edition of STATE OF AI REPORT was published earlier this week and it highlights some good yearly insights. You can read the entire report on Google Sheets without entering your email address or clicking on ads.
page 84
Intelligence at the Edge of Chaos
Zhang et al. [Yale University, Columbia University, Northwestern University, Idaho State University]
♥ 1.4k Intelligence Theory bycloud’s pick
What Makes LLMs Intelligent?
Until now, we have assumed that the only way to create artificial intelligence is by training it on data that already contains inherent intelligence, such as human-generated language or expert-annotated datasets. This approach limits our understanding of how intelligence emerges and may not be the most efficient or accurate way to develop artificial intelligence systems.
This paper argues that intelligence can emerge from modeling simple systems that exhibit complex behaviors, even when the underlying process generating the data lacks inherent intelligence. The researchers investigate this by training LLMs on different elementary cellular automata (ECA) and evaluating their performance on downstream tasks.
How to Build Intelligent LLMs?
The researchers developed a sophisticated approach to investigate how complexity in simple systems might lead to intelligent behavior in artificial intelligence models. Here's a breakdown of their methodology:
They started by simulating various elementary cellular automata (ECA) rules. These are simple systems that generate patterns based on specific rules. The researchers created sequences of binary vectors for each ECA rule, representing how the system changes over time. They began with a random initial state and let it evolve for 1000 time steps. To create diverse training data, they extracted random windows from these sequences, each covering 60 time steps and 100 spatial dimensions.
Next, they adapted a GPT-2 language model to work with this binary data. Instead of processing text, the model now predicts the next state in the ECA sequence. They trained separate models on data from different ECA rules for up to 10,000 epochs, using early stopping to prevent overfitting. The training process involved careful optimization techniques, including a specialized learning rate schedule and gradient clipping.
To test if the models developed intelligent behavior, the researchers designed three downstream tasks: an easy reasoning task, a hard reasoning task, and a chess move prediction task. The reasoning tasks were inspired by the Abstraction and Reasoning Corpus, and they required the models to infer transformation rules and apply them to new scenarios. The chess task asked the models to predict the next move in high-level chess games. Importantly, they only trained new input and output layers for these tasks, keeping the main model frozen. This approach allowed them to measure the inherent capabilities the models gained from training on the ECA data.
Testing Intelligent LLMs
This paper found a clear positive correlation between the complexity of the ECA rules used for pretraining and the models' performance on complex tasks. Models trained on more complex rules generally performed better, especially on the harder tasks. For the reasoning tasks, they measured efficiency (how quickly models reached 80% accuracy), while for chess, they reported final accuracy. They observed that models trained on rules from Wolfram's Class III (chaotic) and Class IV (complex) outperformed those trained on simpler rules.
Interestingly, they discovered a "sweet spot" of complexity that was most conducive to intelligence - rules that were challenging but not too chaotic. The researchers also analyzed the models' attention patterns and found that models trained on more complex rules tended to integrate more information from past states which suggests they learned more sophisticated strategies even for simple prediction tasks.
Differential Transformer
Ye et al. [Microsoft Research, Tsinghua University]
♥ 1.6k Transformer Theory
Introduction to Differential Transformer
If you know anything about LLMs then you would know that a vast majority of models use transformers in their architecture. Although Transformers helped scientists win the Nobel prize, they are still not perfect. One major problem is that they tend to over allocate attention to irrelevant context.
This means that when processing text, these models often focus too much on unimportant parts of the input, which can lead to issues like poor information retrieval, hallucinations (generating incorrect or nonsensical information), and difficulties in handling long contexts. This problem arises because the standard attention mechanism in Transformers assigns non-negligible attention scores to irrelevant parts of the context, which can drown out the important information.
This paper introduces a new architecture called DIFF Transformer which uses a differential attention mechanism that calculates attention scores by subtracting two separate softmax attention maps. By doing this subtraction, the DIFF Transformer can amplify attention to relevant context while reducing attention to irrelevant parts (noise).
How do Differential Transformers Work?
The Differential Transformer (DIFF Transformer) is a new architecture designed to improve upon the standard Transformer model. The main innovation of DIFF Transformer is its differential attention mechanism. In this mechanism, the model computes two separate attention maps and then subtracts one from the other. This subtraction helps to cancel out "noise" in the attention, allowing the model to focus more sharply on relevant information.
The process works like this:
The input is split into two sets of query and key vectors.
Two separate attention maps are computed using these sets.
One attention map is subtracted from the other, creating a "differential" attention score.
This differential score is then used to weigh the value vectors, determining how much each part of the input contributes to the output.
Multi-head differential attention.
The model uses multiple "heads" of this differential attention, each operating independently. The outputs from these heads are then normalized (to keep the values in a manageable range) and combined. The overall structure of DIFF Transformer is similar to the standard Transformer:
It's made up of multiple layers and each layer first applies the differential attention mechanism.
This is followed by a feed-forward network.
There are also normalization steps and residual connections (ways to preserve information from earlier in the process).
This architecture is compatible with existing optimization techniques for Transformers which makes it easier to implement and scale up.
Results and Evaluation of Differential Transformer
In this paper, we saw that the Differential Transformer outperforms the standard Transformer in several key areas.
Scalability: DIFF Transformer shows better performance across various model sizes (from 830M to 13.1B parameters) and amounts of training data. It achieves comparable results to larger Transformer models while using fewer parameters (about 60-65% of the Transformer's size) or less training data.
Long-context handling: When evaluated on longer sequences (up to 64K tokens), DIFF Transformer maintains better performance than the standard Transformer, especially in leveraging information from earlier parts of the context.
Key information retrieval: In tasks requiring the extraction of specific information from a large context, DIFF Transformer significantly outperforms the standard Transformer.
In-context learning: DIFF Transformer shows superior performance in many-shot classification tasks across various datasets. It also demonstrates greater robustness to the ordering of examples in the context, which has been a persistent issue for standard Transformers.
Many-shot in-context learning accuracy on different datasets.
Reduction of hallucinations: In both summarization and question-answering tasks, DIFF Transformer produces fewer hallucinations (incorrect or unsupported statements) compared to the standard Transformer.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Mirzadeh et al. [CUHK, CityU, Tencent AI Lab]
♥ 4.9k LLM Reasoning
Mathematical Reasoning Benchmarks for LLMs
When we make a new LLM, we test it on a number of benchmarks to see if it is any good. One of these popular benchmarks is the GSM8K benchmark which tests the mathematical reasoning capabilities of the models. While GSM8K is widely used to assess LLMs' mathematical reasoning, it only provides a single metric on a fixed set of questions. This limits comprehensive understanding of the models' capabilities. Moreover, the popularity of GSM8K increases the risk of inadvertent data contamination which could potentially skew results.
Although many models are performing well on GSM8K, it is still unclear whether LLMs have genuinely advanced in mathematical reasoning or are simply pattern-matching.
To address these issues, this paper has introduced GSM-Symbolic, an improved benchmark which uses symbolic templates to generate diverse variants of GSM8K questions. This allows for a more nuanced and reliable evaluation across various setups. By generating multiple variants of questions, GSM-Symbolic reduces the risk of data contamination and allows for a more comprehensive assessment of LLM capabilities.
Additionally, instead of relying on single-point accuracy metrics, the researchers can now analyze LLM performance as a distribution across different question variants. It also allows for adjusting question complexity, enabling the researchers to investigate how LLMs handle increased difficulty and number of clauses.
How to Benchmark Mathematical Reasoning in LLMs using GSM-Symbolic
This paper conducted an experiment to evaluate the reliability of current GSM8K results and assess the performance of Large Language Models (LLMs) on mathematical reasoning tasks. Here's how they did it:
They created GSM-Symbolic, a new benchmark based on GSM8K which took examples from the GSM8K test set and converted them into parsable templates.
They identified variables, their domains, and necessary conditions for each template.
They automated checks to ensure correct annotation and generated valid questions.
They generated multiple datasets from these templates:
They used 100 templates and generated 50 samples per template which resulted in 5000 total examples for each benchmark.
They created 50 datasets, each containing 100 examples.
They evaluated multiple LLMs on these datasets:
They tested over 20 open models ranging from 2B to 27B parameters.
They included state-of-the-art closed models like GPT-4 and Claude.
They used Chain-of-Thought (CoT) prompting with 8 shots and greedy decoding.
They analyzed the performance distribution:
They ran nearly 500 total evaluations on various setups.
They plotted the accuracy distribution for each model across the 50 datasets.
They compared the GSM-Symbolic results to the original GSM8K performance.
They investigated the impact of different changes in the input. For example, they separately modified proper names and numerical values in the questions and examined how these changes affected model performance.
They studied the effect of question difficulty by adding or removing clauses from the questions to adjust difficulty and then analyzed how the number of clauses affected performance and variance.
Mathematical Reasoning Results of LLMs on GSM-Symbolic Benchmarks
In GSM-NoOp, the researchers added extra information to math problems that seemed relevant but didn't actually affect the solution. These additions were called "No-Op" statements because they had no operational significance for solving the problem. When faced with these modified problems, all the models tested showed a significant drop in performance. Even the best models struggled with these new examples. For instance, the Phi-3-mini model's performance dropped by over 65%, and even more advanced models like o1-preview showed notable declines.
How sensitive are LLMs when we change only names, only proper numbers, or both names and numbers?
The researchers observed that the models often tried to use the extra information in their calculations, even when it wasn't necessary. For example, if a problem mentioned smaller fruits, the models would try to subtract them, even if that wasn't part of the actual problem. This behavior suggested that the models were applying operations based on patterns they'd seen in their training data, rather than truly understanding the problem at hand.
The distribution of 8-shot Chain-of-Thought (CoT) performance across 50 sets generated from GSM-Symbolic
These findings raise important questions about whether LLMs really understand mathematical concepts or if they're just very good at pattern matching.
Reply