- The AI Timeline
- Posts
- 🚨This week’s top AI/ML research papers - Oct 12th
🚨This week’s top AI/ML research papers - Oct 12th
(Oct 5 ~ Oct 12th, 2024)
🚨This week’s top AI/ML research papers:
Differential Transformer
GSM-Symbolic
Pixtral 12B
Intelligence at the Edge of Chaos
Cheating Automatic LLM Benchmarks
nGPT
Upcycling Large Language Models into Mixture of Experts
Personalized Visual Instruction Tuning
Towards World Simulator
Only-IF
Addition is All You Need for Energy-efficient Language Models
Selective Attention Improves Transformer
MLLM as Retriever
Rectified Diffusion
Everything Everywhere All at Once
Astute RAG
LLMs Are In-Context Reinforcement Learners
Scaling Laws For Diffusion Transformers
EVOLvE
Rewarding Progress
Falcon Mamba
Efficient Dictionary Learning with Switch Sparse Autoencoders
Scaling Up Your Kernels
RL, but don't do anything I wouldn't do
Aria: An Open Multimodal Native Mixture-of-Experts Model
Inheritune: Training Smaller Yet More Attentive Language Models
overview for each + authors' explanations ⬇️
Differential Transformer
Overview:
Diff Transformer introduces a differential attention mechanism that improves the focus of Transformer models on relevant context by subtracting one softmax attention map from another to reduce noise.
This method results in sparse attention patterns, enhancing performance in language modeling across different model scales and training token settings.
It is particularly effective in long-context modeling, key information retrieval, hallucination mitigation, and in-context learning, showing robustness to order permutations.
The approach proves successful in minimizing distractions from irrelevant context, advancing the capabilities of LLMs in tasks such as question answering and text summarization.
Paper:
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Overview:
Recent large-scale evaluations reveal that LLMs show variability and reduced accuracy in mathematical reasoning when evaluated using the new GSM-Symbolic benchmark.
This benchmark, designed with symbolic templates, highlights that merely altering numerical values or adding non-contributory clauses can significantly degrade LLM performance, with declines up to 65%.
The study suggests that these models don't perform genuine logical reasoning but rather mimic patterns from their training data, casting doubt on previous metrics of their mathematical reasoning progress.
Paper:
Author's Explanation:
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… x.com/i/web/status/1…
— Mehrdad Farajtabar (@MFarajtabar)
7:16 PM • Oct 10, 2024
Pixtral 12B
Overview:
Pixtral-12B is a 12-billion-parameter multimodal language model designed to handle both natural images and documents, achieving superior performance on several benchmarks and outperforming larger models.
It features a newly trained vision encoder that accepts images at natural resolution and aspect ratio, offering flexibility in token usage.
Capable of processing multiple images within a context window of 128K tokens, Pixtral-12B surpasses models like Llama-3.2 11B and even outperforms the much larger Llama-3.2 90B model while being significantly smaller.
Additionally, the paper introduces MM-MT-Bench, an open-source benchmark for evaluating the performance of vision-language models in practical settings.
Paper:
Author's Explanation:
We just released Pixtral 12B paper on Arxiv:
arxiv.org/abs/2410.07073— Devendra Chaplot (@dchaplot)
2:26 AM • Oct 10, 2024
Intelligence at the Edge of Chaos
Overview:
This paper investigates the relationship between rule complexity and intelligence in artificial systems by examining elementary cellular automata (ECA) and training LLMs on them.
The study reveals that LLMs trained on more complex rules exhibit higher intelligence, improving performance on tasks such as reasoning and chess move prediction.
It is found that systems with uniform, periodic, or highly chaotic behavior resulted in poorer performance, suggesting an optimal level of complexity for fostering intelligence.
The authors propose that intelligence emerges from the capability to predict and handle complexity, indicating that exposure to such complexity is key to developing intelligent systems.
Paper:
Author's Explanation:
How does complexity shape intelligence? 🤔
In our new paper Intelligence at the Edge of Chaos, we explore the relationship between complex systems and the emergence of intelligence in AI models. Can complexity alone unlock smarter systems? 🌌🧠 #AI#complexity… x.com/i/web/status/1…
— Van Dijk Lab (@david_van_dijk)
12:25 PM • Oct 10, 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Overview:
Automatic LLM benchmarks have gained popularity for their efficiency, but this paper demonstrates that they can be vulnerable to manipulation.
The authors show that even a "null model," which outputs a constant irrelevant response, can achieve high win rates, such as an 86.5% win rate on AlpacaEval 2.0, by exploiting weaknesses in benchmark design.
The study suggests that an adversary could use LLMs to generate undetectable cheating responses, emphasizing the need for robust anti-cheating mechanisms to ensure the reliability of automatic benchmarks.
Paper:
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Overview:
The paper introduces nGPT, a normalized Transformer architecture implementing representation learning on the hypersphere.
In this design, vectors in the embeddings, MLP, attention matrices, and hidden states are unit normalized, with the input tokens moving across a hypersphere surface and each layer adding a displacement to reach output predictions.
These displacements are managed by MLP and attention blocks, which also reside on the hypersphere.
nGPT demonstrates significantly faster learning, achieving equivalent accuracy with up to 20 times fewer training steps depending on the sequence length.
Paper:
Upcycling Large Language Models into Mixture of Experts
Overview:
Upcycling dense LLMs into sparse mixture-of-experts (MoE) models is explored to enhance model capacity efficiently.
The study introduces a "virtual group" initialization scheme and weight scaling method for fine-grained MoE architectures, demonstrating that upcycling is more effective than continued dense model training.
It reveals that a softmax-then-topK expert routing enhances performance compared to the topK-then-softmax approach, with higher granularity MoEs improving accuracy.
In a comparison on 1T tokens, an upcycled Nemotron-4 15B model showed a higher accuracy than a continuously trained counterpart, illustrating the potential of upcycling methods for MoE language models.
Paper:
Author's Explanation:
I'm excited to share our latest research on improving LLM by upcycling them into Mixture of Experts (MoE)!
1. We upcycled the Nemotron-4 15B model on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens. The continuously trained… x.com/i/web/status/1…— Ethan He (@EthanHe_42)
12:56 AM • Oct 11, 2024
Reply