🚨This week’s top AI/ML research papers - Oct 12th

(Oct 5 ~ Oct 12th, 2024)

🚨This week’s top AI/ML research papers:

  • Differential Transformer

  • GSM-Symbolic

  • Pixtral 12B

  • Intelligence at the Edge of Chaos

  • Cheating Automatic LLM Benchmarks

  • nGPT

  • Upcycling Large Language Models into Mixture of Experts

  • Personalized Visual Instruction Tuning

  • Towards World Simulator

  • Only-IF

  • Addition is All You Need for Energy-efficient Language Models

  • Selective Attention Improves Transformer

  • MLLM as Retriever

  • Rectified Diffusion

  • Everything Everywhere All at Once

  • Astute RAG

  • LLMs Are In-Context Reinforcement Learners

  • Scaling Laws For Diffusion Transformers

  • EVOLvE

  • Rewarding Progress

  • Falcon Mamba

  • Efficient Dictionary Learning with Switch Sparse Autoencoders

  • Scaling Up Your Kernels

  • RL, but don't do anything I wouldn't do

  • Aria: An Open Multimodal Native Mixture-of-Experts Model

  • Inheritune: Training Smaller Yet More Attentive Language Models

overview for each + authors' explanations ⬇️ 

Differential Transformer

Overview:

Diff Transformer introduces a differential attention mechanism that improves the focus of Transformer models on relevant context by subtracting one softmax attention map from another to reduce noise.

This method results in sparse attention patterns, enhancing performance in language modeling across different model scales and training token settings.

It is particularly effective in long-context modeling, key information retrieval, hallucination mitigation, and in-context learning, showing robustness to order permutations.

The approach proves successful in minimizing distractions from irrelevant context, advancing the capabilities of LLMs in tasks such as question answering and text summarization.

Paper:

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Overview:

Recent large-scale evaluations reveal that LLMs show variability and reduced accuracy in mathematical reasoning when evaluated using the new GSM-Symbolic benchmark.

This benchmark, designed with symbolic templates, highlights that merely altering numerical values or adding non-contributory clauses can significantly degrade LLM performance, with declines up to 65%.

The study suggests that these models don't perform genuine logical reasoning but rather mimic patterns from their training data, casting doubt on previous metrics of their mathematical reasoning progress.

Paper:

Author's Explanation:

Pixtral 12B

Overview:

Pixtral-12B is a 12-billion-parameter multimodal language model designed to handle both natural images and documents, achieving superior performance on several benchmarks and outperforming larger models.

It features a newly trained vision encoder that accepts images at natural resolution and aspect ratio, offering flexibility in token usage.

Capable of processing multiple images within a context window of 128K tokens, Pixtral-12B surpasses models like Llama-3.2 11B and even outperforms the much larger Llama-3.2 90B model while being significantly smaller.

Additionally, the paper introduces MM-MT-Bench, an open-source benchmark for evaluating the performance of vision-language models in practical settings.

Paper:

Author's Explanation:

Intelligence at the Edge of Chaos

Overview:

This paper investigates the relationship between rule complexity and intelligence in artificial systems by examining elementary cellular automata (ECA) and training LLMs on them.

The study reveals that LLMs trained on more complex rules exhibit higher intelligence, improving performance on tasks such as reasoning and chess move prediction.

It is found that systems with uniform, periodic, or highly chaotic behavior resulted in poorer performance, suggesting an optimal level of complexity for fostering intelligence.

The authors propose that intelligence emerges from the capability to predict and handle complexity, indicating that exposure to such complexity is key to developing intelligent systems.

Paper:

Author's Explanation:

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Overview:

Automatic LLM benchmarks have gained popularity for their efficiency, but this paper demonstrates that they can be vulnerable to manipulation.

The authors show that even a "null model," which outputs a constant irrelevant response, can achieve high win rates, such as an 86.5% win rate on AlpacaEval 2.0, by exploiting weaknesses in benchmark design.

The study suggests that an adversary could use LLMs to generate undetectable cheating responses, emphasizing the need for robust anti-cheating mechanisms to ensure the reliability of automatic benchmarks.

Paper:

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Overview:

The paper introduces nGPT, a normalized Transformer architecture implementing representation learning on the hypersphere.

In this design, vectors in the embeddings, MLP, attention matrices, and hidden states are unit normalized, with the input tokens moving across a hypersphere surface and each layer adding a displacement to reach output predictions.

These displacements are managed by MLP and attention blocks, which also reside on the hypersphere.

nGPT demonstrates significantly faster learning, achieving equivalent accuracy with up to 20 times fewer training steps depending on the sequence length.

Paper:

Upcycling Large Language Models into Mixture of Experts

Overview:

Upcycling dense LLMs into sparse mixture-of-experts (MoE) models is explored to enhance model capacity efficiently.

The study introduces a "virtual group" initialization scheme and weight scaling method for fine-grained MoE architectures, demonstrating that upcycling is more effective than continued dense model training.

It reveals that a softmax-then-topK expert routing enhances performance compared to the topK-then-softmax approach, with higher granularity MoEs improving accuracy.

In a comparison on 1T tokens, an upcycled Nemotron-4 15B model showed a higher accuracy than a continuously trained counterpart, illustrating the potential of upcycling methods for MoE language models.

Paper:

Author's Explanation:

Personalized Visual Instruction Tuning

Subscribe to keep reading

This content is free, but you must be subscribed to The AI Timeline to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.