• The AI Timeline
  • Posts
  • Transcendence, DeepSeek Coder V2, and Compute Better Spent

Transcendence, DeepSeek Coder V2, and Compute Better Spent

#11 | Latest AI Research Explained Simply

In this issue: x4 industry news, x3 AI research papers

June 18th ~ June 25th

🗞️ Industry News in 1 Line

  1. ♥ 6.1k Anthropic released Claude Sonnet 3.5 which tops GPT-4o in coding and the charts on a number of benchmarks and can produce code, docs, vector graphics, and even some simple games with its new Artifacts function.

  2. ♥ 31k Ilya Sutskever, co-founder and chief scientist at OpenAI, has left the company to start his new company - Safe Superintelligence Inc due to the disagreement he has with the company’s vision.

  3. ♥ 929 Nous Research, an AI company which has released a huge collection of open source LLM finetunes in the past, has recently released Hermes-2 Θ (Theta) 70B which is capable of function calling, producing structured outputs for JSON mode, feature extraction, and beats llama-3 70B instruct nearly across the board.

  4. ♥ 805 Microsoft has released Florence-2, a lightweight vision-language model, under the MIT license. It has two variants - 0.23B parameters and 0.77B parameters which makes it highly compatible for mobile devices or even Raspberry Pi.

Southern New Hampshire University - Computer Science Program

Gain the skills you need to enter computer science and AI with Southern New Hampshire University’s (SNHU) cutting-edge program. 

  • Master popular languages like Python, Java, and C++, and expand your expertise with full-stack development and cloud integration using JavaScript, NoSQL, and Amazon Web Services. 

  • Learn agile software methodologies and develop a security mindset to tackle industry challenges head-on.

  • Experience cloud-based virtual environments to access the technology needed for your degree and career. 

  • SNHU offers some of the lowest online tuition rates in the nation, making your education radically affordable.

Visit https://snhu.edu/TheAITimeline to discover the current median annual salary for developers and request free information about the program. Speak now with a real person to see how this program can benefit you personally!

Transcendence: Generative Models Can Outperform The Experts That Train Them

Zhang et al. [Harvard, UC Santa Barbara, Princeton, Kempner Institute, Google DeepMind, Apple]

♥ 824   AI Reasoning

The last hidden layer latent representations of game transcripts during training time. The colors represent the probability of winning, with +1 corresponding to a state where White has won and 0 to Black.

Generative models are typically designed to imitate the conditional probability distribution induced by their training data. When this data is generated by humans, it is generally assumed that the models will, at best, match human performance on the tasks. This paper addresses the intriguing phenomenon of transcendence, where a generative model not only matches but surpasses the capabilities of the experts who created its training data.

The authors demonstrate this by training an autoregressive transformer, named ChessFormer, on chess game transcripts from players with ratings up to a certain level. Remarkably, ChessFormer can exceed the highest rating seen in its training data, showcasing its ability to transcend the skill level of the expert data sources.

How Does ChessFormer Achieve Transcendence?

ChessFormer achieves transcendence through a combination of low-temperature sampling and leveraging the diversity of its training data. Here's how each of them works:

Low-Temperature Sampling

When the data is generated by a single, noisy expert, transcendence can be achieved through low-temperature sampling. The model learns to focus on the correct predictions by reducing the impact of noise. As the temperature decreases, the model increasingly selects high-reward actions, effectively denoising the expert's predictions and surpassing the expert's performance. This process amplifies the signal from correct predictions while minimizing the noise from incorrect or suboptimal moves. As a result, the model consistently selects better moves, surpassing the performance of the expert players in the training data.

In cases where the dataset is generated by multiple experts, each excelling in different domains, low-temperature sampling can still help the model to achieve transcendence. By integrating diverse expert predictions, the model benefits from the strengths of each expert while minimizing individual weaknesses. This complementary knowledge enables the model to outperform any single expert, provided the test distribution includes a variety of inputs covered by different experts.

Diverse Training Data

ChessFormer is trained on a dataset that includes game transcripts from players of varying skill levels, each contributing unique strategies and knowledge. This diversity allows the model to learn from a wide range of scenarios and expert behaviors. When the model combines this diverse information through low-temperature sampling, it can identify and adopt the best strategies from each expert, effectively creating a composite expertise that exceeds any individual player's skill level.

Results and Findings for Transcendence

The concept of transcendence has profound implications for the future of AI and generative models. It suggests that under the right conditions, models can not only replicate but surpass human expertise, opening new possibilities for AI applications in various domains.

They’ve shown that when ChessFormer is trained on rating between 1000 ~ 1300, it is able to trascendence and perform at rating level of 1500. However, when it is trained on rating level of 1500, it is unable to transcend to higher ratings like 2000. This may have been due to the lack of data diversity, and if so, a 1000 rated player can be thought of as a noisy 1500 rated player, but a 1500 rated player cannot be thought of as a noisy 2000 rated player

The study lays the groundwork for further exploration into transcendence across different fields, potentially revolutionizing our understanding of AI capabilities and their practical implementations.

the model is only trained on games where players have ratings up to a threshold, then ChessFormer is able to learn from the games effectively and surpass the threshold (95% confidence interval)

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Zhu et al. [DeepSeek]

♥ 1.5k   LLM

Evaluation results on the “Needle In A Haystack” (NIAH) tests. DeepSeek-Coder-V2 performs well across all context window lengths up to 128K.

Introduction to DeepSeek-Coder-V2

Whenever people talk about AI, they are usually talking about models released by one of the few tech giants – it is easy to forget that there are other players in this space who are releasing top notch open-source AI models. Till now, existing open-source models, despite significant progress, have struggled to match the capabilities of these leading proprietary models, particularly in code-specific tasks and mathematical reasoning.

A new model by DeepSeek, called DeepSeek-Coder-V2 (GitHub repo), aims to reduce this performance gap between open-source code language models and state-of-the-art closed-source models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro through a combination of advanced architectural choices, extensive pre-training, expanded language support, and enhanced training methodologies.

How does DeepSeek-Coder-V2 Work?

DeepSeek-Coder-V2 uses the same architecture as DeepSeek-V2 but introduces several innovative approaches to bridge the performance gap and establish itself as a competitive open-source alternative:

  1. Mixture-of-Experts (MoE) Architecture: DeepSeek-Coder-V2 leverages the Mixture-of-Experts framework, which dynamically selects different subsets of model parameters for different inputs, thereby optimizing computational efficiency and enhancing performance on diverse tasks.

  2. Extensive Pre-Training with Additional Tokens: The model is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with an additional 6 trillion tokens. This substantial increase in training data significantly enhances its coding and mathematical reasoning abilities while maintaining strong general language performance.

  3. Expanded Programming Language Support: DeepSeek-Coder-V2 expands its programming language support from 86 to 338 languages, making it highly versatile and applicable across a broader range of coding tasks.

  4. Extended Context Length: The model's context length has increased from 16K to 128K tokens, and now it is able to handle more complex and extensive code inputs and scenarios, which is crucial for advanced coding tasks.

  5. Comprehensive Dataset Composition: The pre-training dataset is composed of 60% source code, 10% math corpus, and 30% natural language corpus. This diverse and well-balanced dataset helps the model achieve high performance across different types of tasks.

Reinforcement Learning with Group Relative Policy Optimization (GRPO): This method was also used in DeepSeek-V2 and researchers found that it helps in aligning the model's behavior with human preferences, particularly in the coding domain, using feedback from compiler outputs and test cases.

Evaluation of DeepSeek-Coder-V2

DeepSeek-Coder-V2 demonstrates superior performance in several standard benchmark evaluations, achieving notable accuracy improvements in HumanEval, MBPP, and other coding and math benchmarks. This places DeepSeek-Coder-V2 on par with leading closed-source models. It is now the top scoring model on lmsys coding arena.

The following table shows the benchmark performance of DeepSeek-Coder-V2-Instruct across multiple programming languages – it achieved the second highest average score on the leaderboard, effectively challenging the dominance of closed source models.

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Qiu et al. [New York University, Carnegie Mellon University]

♥ 528   ML Theory

Introduction to Structured Matrices

As the model size grows, computation requirements grow exponentially, which makes the process quite expensive and slow. If we can find a way to remove inefficiencies and speed up the model training process, then we can expect to see better AI models in the future.

This paper introduces the idea of systematic exploration and proposes replacing dense layers in neural networks with structured matrices to improve computational efficiency while maintaining the performance. 

Inner-Workings of Structured Matrices

Structured matrices are specialized types of matrices that exhibit specific patterns or structures, allowing for more efficient computations compared to traditional dense matrices. This paper discusses several different types of matrices – Low-Rank, Convolutional (Toeplitz), Kronecker Product, Monarch, Tensor-Train (TT), and Block Tensor-Train (BTT). Each of these matrices relies on a number of assumptions about the underlying data, as such each of them is applicable in a different scenario and provides varying levels of performance boost.

How Structured Matrices can Replace Dense Layers

  1. Efficient Computation: Structured matrices are designed to perform matrix-vector multiplications (MVMs) more efficiently than dense matrices. For example, a dense d×d matrix requires d2 parameters and d2 FLOPs for MVM, while structured matrices like low-rank or convolutional matrices can significantly reduce these requirements.

  2. Parameter Efficiency: By decomposing a dense matrix into structured components, the number of parameters can be drastically reduced. This reduction not only saves memory but also potentially improves generalization by reducing the model's complexity.

  3. Modeling Assumptions: Structured matrices incorporate specific assumptions about the data (e.g., translational symmetry for convolutions, low-rank structure for compression). These assumptions help in capturing the inherent patterns of the data more effectively than dense matrices, leading to better performance with fewer resources.

  4. Scaling and Initialization: The paper discusses the importance of appropriately scaling the initialization and learning rates for structured matrices. By using the Maximal Update Parameterization (µP) framework, the authors determine the optimal initialization and learning rates, which ensures that the structured matrices perform well as models scale.

Evaluating Results of Structured Transformers

This paper trained Vision Transformers (ViT) on ImageNet with patch size of 32 for 300 epoch and found that BTT reached the performance level of a dense ViT-S/32 model with up to 3.8 times fewer floating-point operations (FLOPs). The following chart shows the error percentage of each approach and how it changes with compute operations.

This paper also trained GPT-2 models on the OpenWebText dataset for 600,000 steps with a batch size of 245,760 tokens and a sequence length of 512. Researchers tried to make GPT-2 more compute-efficient by replacing all linear layers, including the language modeling head, with BTT layers. When the compute spent in the language modeling head was excluded, BTT and dense layers performed similarly. This suggests that the efficiency gains from BTT primarily come from reducing the compute spent in the language modeling head.

Reply

or to participate.