- The AI Timeline
- Posts
- Duality in Deep Learning, Iterative Poisson VAE, and Breaking Down Mixture of Experts
Duality in Deep Learning, Iterative Poisson VAE, and Breaking Down Mixture of Experts
#30 | Latest AI Research Explained Simply
In this issue: 3x industry news, 3x AI research papers
Oct 28th ~ Nov 3rd
🗞️ Industry News in 1 Line
♥ 13k OpenAI has released ChatGPT search to its Plus, Team, and SearchGPT waitlist subscribers. You can use its chrome extension which can now give faster answers and cite relevant sources from the web. OpenAI has also partnered with several data providers and now it will be able to provide real time updates on weather, stocks, sports, maps, and other news events.
♥ 2.9k We have already seen how AI can generate text and images, but now it can also produce game worlds in real-time. A new model called Oasis can produce frames on keyboard inputs and produce a brand-new map for every game. Try Oasis in your browser.
♥ 1k Redcraft (also known as Red Panda) has released a new image generation model called Recraft V3 which supposedly “thinks in design”. It allows users to control text size and placement which makes it very useful for creating poster and dynamic advertisements.
A poster generated by Recraft V3
The top AI labs/startups are hiring!
San Francisco, Full-Time |
San Francisco, Full-Time, $120K- $180K & 0.50%~2.00% |
Tokyo/San Francisco, Full-Time |
Modular Duality in Deep Learning
Bernstein and Newhouse [MIT CSAIL]
♥ 200 LLM Attention bycloud’s pick
Understanding Duality in Deep Learning
When we train an AI model, we rely on a fundamental algorithm which calculates a loss and iteratively tries to reduce that loss – this is called gradient descent. This paper has identified a fundamental flaw in standard neural network training: gradient descent naively applies the same learning rate across all weight dimensions, ignoring that the loss function's geometry varies differently in different directions.
To address this, this paper proposes "modular dualization," a framework that transforms gradients through carefully constructed "duality maps" that respect the geometric structure of the loss landscape. Their approach breaks down the problem by first assigning operator norms to individual layers, then constructing layer-specific duality maps, and finally combining these maps recursively for the full network.
A Better Approach for Deep Learning
This paper presents a systematic approach to improving neural network training by focusing on how gradients should be transformed differently for different types of neural network layers. At the foundational level, they identify that Linear layers, Embedding layers, and Convolutional layers each need their own specific rules for gradient transformation.
For instance, while Linear layers deal with general vectors and need to consider size relationships between input and output dimensions, Embedding layers specifically handle one-hot vectors (like word indices in language models) and need different scaling rules. Convolutional layers, being even more complex, need to handle spatial relationships and multiple channels while maintaining consistent behavior across different kernel positions.
To make this framework practical, they develop a hierarchical approach. First, they define rules for individual layers (atomic modules), then extend these rules to handle combinations of layers through composition (when one layer feeds into another) and concatenation (when layers operate in parallel).
They also address special cases like activation functions (which they call "bond modules") that don't have trainable weights but still need to be considered in the overall framework. This structured approach ensures that their method can handle modern neural architectures like transformers and convolutional networks consistently.
The most significant challenge they tackle is making these transformations computationally efficient. Traditional methods for the required matrix operations (particularly Singular Value Decomposition) are too slow for practical use in deep learning. They propose three solutions:
A randomized approximation method called sketching
A technique using iterative matrix root calculations
A novel method called rectangular Newton-Schulz iteration. This method avoids common numerical problems with previous approaches and can handle both full-rank and low-rank matrices effectively.
Evaluating Newton-Schulz Iteration for Deep Learning
The Newton-Schulz iterations make neural network training both faster and more scalable. This paper introduces "modular dualization," a mathematical framework that elegantly unifies two previously separate optimization methods: maximal update parameterization (µP) and Shampoo, with the latter recently winning a significant optimization competition. The key insight is that both methods can be viewed as approximations of a single underlying mathematical structure, much like how different roads might lead to the same destination.
Their framework helps resolve several long standing puzzles in deep learning, including the mysterious behavior of neural networks at large scales and the relationship between weight updates and network activations. Most importantly, early implementations of their methods have already shown impressive real-world results. It has set new speed records in training transformer models like NanoGPT.
The broader impact of this research could be significant for AI development, as training efficiency is often a major bottleneck in advancing the field.
A prescriptive theory for brain-like inference
Vafaii et al. [Redwood Center for Theoretical Neuroscience]
♥ 924 LLM
Introduction to iterative Poisson VAE
The Evidence Lower Bound (ELBO) approach is a powerful method for training AI models in the field of neuroscience and machine learning. However, it is too abstract to provide practical guidance for building neural networks that accurately reflect how biological brains work. The main issue stems from previous approaches using Gaussian distributions to model neural activity, which doesn't match the reality that biological neurons primarily operate using Poisson distributions (relating to discrete spike patterns).
To solve this, this paper proposes the iterative Poisson VAE (iP-VAE), which replaces Gaussian assumptions with Poisson distributions and implements an iterative inference process that better mirrors how actual neurons communicate through spikes. This has two significant improvements:
It creates a more biologically accurate representation of neural activity
It also demonstrates better performance in machine learning tasks, particularly in handling out-of-distribution data and creating sparse (efficient) representations.
How Does Iterative Poisson VAE Work?
The Iterative Poisson VAE (IP-VAE) is essentially a way to process and understand sequences of data points, particularly when dealing with Poisson distributions (which are good for modeling count data or events that occur at a fixed rate). Here's how it works:
Basic Structure: The system looks at a sequence of data points over time and for each data point, there are two types of variables:
Observable variables (the actual data we can see)
Hidden/latent variables (underlying patterns we're trying to discover)
Key Principles: The system assumes that what happens at one moment depends only on what happened right before it (this is called Markovian dependence). Additionally, it uses a concept called "stationarity," meaning the rules for how things change stay the same over time. This is particularly useful for situations where you show the same image repeatedly.
The Learning Process: The system has two main parts:
An encoder that converts input data into hidden patterns
A decoder that tries to reconstruct the original data from these patterns
Biological Inspiration: The system is designed to mirror how neurons work in the brain. It uses something similar to membrane potentials (the electrical charge difference across a neuron's membrane) and tries to replicate how neurons fire in response to stimuli.
Dynamic Updates: Instead of directly updating the rates at which events occur, the system works with logarithms of these rates which makes the math easier and more biologically realistic. When processing a sequence, it uses what it learned from the previous observation as a starting point for understanding the next one.
Practical Implementation: The system learns by trying to minimize the difference between what it predicts and what it actually observes. Additionally, it includes a mechanism to prevent the model from becoming too complex (called a sparsity penalty). This allows it to handle situations where the same input is shown multiple times, making it good for learning stable patterns.
Comparing loss of different VAEs against iP-VAE
Evaluating Iterative Poisson VAE
The iP-VAE uniquely combines spiking neural networks with Bayesian inference through membrane potential dynamics. This design allows it to solve fundamental limitations of previous approaches by operating purely through spike-based communication and private membrane potentials, while naturally handling positive firing rates in a way that closely mirrors biological neurons.
What makes iP-VAE particularly notable is its superior performance compared to existing models while using fewer parameters, demonstrating remarkable adaptability and robustness to out-of-distribution samples. Moreover, iP-VAE shows great promise for neuromorphic hardware implementation, especially given its connection to proven architectures like spiking LCA, while future work could extend its capabilities to handle hierarchical models and nonstationary sequences like videos.
Performance of different VAEs against iP-VAE
Reply