- The AI Timeline
- Posts
- Movie Gen, Were RNNs All We Needed? and Contextual Document Embeddings
Movie Gen, Were RNNs All We Needed? and Contextual Document Embeddings
#26 | Latest AI Research Explained Simply
In this issue: 3x industry news, 3x AI research papers
Sep 30th ~ Oct 6th
đïž Industry News in 1 Line
â„ 6.2k Meta released MovieGen, a media foundation AI model which provides video personalization, text based video editing, and generate audio with text + video. A full 92 page research paper is available (which is covered in this issue!).
â„ 1.7k Black Forest Labs, the startup with the key researcher behind Latent Diffusion and released Flux.1, has announced Flux1.1 Pro under API. Seems like SoTA to me.
â„ 9.1k OpenAI just raised the largest venture capital round in history, with 6.6 billion raised at a valuation of 157 billion. On the other hand, ChatGPT announced Canvas, a new way of writing & coding projects with ChatGPT. It is hard to use with codes right now, but it has been perfect for writing in my opinion.
No Sponsor This Week!
Instead let me share with you our plans right now.
I have 1 collection and 1 selection for the latest weekly papers
Collection: weekly research papers recap like this but not an issue
Selection: Selected highly intriguing papers for the week like this issue
But Mr. cloud, with all these research papers you go through, what do you intend to do with all of them? Arenât you just overhyping it like an AI bro?
Currently, I am building a directory that maps out ALL the latest useful LLM research. The collection is like showing you guys what is going to be updated into that directory, and the selection is just to update you with the minimum key research breakthroughs of the week.
I donât have plans to charge for my collection and selection, but I do tend to build the mapping as best as I can. Iâve partnered up with a good friend of mine that is insane at UI/UX, to ensure you would have both ease and comfort when browsing through the latest LLM research landscape.
Right now itâs fully funded by me, but if youâre interested in supporting us, check out my Patreon. I canât guarantee any concrete timeline right now, but Iâll update as much as I can, and feel free to ask any questions on my discord or in my DM. So do only support if you like what I am doing with the Newsletter/YouTube so far.
But the patrons will be the first to see a preview of my project, and we have so much planned for the future. So itâll be amazing to have you on the sideline!
Were RNNs All We Needed?
Feng et al. [Universite de Montreal, Borealis AI]
â„ 721 RNN LM bycloudâs pick
Introduction to minGRU and minLSTM
Transformers are one of the most commonly used components in the field of AI but they suffer from scalability issues due to their quadratic computational complexity with respect to sequence length. This means if the input is 10 times longer, it takes 100 times more computing power to process.
In recent weeks, we have seen new recurrent architectures like S4, and Mamba which use parallelization techniques such as the parallel prefix scan algorithm for efficient training. But this paper revisits much older recurrent models, LSTMs and GRUs, which were previously considered computationally expensive due to the need for backpropagation through time (BPTT).
The authors propose a key modification: removing hidden state dependencies from the input, forget, and update gates of these models. This seemingly simple change eliminates the need for BPTT and enables parallel training using the efficient parallel scan algorithm.
How Does minGRU and minLSTM Work?
minGRU and minLSTM are simplified and parallelizable versions of traditional GRU and LSTM recurrent neural networks. The core idea is to enable parallel training to address the computational bottleneck of sequential processing in traditional RNNs that rely on backpropagation through time.
Changes in minGRU Architecture
Removing Hidden State Dependencies: The key change is removing the dependence of the update gate (z_t) and candidate hidden state (hÌ_t) on the previous hidden state (h_t-1). In a standard GRU, these components are calculated based on both the current input (x_t) and the previous hidden state. By making them solely dependent on the current input, each step's computation becomes independent of prior steps, enabling parallelization. This also eliminates the need for the reset gate (r_t), which controls the influence of the previous hidden state on the candidate state.
Removing Tanh: The hyperbolic tangent (tanh) function is used in standard GRUs to constrain the range of hidden states. By removing it, this further reduces computational overhead without significant performance degradation, as the hidden state dependencies which were managed by tanh are already removed.
Changes in minLSTM Architecture
Removing Hidden State Dependencies: Similar to minGRU, the forget (f_t), input (i_t), and candidate cell state (cÌ_t) calculations are made independent of the previous hidden state (h_t-1) which allows parallel computation.
Removing Tanh: The tanh activations, used in both the candidate cell state and hidden state calculations are removed.
Normalizing Gates and Removing Output Gate: The forget and input gates are normalized to ensure their sum equals 1. This ensures the cell state's scale remains consistent across time steps, simplifying training and mimicking the inherent time-independent scale of GRUs. Consequently, the output gate (o_t), which scaled the hidden state based on the cell state, becomes unnecessary and is removed. Furthermore, the cell state and hidden state become equivalent and are merged into a single hidden state.
Results and Real-World Implications of minGRU and minLSTM
Training Speed: minLSTM and minGRU are significantly faster to train than traditional LSTMs and GRUs. With a sequence length of 512, they achieve speedups of 175x and 235x, respectively, on a T4 GPU. This advantage grows even larger with longer sequences (over 1300x speedup for sequence length 4096). This massive speed improvement is mainly due to the removal of hidden state dependencies from the gates which enables parallel computation via the parallel scan algorithm.
Memory: While minLSTM and minGRU consume ~88% more memory than their traditional counterparts and Mamba uses ~56% more memory than minGRU due to the larger computational graph created by the parallel scan algorithm, the substantial gains in training speed outweigh the increased memory footprint. Training runtime is generally the bottleneck in RNN training, not memory.
Benefits of the Minimal Architectures:
Parallelization: The most significant advantage is the ability to train these networks in parallel using the parallel scan algorithm. This leads to substantial speed improvements compared to the sequential BPTT training required by traditional LSTMs and GRUs.
Reduced Parameters: minGRU and minLSTM use fewer parameters than their traditional counterparts, making them more memory-efficient. minGRU requires only about 13-33% of the parameters of a GRU, and minLSTM requires 15-38% of the parameters of an LSTM, depending on the relative sizes of the input and hidden state dimensions.
Comparable Performance: Despite these simplifications, minGRU and minLSTM achieve performance comparable to state-of-the-art sequence models, including Transformers and recent recurrent architectures like Mamba, across various tasks.
Contextual Document Embeddings
Morris and Rush [Cornell University]
â„ 2.3k LLM Embedding
Introduction to Contextual Document Embeddings
When we talk to computers, we are just storing bits of zeroes and ones in the memory. When you type âredâ, the computer stores a string of text in its memory but it doesnât understand the concept of red. Embeddings are a way to store the idea of red in the computerâs memory.
Currently, the embeddings donât have the necessary contextual awareness in current neural document embedding models for retrieval tasks. Existing methods treat each document in isolation by encoding them independently without considering their relationship to other documents in the corpus. For example, the term "draft" might have different importance in the context of sports articles versus legal documents.
This paper contains two complementary solutions to introduce Contextual Document Embeddings (CDE):
Contextual Contrastive Learning: This method modifies the training process by incorporating the notion of "neighboring documents."
Contextual Encoder Architecture: This introduces a new architecture that explicitly injects information about neighboring documents into the embedding process.
How do Contextual Document Embeddings Work?
The core idea of Contextual Document Embedding is to make document embeddings (representations of text documents as numerical vectors) more aware of their context. Traditionally, when we convert a document into a vector, we only look at that document in isolation. But this new approach considers other documents in the corpus when creating the embedding.
Two-Stage Process: The model uses a two-stage process to create these contextual embeddings:
First Stage: The model takes a subset of documents from the corpus and embeds them using a special embedding model. This creates a set of "context vectors" that represent the broader context of the corpus.
Second Stage: When embedding a specific document, the model combines the original document's text with the context vectors created in the first stage. It then uses another embedding model to create the final representation of the document.
Handling Queries: For search queries, the model uses a similar approach. It combines the query text with the same context vectors used for documents. This allows the query representation to be aware of the corpus context too.
Flexibility and Efficiency: The model is designed to be flexible and efficient:
It can work without context by using a special "null" token instead of context vectors.
During training, it shares context within batches of documents to save computation time.
When indexing a new corpus, it can cache the first-stage context vectors to speed up processing.
Training Improvements: The researchers also introduce two key improvements to the training process:
Contextual Batching: Similar documents are grouped together instead of randomly sampling documents for training. This creates "mini-domains" within each batch which helps the model learn to handle different contexts more effectively.
Two-Stage Gradient Caching: This is a technical optimization that allows the model to work with larger batches and more context samples without running out of memory.
By doing all this, this model tries to mimic how humans understand documents - not in isolation, but in the context of other related documents.
Results and Evaluation of Contextual Document Embeddings
In the following table, we can see that the suggested technique improves performance compared to standard biencoder training. This paper also showed that creating more challenging training batches through contextual batching leads to more effective learning. We also saw that while large batch and cluster sizes are beneficial without filtering, smaller, harder clusters, achieved through filtering false negatives significantly improve performance. This suggests that focusing on challenging examples within a context is more crucial than simply increasing batch size.
Though they were designed for finding text, however this context-aware approach also helps with grouping similar things, categorizing things, and figuring out how similar words or sentences are. This paper showed that if you don't give the model information about the context of the words it's looking at, it does a worse job, especially at judging similarity. This proves that knowing the context of words is really important for creating good word representations.
Movie Gen: A Cast of Media Foundation Models
The Movie Gen team @ Meta
â„ 5.8k Video Gen
Movie Gen Video Transformer backbone and model parallelism applied.
Introduction to Movie Gen
Current AI models can generate pretty realistic images, but they struggle in creating high-quality, coherent, and customizable video content with synchronized audio from text prompts. The researchers behind MovieGen have developed a suite of AI models that aims to compete existing commercial systems like Runway Gen3, LumaLabs, and OpenAI Sora in video quality Check out the sample video below!
They've also introduced new capabilities for video personalization and precise editing, which they are missing from current commercial systems. Additionally, their audio generation model, MovieGen Audio, surpasses prior state-of-the-art systems for sound effect and music generation, as well as audio extension.
Inner-Workings of Movie Gen
Movie Gen is an advanced model, and it uses a sophisticated approach to processing and encoding text prompts for video generation. In this section, we have tried to explain how it works but we left out many important parts. Please read the entire paper provided at the bottom of this post if you want to learn more.
Text Input: The system starts with a text prompt provided by the user. This prompt describes the desired video content.
Multiple Text Encoders: Instead of using a single text encoder, this model uses three different pre-trained text encoders
UL2: This encoder is trained on a vast amount of text-only data and it provides strong text reasoning capabilities.
Long-prompt MetaCLIP: This is a modified version of the MetaCLIP text encoder, fine-tuned to handle longer text inputs (up to 256 tokens instead of the original 77). It provides text representations that are well-aligned with visual concepts.
ByT5: This is a character-level encoder, specifically used for encoding visual text - parts of the prompt that explicitly request text to appear in the generated video.
Text Processing: Each of these encoders processes the input text prompt in its own way
UL2 and Long-prompt MetaCLIP process the entire prompt, creating prompt-level embeddings.
ByT5 focuses on character-level encoding, particularly for parts of the prompt requesting visible text in the video.
Embedding Transformation: The outputs from each encoder are then processed
Each output goes through a separate linear projection layer to transform it.
The transformed outputs are then normalized using LayerNorm layers.
This process ensures all embeddings are in the same 6144-dimensional space, making them compatible for combination.
Embedding Concatenation: The processed embeddings from all three encoders are concatenated into a single, comprehensive text embedding.
Text Output: This combined embedding is used as the conditioning input for the video generation backbone. It contains a rich representation of the text prompt with semantic understanding, character-level details, and visual-textual alignments.
Overview of the joint image and video generation pipeline.
Temporal Autoencoder (TAE) in MovieGen Video Model
Variable length video encoding and decoding using the TAE.
The MovieGen Video model uses a Temporal Autoencoder (TAE) for video handling. Here's an explanation of how it works:
Spatio-Temporal Compression: The TAE is designed to compress both the spatial and temporal dimensions of input videos and images. It takes RGB pixel-space videos (or images) and encodes them into a learned, compressed latent space. This compression reduces the input dimensions by a factor of 8 in each of the spatial and temporal dimensions.
Architecture Inflation: The researchers started with an image autoencoder architecture and "inflated" it to handle video data. This inflation involves adding following temporal parameters to the existing spatial architecture:
They added 1D temporal convolutions after each 2D spatial convolution.
They added 1D temporal attention layers after each spatial attention layer.
Temporal convolutions use symmetrical replicate padding to maintain temporal consistency.
Temporal Processing: Downsampling in the temporal dimension is achieved using strided convolutions with a stride of 2. This allows the model to handle videos of varying lengths, including single-frame "videos" (i.e., images).
Latent Space: The researchers found that increasing the number of channels in the latent space improved both reconstruction and generation performance. They settled on using 16 channels (C = 16) in the latent space.
Training Process: The spatial parameters of the TAE are initialized using a pre-trained image autoencoder and Temporal parameters are then added to "inflate" the model for video processing. The TAE is jointly trained on both images and videos, with a ratio of 1 batch of images to 3 batches of videos.
This approach allows the MovieGen Video model to efficiently process and generate videos by working in a compressed latent space, reducing computational requirements while maintaining the ability to produce high-quality, temporally coherent video outputs.
Transformer Backbone
Movie Genâs Video Transformer backbone have model parallelism applied.
On the left side, it shows the Transformer backbone and theyâve color-coded different model parallelizations used to shard their 30B model.
On the right, there are more details about its feature dimensions in a number of key steps during the most expensive stage of Movie Gen Video training. It processes 768 px video inputs with a per-sample sequence length of 73K tokens.
That's really a GPU-rich training.
â Zizheng Pan (@zizhpan)
1:56 AM âą Oct 5, 2024
Evaluation of Movie Gen
When researchers publicly release results, they often use the cherry-picked "best" samples. To address this potential bias and ensure a fairer comparison, the researchers took a methodical approach when comparing MovieGen Video to OpenAI's Sora model. For each prompt, they generated five videos using MovieGen Video and manually selected one. This allows them to copy the selection process of publicly released examples from other models. This approach aims to level the playing field and provide a more accurate comparison of capabilities.
Next, different models produce outputs in different resolutions and aspect ratios. To mitigate potential annotator bias stemming from these differences, the researchers downsampled MovieGen Video's outputs to match the resolution and aspect ratio of the comparison model for each evaluation. After all that, they calculated the net win rate as shown in the table below.
Movie Gen Video vs. prior work. This table measures the net win rate (win% - loss% of our model) which has a range of [-100% to 100%]
Here we can see that MovieGen Video has strong performance in overall quality and outperforms Runway Gen3 as well as LumaLabs significantly, while showing moderate improvement over OpenAI Sora. The model excels in various quality aspects such as motion naturalness and frame consistency which indicates its ability to generate realistic videos that respect physics and maintain consistency. However, it still faces some challenges in motion completeness compared to Kling1.5.
Reply