• The AI Timeline
  • Posts
  • HunyuanVideo: A Systematic Framework For Large Video Generative Models

HunyuanVideo: A Systematic Framework For Large Video Generative Models

DeMo: Decoupled Momentum Optimization, and Densing Law of LLMs

Dec 2nd ~ Dec 8th
#35 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 5.3k After nearly a full year of OpenAI hyping up Sora, on the 3rd day of “12 days of OpenAI”, they have opened access to Sora to ChatGPT Plus & Pro users.

    openAI Sora's demo
  2. ♥ 6.3k xAI’s Grok has released Aurora, an autoregressive mixture-of-experts model designed for multimodal tasks, trained on extensive interleaved text and image data to achieve high-quality photorealistic rendering and precise text instruction following. Aurora also supports multimodal inputs, enabling it to condition or directly edit user-provided images.

    Grok image model
  3. ♥ 6.4k Google DeepMind introduces Genie 2, an AI model that can generate a playable and interactive 3D world. No playable demo has been released, but you can check out their Genie 2 blog for more demos.

    Genie 2

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot!

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Zhong et al. [Hunyuan Foundation Model Team]

♥ 500   Video Gen

Introduction to HunyuanVideo

Currently, there is a large performance gap between open-source video generation models and proprietary video generation models. 

This paper addresses this performance gap by introducing HunyuanVideo, an open-source video foundation model with 13 billion parameters. 

The primary goal of this research is to bridge the gap between closed and open-source video generation communities by making their code publicly available on GitHub. Professional evaluations against top global models like Runway Gen-3 and Luma 1.6 show that HunyuanVideo achieved the highest overall satisfaction rates.

Architecture of HunyuanVideo

The HunyuanVideo model architecture is a multistage process designed for generating videos from text descriptions. It starts with a 3D Variational Autoencoder (VAE). This VAE acts like a compressor, shrinking video data (and images, treated as single-frame videos) into a smaller, more manageable representation.

3D Variational Autoencoder in HunyuanVideo.

This compression uses CausalConv3D, which is a special type of convolution designed to handle sequential data like video frames without looking into the future. The VAE is trained from scratch without using pre-trained weights, and uses a combination of loss functions (L1, perceptual loss, adversarial loss, and KL divergence) to balance reconstruction quality and prevent overfitting. A spatial-temporal tiling strategy is also used during inference to handle high-resolution videos on limited GPU memory. A clever finetuning step ensures consistent performance regardless of whether tiling is used.

The overall architecture of HunyuanVideo.

HunyuanVideo is a diffusion model built upon a Transformer architecture. This Transformer uses a "full attention" mechanism, allowing it to process the compressed video and text embeddings simultaneously, unlike other approaches which separate spatial and temporal processing. The model uses a dual-stream, then single-stream design: initially, video and text are processed separately, and then combined for final processing.

The architecture of HunyuanVideo Diffusion Backbone.

Rotary Position Embeddings (RoPE) are used to handle videos of varying lengths, aspect ratios, and resolutions effectively. The text input is processed using a pre-trained Multimodal Large Language Model (MLLM), chosen for its strong image-text alignment and ability to handle complex instructions and offers advantages over other text encoders like CLIP or T5. 

Text encoder comparison between T5 XXL and the instruction-guided MLLM introduced by HunyuanVideo.

A multi-stage training process is used which starts with image-only pre training at low and then higher resolutions, followed by joint image and video training with a progressive curriculum, gradually increasing video resolution and length. The training uses Flow Matching, a method that estimates the changes in video data over time to generate realistic sequences. A prompt rewrite model is also used to standardize the input text. All these steps work together to create a sophisticated video generation system.

Performance Evaluation of HunyuanVideo 

HunyuanVideo shows exceptional performance in video generation by successfully creating videos that closely match complex text prompts. 60 professional reviewers evaluated 1,533 text prompts and found that the model outperformed five leading closed-source video generation models across three critical metrics: text alignment, motion quality, and visual quality. The model showed remarkable ability to handle intricate scenes with multiple subjects, accurately capturing nuanced relationships and movements.

The 3D Variational Auto-Encoder and a unified generative approach allows HunyuanVideo to excel in challenging tasks such as concept generalization, where it successfully created videos depicting scenarios not present in its original training data. For example, an astronaut floating on a gemstone-like lake in a distant galaxy or an orchestra of tiny insects playing instruments.

DeMo: Decoupled Momentum Optimization

Peng et al. [Nous Research]

♥ 2.3k   LLM Federated Learning

Introduction to Decoupled Momentum (DeMo)

When training large neural networks, there is a need for specialized high-speed interconnects and frequent gradient synchronization across accelerators. Current distributed training methods require expensive networking topologies and generate significant communication overhead proportional to model size. This limits the scalability and accessibility of training large-scale foundation models.

This paper introduces Decoupled Momentum (DeMo), which is a novel optimizer that dramatically reduces inter-accelerator communication requirements by taking advantage of the inherent redundancy in gradients and optimizer states. By decoupling momentum updates and allowing controlled divergence across accelerators, DeMo enables topology-agnostic, architecture-independent training with minimal computational overhead.

Decoupled Momentum (DeMo) Research Methodology

The researchers introduce a novel optimization approach based on three key conjectures about momentum in neural network training. They hypothesize that momentum components have high spatial auto-correlation, with most energy concentrated in a few principal components. Furthermore, they suggest that fast-moving momentum components should be applied immediately, while slow-moving components require temporal smoothing and are crucial for long-term convergence.

DeMo introduces a unique algorithm that breaks from traditional gradient synchronization. Using the Discrete Cosine Transform (DCT), they efficiently extract and separate fast-moving momentum components across accelerators. This approach allows them to synchronize only the most significant momentum components with minimal communication overhead.

This enables more flexible and bandwidth-efficient neural network training. By removing and synchronizing these fast components while letting slow components diverge across accelerators, they aim to maintain model performance while dramatically reducing inter-accelerator communication requirements.

Evaluating Decoupled Momentum (DeMo)

The researchers evaluated the DeMo optimizer using OLMo, an open and reproducible large language model pre-training framework.

The experiments compared two models trained on the Dolma v1.55 dataset: a baseline OLMo-1B6 model with 1.18 billion parameters using the standard AdamW optimizer, and a DeMo variant.

Due to computational constraints, they trained models on 100 billion tokens instead of the full 3 trillion, using 64 H100 GPUs with a global batch size of 2048 and sequence length of 2048 tokens. The results show that DeMo can be used to successfully train the AI models and it can be a viable alternative to other optimizers.

Densing Law of LLMs

Xiao et al. [Tsinghua University, ModelBest Inc.]

♥ 178   LLM Scaling Law   bycloud’s pick  

Introduction to Densing Laws in LLMs 

The quality of outputs in LLMs improves drastically by increasing the number of model parameters, but this approach introduces significant challenges in the training and inference stage. As LLMs expand to hundreds of billions of parameters, deploying them in resource-constrained environments becomes increasingly difficult, particularly for end devices like smartphones and for applications with high computational costs.

To address these challenges, the researchers introduce the novel concept of "capability density" - a metric that evaluates LLM quality by quantifying the ratio of effective parameter size to actual parameter size. This research explores how to measure the "density" of LLMs. Imagine density like how much stuff is packed into a space – a dense model gets a lot done with fewer parts. Instead of focusing on the total number of parts (parameters), this work looks at how well a model performs compared to how many parts it has. It's hard to directly measure a model's "intelligence," so this study uses existing tests to see how efficiently models solve problems. 

By analyzing 29 open-source pre-trained base models, they discovered the "Densing Law" - an empirical observation that the maximum capability density of LLMs grows exponentially, with density potentially doubling approximately every three months.

Understanding Densing Laws in LLMs 

This paper defines LLM density as the ratio of a model's effective parameter size to its actual parameter size. The effective parameter size represents the number of parameters a smaller, reference model would need to achieve the same performance. The researchers estimate this by first fitting a function relating a reference model's parameter size to its language modeling loss (using a power-law function based on the Scaling Law).

This is then used to fit a second function relating this loss to downstream task performance (using a sigmoid function because performance is bounded). The inverse of these functions allows us to calculate the effective parameter size for a given model's performance, leading to its density calculation.

This two-step process addresses challenges in directly linking parameter size to downstream performance. The first step, loss estimation, uses reference models of varying sizes to establish the relationship between parameter size and language modeling loss on a test set. It also focuses on conditional loss which indicates the model's ability to generate correct answers given specific instructions, rather than overall sequence prediction. The researchers handle various task formats (e.g., multiple choice, mathematical reasoning) by adjusting how we calculate this conditional loss.

The second step uses well-trained, open-source models to link the estimated loss to actual performance on downstream tasks, using a sigmoid function to account for the bounded nature of performance metrics. Finally, it is used to compute LLM density. This metric provides a more nuanced understanding of model efficiency than simply considering parameter count.

Comparing LLM Density

By analyzing the density of various LLMs over time, the researchers observe a rapid increase which suggests a shift in development focus from simply increasing parameter size to improving model efficiency and cost-effectiveness.

This "Densing Law" mirrors the historical trend in semiconductor technology, where increasing transistor density on a chip has been a key driver of progress. The Densing Law suggests that we're not just making bigger models; we're also making them smarter at using their resources. It's like comparing two cars: one might have a bigger engine, but the other might be more fuel-efficient and still get you to the same place. While the Densing Law suggests that LLMs will continue to become more efficient, the researchers are a bit apprehensive.

Current tests to measure efficiency might not fully capture all of what LLMs can do, especially as they become more capable of complex reasoning. Therefore, future research needs to include other types of models that deal with more than just text, like those that use images or videos, because it's important to also measure their efficiency. Finally, it might be even better to judge models based on how much computing power they use when they're actually answering questions, not just how many parts they have inside.

Reply

or to participate.