Samba, Depth Anything v2, and TiTok

The AI Timeline #10

In this issue: x4 industry news, x3 AI research papers

June 10th ~ June 16th

🗞️ Industry News in 1 Line

  1. ♥ 5.1k Luma AI Labs recently announced Dream Machine, a text-to-video model capable of creating very detailed and photorealistic videos. It is quite fast and generates about 1 frame per second and it’s currently one of the best text-to-video tools out there, measuring up to OpenAI’s Sora.

  2. ♥ 3.3k Shortly after Luma AI Labs released Dream Machine, RunwayML has also dropped Gen-3 Alpha, their latest text-to-video model to compete with Luma AI Labs. It can generate human characters with a wide range of actions, gestures, and emotions and comes with a new set of safeguards and C2PA provenance standards.

  3. ♥ 1.5k Stability AI recently announced Stable Diffusion 3 Medium, a cutting-edge text-to-image model, optimized for consumer hardware and released under open and commercial licenses. It can generate text, but it has strong guardrails, so people on Reddit had mixed opinions about it.

  4. ♥ 1.1k NVIDIA launched Nemotron-4 340B, an open suite of models for synthetic data generation to train large language models, optimized for NVIDIA NeMo and TensorRT-LLM, available for download and customizable for enterprise use.

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

1. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Ren et al. [Microsoft, University of Illinois]

 ♥ 1.5k   LLM
Diagram depicting Architecture overview of Samba and Mamba

Architecture overview of Samba and Mamba

Introduction to Samba

Creating efficient large language models with large context length has been a long-standing challenge. While most popular models use attention-based mechanisms, State Space Models (SSMs) offer an alternative with linear computation complexity and potential for better performance. Inspired by Mamba, a selective SSM, this paper introduces SAMBA—a neural architecture that combines SSMs with attention mechanisms. SAMBA achieves unlimited sequence length extrapolation with linear time complexity.

How Does Samba Work?

  1. Mamba layer captures recurrent sequence structures.

    • How? It selectively focuses on relevant input elements using input-dependent gating. It expands input representations, applies a short convolution, and calculates a selective gate for soft selection.

    • Why? Mamba helps the model remember important information over long sequences. Read our breakdown of Mamba-2.

  2. Sliding Window Attention (SWA) layer precisely retrieves the relevant memory.

    • How? It operates on a sliding window over the input sequence, directly accessing context contents through attention. SWA captures non-Markovian dependencies that Mamba might miss.

    • Why? SWA complements Mamba by capturing short to middle-term history.

  3. Multi-Layer Perceptron (MLP) Layer recalls factual knowledge and enables nonlinear transformations.

    • How? This paper use SwiGLU, an MLP variant, to process different types of information captured by Mamba and SWA, which is also used in Llama models.

    • Why? MLPs enhance SAMBA’s ability to handle complex patterns and improve overall performance.

Results and Real-World Implications of Samba

This study evaluates the performance of four SAMBA models with varying parameter sizes (421M, 1.3B, 1.7B, and 3.8B) on various benchmarks, focusing on commonsense reasoning, language understanding, truthfulness, and math/coding tasks. SAMBA consistently outperforms other models, demonstrating its effectiveness in handling diverse language comprehension tasks. It excels in the GSM8K benchmark, achieving significantly higher accuracy than other models and its efficient length extrapolation and long-context understanding capabilities make it a powerful language model.

Chart depictig Comparison of Samba with state-of-the-art models such as Mistral and Mamba.

Comparison of Samba with state-of-the-art models such as Mistral and Mamba.

2. Depth Anything V2

Yang et al. [HKU, TikTok]

 ♥ 951   Depth Estimation

Ground truth vs Depth Anything V1 vs V2

Introduction to Depth Anything V2

Bats use echolocation to determine how far something is but computers can’t shout so humans invented advanced systems such as SONAR, RADAR, and Lidar. However, it requires specialized hardware, and it would be nice if computers can do this with a simple image. One of the ways to do this is using a technique called Monocular depth estimation (MDE) and it plays a crucial role in various applications, from 3D reconstruction to autonomous driving.

As the demand for precise depth information grows, researchers have developed numerous MDE models, Depth Anything V2 is one such model which builds upon the foundation of Depth Anything V1. Instead of relying on fancy techniques, it mainly focuses on data-driven improvements such as replacing all labeled real images with precise synthetic images and scaling up the capacity of the teacher model.

Example of image segmentation using Depth Anything V2 on sample data

Example of image segmentation using Depth Anything V2

How does Depth Anything V2 Work?

DepthAnything V2 aims to improve MDE by addressing the limitations of synthetic data – synthetic images generated by graphics engines differ from real images in terms of style and color; real images are more random, while synthetic ones appear “cleaner.” Moreover, synthetic images are sampled from pre-defined scene types and have limited diversity whereas real-world scenes are more varied. Transferring from synthetic training to real-world testing is difficult due to distribution differences and models struggle to generalize from synthetic to real images, even if layouts are similar.

This paper incorporates unlabeled real images as an intermediate step, models learn from these real images before facing the real test data. Since real images contain diverse scenes which are not present in the synthetic data, training on both types of images improves the generalization. Depth Anything V2 also uses the transfer knowledge mechanism to train smaller models from the most capable model (teacher) using pseudo-labeled real data.

Depth Anything V2 Training Pipeline

  1. Train a reliable teacher model (DINOv2-G): Paper starts by training a ‘teacher’ model on high-quality synthetic images that have precise depth information. This model is expected to learn good depth estimation from these accurate data.

  2. Annotate unlabeled real images with pseudo depth labels from the teacher: Use the teacher model to generate depth labels for real-world images that don’t have any depth information. These generated labels are called ‘pseudo labels’ because they are inferred by the model, not obtained by actual measurements.

  3. Train student models on pseudo-labeled real images: Finally, train ‘student’ models using the real images and their pseudo labels from the teacher model. The goal is for these student models to learn to generalize well to new, unseen images by learning from the pseudo-labeled data.

Diagram depicting Training pipeline of Depth Anything V2

Training pipeline of Depth Anything V2

Evaluating Depth Anything V2

Depth Anything V2 outperforms existing MDE models across a range of metrics and it achieves state-of-the-art performance on the newly introduced DA-2K benchmark, which tests models on high-resolution images with sparse depth labels. The following table presents a comparative analysis of various models fine-tuned for in-domain metric depth estimation, where the training and testing images are from the same domain. All models compared are similar in encoder size to ViT-L. 

Table showing Benchmark results of Depth Anything V2

Benchmark results of Depth Anything V2

From the results, we can see that the Depth Anything V2 model shows remarkable improvements over previous methods on both NYU-D and KITTI datasets and even its smallest model variant, based on ViT-S, outperforms other models that utilize a ViT-L encoder. However, it’s important to note that while the metrics are impressive, models trained solely on NYUv2 or KITTI datasets struggle with fine-grained depth prediction and robustness against transparent objects due to noise in the training sets. 

An Image is Worth 32 Tokens for Reconstruction and Generation

Yu et al. [ByteDance, Technical University Munich]

 ♥ 1.3k   Image Transformer
Example of images generated by TiTok

Example of images generated by TiTok

Introduction to TiTok

The existing image tokenization methods, like VQGAN, use 2D latent grids with fixed downsampling factors, which don’t manage redundancies in images well. These methods struggle to compress latent representations effectively because similar regions in images result in unnecessary repetition.

This paper introduced the Transformer-based 1-Dimensional Tokenizer (TiTok). Unlike traditional 2D tokenizations, TiTok tokenizes images into 1D latent sequences, leading to a more compact representation. For instance, it can reduce a 256 × 256 × 3 image to just 32 discrete tokens, significantly fewer than the 256 or 1024 tokens required by previous methods. This approach not only saves computational resources but also achieves competitive or even superior performance in image generation tasks.

Inner-Workings of TiTok

TiTok, which stands for Transformer-based 1-Dimensional Tokenizer is a new framework which converts images into a one-dimensional sequence of tokens, rather than the traditional two-dimensional grid of tokens that most vector quantization (VQ) models use. The key advantage of TiTok is that it doesn’t require a fixed mapping between image patches and latent tokens, which allows for more flexibility and potentially better performance in tasks like image reconstruction and generation.

Image Reconstruction with TiTok

  1. The image is broken down into patches and combined with a set of latent tokens.

  2. These are processed by an encoder (a type of neural network) to create a compact one-dimensional sequence of tokens that represent the image.

  3. During de-tokenization, these tokens are combined with mask tokens (a technique to help with reconstruction) and fed into a decoder to reconstruct the original image.

Image Reconstruction Pipeline for TiTok

Image Generation with TiTok

  1. For generating images, TiTok uses a similar approach but includes a step where some tokens are randomly replaced with mask tokens.

  2. A transformer model then predicts what the masked tokens should be.

  3. This process is repeated iteratively to generate an image from scratch.

Image Generation Pipeline for TiTok

Two-Stage Training of TiTok

  1. The training involves two stages: a warm-up stage and a decoder fine-tuning stage.

  2. In the warm-up stage, TiTok is trained using proxy codes from another model to simplify the training process.

  3. In the fine-tuning stage, only the decoder is trained further to improve image quality.

Training Pipeline for TiTok

Real-World Implications of TiTok

TiTok can significantly reduce the number of tokens by 8 to 64 times compared to traditional 2D tokenizers – without compromising on the quality of image reconstruction. Experiments with TiTok on ImageNet (256 × 256) have shown that it can achieve comparable reconstruction FID (rFID) scores with substantially fewer tokens. For instance, TiTok-L-32 recorded an rFID of 2.21 using only 32 tokens, which is on par with VQGAN from MaskGIT (rFID 2.28) but with an 8× smaller latent representation. Additionally, under identical generator frameworks and sampling steps, TiTok-L-32 significantly outperformed MaskGIT in generative FID (gFID), dropping from 6.18 to 2.77.

Table showing Benchmark results of TiTok

Benchmark results of TiTok

When benchmarked against diffusion-based generative models, TiTok maintained competitive performance while offering over a 100× speed-up during sampling. Notably, TiTok-L-32 surpassed LDM-4 in gFID (2.77 vs. 3.60) while being 254 times faster in image generation.

Reply

or to participate.