• The AI Timeline
  • Posts
  • Byte Latent Transformer: The Future of LLM is Without Tokenization?

Byte Latent Transformer: The Future of LLM is Without Tokenization?

Training Large Language Models to Reason in a Continuous Latent Space, and [MASK] is All You Need

Dec 11th ~ Dec 18th
#36 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 3k Meta has introduced the Llama 3.3, a 70B model offering top-tier performance for text-based tasks like synthetic data generation while significantly reducing inference costs. Its advancements stem from a new alignment process and improvements in online RL techniques, delivering performance comparable to Llama 3.1 405B with cost-efficient inference, making it suitable for local deployment on common workstations.

    Llama-3.3 benchmark

    Llama-3.3 benchmark

  2. ♥ 719 Microsoft has published a new version of Phi-4 along with Phi-4’s technical report. Phi-4 is a state-of-the-art small language model (SLM) with 14B parameters, excelling in complex reasoning tasks such as math alongside conventional language processing. Phi-4 is currently accessible on Azure AI Foundry and will be available on Hugging Face next week.

    Phi-4 efficiency

    Phi-4 model efficiency

  3. ♥ 1.5k Google DeepMind has introduced Gemini 2.0, AI models that are capable of advanced reasoning, multimodal input and output, and tool use. Gemini 2.0 enhances AI capabilities across text, images, video, and audio, supporting complex tasks like advanced math, multimodal queries, and coding. The Gemini 2.0 Flash variant delivers twice the speed of its predecessor, 1.5 Pro, while enabling native image generation, multilingual text-to-speech, and tool integration. Developers can access Gemini 2.0 Flash via the Gemini API, with broader availability expected in January. Additionally, a new Multimodal Live API supports real-time audio and video streaming for interactive applications.

    Gemini 2.0 benchmark

    Gemini 2.0 Flash Experimental Benchmark

  4. ♥ 5.4k Google DeepMind has announced Veo 2, a state-of-the-art video generation model capable of producing realistic, high-quality clips from text or image prompts (actually looks REALLY good). Veo 2 supports resolutions up to 4K, understands camera controls (e.g., POV, wide shots, drone shots), and improves realism in physics and human expressions.

    Additionally, Imagen 3, an improved text-to-image model, offers enhanced accuracy, balanced compositions, and diverse art styles, such as realism and fantasy. Imagen 3 is now available in ImageFX across 111 countries, while Veo 2 will launch in VideoFX next year.

    DeepMind Veo 2 Demo

    DeepMind Veo 2 Demo (check out their blog for better quality)

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot!

Byte Latent Transformer: Patches Scale Better Than Tokens

Facebook AI Research at Meta

♥ 3.1k   LLM Tokenization   bycloud’s pick  

Introduction to Byte Latent Transformer

LLMs rely on a preprocessing step called tokenization, which breaks text into fixed-size tokens before the actual model training begins. This approach has several significant drawbacks. Tokenization creates inherent biases in how language is compressed. This leads to problems like difficulty understanding different languages equally, sensitivity to input variations, and limited understanding of subtle language nuances like orthographic details and phonetic structures.

This paper introduces Byte Latent Transformer (BLT) which tackles these issues by introducing a new approach: instead of using predefined tokens, the model directly learns from raw bytes and dynamically groups them into "patches" based on the complexity of the next predicted byte. By allocating computational resources more intelligently, it uses smaller transformer blocks for predictable, low-complexity text regions and larger blocks for more complex sections.

Byte Language Transformer Architecture 

The BLT architecture is a three-part system designed to process and transform byte-level data more efficiently. The first part is called the global transformer and it acts as the brain of the system. It works on these patches and decides how to handle different parts of the input with varying levels of computational effort.

The second part is called the local encoder. It acts like a specialized translator that takes raw byte sequences and converts them into more meaningful, compressed representations called "patches". It does this through an innovative approach that includes looking at not just individual bytes, but also their surrounding context using something called "n-gram hash embeddings". This encoder uses cross-attention layers, which essentially help it pool and compress byte information into more compact, informative patches that can be efficiently processed by the global transformer.

The third part is called the local decoder and completes the system by doing the reverse process - transforming these processed patch representations back into raw byte sequences. It uses a similar cross-attention mechanism as the encoder, but with the roles of queries and keys/values reversed. This means it can take the transformed patch representations from the global transformer and reconstruct the original byte sequence, almost like reassembling a puzzle after examining each piece carefully. The entire architecture is designed to be computationally flexible, allowing the model to spend more processing power on complex parts of the input and less on simpler sections, which is a significant innovation in handling byte-level transformations.

The most fascinating aspect of this architecture is its ability to handle byte sequences with varying complexity dynamically. By using a block-causal attention mask and the ability to control computational resources across different input segments, the BLT model can adapt its processing power in real-time. 

Results and Evaluation of Byte Language Transformer 

The experimental evaluation shows that the BLT has good performance across multiple challenging benchmarks. In character-level tasks, BLT outperformed token-based models like Llama 3 by significant margins. On the CUTE benchmark, which tests intricate sequence manipulation and character understanding, BLT surpassed existing models by over 25 points and achieved an impressive 99.9% accuracy on spelling tasks. What's even more impressive is that these improvements were achieved despite BLT being trained on 16 times less data than comparable models.

When evaluated on the FLORES-101 benchmark, which spans six language families and twenty-one lower-resource languages, BLT achieved a 2-point overall advantage in translation to English and a 0.5-point advantage in translation from English. Furthermore, by initializing the global transformer from pre-trained Llama 3.1 parameters, the researchers discovered a transfer learning approach that not only reduced training computational requirements but also enabled significant performance improvements. 

Training Large Language Models to Reason in a Continuous Latent Space

Hao et al. [FAIR at Meta, UC San Diego]

♥ 1.8k   LLM Reasoning

Introduction to Coconut (Chain of Continuous Thought) 

LLMs have impressive reasoning capabilities, but they're fundamentally constrained by having to express their entire reasoning process through language tokens. Current methods like chain-of-thought (CoT) require models to generate step-by-step solutions in text, which is inefficient and doesn't match how humans actually reason. Neuroimaging studies have shown that our language networks are largely inactive during complex reasoning tasks, suggesting that verbal articulation isn't the most natural way to solve problems. Moreover, existing approaches allocate almost equal computational resources to every reasoning token, even though some tokens are critical for solving a problem while others are mere linguistic filler.

This paper introduces the Coconut (Chain of Continuous Thought) approach. This is a new process which allows language models to reason in a pure, unrestricted latent space. Instead of translating reasoning steps into words, the model uses the hidden state representations directly as input for the next reasoning step. This enables a more flexible reasoning mechanism where the model can simultaneously explore multiple potential solution paths, similar to a breadth-first search.

By treating reasoning as a continuous, differentiable process rather than a linear token sequence, Coconut can maintain multiple reasoning options, progressively eliminate incorrect paths, and potentially solve complex problems more efficiently. 

Architecture of Coconut (Chain of Continuous Thought) 

The Coconut method reimagines how large language models reason by introducing a novel "latent mode" alongside the traditional language mode. In standard language models, reasoning occurs through sequentially generated text tokens where each step is constrained by the vocabulary. Coconut breaks this limitation by introducing a unique approach where the model can switch between generating language tokens and directly using hidden state representations as reasoning steps. This is achieved through special tokens <bot> and <eot> that mark the beginning and end of latent reasoning mode. During latent mode, instead of using token embeddings, the model uses the last hidden state of the previous token as the input embedding for the next reasoning step, effectively allowing the model to reason in a continuous, unrestricted latent space.

The training process for Coconut is equally innovative. It uses a multi-stage curriculum that gradually introduces continuous thoughts into the reasoning chain. Initially, the model is trained on traditional chain-of-thought reasoning with language tokens. In subsequent training stages, the model progressively replaces language reasoning steps with continuous thought representations. Each continuous thought is fully differentiable, allowing for end-to-end optimization. These continuous thoughts are not meant to compress the language reasoning, but to facilitate more effective reasoning by maintaining multiple potential solution paths simultaneously. This enables a breadth-first search-like reasoning approach, where the model can explore and eliminate incorrect reasoning paths more dynamically than traditional linear reasoning methods.

During inference, Coconut operates similarly to a standard language model, with the critical difference being its ability to switch between language and latent modes. The model begins by inserting a <bot> token after the initial question, signaling the start of reasoning. It then alternates between generating language tokens and exploring continuous thought representations. To manage the transition back to language mode, the researchers explored two strategies: training a binary classifier to autonomously determine when to terminate latent reasoning, or simply padding latent thoughts to a constant length. Their experiments showed both approaches performed comparably, with the latter being preferred for its simplicity. This approach allows the model to reason more flexibly, potentially maintaining multiple reasoning paths and progressively refining its solution, which differs fundamentally from traditional token-by-token reasoning methods.

Benchmarking Coconut (Chain of Continuous Thought) LLMs

By exploring "chaining" continuous thoughts, this paper has shown significant performance improvements across various tasks, particularly in complex reasoning scenarios like mathematical word problems. In experiments with the GSM8k dataset, Coconut outperformed existing architectures, including the latest baseline iCoT.

The study shows that although LLMs have immense potential, they still require guided learning to effectively leverage latent reasoning. The researchers used a multi-stage curriculum training approach that decomposes learning into more manageable objectives. This enables the model to achieve top performance across different tasks. Additionally, the research showed that increasing the number of latent thoughts from 0 to 2 steadily improved model performance, with particularly impressive results in planning-intensive tasks like ProsQA. 

[MASK] is All You Need

Hu and Ommer [LMUMunich, MCML]

♥ 1.1k   Image Tokenization

Introduction to Discrete Interpolants

Currently, there are two popular AI training methods being used in the industry: Masked Generative Models and Non-Autoregressive Diffusion Models. These approaches have developed separately and they lack a comprehensive understanding of their underlying similarities and potential for integration. Moreover, there's a significant gap in bridging generative and discriminative tasks, particularly in the vision domain. Existing methods struggle with flexible sampling, conditional generation, and understanding the theoretical connections between different generative modeling approaches.

This paper proposes the "Discrete Interpolants" framework to address these challenges by creating a unified approach that bridges multiple generative modeling techniques. The researchers have developed a comprehensive design space that connects Masked Generative Models and Diffusion Models which allows for a more flexible and generalized approach to generative AI.

Understanding Discrete Interpolants Architecture in LLMs

This paper introduces a novel framework called "Discrete Interpolants" that bridges different generative AI modeling approaches. This method works by progressively unmasking data tokens. Let’s understand this with the help of an example, imagine starting with a completely blacked-out image and gradually revealing details, like solving a complex puzzle. Instead of using continuous data transformation, they use discrete tokens, which makes the process more compatible with existing language models and more computationally efficient.

During training, the model learns to predict original data by starting from a fully masked state and progressively revealing tokens according to a specific "masking schedule". This schedule determines how and when tokens get revealed. The researchers introduced two significant model types: an Explicit Timestep Model (which depends on specific time steps) and an Implicit Timestep Model (which removes time step dependencies), providing more flexibility in how the model generates or reconstructs data.

The framework can be applied not just to generative tasks like image creation, but also to discriminative tasks like image segmentation. By treating segmentation as an "unmasking process", they can train a single model to perform multiple types of tasks. The method also includes advanced features like classifier-free guidance, which allows more controlled generation by introducing conditional sampling. This means the model can generate images or data with specific characteristics more precisely, opening up new possibilities for AI-driven content creation and analysis.

Evaluating Discrete Interpolants on Benchmarks

The researchers tested their new "Discrete Interpolants" method on two main tasks: image and video generation. For image generation, they achieved impressive results on two challenging datasets - MS-COCO and ImageNet. This method not only competed with existing state-of-the-art models but in some cases outperformed them. By using a unique approach of progressively "unmasking" images - similar to revealing a picture hidden under layers - they were able to generate high-quality images with remarkable accuracy.

Additionally, they demonstrated that this method works effectively across different types of generative models, including both "Explicit Timestep" and "Implicit Timestep" models. This means the technique can adapt to various image generation scenarios. When the researchers extended their method to video generation using the FaceForensics dataset, it again showed promising results. This suggests that this technique can successfully scale from creating single images to generating more complex video content.

Reply

or to participate.