The AI Timeline
Posts
Synthetic 1 Million Personas, T-FREE, and Adam-mini

Synthetic 1 Million Personas, T-FREE, and Adam-mini

#13 | Latest AI Research Explained Simply

by cloud
July 09, 2024

In this issue: x3 industry news, x3 AI research papers

July 2nd ~ 9th

🗞️ Industry News in 1 Line

♥ 303 Kuaishou, the company that announced text-to-video KLING, has also published an open source text-to-image model Kolors. It’s a Chinese-based model using the SDXL architecture.
♥ 1.6k Lmsys & AnyScale announces RouteLLM, an open-source framework for cost-effective LLM. RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.
♥ 161 Claude 3.5 Sonnet is now #1 in coding and instruction following on SEAL leaderboard, Scale.AI’s new private AI leaderboard.

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

sponsor us

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Chan et al. [Tencent AI Lab]

♥ 697 LLM data

Personas can work with a wide range of data synthesis prompts (e.g., create a math problem or a user prompt) to guide an LLM to synthesize data with corresponding perspectives

Introduction to Persona Hub

Humans use their varied experiences and perspectives to create diverse content, but AI models often struggle with generating truly varied synthetic data. To address this, the researchers at Tencent AI Lab have proposed a technique called “persona-driven data synthesis”. It plays a crucial role in various applications, from training language models to simulating real-world scenarios.

As the demand for diverse, high-quality synthetic data grows, researchers have developed numerous approaches. “Persona Hub” is one such innovation which builds upon the foundation of large language models. Instead of relying on complex architectures, it mainly focuses on data-driven improvements by leveraging a vast collection of 1 billion diverse artificial personas automatically curated from web data.

How does Persona Hub work?

Persona Hub aims to improve synthetic data creation by addressing the limitations of traditional methods - existing approaches often struggle to generate truly diverse and contextually relevant data. Moreover, current synthetic data generation techniques are often limited in scale and adaptability across different domains. Transferring from narrow, domain-specific data generation to broad, multi-domain applications is challenging due to the complexity of real-world knowledge and experiences.

This paper incorporates a massive collection of diverse personas as an intermediate step. Models learn to generate data from these varied perspectives before facing real-world applications. Since the personas contain diverse backgrounds and knowledge which may not be explicitly present in the training data, using them improves the generalization and relevance of the generated synthetic data.

The authors have initially released 200,000 personas from Persona Hub to facilitate research in persona-driven data synthesis, and are open to releasing more data when they can better assess the potential risk and concerns. You can find the personas here.

Persona Hub also uses flexible prompting mechanisms to adapt the data generation process across different scenarios and domains.

Persona Hub Methodology

Persona Generation: The authors employed two main methods to create personas
- Text-to-Persona: Inferring personas from web text data
  text-to-Persona
- Persona-to-Persona: Deriving new personas based on interpersonal relationships
  Persona-to-Persona

Deduplication: To ensure diversity, the authors used MinHash (with 1-gram features and a signature size of 128) and embedding-based techniques (using models like OpenAI’s text-embedding-3-small) to remove duplicate or highly similar personas. Both methods use a similarity threshold of 0.9.
Flexible Prompting: The method supports zero-shot, few-shot, and persona-enhanced few-shot prompting, allowing for adaptability across different data synthesis scenarios.

Evaluating Persona Hub

Persona Hub demonstrates impressive capabilities across various synthetic data generation tasks. One particularly notable achievement is enabling a 7B parameter model to achieve 65% performance on the MATH dataset, matching the performance of gpt-4-turbo-preview only at 7B scale.

The technical report provides solid evidence that the Persona Hub approach shows remarkable improvements over previous methods on tasks ranging from math problem generation to creating diverse instructions for LLMs. However, it’s important to note that while the capabilities are impressive, there are still ethical considerations and potential limitations to be addressed as this technology develops further.

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Deiseroth et al. [IPAI, Technical University Darmstadt, hessian.AI, DFKI]

♥ 52 LLM Tokenization

“Eternal glory to anyone who can delete tokenization as a required step in LLM” - Andrej Karpathy in his recent tokenization video.

Andrej Karpathy’s video - Let's build the GPT Tokenizer (2:10:50)

In this paper, the authors have proposed an innovative alternative to traditional tokenization methods used in LLMs. This new approach, called “T-FREE”, addresses several limitations of current tokenization techniques and offers promising improvements in efficiency and performance.

Background to Tokenization

Tokenization is a crucial step in processing text for LLMs, covering input text into integer representations that can be fed into the model. Current popular methods like Byte Pair Encoding (BPE) and Unigram tokenization have served the NLP community well, but they come with inherent weakness:

Large vocabularies leading to oversized embedding layers
Duplicate tokens wasting model capacity
Performance degradation on languages not well-represented in the training corpus

T-FREE: A New Paradigm

Classic tokenizers learn a single-label vocabulary, i.e. a token is bijectively mapped into a single entry of the vocabulary. Instead, T-FREE uses a bijective multi-label mapping over multiple activations of hashed character trigrams. As T-FREE explicitly models morphological similarities, it enables compression of the embedding layer.

The authors introduce T-FREE as a paradigm shift in text encoding and decoding for LLMs. Instead of using a “fixed vocabulary”, T-FREE directly embeds words through “sparse activation patterns” over character triplets (trigrams). This approach offers several key advantages:

Vocabulary Compression: T-FREE achieves competitive performance with only 12.5% of the embedding parameters used by traditional tokenizers.
Elimination of Duplicates: By design, T-FREE avoids duplicate tokens that often occur due to capitalization and whitespace variations.
Language Adaptability: T-FREE shows better performance across diverse languages without requiring specific training on each language.

How does T-FREE work?

T-FREE’s encoding process involves:

Splitting text into words, digits, and special characters
Encoding each word using character trigrams
Projecting trigrams into a sparse hidden representation
Aggregating activations to produce the final word embedding

For decoding, T-FREE uses multi-label binary cross-entropy loss function, reflecting the multi-activation nature of its word representations.

Empirical Results for T-FREE

The authors conducted extensive experiments to validate T-FREE’s performance:

Hyperparameter ablation: 1B parameter model trained with T-FREE outperformed a 64k Unigram baseline on average, using only 8k vocabulary size.
Duplicate Analysis: While traditional tokenizers showed 15-35% duplicate tokens, T-FREE is inherently free of duplicates.
Cross-lingual performance: T-FREE demonstrated superior fertility (tokens per word) across English, German, Russian, Vietnamese, and Arabic compared to traditional tokenizers.
Language Transfer: In a 3B parameter model experiment, T-FREE demonstrated superior cross-lingual adaptability by outperforming classic tokenizers on German benchmarks even before German-specific training, then rapidly improving by 5% after just 20,000 steps of German data exposure, while the traditional approach showed minimal gains and suffered more significant English performance degradation.

T-FREE outperforms in German already with the baseline. Within 20k continued steps, T-FREE improves by 5% on average in 0 and 2-shot, while the classic tokenizer approach barely improves.

Implications and the future of T-FREE

T-FREE’s approach opens up new possibilities for LLM development:

More efficient use of parameters in large models
Improved low-resource model development
Faster training iterations due to increased micro-batch sizes
Potential for reduced hallucinations and more controlled decoding

While the current work focuses on models up to 3B parameters, future research could explore T-FREE’s performance in even larger models and diverse training datasets.

Innovations like T-FREE represent a significant step forward in tokenization techniques for LLMs. By addressing key limitations of traditional methods, it offers a path to more efficient, adaptable, and performant language models.

Adam-mini: Use Fewer Learning Rates To Gain More

Zhang et al. [CUHK, SRIBD, Duke University, Stanford University]

♥ 282 ML Optimizer

Adam-mini takes less memory and can reach higher throughput (# tokens per second).

Introduction to Adam-mini

In the ever-evolving landscape of machine learning, optimizers play a crucial role in training LLMs. The Adam optimizer has long been the go-to choice for effectiveness, but it comes with a significant memory overhead. This paper introduces a new approach that challenges the status quo and offers a compelling alternative.

The authors propose “Adam-mini”, an optimizer that achieves a comparable or better performance than AdamW while using 45% to 50% less memory. This breakthrough is particularly significant for training LLMs, where memory constraints often limit model size and training efficiency.

An illustration of Adam-mini. Adam-mini assigns learning rates (lrs) by Hessian structure. It uses more lrs than SGD but fewer than Adam

How does “Adam-mini” achieve this?

Hessian Structure Exploration: The authors observed that the Hessian matrix of Transformers exhibits a near-block-diagonal structure. This insight led them to realize that using individual learning rates for each parameter(as in Adam) might be overkill.
Parameter Partitioning: Adam-mini “partitions” model parameters into “blocks” based on the smallest dense sub-blocks in the Hessian. For Transformers, this means partitioning Query and Key matrices by heads, while using default partitioning for other components.
Efficient Learning Rate Assignment: Instead of maintaining a separate learning rate for each parameter, Adam-mini assigns a single learning rate to each parameter block. This rate is calculated using the average of Adam’s second-order momentum(v) within the block.
Special Handling for Embedding Layers: The authors found that embedding and output layers require special treatment. Adam-mini maintains individual learning rates for these layers to ensure training stability.

Benefits of Adam-mini

The introduction of Adam-mini has several great implications for the field:

Cheaper LLM Training: It significantly reduces the memory footprint by almost 45~50%.
Improved Training Efficiency: The reduced memory footprint allows for larger batch sizes and less communication overhead between GPUs, leading to faster training times. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2× A800-80GB GPUs, which saves 33% wall-clock time for pre-training
Compatibility with Existing Methods: Adam-mini can be combined with other memory-efficient techniques like GaLore, Sophia, and LoRA, potentially leading to even greater memory savings and performance improvements.

Adam-mini performs on par as AdamW with less memory, while other methods perform worse on these tasks. (c): Adam-mini seems not sensitive to hyperparameters.

Limitations of Adam-mini

While Adam-mini shows great promise, it's worth noting that the current implementation might not be optimal for every scenario. The authors acknowledge that their learning rate design, while computationally efficient, may have room for improvement with more fine-grained analysis of Hessian sub-blocks.

ML Community on Adam-mini

I trained GPT-2 (124M) with @aaron_defazio's Schedule-Free optimizer on @karpathy's nanoGPT:
- Settings: AdamW with learning rate=0.0018 (same as x.com/Yuchenj_UW/sta…), warmup_steps=700; Schedule-Free AdamW with default learning rate=0.0025, warmup_steps=700
- Observations:… x.com/i/web/status/1…
— Yuchen Jin (@Yuchenj_UW)
4:16 PM • Jul 6, 2024

Future work of Adam-mini

As the field of machine learning continues to push the boundaries of model size and complexity, innovations like “Adam-mini” play a crucial role in making large-scale training more accessible and efficient. This research not only offers a practical solution to memory constraints but also provides valuable insights into the nature of optimization in deep learning.

Reply

or to participate.