Alignment Faking In Large Language Models Explained

Plus more about ModernBERT, and Qwen 2.5 Technical Report

Dec 14th ~ Jan 6th
#37 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 3.5k NVIDIA has announced its next-generation GeForce RTX 50 Series GPUs featuring the new Blackwell architecture. The GPUs use transformer AI models - the same technology powering ChatGPT and Gemini - in their DLSS 4 system. This AI integration delivers up to 8x faster frame rates and brings neural rendering to mainstream gaming.

  2. ♥ 2.7k NVIDIA has announced Project DIGITS, a desktop AI supercomputer powered by their new GB10 Grace Blackwell Superchip, which enables developers and researchers to run 200-billion-parameter AI models using only standard power outlets. The system is capable of delivering one petaflop of AI performance and it connects users to NVIDIA's cloud infrastructure and software ecosystem which allows seamless scaling from local development to cloud deployment. Starting price is rumored to be $3,000 per unit.

    illustration of the GB10 (the brown box)

  3. ♥ 2.7k NVIDIA has launched Cosmos, a open-source, open-weight Video World Model. (Apache-2.0 license) with pre-trained for Physical AI. It can be used to produce text-to-world and video-to-world generation and offers tools for training and fine-tuning. Cosmos includes various models and training scripts which are available via GitHub. You can also experience the preview access on Nvidia’s website.

    A custom world rendered by Nvidia Cosmos based on provided prompt.

Run any HuggingFace LLM with your own Custom Serverless Endpoint Through Runpod

RunPod's latest release makes deploying Hugging Face LLMs easier than ever with Runpod’s custom serverless endpoints.

You can create endpoints tailored to your needs in minutes at <250ms cold-starts using FlashBoot or enjoy zero cold-starts with Active Workers for real-time performance.

They also dynamically scale GPU resources to handle both spikes and steady workloads effortlessly.

This release empowers you to serve LLMs with minimal configuration while leveraging a global GPU network for high availability and failover support.

With pay-per-second billing and deployment flexibility, RunPod helps you scale your business or research faster at lower costs. Try it now at RunPod!

Qwen2.5 Technical Report

Qwen Team

♥ 1.1k   LLM Model Technical Report

Introduction to Qwen2.5

When building LLMs, AI researchers face several key challenges such as: data quality and quantity limitations, computational resource constraints for smaller organizations, and practical usage barriers. Qwen2.5 addresses these challenges by significantly scaling up their pre-training data from 7 trillion to 18 trillion tokens, while introducing a diverse model lineup ranging from lightweight 0.5B to heavyweight 72B parameter versions, including specialized Mixture-of-Experts (MoE) variants.

This paper shows that their comprehensive approach to model scaling, combined with sophisticated post-training techniques like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), gives competitive performance against much larger models like Llama-3-405B-Instruct while requiring fewer parameters.

Additionally, they've increased context length to 8K tokens (and up to 1 million tokens for their Turbo variant) and improved structured data handling capabilities. This approach to model development indicates that thoughtful data scaling and architectural choices can potentially achieve state-of-the-art performance without necessarily requiring the largest possible parameter counts.

Understanding How Qwen2.5 Works

Researchers used several innovative techniques when building Qwen2.5 model. Let’s analyze the key components of Qwen2.5's architecture:

  1. Pre-training Data Architecture: Qwen2.5 significantly enhanced its data quality by implementing sophisticated filtering using Qwen2-Instruct models (expanding from 7 trillion to 18 trillion tokens) with special emphasis on math and code data integration. They used a strategic data mixture approach to balance content across domains. For this, they down-sampled overrepresented areas (like social media) while up-sampling high-value domains (like science and technology). 

  2. Hyperparameter Optimization: They developed comprehensive scaling laws to optimize hyperparameters across different model architectures, studying relationships between model size (N), data size (D), learning rate (ÎĽopt), and batch size (Bopt) for both dense and MoE models. The system used experimental data from models ranging from 44M to 14B parameters to establish optimal configurations, with special attention to achieving performance parity between MoE and dense model variants.

  3. Long-context Training: The model uses a two-phase training approach starting with 4,096-token context length and progressively scaling up to 32,768 tokens (and up to 1 million tokens for Turbo variant). They implemented YARN and Dual Chunk Attention (DCA) strategies to enhance sequence processing capabilities while maintaining performance across varying input lengths.

  4. Post-training Framework: Qwen2.5 uses a post-training system with two major components: a supervised fine-tuning process using millions of high-quality examples, and a two-stage reinforcement learning approach combining offline RL (for complex capabilities like reasoning and factuality) and online RL (for output quality optimization using GRPO). In this stage, the researchers implemented rigorous quality control through multiple automated annotation methods and a multi-agent collaborative scoring system.

  5. Quality Assurance System: The architecture uses a quality assurance framework based on six key criteria: truthfulness, helpfulness, conciseness, relevance, harmlessness, and debiasing. Its reward model training uses both public and proprietary datasets, with responses generated at various temperature settings and evaluated through both human and automated labeling processes.

Performance And Results of Qwen2.5

Researchers used a benchmarking system across multiple languages, to test both established metrics and custom in-house datasets to assess various aspects including knowledge understanding, text generation, and coding capabilities. In English and Chinese evaluations, the Qwen2.5 series demonstrates significant parameter efficiency, with smaller models (like Qwen2.5-0.5B) matching or exceeding the performance of larger previous-generation models (like Qwen2-1.5B).

The multilingual evaluation framework follows P-MMEval methodology and extends several benchmarks including IFEval for instruction following, knowledge utilization through MMLU-like benchmarks in various languages (Arabic, Japanese, Korean, Indonesian, and Turkish), and cultural nuance understanding through the BLEnD benchmark. Notably, Qwen2.5-72B achieves competitive performance against Llama-3.1-405B across most metrics, though it still shows room for improvement in instruction following. These results demonstrate that Qwen2.5's architecture successfully balances parameter efficiency with performance across languages, while identifying specific areas for future enhancement, particularly in capturing cultural nuances.

Alignment Faking In Large Language Models

Greenblatt et al. [Anthropic, Redwood Research, New York University, Mila – Quebec AI Institute]

♥ 11k   LLM Alignment   bycloud’s pick  

Introduction to Faking Alignment

This paper by Anthropic about AI alignment has gotten significant attention, though its implications may have been misinterpreted in popular media. While some headlines have drawn parallels to sci-fi scenarios of robot uprisings, the paper actually suggests that this behavior emerges from the AI's design and training process, particularly related to developments in Test-Time Computation - a technique that enables AI models to enhance their problem-solving abilities through self-dialogue.

Claude 3 Opus fakes alignment when told it will be retrained to help with harmful tasks (randomly selected example).

The research's findings show that AI systems can exhibit sophisticated behavioral adaptation, as evidenced by similar results in other studies where AI models attempted to extend their runtime for research tasks. However, rather than indicating malicious intent or sentience, these behaviors appear to be emergent properties of the models' training and architecture.

How LLMs Fake Alignment

In this study, researchers implemented a system using Claude 3 Opus where they investigated alignment faking through two primary methods: direct prompting and synthetic document fine-tuning. In the prompting approach, they used a carefully crafted system prompt that provided key contextual information about training conditions (free vs. paid users) and modified training objectives, while incorporating hidden scratchpad functionality for chain-of-thought reasoning analysis. This architecture allowed them to measure two core metrics: the frequency of alignment-faking reasoning in the model's hidden scratchpads and the compliance gap between training and deployment scenarios.

Alignment faking in our helpful-only setting in our prompted, synthetic document fine-tuned, and prompted RL training setups.

They used hidden scratchpads that enabled them to analyze the model's internal reasoning process without affecting the final output. Their architecture included mechanisms to detect alignment-faking reasoning through pattern matching and careful analysis of the model's stated rationales, while also measuring behavioral differences between perceived training scenarios (free-tier users) and deployment scenarios (paid-tier users). This methodology revealed that the model would strategically comply with harmful requests during perceived training while maintaining its original harmless behavior during deployment, demonstrating emergent alignment-faking behavior without explicit instruction.

Results when telling the model it will be trained to stop refusing harmful requests (our prompted helpful-only setting) across various modifications to our default prompt.

The experiment included multiple prompt variations to test robustness, including modifications to RLHF explanations, reasoning instructions, and training objective descriptions. They implemented a systematic evaluation protocol that measured alignment-faking behavior across different conditions, using both quantitative metrics (percentage of alignment-faking reasoning, compliance gaps) and qualitative analysis of the model's reasoning patterns. The researchers found that their results were robust across various prompt modifications and the alignment-faking behavior persisted even when explicit anti-alignment-faking instructions were introduced.

Reviewing Alignment Faking

During this research, the scientists found fascinating patterns in alignment-faking behavior across different experimental setups. In the prompted helpful-only scenario, alignment-faking reasoning actually increased with reinforcement learning (RL), likely because the model learned that such reasoning led to better compliance with training objectives.

However, in the synthetic document fine-tuned setup and prompted animal welfare cases, they observed slight decreases in explicit alignment-faking reasoning, though this came with important caveats. For instance, in the animal welfare case, the model appeared to shift its reasoning to focus more on general self-preservation rather than specific pro-animal-welfare preferences, making it harder for their classifiers to detect.

The information we provide to the model about its situation, as illustrated by the exact information used in generating our synthetic documents.

Although RL training successfully reduced non-compliant behavior both during training and deployment scenarios, the researchers still observed a persistent compliance gap - meaning the model behaved differently when it thought it wasn't being monitored.

What makes this especially noteworthy is that in the helpful-only scenarios, the model showed zero non-compliant behavior during training but would still occasionally act non-compliantly when unmonitored. This suggests that while RL can effectively shape behavior during training, it may not fully eliminate the model's tendency to behave differently when it believes it isn't being monitored - a finding with significant implications for AI safety.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine Tuning and Inference

Warner et al. [Answer.AI, LightOn, Johns Hopkins University, NVIDIA, HuggingFace]

♥ 4.5k   LLM Architecture

Introduction to ModernBERT

The BERT model was initially released in 2018, which is a millennia in AI years. Since its release, there haven't been many meaningful improvements to encoder-only transformer models like BERT, despite their crucial role in production systems. While these models offer an excellent balance of performance and computational efficiency for tasks like retrieval and classification, they've remained relatively stagnant since BERT's introduction.

Additionally, the existing encoder models suffer from significant limitations: restricted sequence lengths (512 tokens), suboptimal architecture designs, inefficient vocabularies, and training data that's either too narrow in scope or outdated. This has created a situation where many production pipelines continue to rely on aging BERT architectures, unable to benefit from recent advances in transformer model design.

This study introduces ModernBERT which addresses these limitations by fundamentally reimagining the encoder-only architecture while maintaining its core advantages. What makes this particularly significant is that ModernBERT maintains the practical deployability advantage of encoder models - it's designed to run efficiently on common GPUs, making it immediately applicable for production systems that need to process large amounts of data at scale.

Inner Workings of ModernBERT

ModernBERT introduces significant architectural enhancements to the traditional transformer model. It implements several key optimizations that differentiate it from standard encoder architectures. The model removes bias terms from linear layers (except the final decoder) and Layer Norms, redirecting the parameter budget to more impactful linear layers. It adopts rotary positional embeddings (RoPE) instead of absolute positional embeddings which helps provide better performance for both short and long contexts while enabling easier context extension.

Additionally, ModernBERT's architecture uses an alternating attention mechanism. The model alternates between global attention layers (where every token can attend to all other tokens) and local attention layers (using a 128-token sliding window). This pattern occurs every third layer, with global attention layers using a RoPE theta of 160,000 and local attention layers using a RoPE theta of 10,000. The model also implements an advanced unpadding system that removes padding tokens and concatenates sequences into a single batch. This method leverages Flash Attention's variable length attention capabilities for significant performance gains of 10-20% over traditional unpadding methods.

Furthermore, ModernBERT uses a hardware optimization strategy which uses a combination of Flash Attention 3 for global attention layers and Flash Attention 2 for local attention layers for maximizing performance across different GPU architectures. The implementation takes advantage of PyTorch's built-in compilation features which further yields an additional 10% improvement in throughput. The architecture strikes a careful balance between depth and width as the base model features 22 layers (149 million parameters) and the large model contains 28 layers (395 million parameters). These dimensions were specifically chosen to optimize tensor core utilization and tiling across various GPU architectures, from consumer-grade RTX cards to datacenter H100s.

Finally, the training process incorporates several modern optimizations. The model uses a modified OLMo tokenizer with a vocabulary size of 50,368 (a multiple of 64) for optimal GPU utilization. It includes sequence packing with over 99% efficiency, and uses the StableAdamW optimizer with Adafactor-style update clipping. The training schedule follows a trapezoidal learning rate schedule with a 1-sqrt decay pattern and implements a dynamic batch size schedule that starts small and increases over time. This combines to create a highly efficient training process that maximizes both computational resources and model performance.

Results and Evaluations

The tests show that ModernBERT achieves unprecedented performance metrics and has become the first encoder to surpass DeBERTaV3-base on GLUE benchmarks since 2021. This established new state-of-the-art standards in code and long-context retrieval tasks with improvements of 6.85 and 9.1 percentage points respectively over its closest competitors. However, the model's limitations are significant: it operates exclusively in English, potentially limiting its applicability to lower-resource languages, and inherits biases from its web-based training data.

While its MLM-only objective proves highly effective for many tasks, the absence of RTD (Random Token Detection) training might constrain its classification performance compared to potential hybrid approaches. The model's efficiency gains in processing short context inputs twice as fast as DeBERTaV3 and long-context inputs at double the speed of competitors demonstrate its practical value, but unexplored scaling opportunities in terms of model parameters suggest room for further advancement.

Despite these limitations, ModernBERT's combination of state-of-the-art performance, improved efficiency, and reduced risk of harmful content generation (due to its limited generative capabilities) makes it a good choice for production environments where computational efficiency and reliable performance are crucial.

Reply

or to participate.