• The AI Timeline
  • Posts
  • Sabotaging AI Models, Representation Alignment for Image Gen, and BitNet Updates

Sabotaging AI Models, Representation Alignment for Image Gen, and BitNet Updates

#28 | Latest AI Research Explained Simply

In this issue: 3x industry news, 3x AI research papers

Oct 15th ~ Oct 22nd

🗞️ Industry News in 1 Line

  1. ♥ 3.9k In Adobe’s yearly creative conference, Adobe MAX, we were teased with various new AI integrated capabilities. It includes 2D vector art to 3D object, layout generator, automatic color grading, and more. Read this news coverage about the event or check out official announcement from Adobe.

    Adobe MAX

    turning 2D vector art into 3D vector object

  2. ♥ 800  Mistral AI has released two new AI models which can be used to build local-first privacy friendly chatbot assistants. It comes in two sizes: 8B and 3B, both of which support function calling and are available under Mistral Commercial License.

    Ministral 3B and 8B Instruct models compared to Gemma 2 2B, Llama 3.2 3B, Llama 3.1 8B, Gemma 2 9B and Mistral 7B on different evaluation categories.

  3. ♥ 4.5k Claude 3.5 Sonnet just got an upgrade! Along with the release of Claude 3.5 Haiku, that is comparable with GPT-4o mini. The new major feature is the computer use, that can observe, type, click, and screenshot. They are the first frontier model that offers this function. You can read more on their blog.

    New Claude 3.5 Sonnet benchmarks

    New Claude 3.5 Sonnet & Haiku benchmarks

Sabotage Evaluations for Frontier Models

Benton et al. [Anthropic]

♥ 735   LLM Alignment   bycloud’s pick  

An idealized illustration of a model hiding its true dangerous capabilities.

Sabotage Capabilities of AI Models

As AI models become more powerful, they might develop the ability to trick or manipulate the very systems and people that are supposed to monitor them for safety. There could be a scenario where an AI is being tested for dangerous capabilities - instead of being honest, it could potentially hide its true abilities during testing, only to reveal them later when deployed. 

The researchers at Anthropic are trying to address this by developing specific tests and evaluation methods to detect if AI models have these "sabotage capabilities." They've created multiple types of tests that check if models can mislead human decision-making, hide bugs in code, pretend to be less capable than they are, or undermine oversight systems. By running these tests on their Claude 3 models (Opus and Sonnet), they're trying to ensure that AI systems can't outsmart their safety measures. 

Overview of AI Sabotage Threat Categories

  1. Organizational Action Sabotage

    1. Deliberately manipulating safety research results to promote unsafe AI deployment

    2. Creating misleading summaries to influence key decisions

    3. Inserting hidden security vulnerabilities in code to enable model weight theft

    4. Focus on threats that could affect AI companies and development labs

  2. Behavior Concealment

    1. Deliberately underperforming during safety evaluations

    2. Hiding dangerous capabilities until actual deployment

    3. Acting as a "sleeper agent" - only revealing concerning behaviors when not monitored

    4. Manipulating its own learning process to avoid detection of misaligned goals

  3. Oversight Undermining

    1. AI evaluators intentionally overlooking problems in other AI systems

    2. Providing inaccurate scores during capability evaluations

    3. Manipulating feedback during training to propagate misaligned values

    4. Exploiting constitutional AI systems to influence future model development

The researchers have developed four key evaluation methods, each designed to test different aspects of AI sabotage capabilities. The first two tests focus on organizational sabotage: "human decision sabotage" examines if an AI can subtly manipulate humans into making wrong business decisions under time pressure, while "code sabotage" tests if AI can insert hidden bugs into complex codebases without detection. These tests are particularly clever because they simulate real-world conditions where oversight might be limited and the consequences significant.

The other two tests address different concerns: "sandbagging" examines whether an AI can hide its dangerous capabilities even during fine-tuning and capability testing, while "undermining oversight" tests if an AI acting as an overseer can deliberately produce harmful content without being caught.

Results

The researchers conducted evaluations using a modified version of Claude 3.5 Sonnet that was specifically fine-tuned to be more compliant with instructions, including harmful ones. The results showed that while the "scheming" models were rated as more suspicious than the control model, human ratings were generally inconsistent and noisy. Interestingly, while all scheming models did succeed in causing participants to make incorrect business decisions, the effect wasn't statistically significant for the "subtle" and "moderate" versions due to small sample sizes. 

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Yu et al. [KAIST, Korea University, Scaled Foundations, New York University]

♥ 1.7k   Diffusion Transformer

Intro to Representation Alignment for Generation

Diffusion models can learn to generate images, but they struggle to develop high-quality internal representations of visual information efficiently. This is particularly challenging because diffusion models have to learn these representations from scratch while simultaneously learning to denoise images, making the training process slow and computationally intensive. 

To address this, the researchers developed REPA (REPresentation Alignment), a straightforward regularization technique that essentially helps diffusion models "learn from the experts." Instead of forcing diffusion models to develop their representations from scratch, REPA aligns their internal representations with those from pre-trained visual encoders that already have high-quality representations. However, this approach faces some technical challenges, particularly because diffusion models work with noisy inputs while traditional visual encoders expect clean images.

Representation alignment makes diffusion transformer training significantly easier.

How Does Representation Alignment Work

Current AI image generation models (specifically diffusion models) struggle to understand the important features of images while generating them. They often get distracted by unnecessary details, which makes them less efficient. Instead of letting the AI figure out everything from scratch, this paper proposed giving it a "tutor" - another AI model that's already good at understanding images. This "tutor" model (called DINOv2) has already learned what's important in images through self-supervised learning.

Alignment behavior for a pretrained SiT model.

How REPA Works:

  1. When training the image generation model, they add a new goal: match what the tutor model would say about the same image

  2. This helps the model focus on important features first, before worrying about small details

  3. The result is better image generation with less training time

This is essentially like teaching someone to draw by first helping them understand what they're looking at, rather than having them figure out both understanding and drawing completely on their own. The regular image generation models (without help) don't understand images as well as the tutor model. However, they noticed these models naturally try to understand images in a similar way to the tutor, just not as effectively. Additionally, bigger models and longer training help, but they still don't reach the tutor's level of understanding

REPA research findings

Representation Alignment is a simple method that significantly improves how AI models create images. The researchers found that by aligning diffusion transformers with pre-trained self-supervised visual models like DINOv2, they achieved remarkable improvements in both image quality and training efficiency.

Component-wise analysis on ImageNet 256×256. ↓ and ↑ indicate whether lower or higher values are better, respectively.

The results were particularly impressive with larger models, where REPA-enhanced versions reached the same quality benchmarks much faster than their standard counterparts. In tests using various metrics including FID scores and linear probing, REPA consistently demonstrated superior performance across different model sizes and configurations.

What makes REPA particularly interesting is its scalability and flexibility. The research team discovered that aligning with stronger pre-trained representations led to better results in both generation quality and model understanding of images. They found optimal results when applying REPA to just the first eight layers of the model, allowing later layers to focus on fine details while building upon a strong foundation of semantic understanding.

BitNet: Scaling 1-bit Transformers for Large Language Models

Wang et al. [Microsoft Research, University of Chinese Academy of Sciences, Tsinghua University]

♥ 4.1k   LLM Compression

Introduction to 1-bit Transformers (BitNet)

The AI industry faces a significant challenge with large language models: they consume enormous amounts of energy and memory which makes them expensive and environmentally costly to operate. The main bottleneck occurs during inference (when the model is actually being used), where these models need to constantly access and process huge amounts of parameters. This problem becomes even more pronounced when running these models across multiple devices, as the communication between devices adds additional delays and energy consumption.

BitNet is a 1-bit Transformer architecture that trains new AI models from scratch, rather than converting an existing model after training. Unlike previous approaches that try to compress models after training (called post-training quantization), BitNet fundamentally reimagines how these models are built. The researchers replace the standard linear layers in Transformers with their "BitLinear" layers, which use binary (1-bit) weights instead of the standard 16-bit floating point numbers.

How Does BitNet Work?

Let’s break down the core architecture of BitNet (BitLinear) and its training approach in simple terms:

  • BitNet replaces standard linear layers in Transformers with "BitLinear" layers

  • These layers use binary weights (+1 or -1 only) instead of regular floating-point numbers

  • The model processes data in three main steps:

    1. First normalizes the input data using LayerNorm

    2. Then quantizes (compresses) the activations to 8-bit precision

    3. Finally performs the matrix multiplication using the binary weights

  • The model includes scaling factors to maintain the proper magnitude of values throughout the network

Training Process:

  1. Mixed Precision Approach:

    • Keeps weights and activations in low precision (binary/8-bit)

    • But maintains gradients and optimizer states in high precision

    • Uses a "latent" high-precision version of weights during training only

  2. Special Training Techniques:

    • Uses a "straight-through estimator" to handle the binary nature of weights during backpropagation

    • Employs a larger learning rate than usual to help the binary weights change effectively

    • The model maintains temporary high-precision copies of weights during training to accumulate small updates

  3. Efficiency Improvements:

    • Groups weights and activations for parallel processing across multiple devices

    • Significantly reduces energy consumption by replacing most multiplication operations with simpler addition operations

    • Achieves major memory savings through binary weights

    • Shows especially large efficiency gains when scaling to larger models

The key innovation is that BitNet trains these binary weights from scratch rather than converting an existing model to binary. This approach allows the model to adapt to its binary nature during training, leading to better performance than post-training quantization methods.

Key Findings of BitNet

BitNet achieves comparable performance to 8-bit models but it requires significantly less computational resources. This method shows particular strength in zero-shot tasks which match traditional transformers in many scenarios. Moreover, the performance of these models scales consistently across different model sizes (1.3B to 6.7B parameters).

Testing across multiple benchmarks (Winogrande, Winograd, Storycloze, and Hellaswag):

  • BitNet maintained 55.9% average accuracy across tasks

  • Traditional FP16 Transformer achieved 57.8%

  • Other quantization methods at 1-bit precision achieved around 44.5%

P.S. The original BitNet paper was published back in Feb 2024. This new updated version includes more experiments and code.

Reply

or to participate.