LLM Interpretability & The Multimodal Future of AI

The AI Timeline #7

In this issue: x3 industry news, x3 AI research papers

May 20th ~ May 26th

🗞️ Industry News in 1 Line

  1. ♥ 394 At Microsoft Build 2024, Microsoft introduced new models to the Phi-3 family, including Phi-3-vision, a new open-source multimodal model, alongside Phi-3-mini, Phi-3-small, and Phi-3-medium. These are available through Azure AI or can be downloaded locally from HuggingFace.

  2. ♥ 9.2k xAI just raised 6 billion dollars at a pre-money valuation of $18B. In their series B funding, they obtained key investors including Valor Equity Partners, Vy Capital, Andreessen Horowitz, Sequoia Capital, Fidelity Management & Research Company, Prince Alwaleed Bin Talal and Kingdom Holding, amongst others.

  3. ♥ 732 Mistral just rolled out Mistral v3 base and instruct editions, the base version comes with an extended vocabulary of 32,768 words, while the instruct version comes with a fantastic feature: support for function calling! This would allow breaking down complex tasks into smaller, modular components which can be re-used across a variety of tasks.

1. The Platonic Representation Hypothesis

Huh et al. [MIT]

♥ 1.3k   LLM Interpretability

Images (X) and text (Y) are projections of a common underlying reality (Z). They conjecture that representation learning algorithms will converge on a shared representation of Z, and scaling model size, as well as data and task diversity, drives this convergence

Introduction to Platonic Representation Hypothesis

Different AI models, even when trained on different tasks and data, are starting to represent information in very similar ways. This paper argues that these AI models are converging towards a common, ideal way of understanding the world, which they call the "Platonic representation." 

What is Platonic Representation Hypothesis?

Specifically, neural networks trained with different objectives on different data and modalities are converging to a shared statistical model of reality in their representation spaces.

This hypothesis is inspired by Plato's concept of an ideal reality, suggesting that as models become more sophisticated and are exposed to more varied data, their internal representations are aligned more closely with each other, reflecting an agreed internal depiction of the real world.

An image of people inside a cave.

Description: Plato’s cave – The training data for our algorithms are shadows on the cave wall, and models are recovering ever better representations of the actual world outside the cave

The paper empirically showed that as large language models (LLMs) and vision models grow in capability, their learned representations become increasingly similar. This trend suggests that powerful models, regardless of their specific domain (language or vision), tend to develop analogous internal structures.

Graph showing performance of different language models improve over time.

There are multiple factors which might be causing this convergence:

  1. One major factor is the need for AI models to perform well across a wide variety of tasks. As models are trained to handle more tasks, they tend to develop similar ways of representing data because there are fewer effective ways to cover all these tasks successfully. This is known as the "Multitask Scaling Hypothesis", where Models trained with an increasing number of tasks are subjected to pressure to learn a representation that can solve all the tasks. 

  2. Another factor is the increase in scale of training data and model parameters. As models become larger and are trained on more diverse data, the “Anna Karenina scenario” where all well-performing neural nets represent the world in the same way appears. In the paper, they observed that all strong models are alike, each weak model is weak in its own way.

Venn diagram showing different AI mdodels can solve same tasks.

Multitask Scaling Hypothesis visualized

Platonic Representation Hypothesis Implications

  1. When models converge, it is easier to switch between different data types (modalities). Converged models are prime candidates for transfer learning, where a model trained on one task is fine-tuned for another, often related, task. For example, consider a model trained on text data (e.g., BERT). Once BERT converges, it can be adapted to a variety of tasks such as question answering, image captioning, multimodal sentiment analysis, etc.

  2. When a model converges, it has developed stable and robust representations of the input data – bigger models might reduce errors and biases if trained on diverse, high-quality data.

A 3d scatter plot showing how much information can be captured by an AI model.

Kernels visualized with multidimensional scaling (i.e. a visualization where nearby points are similar according to the kernel, and far apart points are dissimilar).

Downfalls of the Platonic Representation Hypothesis

  1. Different Modalities Contain Unique Information: Information unique to each modality may not converge to a single representation. For example, you can not create a sound to describe a color. 

  2. Limited Convergence Across All Domains: Convergence has primarily been observed in vision and language, not across all modalities.

  3. Special-Purpose Intelligences: Models designed for specific tasks may not agree with general purpose models on a shared representation. For example, a model trained on botany data might argue that tomato is a fruit and other general models might disagree.

2. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

Wu et al. [Monash University, University of Macau, Tencent AI Lab]

♥ 744   LLM

They demonstrated that the translations produced by their TRANSAGENTS are more preferred by humans than those from conventional MT systems.

Introduction to TRANSAGENTS

In the past few years, we have seen lots of advancements in machine translation (MT) which improved performance across various domains. However, translating literary texts, i.e. text which contains poetry, dramatic references, humor etc., remains challenging due to their complex language and cultural nuances. This paper introduces TRANSAGENTS, a new multi-agent framework using large language models to better handle these complexities. 

How Does TRANSAGENTS Work?

TRANSAGENTS simulates a virtual company by using multi-agent collaboration strategies to deliver high-quality literary translations. Its structured approach, diverse team composition, and evaluation strategies provide higher accuracy, cultural appropriateness, and reader satisfaction in translated works.

TRANSAGENTS Company Overview

TRANSAGENTS comprises various roles, including a CEO, senior editors, junior editors, translators, localization specialists, and proofreaders.

  1. The CEO oversees the entire translation process and selects a Senior Editor to lead each project.

  2. Senior Editors set editorial standards, guide junior editors, and ensure that translations align with company objectives.

  3. Junior Editors assist senior editors in daily workflow management, content editing, and communication with other team members.

  4. Translators convert written material from one language to another while preserving tone, style, and context.

  5. Localization Specialists adapt content for specific regions or markets, adjusting cultural references and idioms.

  6. Proofreaders perform final checks for grammar, spelling, punctuation, and formatting errors before publication.

A cartoon showing differet AI models pretending to be part of an organization.

TRANSAGENTS Collaboration Strategies

  1. Addition-by-Subtraction Collaboration: Two agents collaborate to refine translations – one agent adds relevant information, while the other removes redundant details.

  2. Trilateral Collaboration: Three agents collaborate to generate and evaluate translations – one generates a response, another provides critiques, and a third evaluates the quality of the response.

Translation Workflow

  1. Preparation: 

    1. Project Members Selection: The CEO selects a Senior Editor based on client requirements and assembles the project team.

    2. Self-reflection: A "ghost agent" prompts the CEO to reconsider decisions to ensure optimal team selection.

    3. Translation Guideline Documentation: Guidelines are established for maintaining consistency throughout the translation process, covering aspects like glossary, book summary, tone, style, and target audience.

  2. Execution:

    1. Translation, Localization, and Proofreading: The translation process involves collaboration between translators, junior editors, and senior editors to ensure accurate and culturally appropriate translations.

    2. Cultural Adaptation: Localization specialists tailor translations to fit the cultural context of the target audience.

    3. Final Review: Senior editors evaluate the quality of translations and ensure narrative consistency between chapters.

Results for TRANSAGENTS

The results of the paper show that:

  1. Human evaluators marginally prefer translations produced by TRANSAGENTS over other tested methods in monolingual evaluations. In bilingual evaluations, GPT-4-0125-PREVIEW also prefers translations from TRANSAGENTS over other models.

  2. Despite lower d-BLEU scores, TRANSAGENTS exhibits strengths in translating genres requiring extensive domain-specific knowledge, while underperforming in contemporary domains.

  3. Linguistic diversity metrics show that TRANSAGENTS improves diversity in translations, particularly with the help of translation guidelines and the localization step.

  4. Cost analysis reveals that using TRANSAGENTS for translation can result in significantly lower costs compared to employing professional human translators, making it a cost-effective solution for literary translation.

Graph showing people prefer text generated by AI model.s

3. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Anthropic

♥ 2.3k   LLM Interpretability

Introduction to Monosemanticity

Imagine a neural network like a brain where each neuron (or group of neurons) has a specific role. In a monosemantic model, certain neurons would only respond to one specific idea. For example, one neuron might always activate when the network encounters text about the Golden Gate Bridge. This neuron would ignore other topics, like cooking or politics, which makes its behavior predictable and understandable.

The problem right now is that most LLM neurons are uninterpretable, i.e. we don’t know why they do what they do. So, researchers at Anthropic proposed special kind of neural network called a sparse autoencoder which is designed to produce clear and distinct features by encouraging the network to use only a small number of neurons to represent any given input. This sparsity makes it easier to identify which neurons are associated with which concepts.

How Does Monosemanticity Scale with Model Size?

The researchers argue that neural networks use almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions (superposition hypothesis) and they want to test this by using dictionary learning, specifically through sparse autoencoders (SAEs), which have shown promise previously. The goal of this paper is to decompose model activations into interpretable components, i.e. we are able to understand which weights are responsible for which output.

concept relations map for the immunology feature

The SAE used in this paper consists of an encoder and a decoder – the encoder maps activations to a higher-dimensional space using a linear transformation followed by a ReLU nonlinearity. The decoder reconstructs the model activations from these high-dimensional features, trained to minimize reconstruction error and enforce sparsity through L1 regularization.

The decomposition process involves normalizing the activations and representing them as a weighted sum of feature directions. The sparsity penalty ensures that only a small fraction of features are active for any given input, making the model's behavior more interpretable.

Dictionaries with low training loss produced interpretable features and improved metrics like the L0 norm as well as the number of dead features.

Graph showing how neurons activate when relevant words are mentioned.

All the words related to “transit infrastructure” activate neurons and therefore, are highlighted. Words in the left box are loosely related so they are mildly highlighted but words on the right are strongly related so they are highlighted using a dark color.

Results

The researchers succeeded and discovered features that corresponded to a wide range of topics, such as famous landmarks, brain sciences, tourist attractions, and transit infrastructure. These features were not only specific but also abstract, meaning they could generalize across different languages and forms (text and images).

For example, the "Golden Gate Bridge" neuron (aka feature) fires for descriptions and images of the bridge. When they force the neuron to fire more strongly, Claude mentions the bridge in almost all its answers, which implies they can fool Claude into believing everything is related to the bridge.

AI model strongley relates images of Golden Gate Bridge with its description.

Currently, transformer models often operate as black boxes, which means their decision-making processes are hidden within complex layers of computation. By using techniques like sparse autoencoders, we can begin to get an insight into these processes, making the inner workings of these models more transparent.

Reply

or to participate.