Anthropic's Research On The Biology of a LLM

Plus more about Defeating Prompt Injections by Design and Reasoning to Learn from Latent Thoughts

Mar 24th ~ Mar 30th
#49 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 5.1k Google has released Gemini 2.5 Pro, a new state-of-the-art model with improved performance for reasoning, coding, and science tasks. An experimental version is now available free for all users, try it now on their AI Studio or Gemini app today.

    Gemini 2.5 Pro Benchmark

    Gemini 2.5 Pro Benchmark

  2. ♥ 1.5k Runway has introduced Gen-4, an AI model that generates images and videos using visual references combined with text instructions for consistent results. The new model allows users to maintain continuity in style, subjects, and locations, as showcased in several short demonstration films. You can try the Image-to-Video functionality today by subscribing to paid or Enterprise plan.

  3. ♥ 1.9k The Qwen team has released Qwen2.5-VL-32B, an updated vision-language AI model optimized for better mathematical reasoning, detailed image understanding, and providing responses more aligned with human preferences. This 32B parameter model has shown strong benchmark performance, surpassing comparable or even larger models on complex reasoning tasks. You can download its weights from Hugging Face or ModelScope.

    Qwen-2.5-VL-32B benchmark

    Qwen-2.5-VL-32B benchmark

  4. ♥ 21k OpenAI has released GPT-4o’s native image generation, along side improvements on its text generation capabilities. Its image generation is incredibly good at generating text within images and style transfer. GPT-4o is an autoregressive multimodal model, and you can read more about it on GPT-4o’s native image generation system card.

    an image generated with GPT-4o native image generation

    1 million new users per hour

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

On the Biology of a Large Language Model

Lindsey et al. [Anthropic]

♥ 7.1k   LLM Interpretability

Breaking Down Claude 3.5 Using Circuit Tracing Methodology

Large AI language models are incredibly capable, however, we don't fully understand how they work on the inside. They operate like complex "black boxes," and as these models get smarter and are used in more important situations, simply knowing they work isn't enough. We need to understand their internal reasoning to trust them, ensure they are safe, and judge if they're suitable for specific tasks.

This technical paper tackles this "black box" problem by developing and applying new tools to peek inside these models, similar to how biologists use microscopes to understand cells. Building on previous work that identified basic concepts or "features" within models, this paper introduces methods, especially one called "attribution graphs," to map out how these features connect and interact. You can think of it like creating a wiring diagram for the AI's thought process.

By tracing the steps the model takes internally from an input (like a question) to an output (its answer), the researchers aim to understand the specific mechanisms the AI uses for tasks like reasoning, planning, and even identifying harmful requests, using the Claude 3.5 Haiku model as a case study.

How Does Claude 3.5 Work?

To understand how language models like Claude 3.5 Haiku arrive at their answers, the researchers developed a method focused on uncovering the hidden intermediate steps. They illustrate this using a simple example: completing the sentence "Fact: the capital of the state containing Dallas is..." with "Austin." While this seems intuitive (Dallas is in Texas, the capital of Texas is Austin), the question is whether the AI actually performs these two steps internally or uses some kind of memorized shortcut. This research provides evidence for genuine multi-step reasoning happening inside the model, coexisting with simpler pathways.

The core technique involves generating an "attribution graph." This graph visualizes the key internal "features", concepts the model represents internally, that become active during processing and how strongly they influence each other to produce the final output. The first step is to identify and interpret these features.

For instance, researchers found features specifically activated by the word "capital," but also more abstract features representing the concept of a capital, even activating across different languages (like "Hauptstadt" in German or "省会" in Chinese). Similarly, they identified features representing "Texas" (activated by "Dallas") and features that specifically push the model to output the word "Austin" ("say Austin" features) or any capital city ("say a capital" features). These individual features are then grouped into simplified categories called "supernodes" (e.g., a "Texas" supernode containing various Texas-related features) to make the interactions easier to analyze.

The resulting attribution graph for the Dallas example revealed distinct pathways. Features in the "Dallas" supernode strongly activate features in the "Texas" supernode. Separately, the "capital" supernode activated the "say a capital" supernode. More importantly, the "Texas" and "say a capital" supernodes then jointly activated the "say Austin" supernode, leading to the final answer. This Dallas → Texas → Austin pathway provides clear evidence of the hypothesized two-step reasoning.

Interestingly, the graph also showed a direct "shortcut" link from "Dallas" to "say Austin," suggesting multiple mechanisms operate simultaneously. To confirm these aren't just artifacts of the analysis, the researchers performed "inhibition experiments." They artificially suppressed the activity of specific supernodes (like "Texas" or "capital") and observed the downstream effects. Inhibiting "Texas" reduced the activation of "say Austin" but not "say a capital," while inhibiting "capital" did the opposite, confirming the distinct roles of these pathways in the model's computation and altering the final predicted word in logical ways.

Results and Real-World Implications of Transcendence

This experiment shows that tools like attribution graphs can successfully peek inside complex AI models like Claude 3.5 Haiku, and reveal surprisingly sophisticated internal mechanisms. The studies uncovered evidence of the model performing multi-step reasoning, using parallel computational pathways. This methodology is applicable for auditing specific model behaviors, understanding how capabilities emerge, identifying potential issues like hallucination triggers or ingrained biases, and assessing whether a model's explicit reasoning (like chain-of-thought) matches its internal processing. These insights confirm that modern AI models develop intricate internal strategies far beyond simple pattern matching.

However, the researchers are clear about the limitations. These findings are based on specific examples and don't claim universality; the identified mechanisms might only apply in certain situations, and other mechanisms likely exist undiscovered. The methods currently struggle with long prompts, very complex reasoning chains spanning many steps, understanding why a model doesn't do something, and fully explaining the important role of attention mechanisms.

Defeating Prompt Injections by Design 

Debenedetti et al. [Google, Google DeepMind, ETH Zurich]

♥ 549   LLM Jailbreaking  

Securing LLM Agents Against Untrusted Data

LLMs are increasingly used as the core component in agentic systems designed to interact with external environments, such as APIs, web pages, or user inputs. This interaction exposes a significant vulnerability: prompt injection attacks. When an LLM agent processes data from untrusted sources (e.g., content scraped from a website or output from an external tool), malicious instructions embedded within that data can potentially hijack the agent's behavior.

These attacks can lead to undesirable outcomes, ranging from executing unintended actions to leaking sensitive user data. Current defense mechanisms often focus on training models to resist such manipulations or rely on system prompts defining security rules, but these approaches can be difficult, especially when the malicious instructions are cleverly disguised within legitimate-looking data.

This paper introduces CaMeL (Capabilities for Machine Learning), a new defense mechanism that operates as a protective system layer around the LLM agent, and aims to mitigate prompt injection risks without altering the LLM itself. It achieves this by first explicitly extracting the intended control flow (the sequence of actions) and data flow (how information moves) directly from the initial, trusted user query.

How to Stop Prompt Injection Attacks via CaMeL

CaMeL (Capabilities for Machine Learning) introduces a system-level defense architecture designed to mitigate prompt injection vulnerabilities in LLM-based agentic systems without requiring modifications to the underlying language model. Its core principle draws heavily from established software security frameworks, namely Control Flow Integrity (CFI) and Information Flow Control (IFC).

It requires you to separate the determination of the agent's execution path (control flow) from the handling of potentially untrusted data retrieved during execution. CaMeL achieves this by treating the initial user query as the sole trusted source for defining the intended program structure and operational sequence.

Operationally, CaMeL follows a distinct procedural flow. First, upon receiving a user query deemed trusted, the CaMeL system performs static analysis to extract an explicit representation of the intended control flow (e.g., a sequence of API calls or internal functions) and the corresponding data flow dependencies. This effectively defines the authorized execution graph before any interaction with external, untrusted sources.

Subsequently, when the LLM is invoked (for instance, to process data from a web page or an API response), its output, including any retrieved data, is strictly compartmentalized. This retrieved data, regardless of any embedded malicious instructions, is prohibited from influencing or altering the pre-determined control flow graph established from the original trusted query.

The enforcement of security guarantees within CaMeL relies on two key components: capabilities and a custom interpreter. Each data value within the system is associated with metadata, termed 'capabilities,' which encode fine-grained permissions dictating how that specific data can be used or propagated (implementing IFC). For instance, sensitive data might have capabilities restricting its flow to pre-approved, secure "sinks" (e.g., specific internal functions) and prohibiting transmission to external APIs.

A custom Python interpreter then executes the agent's program, meticulously tracking data provenance and enforcing the constraints defined by these capabilities at runtime. This ensures that even if an LLM attempts to leak data due to injected prompts, the interpreter, guided by the capability system, will prevent the policy violation.

Results and Evaluation

The researchers tested the CaMeL framework by using the comprehensive AgentDojo benchmark for testing its impact on both task completion utility and security against prompt injection across a diverse set of contemporary LLMs, including variants of Gemini, Claude, GPT, and the o1/o3 models.

Researchers concluded that CaMeL generally maintains high utility, with minimal performance degradation observed in most task suites; unexpectedly, utility even improved in certain model/task combinations (e.g., Gemini Pro 2.0 on Banking). The primary exception was the Travel suite, where reduced performance was linked to poorly documented APIs hindering the planning of LLM's ability to anticipate and parse tool output structures. This approach has some inherent limitations like "Data requires action" (where task logic depends on untrusted data inaccessible to the planning LLM) and instances of "Not enough context for Q-LLM".

Reasoning to Learn from Latent Thoughts

Ruan et al. [Stanford University, University of Toronto, Vector Institute]

♥ 640   LLM Reasoning   bycloud’s pick  

Introduction to Bootstrapping Latent Thoughts (BoLT)

LLMs are getting bigger at a very fast pace and we are running out of available human-written text to train them. This challenge is known as the data bottleneck and it can slow down the further progress in LLMs. To address this, the researchers introduced a new pretraining approach named "reasoning-to-learn." This approach says that standard web text is merely a compressed representation of a richer, underlying human thought process. 

By explicitly modeling and inferring these latent thoughts (Z) associated with the observed text (X), the researchers propose augmenting the pretraining data with this inferred reasoning, aiming to significantly enhance data efficiency. The core idea is that these latent thoughts contain contextual information and reasoning steps that enable more effective learning, much like how humans actively infer and decompress information when reading complex material.

Inner-Workings of Bootstrapping Latent Thoughts (BoLT)

We want a language model (LM) to learn more efficiently, especially when high-quality training data is scarce. The core idea is that text we see online is often a summary of a longer thought process. If we could add those "missing thoughts" back into the training data, the LM might learn faster and better. The problem is, getting these thoughts usually requires a very powerful helper model, which limits how good our own LM can become. BoLT offers a clever workaround: it lets the LM teach itself to generate better thoughts over time. It does this using a two-step process called Expectation-Maximization (EM), repeated in cycles.

First is the "Expectation" (E) step: Here, the current LM takes a piece of text and tries to guess the underlying "thoughts." But instead of just taking its first guess, it generates multiple possible thought processes (this is the Monte Carlo part). Then, it evaluates these candidate thoughts; it gives higher scores to thoughts that are both logical on their own and do a good job explaining the original text, while penalizing obvious or unhelpful thoughts. It then picks one of the best-scoring thoughts to use. This whole process acts like a filter, selecting higher-quality reasoning than the model might produce on average.

Second is the "Maximization" (M) step: The LM is then trained using the original text paired with these carefully selected, higher-quality thoughts generated in the E-step. Because it's learning from better, more explicit reasoning, the LM itself becomes smarter and better at understanding context and logic. Instead of directly using latents sampled from the current model's posterior q(Z | X; Mt), BoLT samples multiple (K) candidate latents. It then calculates importance weights for each candidate based on the ratio p(Z(k), X; Mt) / q(Z(k) | X; Mt). 

The key is that this EM cycle repeats: the slightly smarter LM from the M-step then performs the next E-step, generating even better thoughts, which leads to more effective learning in the next M-step, creating a self-improvement loop where the model "bootstraps" its own reasoning capabilities.

Real-World Implications of Bootstrapping Latent Thoughts (BoLT)

The researchers tested their new "Bootstrapping Latent Thoughts" (BoLT) method to see if it really helps language models learn better with limited data, focusing on math problems. They found that it works surprisingly well! In their main experiment, they took a standard language model and had it go through several cycles of generating its own "thinking steps" for a fixed set of math text and then retraining itself using those thoughts. With each cycle, the model got consistently better, both at understanding the text (measured by standard model quality scores) and, more importantly, at solving challenging math problems from the MATH benchmark. This self-improvement continued for at least three cycles.

Additionally, the models trained using this BoLT method significantly outperformed models trained the old-fashioned way, i.e. just feeding them the raw text, even when those baseline models were trained for the same amount of computer time or saw the same amount of raw text. This shows that teaching the model to generate and learn from its own reasoning makes learning much more efficient.

They also discovered that letting the model generate more candidate "thoughts" and picking the best one during each cycle further boosted performance. It suggests that spending more computing power on generating better reasoning during training pays off. While the method greatly improved complex math skills, they did note a slight dip in performance on simpler word problems in some tests (unless specifically fine-tuned), hinting that intense focus on one area might slightly affect others.

Reply

or to participate.