(How) Do Language Models Track State?

Plus more about Optimal Hyperparameter Scaling Law in Large Language Model Pretraining and PokéChamp: an Expert-level Minimax Language Agent

Mar 3rd ~ Mar 9th
#46 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 2.2k OpenAI has introduced several new tools and SDKs to enhance AI agent capabilities. The Web Search tool allows agents to retrieve up-to-date information from the web, while the File Search tool enables precise information retrieval from large document collections. Additionally, the Computer Use tool, powered by the CUA model, allows agents to perform tasks on computers, and the new Agents SDK facilitates the orchestration of multi-agent workflows.

  2. ♥ 11k Mistral AI has launched Mistral OCR, an Optical Character Recognition API that can understand media, text, tables, and equations within complex documents. It has the ability to process images and PDFs and extract content in an ordered interleaved format. This makes it ideal for use with RAG systems handling multimodal documents.

  3. ♥ 8.9k The Qwen Team has introduced QwQ-32B, a 32 billion parameter model that leverages the power of Reinforcement Learning to achieve performance comparable to the much larger DeepSeek-R1 model. QwQ-32B demonstrates significant improvements in mathematical reasoning, coding proficiency, and general problem-solving capabilities, thanks to a multi-stage RL approach that begins with cold-start data and progresses through specialized training for math and coding, followed by general capability enhancement. Its weights have been released under the Apache 2.0 license on Hugging Face and ModelScope, and accessible via Qwen Chat.

  4. ♥ 3k Google DeepMind has launched the Gemma 3 family of open models, featuring variants with 1 billion, 4 billion, 12 billion, and 27 billion parameters. The models incorporate vision input capabilities with a 400 million parameter SigLIP model and support a context length of 128k. Notably, the 27B parameter model ranks 9th on LMArena, surpassing models like o3-mini, DeepSeek V3, Claude 3.7 Sonnet, and Qwen2.5-Max.

RTX 4080 SUPER Giveaway With NVIDIA’s GTC 2025

RTX4080 SUPER Giveaway

RTX 4080 SUPER Giveaway!

During NVIDIA’s GTC event which is NVIDIA’s annual flagship AI & developer Conference, March 17-21, 2025, there will be various big announcements, events, and sessions you can attend both in-person or virtually

This is one of the best times to learn from global experts on how generative AI is impacting industries and society as a whole.

You can attend virtually to sessions like:

  • How to Build an Agentic AI Blueprint Using the Best Tools and Frameworks hosted by the director of engineering from NVIDIA 

  • Accelerate Inference on NVIDIA GPUs hosted by the CTO of Together AI

So you can virtually discover the latest breakthroughs in generative AI and NVIDIA technologies from subject matter experts at #GTC25

Highlighted Technical Speakers List from GTC2025

Highlighted Technical Speakers List from GTC2025

By virtually attending these sessions, you can join my giveaway for an RTX4080 SUPER. All you have to do is to take a selfie of yourself attending the LIVE virtual sessions that are available during GTC (March 17-21), submit it using this Google Form, and you can learn and possibly earn a GPU at the same time! You can find more information on the google form.

And of course, the yearly keynote by the NVIDIA CEO Jensen Huang, will be happening on Tuesday March 18th at 10am Pacific Time. So don’t miss that out!

PokéChamp: an Expert-level Minimax Language Agent

Karten et al. [Princeton University]

♥ 1.1k   LLM Agent

Introduction to PokéChamp

The challenge of creating agents that can effectively compete in complex, partially observable environments like Pokémon battles has been significant. Traditional reinforcement learning approaches often require extensive task-specific training, which can be resource-intensive and less adaptable to new scenarios.

This paper introduces PokéChamp, a novel agent that leverages the generalist capabilities of LLMs to enhance minimax tree search in Pokémon battles. By integrating LLMs into three key modules, player action sampling, opponent modeling, and value function estimation, PokéChamp effectively utilizes gameplay history and human knowledge to reduce the search space and address partial observability. This approach does not require additional LLM training, making it highly flexible and efficient. 

How Does PokéChamp Work?

The PokéChamp has three key modules: approximate game transition, player action sampling, and opponent modeling.

First, we tackle the challenge of simulating game transitions under partial observability. We use statistical data from Pokémon Showdown, including move pools, EV spreads, and item usage, to infer hidden information and approximate the latent state. We also incorporate LLM predictions to estimate hidden opponent variables, such as attack and defense stats, based on game history. After predicting the current state, we simulate the next state using our local Showdown simulator. To manage the computational burden, we simplify the approach by computing expected values within these transitions.

Next, for player action sampling, PokéChamp generates a set of candidate actions for the minimax search tree. The input prompt for action sampling includes the team strategy, observable state, battle history, approximate state transition heuristic, and available actions. In addition to LLM-generated actions, we include candidate actions from our tools, such as the top move choice from our one-step lookahead and the top switch choice from the Abyssal bot.

For opponent modeling, we address the partial observability of the opponent's actions and hidden state information. We use historical data to estimate unknown opponent stats and employ LLM-based predictions to generate likely opponent actions based on a prompt similar to the action sampling process but focused on the opponent's perspective.

PokéChamp replaces components of minimax tree search with LLM-based generations

Finally, due to the computational constraints of live gameplay, we use an LLM-generated value function to evaluate leaf nodes in our minimax tree. The LLM generates a score based on positive factors like the effectiveness of current moves and the number of remaining Pokémon, and negative factors like excessive switching and the opponent's move effectiveness. By combining these three components, action sampling, opponent modeling, and value function approximation, PokéChamp effectively navigates the complex, partially observable state space of Pokémon battles, approximating optimal play within the constraints of real-time gameplay.

Results and Real-World Implications of PokéChamp

The evaluation of PokéChamp was conducted using both an offline dataset and online games on the Pokémon Showdown platform. The compiled dataset, consisting of over 3 million battles, provided a foundation for analyzing transition probabilities and opponent policies, with detailed information on team compositions, move choices, and battle outcomes.

Action prediction experiments showed that PokéChamp achieved a player action prediction accuracy of 26-30% and an opponent action prediction accuracy of 13-16% across various Elo ratings, significantly outperforming random prediction baselines. In small game puzzles designed to test core game mechanics, PokéChamp demonstrated an 86% win rate in 1v1 battles, surpassing PokéLLMon's 76%, and effectively utilized special mechanics like Terastallization and Dynamax to gain strategic advantages.

Against human players on the online ladder, PokéChamp achieved a 76% win rate within the time constraints, reaching an Elo score above 1300 and placing it in the top 30%-10% of players.

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

Li et al. [StepFun, Fudan University, Tsinghua University, Megvii Technology]

♥ 136   LLM Pre-training  

Introduction to Hyperparameter Optimization

The paper addresses the challenge of hyperparameter optimization for LLMs, which is computationally expensive at scale. Through extensive studies on 3,700 LLMs and nearly one million GPU hours, the researchers discovered universal scaling laws governing optimal hyperparameters. Their proposed "Step Law" establishes that optimal learning rates follow a power-law relationship with both model parameters and data sizes, while optimal batch sizes scale primarily with data sizes.

This formula achieves results within 0.09% of globally optimal performance without exhaustive searches. Notably, these scaling laws demonstrate robustness across various model architectures, including both dense transformers and Mixture-of-Experts models, as well as different data distributions, providing a plug-and-play tool for efficient hyperparameter selection in LLM training.

Comparison of optimal hyperparameter scaling laws across different approaches.

Core Mechanism of LLM Hyperparameter Optimization

The Step Law presents a universal hyperparameter optimization framework for Large Language Models based on extensive empirical research across 3,700 model configurations. Its central innovation lies in two power-law scaling formulas that predict optimal hyperparameters:

  1. Learning Rate Formula:

    • Scales inversely with model size (N)

    • Scales positively with data size (D)

    • Captures the complex interplay between model capacity and training data volume

  2. Batch Size Formula:

    • Primarily depends on dataset size (D)

    • Remains largely invariant to model parameters

    • Follows a sublinear scaling pattern with data volume

Technical Architecture & Implementation

The system's architecture rests on four key technical discoveries:

  1. Convex Loss Landscape: The researchers demonstrated that hyperparameter space forms a convex optimization landscape with a stable plateau around optimal values. This property ensures that small deviations from ideal settings still produce near-optimal results.

  2. Fixed Final Learning Rate Strategy: Unlike conventional approaches that decay learning rates proportionally to their initial values, Step Law employs a constant minimum learning rate (10^-5). This prevents the "left-skew" bias problem where high initial learning rates produce disproportionately large final rates that impair convergence.

  3. Universal Transferability: The scaling laws demonstrate remarkable robustness across varied model architectures (dense transformers and Mixture-of-Experts), different sparsity ratios, and diverse data distributions. This universality eliminates the need for domain-specific hyperparameter tuning.

  4. Joint Optimization: Unlike previous approaches that optimized learning rate or batch size in isolation, Step Law captures the interdependencies between these parameters for globally optimal configurations.

The system's effectiveness was validated through extensive experimentation, achieving results within 0.09% of global optima found via exhaustive searches while reducing the computational overhead of hyperparameter optimization by orders of magnitude.

Results and Evaluation

In the paper, the authors present a significant advancement in hyperparameter optimization for LLMs by introducing universal scaling laws for learning rate (LR) and batch size (BS). This study concluded that these scaling laws exhibit topological invariance, maintaining consistent scaling constants across different model scales and data sizes, even when varying the topological features of model architectures.

Additionally, the scaling laws are shown to be effective beyond dense Transformers, extending to sparse Mixture of Experts (MoE) models and maintaining high prediction accuracy across various sparsity levels. The robustness of these scaling laws is further evidenced by their performance across diverse data distributions, highlighting their broad applicability in different neural architectures.

(How) Do Language Models Track State?

Li et al. [MIT EECS]

♥ 253   LLM Interpretability   bycloud’s pick  

Introduction to States in Language Models

Language models show impressive abilities to track state changes (like following narratives or executing code), but how they actually accomplish this remains unclear. This paper uses permutation composition tasks, where models must predict final object positions after a series of swaps, as a simplified model for studying state tracking mechanisms.

The researchers discovered that across various model architectures, language models consistently learn one of two algorithmic solutions: either an "associative algorithm" that resembles theoretical constructions from prior work, or a "parity-associative algorithm" that first narrows possibilities using a parity heuristic before refinement. Notably, they found no evidence for step-by-step simulation or fully parallel computation approaches, despite their theoretical viability.

Through careful interventions and training experiments, the researchers demonstrate that these mechanisms can be predicted and even steered through specific intermediate training tasks, which provides valuable insights into how language models might track state when processing language, code, and interactive scenarios.

What Mechanisms Do Transformers Use for State Tracking?

Researchers in this study investigate how language models track state changes, and they proposed four theoretical algorithms these models might implement: Sequential (step-by-step processing), Parallel (constant-depth computation), Associative (hierarchical composition), and Parity-Associative (a two-stage approach combining parity heuristics with associative methods).

The sequential algorithm composes permutations one at a time from left to right.

This paper establishes clear "signatures" for each algorithm using two analysis techniques: prefix patching (measuring how much of a prefix must be modified to affect outputs) and probing (testing what can be decoded from intermediate layers).

The Parity-Associative Algorithm is a novel hybrid approach where models first compute state parity and then separately calculate the remaining information needed for the final state prediction, demonstrating how language models might combine both heuristic shortcuts and structured algorithmic solutions when tracking complex state changes.

Evaluation and Results

Researchers tested their theoretical algorithms against actual language models trained on permutation tasks, and they found that:

  • Models consistently learned either the Associative Algorithm (AA) or Parity-Associative Algorithm (PAA), and activation patching experiments showed distinct signatures matching these two algorithms

  • Linear probing confirmed that in PAA models, state parity is decodable from early layers, while AA models encode parity differently

  • PAA representations can be geometrically decomposed into orthogonal components (parity and "cluster identity")

  • AA models generally demonstrate better generalization to longer sequences

  • Attention pattern analysis revealed "parity heads" in early layers of PAA models (absent in AA models)

  • AA models develop sparse, tree-like attention patterns in later layers

These findings suggest transformers implement efficient, interpretable state tracking mechanisms that combine both algorithmic structures and heuristic features.

Reply

or to participate.