• The AI Timeline
  • Posts
  • From Memorization to Reasoning in the Spectrum of Loss Curvature

From Memorization to Reasoning in the Spectrum of Loss Curvature

Continuous Autoregressive Language Models and Introducing Nested Learning: A new ML paradigm for continual learning

Nov 3rd ~ Nov 10th
#81 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 9.6k Moonshot AI has released Kimi K2 Thinking, an open-source model designed to function as a "thinking agent" capable of complex reasoning and problem-solving. The model can execute up to 200–300 sequential tool calls without human intervention and it excels at agentic search, coding, and reasoning tasks, all supported by a 256K context window. K2 Thinking is currently accessible in chat mode and through its API, with the weights available on HuggingFace.

  2. ♥ 3.5k Google has added the File Search tool in the Gemini API to streamline the development of Retrieval-Augmented Generation (RAG) applications. This fully managed, serverless system simplifies the process by allowing developers to ground models like Gemini in their own data from various file formats, including PDFs and DOCX. The File Search tool automates the entire RAG pipeline (managing file storage, text chunking, embedding creation, and context injection).

  3. ♥ 6.4k OpenAI has released a new feature that allows users to interactively guide a model's reasoning process during complex tasks. Instead of passively waiting for a final output, you can now interrupt the model mid-thought, inject additional information or instructions, and then have it resume its work with the new context.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you enjoy reading The AI Timeline, consider sharing it with another research enthusiast. It helps us keep this up for free!

From Memorization to Reasoning in the Spectrum of Loss Curvature

Merullo et al. [Goodfire]

♥ 2k   LLM Reasoning  

Have you ever wondered how much of a model's output is genuine reasoning versus just regurgitating memorized training data? Many AI researchers are asking this same question to understand modern AI systems, where verbatim recitation from training sets can raise concerns about privacy and copyright. This research provides a method for identifying and reducing such memorization by analyzing the geometry of a model's internal weights.

Overview of activations and gradients from a sample of training data

The method works by examining the "curvature" of the loss landscape across a dataset. It turns out that weights involved in general-purpose reasoning tend to lie in directions of consistently moderate curvature, while those used for verbatim memorization point in many different, sharp directions that average out to appear flatter overall. Using an approximation called K-FAC, the researchers decompose model weights into components ordered from high to low curvature. They found that memorized data interacts much more strongly with components at the low-curvature end of this spectrum.

By selectively removing these low-curvature weight components, the method effectively suppresses the recitation of memorized content. In tests, this approach reduced verbatim memorization more effectively than a recent supervised unlearning technique, especially on unseen memorized data, while maintaining similar perplexity. Interestingly, this editing process revealed that certain capabilities, such as arithmetic and closed-book fact retrieval, rely heavily on these removed directions and experienced significant performance drops. These findings suggest that tasks like math may depend on narrow, specialized circuits in the weight space, separate from broader reasoning mechanisms.

Accuracy change across the relations dataset

Continuous Autoregressive Language Models

Shao et al. [WeChat AI, Tsinghua University]

♥ 424   VAE   bycloud’s pick  

Large language models today are limited by the time it takes to generate text one token at a time. The CALM framework addresses this by grouping multiple tokens into a single continuous vector. This allows the model to predict chunks of text at once instead of individual words. This approach increases the semantic bandwidth of each step, thereby reducing the number of steps required and accelerating generation.

To make this work, the model first uses a specialized autoencoder to compress a chunk of tokens into a dense vector, and then reconstruct those tokens with high accuracy. Because the model now operates in a continuous space, it can't rely on standard next-token prediction methods. Instead, it uses an energy-based generative head that samples the next vector directly. This component refines random noise into a meaningful vector using the model’s current hidden state, all in a single efficient step.

In testing, CALM achieved a significantly better performance-compute trade-off while using substantially less computation. Although the method currently requires an autoencoder and doesn’t yet efficiently support very low temperatures, it opens a promising new direction for ultra-fast, high-capacity AI systems.

Introducing Nested Learning: A new ML paradigm for continual learning

Behrouz et al. [Google Research]

♥ 855   LLM Sampling  

Have you ever noticed how large language models seem to forget new information as soon as it leaves their context window? This static nature limits their ability to learn continuously, much like a person with anterograde amnesia who can't form new long-term memories. The Nested Learning (NL) approach addresses this by reimagining models as interconnected systems of nested optimization problems, each operating at its own update frequency.

Comparison of performance on language modeling (perplexity; left) and common-sense reasoning (accuracy; right) tasks between different architectures: Hope, Titans, Samba and a baseline Transformer.

NL reveals that familiar components, such as gradient-based optimizers, are actually associative memory modules that compress their input context. For instance, momentum in optimizers acts as a memory that stores past gradients and allows us to design deeper, more expressive versions. By structuring models into multiple levels (where inner loops handle fast updates, such as attention and memory, and outer loops manage slower parameter adjustments), NL enables richer in-context learning and continual adaptation without requiring retraining.

Comparison of performance on language modeling (perplexity; left) and common-sense reasoning (accuracy; right) tasks between different architectures: Hope, Titans, Samba and a baseline Transformer.

This approach led to innovations such as deep optimizers and a self-modifying sequence model, combined with a continuum memory system, known as HOPE. Early tests show HOPE delivers promising results in language modeling, continual learning, and long-context reasoning.

Performance comparison on long-context tasks with different levels of difficulty between different architectures: Hope, Titans, TTT, and Mamba2. NIAH-PK, NIAH-H, and NIAH-W are needle-in-a-haystack tasks with pass-key, number, and word, respectively.

Reply

or to participate.