- The AI Timeline
- Posts
- Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Plus more about Continuous Concepts (CoCoMix), and Distillation scaling laws
Feb 10th ~ Feb 16th
#43 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 15k xAI has released Grok 3, ranking 1400 ELO on Chatbot Arena + #1 across the board, and outperforms o3-mini (high) on their official benchmark. They published 2 thinking modes: Think & Big Brain, that uses different amount of thinking time. The Grok-3 model is trained using 200,000 GPUs in Memphis and is now available to Premium+ users on X. API and Pricing is NOT available yet.
Grok-3 Benchmark
♥ 1.1k SkyworkAI has launched a new open-source human-centric video foundation model. It has comparable performance to VEO 2 with advanced capabilities including advanced lip syncing and ComfyUI integration. The model is now freely available through SkyworkAI's platform at skyreels.ai, enabling developers and creators to build upon this technology.
♥ 11k Researchers at DeepSeek have announced NSA (Natively Trainable Sparse Attention), a new attention mechanism that delivers ultra-fast long-context processing through dynamic hierarchical sparse strategies and optimized token handling. This approach matches or exceeds Full Attention model performance while significantly increase speed up to 11.6x, which can potentially revolutionize the efficiency of large language models.
RTX 4080 SUPER Giveaway With NVIDIA’s GTC 2025
During NVIDIA’s GTC event which is NVIDIA’s annual flagship AI & developer Conference, March 17-21, 2025, there will be various big announcements, events, and sessions you can attend both in-person or virtually.
This is one of the best times to learn from global experts on how generative AI is impacting industries and society as a whole.
You can attend virtually to sessions like:
How to Build an Agentic AI Blueprint Using the Best Tools and Frameworks hosted by the director of engineering from NVIDIA
Accelerate Inference on NVIDIA GPUs hosted by the CTO of Together AI
So you can virtually discover the latest breakthroughs in generative AI and NVIDIA technologies from subject matter experts at #GTC25
By virtually attending these sessions, you can join my giveaway for an RTX4080 SUPER. All you have to do is to take a selfie of yourself attending the LIVE virtual sessions that are available during GTC (March 17-21), submit it using this Google Form, and you can learn and possibly earn a GPU at the same time! You can find more information on the google form.
And of course, the yearly keynote by the NVIDIA CEO Jensen Huang, will be happening on Tuesday March 18th at 10am Pacific Time. So don’t miss that out!
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Liu et al. [Shanghai AI Laboratory, Tsinghua University, Harbin Institute of Technology, BUPT]
♥ 536 LLM Reasoning
Can Smaller LLMs Outperform Giants?
The field of LLMs is constantly evolving, and the "Test-Time Scaling" (TTS) approach is quite hot right now. In the past few issues, we have already covered Titans: Learning to Memorize at Test Time and s1: Simple test-time scaling. But we still don't have a systematic understanding of how TTS is influenced by:
Policy Models: The specific LLM being used.
Process Reward Models (PRMs): Models that guide the LLM's reasoning and answer selection.
Problem Difficulty: The complexity of the task.
Without this understanding, we can't optimally apply TTS, potentially wasting computational resources and limiting performance gains. This research dives deep into these factors, asking two fundamental questions:
What's the best way to scale computation during testing, considering the interplay of policy models, PRMs, and problem difficulty?
Can we push the limits of LLM performance on complex tasks using extended computation, and can smaller models, with clever scaling, beat larger ones?
Understanding the Mechanics of Test-Time Scaling
Test-Time Scaling treats the problem of generating a solution (like a mathematical proof) as a step-by-step decision-making process. Think of it like navigating a maze. Each step in the solution is an action, and the current state is the sequence of steps taken so far. The LLM, acting as the "policy," proposes the next step. A "reward" is given for each step, indicating how good that step is. The goal is to find a sequence of steps (a "trajectory") that leads to the correct final answer, maximizing the cumulative reward. This framework is formally known as a Markov Decision Process (MDP), a standard concept in reinforcement learning.

Comparison between the performance of smaller LLMs compute-optimal TTS and that of larger LLMs CoT on MATH-500 and AIME24.
The paper explores three main TTS methods to improve upon simply generating a single solution (like Chain-of-Thought):
Best-of-N (BoN): The LLM generates N different complete solutions independently. A scoring system (using a Process Reward Model or PRM) evaluates each solution, and a voting mechanism (like majority vote or picking the highest-scored solution) selects the final answer. This is like having N people solve the problem and then picking the best answer.
Beam Search: This is a more structured exploration. The LLM starts generating a few initial steps. The PRM scores these "partial solutions." Only the top-scoring partial solutions (the "beam") are kept and extended further. This process repeats, branching out from the best candidates at each stage, until a complete solution is found. This is analogous to exploring multiple paths in the maze simultaneously, but pruning away the less promising paths early on.
Diverse Verifier Tree Search (DVTS): This builds upon beam search to encourage more diverse exploration. Instead of a single beam, the search is split into multiple independent "subtrees," each using beam search. This prevents the search from getting stuck in a local optimum by forcing it to explore different solution paths. It's like having multiple groups explore the maze independently, each using a beam search strategy.

Comparison of different external TTS methods.
The researchers argue that the reward given by the PRM plays a crucial role, and not all PRMs are created equal. A PRM trained on data similar to the LLM's output (an "on-policy" PRM) is generally better. However, training a custom PRM for every LLM is expensive. The paper proposes a "reward-aware" approach. This means the TTS strategy explicitly considers the reward function of the PRM when making decisions about how much computation to allocate. Furthermore, the research shows that using absolute accuracy thresholds to determine the difficulty of a problem (easy, medium, hard) is better than just the Pass@1.
Results and Real-World Implications of Test-Time Scaling
This paper shows that smaller LLMs, when strategically augmented with compute-optimal Test-Time Scaling, can compete with and even surpass much larger models on complex reasoning tasks. This is a significant finding with several practical implications:
The ability of smaller models to achieve state-of-the-art performance opens up possibilities for deploying advanced reasoning capabilities on devices with limited resources. This could include edge devices, mobile phones, or systems where computational cost is a major constraint. It's no longer necessary to rely solely on massive, resource-intensive models.
Smaller models inherently require less computation. Combined with the targeted application of additional resources only when needed (via TTS), this translates to significant cost savings and reduced energy consumption. This is crucial for sustainable AI development and deployment.

LLM Pretraining with Continuous Concepts
Tack et al. [FAIR at Meta, KAIST, UC San Diego]
♥ 449 LLM Architecture
Introduction to Continuous Concepts in LLMs
Traditional language model pretraining relies heavily on next-token prediction, which requires models to learn high-level concepts and reasoning capabilities solely through the lens of predicting discrete tokens where many of which are superficial (like articles and prepositions). This token-level training approach often requires massive amounts of data for models to develop sophisticated reasoning abilities and struggles with long-range dependencies.
This paper proposes CoCoMix to address this limitation by explicitly incorporating continuous semantic concepts into the training process where they use Sparse Autoencoders to extract meaningful latent concepts from pretrained models, then train new models to predict and integrate these continuous concepts alongside traditional token prediction.
How do Continuous Concepts (CoCoMix) Work?
CoCoMix operates through a series of interconnected processes that enhance language model training. The system begins with concept extraction using a Sparse Autoencoder (SAE), which analyzes hidden states from a pretrained model to identify meaningful semantic features. The SAE maps inputs to sparse activations while maintaining the ability to reconstruct the original input, keeping only the most significant features through a top-K selection mechanism to ensure the extracted concepts are meaningful and distinct.
The framework then moves to concept selection, where it uses attribution scoring to determine which concepts have the strongest influence on next token prediction. This process involves calculating how each concept impacts the model's output, enabling the selection of the most salient concepts that will serve as training labels. The attribution scores provide a quantitative measure of concept importance, ensuring that only the most relevant semantic features are incorporated into the training process.

Overview of CoCoMix.
In its concept prediction and integration phase, the model learns to predict previously selected important concepts from its current hidden state. These multiple predicted concepts are then compressed into a single "continuous concept" vector, which is systematically interleaved with the regular token hidden states in the sequence. This integration allows the model to maintain both token-level and concept-level information in its processing pipeline.
The training process combines traditional next-token prediction loss with concept prediction loss, creating a dual learning objective. This approach enables the model to simultaneously learn token prediction while developing an understanding of how to utilize concept information effectively.
Results and Evaluation of Continuous Concepts
The benchmark results show that CoCoMix is better than baseline approaches in multiple scenarios. When compared to Next Token Prediction (NTP), CoCoMix achieves comparable performance with 21.5% fewer training tokens across various model sizes (69M to 1.38B parameters).

CoCoMix vs. Next Token Prediction (NTP) vs. Knowledge Distillation (KD).
In comparison to Knowledge Distillation (KD), CoCoMix shows particular strength in weak-to-strong supervision scenarios, where concepts extracted from a smaller model (124M parameters) successfully guide larger models (up to 1.38B parameters). This is notable because traditional KD often struggles when the student model surpasses the teacher's capabilities, leading to degraded performance.
Distillation Scaling Laws
Busbridge et al. [Apple, University of Oxford]
♥ 1.3k LLM Distillation bycloud’s pick

Introduction to Distillation Scaling Laws
Language models have faced a challenging trade-off: larger models achieve better performance but come with prohibitive inference costs, while smaller models are more practical but less capable. This new paper introduces a comprehensive scaling law for knowledge distillation, and provides a mathematical framework to predict how well a smaller "student" model can learn from a larger "teacher" model based on their respective sizes and training resources.

Expressions related to scaling laws used in this work. In each case, S always refers to student and not supervised.
How Do Distillation Scaling Laws Work?
In this study, the researchers tested different combinations of teacher and student models by varying three critical elements:
the student model's size
the volume of training data used for distillation
the teacher model's capabilities.
Their experiments showed that training smaller models through distillation behaves like a power law, similar to other machine learning patterns, but with a unique twist when the teacher becomes too sophisticated. They discovered the "capacity gap" phenomenon, where an overly capable teacher model can actually impair the student's learning process, much like how a quantum physicist might struggle to effectively teach basic algebra to high school students. The experiments also revealed that the success of distillation hinges on finding the right balance between the teacher's sophistication and the student's ability to absorb knowledge.
The team found that student models have an optimal learning range that depends on both their size and the amount of training data used, regardless of how capable their teacher model becomes beyond a certain point. Their experiments demonstrated that each student model has a natural ceiling for knowledge absorption, and pushing beyond this limit by using an increasingly sophisticated teacher yields diminishing returns or even degraded performance.

The researchers observed that the relationship between teacher and student models follows two distinct patterns: one where the student can effectively learn from the teacher, and another where the student struggles to process the teacher's complex knowledge. This dual behavior helped them identify precisely when a teacher model becomes too advanced for a given student, allowing them to determine the ideal teacher-student pairing for any given scenario.
Real-World Implications Distillation Scaling Laws
This paper explored how to optimize knowledge distillation under different real-world scenarios. The researchers found that when starting from scratch with limited resources, traditional supervised learning actually performs better than training a teacher model and then distilling it into a student model.
However, distillation becomes more advantageous when you either already have a trained teacher model or plan to use the teacher to train multiple student models. The study found that smaller models are more likely to benefit from traditional training, while larger models tend to gain more from distillation. Interestingly, they discovered that the optimal size for a teacher model isn't always "bigger is better". Instead, there's usually an ideal size just slightly larger than the student model, after which using a bigger teacher becomes computationally inefficient.
🚨This week's top AI/ML research papers:
- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
-… x.com/i/web/status/1…— The AI Timeline (@TheAITimeline)
5:20 AM • Feb 17, 2025
Reply