Inference Scaling (F)Laws

O1 Replication Journey, and LLMs Don't Implicitly Reason

Nov 25th ~ Dev 1st
#34 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1.2k Researchers from the Qwen team have released a preview of QwQ, the first open-source model with inference time compute and a known parameter count. In some aspects, its performance is on par with OpenAI o1-preview. The model weights are now available on Hugging Face.

    qwq benchmark
  2. ♥ 1.5k Nous Research has pre-trained a 15B parameter language model using their novel DisTrO technology which uses hardware from several companies including Oracle, LambdaAPI, Northern Data Group, and Crusoe Cloud. This is a decentralized training method and the initial results show a competitive loss curve and convergence rate comparable to centralized training methods.

    nous DisTrO
  3. ♥ 1.2k Prime Intellect also recently completed a large-scale training run of a Llama-3 architecture language model called INTELLECT-1 with 10B parameters. The training utilized a globally distributed system, which used 30 independent compute providers across three continents and five countries. The training was spread across 112 H100 GPUs simultaneously and processed one trillion tokens over 42 days.

    Prime Intellect

Support My Newsletter & Project M Through Patreon!

As I aim to keep this newsletter free forever, your support not only sustains the newsletter but also helps us revolutionize how you do research!

A little bit about Project M: Have you ever had the trouble of finding a ML paper that is on the tip of your tongue? Or you just need an inspiration for next project? Or you want to find out what other people have done in a topic you’re interested in, without having to dig through 10 pages of Google Scholars and going back and forth in #References?

With Project M, we are building the best way for you to do all these, all within a beautiful UX experience. It’ll be updated weekly along with my weekly top papers, so you know all the papers are filtered.

Right now our Project M’s research paper maps consist of Test-Time Compute, LLM Interpretability, DiffusionLM, Attention, and MoE, which are all available now if you become an official patron, where you can explore all the latest paper in 1 glance. We are still expanding more and more maps + improving them!

I am currently building a chatbot to make finding research much easier, and currently improving the retrieval quality. Below is a small prototype vector database I’ve built for MoE!

Project M vector database

a small vector database I’ve built for MoE, visualized

My partner is cooking up an amazing UI to make the experience even better. We have already done a rendering engine to make sure the UI experience is smooth. What’s left is just the visualizer, data handling, and some basic UI.

Project M preview

current Figma design

So it’ll be really amazing if I can have your support early! Please kindly check out my Patreon if you’d like. If you join before 2024 ends, you’ll have life time 25% off discount once our beta is out!

O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Huang et al. [Shanghai Jiao Tong University, SII, NYU, Generative AI Research Lab]

♥ 485   LLM Distillation

Introduction to Simple Distillation

The AI research community is facing a critical challenge: the use of knowledge distillation techniques to replicate advanced AI models like OpenAI's O1 is gaining popularity. But, this trend prioritizes rapid performance gains over transparent technical innovation, which creates a problematic research ecosystem where institutions make ambitious claims without fully revealing their methodologies.

The paper aims to solve this problem by introducing a novel benchmark framework for evaluating O1 replication attempts based on technical transparency and reproducibility. By demonstrating how simple distillation techniques can achieve impressive results while critically examining the limitations of such approaches.

Understanding Long Reasoning Chains Framework

This paper introduces a new approach to synthesize long reasoning chains for solving complex problems. It focuses on generating high-quality training data through knowledge distillation. The researchers used advanced models like O1 as a "teacher" model, and carefully prompted it to generate detailed, step-by-step problem-solving approaches that capture reflection, error correction, and backtracking.

Their method involved a multi-step data preparation process: first, they filtered and curated a dataset of Olympic-level mathematical problems, removing image-dependent or proof-based questions. They then used GPT-4o-mini to rewrite solutions, ensuring they were step-by-step, detailed, and followed a standardized format with explicitly highlighted final answers. This approach transformed raw problem-solving data into refined, high-quality training material.

For model training, they selected Qwen2.5-Math-72B as their base model and implemented a two-phase supervised fine-tuning strategy. The first phase adapted the model to the long-thought format which taught it to generate detailed, fine-grained solutions. The second phase used the distilled dataset to further enhance the model's reasoning capabilities. This allowed it to focus on producing precise, coherent outputs that mimic the sophisticated reasoning demonstrated by advanced models like O1.

Results and Insights

The researchers evaluated their fine-tuned model across multiple domains, revealing significant improvements in performance. On Auto-J and LIMA query sets, the model's scores increased from 81.6% to 88% and 77.2% to 87.2% respectively. This shows enhanced capabilities in bilingual conversations and open-domain question answering, especially in long-term planning scenarios.

Transparency scores of various O1 replication efforts.

However, the model's performance showed mixed results in factuality and safety metrics. While the fine-tuning process slightly reduced sycophancy and introduced more self-reflection capabilities, it also led to increased hallucinations, particularly in attempts to fabricate search engine results. Safety scores marginally declined from 94.3% to 93.0%, suggesting that even high-quality training data focused on reflection may not guarantee improved safety performance without explicit safety alignment.

Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers

Stroebl et al. [Princeton University]

♥ 165   LLM Inference   bycloud’s pick  

What are Inference Scaling Flaws

The existing AI language models struggle to balance performance across multiple domains while maintaining factuality, safety, and self-reflective capabilities. The research paper addresses this challenge by proposing a fine-tuning approach that aims to enhance model performance in bilingual conversations and open-domain question answering.

Understanding Inference Scaling Flaws

The paper discusses the inference scaling techniques for LLMs and focuses on methods to improve solution generation through repeated sampling and verification. The core challenge is finding reliable ways to enhance model performance across different tasks while managing computational resources and addressing inherent model limitations.

The paper investigates various inference scaling techniques like reasoning, critique, fusion, ranking, majority voting, and oracle verification. Each method has distinct approaches: reasoning applies structured logical steps, critique involves self-evaluation, fusion combines multiple samples, and majority voting uses consensus. However, most existing methods have significant limitations, such as unreliable performance improvements, domain-dependent effectiveness, and potential introduction of undesirable outputs.

The paper discusses how previous benchmarks are imperfect verifiers and can not be used to reliably test LLMs, particularly in domains like computer programming where unit tests are used to validate candidate solutions. The researchers also tested how repeated sampling with weaker models impacts generalizability. This highlighted the challenge of false positives where incorrect solutions pass verification tests. 

Results and Observations of Inference Scaling Techniques 

This paper tested inference scaling techniques across different language models, and primarily focused on the optimal number of samples and the quality of generated solutions. By generating 200 samples per task in the HumanEval benchmark, the study revealed that repeated sampling quickly reaches diminishing returns, with false positive rates increasing as more samples are generated. For example, at a cost-benefit ratio of 4, the optimal number of samples was K ≤ 5 for all tested models, and in some high-cost scenarios, attempting solutions became counterproductive.

False positives tend to be lower-quality code than correct implementations.

Results showed that false positive solutions (those passing standard but not comprehensive tests) consistently showed lower code quality across multiple metrics. These false positives resulted in worse readability, less maintainable code structure, and potential issues in naming conventions and commenting, with weaker models being more susceptible to generating such low-quality solutions.

Standard unit tests for the HumanEval/30 task and one example test from the extended test suite of HumanEval+.

They also conducted a bimodal distribution of task difficulty, where easy tasks were solved quickly, while harder tasks increasingly produced false positives. The authors found that weaker models cannot reliably improve performance through repeated sampling if verifiers cannot effectively filter out incorrect solutions. 

LLMs Do Not Think Step-by-step In Implicit Reasoning

Yu [Tsinghua University]

♥ 400   LLM Reasoning

Introduction to Implicit Reasoning in LLMs

Large language models use two ways to reason: implicit and explicit. Explicit reasoning shows each step, like a recipe. Implicit reasoning solves problems directly, without showing the steps, which is faster and uses less computing power. However, the researchers suspected that implicit reasoning might not be as reliable as explicit reasoning. To investigate this, this paper conducted experiments using a powerful 72B-parameter model, forcing it to solve arithmetic problems without showing intermediate steps.

By probing the model's hidden states across different layers, this paper uncovered something surprising: the model often arrives at correct answers, but not through a genuine step-by-step reasoning process. Instead, it seems to rely more on intuitive pattern matching and its vast accumulated experience, which can be fast but potentially unstable.

How Do LLMs Reason?

This paper created an experiment to test and understand the implicit reasoning in LLMs using multi-step arithmetic problems. The author created 2,000 sample problems with 3-5 steps, forcing the Qwen2.5-72B-Instruct model to provide direct answers without showing intermediate reasoning steps. To investigate how the model processes these problems, the author of this study recorded the hidden states of the model's last token across its layers and used a linear classifier to attempt predicting intermediate results.

By analyzing these hidden states, we can say that the model can rarely calculate or retain intermediate step results accurately. While the model successfully memorized the initial input and final answer, it struggled to demonstrate clear step-by-step reasoning. The author suspected that the model appears to leverage its vast memory and training data to intuitively map problems to answers, rather than performing genuine sequential reasoning.

To further test the model's implicit reasoning capabilities, the author conducted additional experiments by slightly modifying the problems, reversing equation order and dividing values by 10.

Evaluation on Benchmarks

The changes suggested in this study dramatically reduced the model's performance in implicit reasoning mode, while explicit Chain-of-Thought reasoning maintained perfect accuracy. This suggests that implicit reasoning is less robust and fundamentally different from explicit step-by-step reasoning as it relies more on intuition and experience than systematic problem-solving.

Reply

or to participate.