- The AI Timeline
- Posts
- LLM with RL Self-Correcting, Chain of Thought Without Prompting, and Larger LMs Become Unreliable
LLM with RL Self-Correcting, Chain of Thought Without Prompting, and Larger LMs Become Unreliable
#25 | Latest AI Research Explained Simply
In this issue: x5 industry news, x3 AI research papers
Sep 23rd ~ Sep 29th
🗞️ Industry News in 1 Line
♥ 21k OpenAI, the close source AI company, had a pretty interesting week. Mira Murati, who joined OpenAI in 2018 as the CTO and led ChatGPT, Dall-E, and Sora, left the company along with many other top executives from the company. Moreover, Sam Altman is working on a plan to restructure its core business from non-profit into a for-profit corporation, which would give himself equity worth billions of dollars.
♥ 3.8k Meta had some big releases during Meta Connect including an update on Llama series. Llama 3.2 includes small text-only models (1B and 3B) and medium-sized vision LLMs (11B and 90B). You can download Llama 3.2 weights today and let us know what you think in the comments!
benchmarks on the vision models, please visit their website for the light-weight benchmarks
♥ 616 The Allen Institute for Artificial Intelligence has released Molmo, a family of new multi-modal AI models which outperform proprietary models like GPT-4o, Gemini 1.5 Pro & Claude 3.5 across 11 benchmarks! Visit Molomo playground and try it yourself.
♥ 1.3k Liquid AI, a new AI company based out of Cambridge (Massachusetts), has released three Liquid foundation models which beat all other AI models in every weight class (1.3B, 3B, 40B MoE). You can try out Liquid models at Liquid playground or via other cloud providers such as Perplexity AI. But proceed hype with caution.
♥ 2k+ Pika Labs released Pika 1.5, a new updated text-to-video model with stunning qualities and offers a wide range of camera control. Additionally, they have tons of “effects” such as explosion, melting, and squishing, which you can see in its promotional video. You can try Pika on Discord.
Practice Coding While Earning with Shipd
The latest gamified coding platform that pays you to code just like bounties. Users can create or solve coding questions and get paid by holding the best solutions.
Shipd presents a rotating selection of questions in various programming languages, with a growing prize pool currently at $90k/month.
Here’s how it works: Datacurve.ai partners with AI companies and provides them with high quality coding data to train better LLMs. In return, a large part of the revenue goes back to the users like paid bounties. Since they pay developers, user screenings are done at sign-up to prevent spam.
Training Language Models to Self-Correct via Reinforcement Learning
Kumar et al. [Google DeepMind]
♥ 753 LLM
Introduction to Self-Correcting via Reinforcement Learning (SCoRe)
LLMs can do a lot of things but they still can’t correct themselves when they make mistakes. Currently, training a self-correction model either requires multiple models, relies on more capable models, or needs other forms of external supervision. Until now, these methods have been largely ineffective.
This paper presents SCoRe (Self-Correction via Reinforcement Learning), a multi-turn online reinforcement learning approach that significantly improves an LLM's self-correction ability using entirely self-generated data. It uses a two-stage process with appropriate regularization to steer the learning process towards an effective self-correction strategy.
The first stage trains a model initialization that optimizes correction performance while keeping the first attempt close to the base model.
The second stage uses multi-turn RL with a reward bonus to encourage improvement from the first attempt to the second.
This approach allows the model to learn self-correction without external supervision or multiple models and achieves state-of-the-art performance on benchmarks like MATH and HumanEval.
How Does SCoRe Work in LLMs?
Until now, we have seen two popular approaches for supervised fine-tuning (SFT) in large language models (LLMs): STaR and Pair-SFT. This paper showed us that while these SFT methods improved upon the base model's self-correction abilities, they still fell short of achieving consistently positive self-correction. The trained models often made only minor changes to their initial responses, suggesting that the fine-tuning process might be amplifying the base model's existing biases rather than teaching it to make meaningful corrections.
The SCoRe (Self-Correction via Reinforcement Learning) method is designed to address the challenges identified in the earlier supervised fine-tuning experiments. It operates in two main stages:
Stage I: Training a Model Initialization
In this stage, we can create a model initialization that is less prone to collapse during subsequent reinforcement learning.
Process:
Fine-tune the base model to produce high-reward revisions at the second attempt.
Constrain the first-attempt response distribution to remain close to that of the base model using a KL-divergence penalty.
This approach forces the model to explore diverse correction strategies without changing its initial responses.
By keeping the first-attempt responses relatively static while improving second-attempt responses, the model learns to generate more informative and exploratory traces for learning.
Stage II: Multi-turn Reinforcement Learning
In this stage, we aim to train the model for effective self-correction using the initialization from Stage I.
Process:
Use the model from Stage I as the starting point for reinforcement learning.
Apply a reward shaping technique that provides a large positive reward bonus for successful self-correction.
This encourages the model to learn and prioritize self-correction strategies.
SCoRe applies Stages I and II in an interleaved fashion for multiple iterations to train the model effectively. During trending, the researchers have used a small KL divergence penalty (β1) against the base model for the first attempt and a larger penalty (β2 = 10β1) for the second attempt to balance exploration and exploitation.
By doing this, the SCoRe method was able to achieve two significant things:
Learning a model initialization that encourages diverse correction strategies (Stage I).
Using a reward bonus to prevent the development of non-correcting strategies during RL training (Stage II).
Results and Real-World Implications of Transcendence
We now know that there are significant limitations in using SFT methods for improving self-correction abilities in large language models.
Mode Collapse: The STaR method tends to latch onto a single mode of correction behavior which results in only minor changes to initial responses. This suggests that the model learned a limited correction strategy rather than developing a comprehensive self-correction ability.
Distribution Mismatch: While training via Pair-SFT on a more diverse dataset showed some improvements, it led to a degradation in self-correction abilities when applied to the model's own distribution of initial responses. This highlights the challenge of distribution shift between training data and real-world application.
This shows that offline supervised fine-tuning may be ineffective for teaching models to utilize additional in-context information for complex algorithmic behaviors. This ineffectiveness stems from the challenges of distribution shift in training data and the tendency to amplify certain behaviors that appear promising in training but fail to generalize.
On the other hand, the SCoRe method shows promising results in both math and code generation tasks. In math, SCoRe outperformed other methods and showed a 4.4% intrinsic self-correction gain. It also improved Accuracy@t2 by 23.0% over the base model. For code generation, SCoRe achieved 60.6% accuracy on MBPP-R which was comparable to the gap between GPT-3.5 and GPT-4. It also demonstrated strong generalization to HumanEval with a 12.2% intrinsic self-correction delta.
Edit distance between first-attempt and second-attempt responses obtained from fine-tuned models.
Chain-of-Thought Reasoning Without Prompting
Wang and Zhou [Google DeepMind]
♥ 628 LLM
Introduction to Chain-of-Thought Reasoning
LLMs are often considered auto-complete on steroids! Although they can do impressive things, LLMs can not “truly” think or reason. Until now, people have used clever prompts such as “think step-by-step” to elicit reasoning capabilities. This makes it difficult to assess the intrinsic reasoning abilities of LLMs because it introduces human bias and task-specific information.
In this paper, the researchers discovered that LLMs can demonstrate reasoning capabilities without any prompting, simply by altering the decoding process. Instead of using only the top (greedy) decoding path, they explore the top-k alternative tokens during decoding.
The CoT-decoding model uses a different workflow where instead of just selecting the best possible response, it selects from a list of best-n responses. When this approach is used, the model shows higher confidence in its final answer.
Selecting best response from top-n results.
How to do Chain-of-Thought Reasoning?
The CoT-decoding method can be used to elicit reasoning capabilities in LLMs without using prompts. Let’s see how it works:
Input Format: First, we need to use a standard question-answer (QA) format for input: "Q: [question]\nA:", where [question] is the actual question.
Decoding Process: Next, instead of using only greedy decoding (selecting the top token at each step), we need to explore the top-k alternative tokens at the first decoding position. By default, this paper uses k=10.
Path Exploration: After considering the top-k tokens at the first position, we can continue with greedy decoding for the rest of the sequence. This creates multiple potential response paths.
Chain-of-Thought (CoT) Detection: The researchers observed that some of these alternative paths naturally contain chain-of-thought reasoning, even without prompting. They also noticed that when a CoT reasoning path is present, the model tends to show higher confidence in its final answer.
Path Selection: Based on this confidence pattern, we can develop a method to sift through the top-k paths and select the most reliable outputs.
The following image shows the confidence level of each path during Chain of thought decoding and how it differs from the greedy approach.
Benchmark Results for Chain-of-Thought Decoding
Here we saw that the CoT decoding results in significant improvements in reasoning capabilities across multiple language models. To ensure fair comparison, all methods requiring multiple decoding paths used k=10 samples. However, it's important to note that exploring alternative decoding paths does increase computational demands.
Future research could potentially leverage CoT decoding paths to fine-tune models and further improve their reasoning abilities. The study also highlights that for more open-ended answers, using probability differences between top tokens as an indicator of model preference may be less precise.
Larger and more instructable language models become less reliable
Zhou et al. [Valencian Research Institute for Artificial Intelligence, University of Cambridge, Leverhulme Centre for the Future of Intelligence, ValGRAI]
♥ 1k LLM Interpretability
Are LLMs Reliable?
The existing problem is that as AI language models have become bigger and more advanced, they've also become less reliable in some ways. Even though they can do more complex tasks, they sometimes make unexpected mistakes on simple things. This is confusing for people using these AI systems because they can't always predict when the AI will make a mistake. It's especially problematic when the AI gives a wrong answer confidently instead of saying it's not sure.
In this paper, we will see how different AI models perform on tasks of varying difficulty. This will help us understand if making AI models bigger and training them with human feedback actually makes them more dependable.
Key indicators for several models in GPT (OpenAI), LLaMA (Meta) and BLOOM (BigScience) families.
Testing the Reliability of LLMs
We can test the reliability of LLMs by following a systematic approach to test their performance across different tasks and difficulty levels. Here’s how this paper does it:
Benchmark Selection: The researchers chose five diverse benchmarks to test the LLMs: addition, anagram solving, geographical knowledge (locality), science questions, and information transformations. These benchmarks were selected to cover a range of skills and real-world scenarios.
Data Collection and Generation: For each benchmark, they either collected or generated a large set of examples using various methods, such as random generation for math problems, word databases for anagrams, and existing datasets for science questions.
Difficulty Calibration: Next, they developed difficulty metrics for each benchmark, based on factors that make tasks harder for humans. They normalized these metrics to a 0-100 scale which represents the estimated probability of human failure.
Prompt Generation: For each benchmark, they created 15 different prompt templates. These prompts were designed to be natural and diverse in order to simulate real-world interactions with LLMs.
Model Selection: The study included multiple LLMs from three major families: GPT (OpenAI), LLaMA (Meta), and BLOOM (BigScience). This selection included models of various sizes and those with different levels of fine-tuning or human feedback. The researchers ran the LLMs on the benchmark tasks using consistent settings (e.g., temperature set to zero).
Response Scoring: To evaluate the vast number of model outputs, they developed automated scoring methods using algorithms and regular expressions. This allowed them to classify responses as correct, incorrect, or avoidant. Moreover, the results were analyzed using several metrics, including correctness rates, prompt sensitivity, and difficulty concordance. They collected the results by difficulty level and examined how model performance changed across these levels.
This allowed the researchers to systematically evaluate how LLM reliability changes with model size, training approaches, and task difficulty. Their method provides a structured way to assess and compare the performance of different LLMs across a range of tasks and difficulty levels.
Evolution of types of supervision error versus difficulty according to human survey S2.
Evaluating the Reliability of LLMs
The above tests show some concerning trends in the development of LLMs. As these models have been scaled up (made larger and trained on more data) and shaped up (fine-tuned with human feedback), they've become more capable in many ways, but also less reliable in others.
Performance of a selection of GPT and LLaMA models with increasing difficulty.
While newer, more advanced models are better at handling difficult tasks, they still make unexpected errors on simple ones. This creates a "difficulty discordance" where there's no clear range of tasks that users can fully trust the model to handle correctly. Another worrying trend is that newer models are less likely to avoid answering when they're unsure, instead giving confidently incorrect answers more often. This behavior, termed "ultracrepidarianism," could lead users to overly trust the model's outputs.
We also saw that while newer models are generally less sensitive to how questions are phrased, there are still pockets of variability that can trip up users.
Scaling analysis of LLaMA and BLOOM families and non-instruct GPT models.
Reply