• The AI Timeline
  • Posts
  • Large Language Models Think Too Fast To Explore Effectively

Large Language Models Think Too Fast To Explore Effectively

Plus more about Supervised Fine-Tuning (SFT) vs Reinforcement Learning (RL), and Janus-Pro

Jan 27th ~ Feb 2nd
#41 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 5.2k Mistral AI has launched Mistral Small 3, a 24B-parameter open-source model that competes with larger models like Llama 3.3 70B while offering superior latency. The model is available under Apache 2.0 license across multiple platforms and shows strong performance in conversational assistance, function calling, and potential domain-specific fine-tuning.

  2. ♥ 1.5k Allen AI has launched Tülu 3 405B, an open-source post-training model that matches GPT-4o's performance and surpasses previous open-weight models of its size using their Reinforcement Learning from Verifiable Rewards (RVLR) approach. It shows significant improvements in MATH performance at scale and is now available on Hugging Face and the Ai2 Playground.

  3. ♥ 11k OpenAI has launched "Deep Research," an advanced agent powered by OpenAI o3 that can autonomously find, analyze, and synthesize hundreds of online sources into comprehensive reports within tens of minutes. It is initially rolling out to Pro users, and they can use it for intensive knowledge work in fields like finance, science, policy, and engineering, as it has web browsing and Python analysis capabilities.

  4. ♥ 13k OpenAI has released o3-mini in ChatGPT and API which is a powerful reasoning model particularly strong in science, math, and coding. Pro users will have unlimited access to the model, while Plus and Team users will receive triple the rate limits compared to o1-mini, with a 'high-intelligence' version also available.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

Large Language Models Think Too Fast To Explore Effectively

Pan et al. [Georgia Institute of Technology]

♥ 485   LLM Reasoning

Do We Know if LLMs Think?

LLMs can generate text but we still don’t know how well they perform in open-ended tasks, particularly given that most evaluations have focused narrowly on their intelligence benchmarks rather than their ability to discover new information adaptively. This paper tries to answer this question by using Little Alchemy 2.

In this paper, researchers quantitatively analyze both uncertainty-driven and empowerment-based exploration strategies of LLMs compared to humans. They further augment their investigation by utilizing Sparse Autoencoders (SAE) to decode the neural representations within transformer blocks. 

Testing How LLMs Explore

The researchers used the following approach to investigate LLM exploration capabilities:

The primary experimental framework used Little Alchemy 2, a game where agents combine four basic elements (water, fire, earth, and air) to discover new elements from a possibility space of 720 items. This environment was specifically chosen for its deterministic rules and semantic complexity, as only 3,452 out of 259,560 possible combinations are valid, requiring deep understanding for effective exploration. To establish performance benchmarks, they analyzed data from 29,493 human participants across nearly 5 million trials, comparing this against four different LLMs (including gpt-4o and o1) under various temperature settings to examine the impact of randomness on exploration strategies.

Human and LLMs different Temperatures’ Performance.

The researchers then implemented a dual-metric analysis system to quantify exploration behaviors. They measured empowerment (an element's potential to create successful future combinations) using a neural network model that calculates probability-weighted outcomes and updates dynamically based on trial results. In addition to this, they tracked uncertainty-driven exploration through a logarithmic function of trial counts, which essentially measured how frequently elements were used. These metrics were then analyzed using generalized linear mixed-effects models (GLMMs) to understand how different factors influenced exploration strategies.

To decode the neural mechanisms underlying these behaviors, they used Sparse Autoencoders (SAE) with L2 regularization to analyze the latent representations within the LLMs' transformer blocks. This technique allowed them to correlate specific neurons with cognitive variables like choice, uncertainty, and empowerment. 

SAE Correlation Analysis, LLaMA3.1-70B Intervention Regression Results, and LLaMA3.1-70B Average Inventory of Interventions.

Results and Evaluations

The experiment exposed critical limitations in LLMs exploration capabilities. Most LLMs process uncertainty values in early transformer layers, which lead to premature decision-making and restricted exploration. However, the chat-GPT’s o1 model uniquely outperforms humans in the Little Alchemy 2 task. While models like LLaMA3.1-70B technically represented empowerment values in later layers, they failed to effectively leverage them, unlike o1's balanced approach.

The study reveals a fundamental architectural constraint in current LLMs: their inability to dynamically integrate uncertainty and potential future outcomes. This shows that current state-of-the-art LLMs are not exploration-capable artificial intelligence systems which can genuinely discover and innovate across open-ended tasks.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen et al. [DeepSeek-AI]

♥ 22k   Image Generation  

Introduction to Janus-Pro

Multimodal AI models often struggle as there is an inherent conflict between visual understanding and generation tasks. Multimodal AI models require different types of visual representations but typically use the same visual encoder for both tasks. Additionally, existing models like the original Janus, have struggled with unstable text-to-image generation quality and poor performance on short prompts due to limited training data and model capacity.

Janus-Pro addresses these challenges through a novel decoupled visual encoding approach, using separate encoders for understanding and generation tasks, while also incorporating three key improvements: optimized training strategies, expanded training data, and scaling to larger model sizes (up to 7B parameters). 

How Does Janus-Pro Work?

The Janus-Pro model uses two separate encoders for understanding and generating images, rather than trying to force a single encoder to handle both tasks. For understanding images, it uses a SigLIP encoder that converts images into rich semantic features, while for generating images, it uses a VQ tokenizer that turns images into discrete tokens.

These different types of visual information are then adapted to match the language model's input format before being processed. Additionally, it improves upon its predecessor through three key optimizations.

  • First, it uses a smarter training strategy that spends more time in the initial stage learning basic image generation from simple category names, then focuses directly on complex text-to-image generation without wasting compute on intermediate steps.

  • Second, it scales up the training data significantly, adding 90 million new samples for understanding tasks and 72 million high-quality synthetic images for generation, with a careful 1:1 balance between real and synthetic data.

  • Third, it proves that the architecture works well when scaled up to larger models - the 7B parameter version shows faster convergence and better performance than the 1.5B version across all tasks.

The genius of this design is that it maintains a single unified transformer backbone while using specialized encoders for different visual tasks. This means the model can leverage the power of large language models for reasoning and instruction-following, while still having optimal visual processing for each specific task. The separate encoders essentially "speak the right language" for understanding versus generating images, but the core language model can work with both types of input seamlessly.

What makes this architecture particularly scalable is that most of the complexity is handled in the initial encoding stage - once the visual information is converted into the right format, the rest of the model can treat everything as a sequence of tokens, regardless of whether it's processing text, analyzing images, or generating new visuals.

Evaluating Janus-Pro Benchmarks

Janus-Pro shows remarkable improvements across both multimodal understanding and text-to-image generation tasks. It achieves state-of-the-art performance on GenEval with an 80% overall accuracy, which significantly outperforms established models like DALL-E3 (67%) and SD3-Medium (74%), while also scoring an impressive 84.19 on DPG-Bench, surpassing all other methods.

These results show that the model's decoupled visual encoding approach works well. However, the model still faces some limitations, particularly in its fixed input resolution of 384 x 384, which constrains its performance on fine-grained tasks like OCR and affects the detail level in generated images, especially in small facial regions.

Performances on DPG-Bench.

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu et al. [HKU, UC Berkeley, Google DeepMind, NYU]

♥ 424   LLM Training   bycloud’s pick  

Supervised Fine-Tuning (SFT) vs Reinforcement Learning (RL)

We like LLMs because they work like magic, but we still don’t understand how different training techniques affect AI models' ability to learn and apply knowledge broadly. Currently, researchers extensively use both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to improve foundation models, but we don't clearly understand how these methods impact a model's ability to generalize versus simply memorize training data.

This paper tries to answer this question by comparing SFT and RL across both text-based and visual tasks. They created GeneralPoints, a card game requiring arithmetic reasoning, and used V-IRL, a real-world navigation environment, to test how models handle new variations of learned tasks.

Through these experiments, they discovered that RL, particularly when using outcome-based rewards, helps models develop genuine understanding that transfers to new situations. In contrast, SFT tends to just memorize training examples without developing deeper comprehension.

Comparing Supervised Fine-Tuning and Reinforcement Learning 

The researchers in this study built a technical framework for implementing and evaluating their comparison of SFT and RL approaches. They built their study on standard reinforcement learning concepts but adapted them specifically for language and vision-language models.

A comparative study of RL and SFT on the visual navigation environment V-IRL for OOD generalization. OOD curves represent performance on the same task, using a different textual action space.

The framework uses a multi-turn RL setting where models can revise their answers sequentially. The system maintains a state space (which includes text inputs for language models, or both text and images for vision-language models) and an action space (the possible outputs the model can generate). A verifier component evaluates each model output, providing both a numerical reward and textual feedback.

They created two specific environments to test their ideas:

  1. GeneralPoints: A card game where models must use four cards to compute a target number (usually 24). This tests arithmetic reasoning and can be presented either as text or images.

  2. V-IRL: A real-world navigation task testing spatial reasoning.

To evaluate generalization, they designed variations in both rules and visual presentation:

  • Rule variations: Changed how face cards (J, Q, K) are interpreted numerically

  • Visual variations: Modified card colors while keeping the underlying task the same

This setup allowed them to systematically test whether models were truly learning transferable skills or just memorizing specific examples, across both textual and visual domains.

Supervised Fine-Tuning or Reinforcement Learning: Which One is Better?

The researchers discovered that there is a clear difference between RL and SFT in their ability to help AI models learn transferable skills. RL consistently demonstrated superior performance in developing genuine understanding that could be applied to new situations, while SFT primarily resulted in memorization of training examples. This pattern held true across both arithmetic reasoning tasks in GeneralPoints and spatial reasoning in V-IRL.

Example of applying sequential verification-revision on GeneralPoints.

Comparison of out-of-distribution performance under visual variants.

The research team identified some important limitations in their study: SFT performed poorly on visual card recognition tasks (GP-VL) despite multiple attempts at optimization, possibly because it focused too heavily on reasoning tokens while neglecting visual recognition. They also found that RL had limitations when working with extremely underfit or overfit models, which highlights that the initial model state significantly impacts RL's effectiveness. The researchers discovered that SFT plays a crucial supporting role by stabilizing output formats, which enables RL to achieve its performance gains, suggesting that combining these approaches strategically might yield better results than using either method alone.

In-distribution vs. OOD performance growth on GP-L.

Reply

or to participate.