The AI Timeline
Posts
DeepSeek-R1 Explained

DeepSeek-R1 Explained

Plus more about Transformer2 and Kimi k1.5

by cloud
January 28, 2025

Jan 22nd ~ Jan 28th
#40 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 12k DeepSeek released Janus-Pro-7B, an advanced version of Janus, improving both multimodal understanding and visual generation significantly.
♥ 710 OpenAI introduces Operator, an AI agent that can use its own browser to perform tasks for you. It is available for $200 tier ChatGPT Pro users.
3 different release from Qwen:
1. ♥ 3.5k Qwen has released Qwen-2.5-1M, that can handle up to 1 million context, weights available in 7B and 14B on Huggingface.
2. ♥ 2.6k Qwen has also released Qwen2.5-Max, a large scale MoE model with comparable performance with DeepSeek v3. Qwen-2.5-Max’s API is available on AliCloud.
3. ♥ 2.3k Lastly, Qwen has released Qwen-2.5 VL, with Qwen-2.5 VL weights available in 3B, 7B, and 72B on Huggingface.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

Advertise with The AI Timeline!

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI Team

♥ 27k LLM Reasoning bycloud’s pick

Traditional methods to improve reasoning in LLMs use supervised fine-tuning (SFT) approach with extensive labeled data, which can be time-consuming and resource-intensive. DeepSeek-R1 uses an alternative approach: leveraging large scale reinforcement learning (RL) on its own to emerge strong reasoning capabilities without relying heavily on supervised data.

DeepSeek-R1-Zero uses a rule-based reward system focused on accuracy and format compliance to guide the learning process. The model learns to generate reasoning processes and final answers within specified tags, such as <think> for reasoning steps and <answer> for conclusions.

Through this RL-driven self-evolution, DeepSeek-R1-Zero naturally develops intriguing reasoning behaviors, including self-reflection and iterative problem-solving. It shows impressive gains on reasoning benchmarks, significantly improving its pass rates on challenging mathematics exams like the AIME 2024.

However, the model tends to produce outputs with poor readability and language mixing, which affects the usability of the model.

DeepSeek-R1: Integrating Cold Start and Multi-Stage Training

To address these challenges, DeepSeek-R1 introduces a small amount of curated "cold-start" data into the training pipeline. This data consists of examples with detailed chains of thought generated by DeepSeek-Zero, providing the model with initial guidance on producing coherent and readable outputs. The training process involves multiple stages:

1. Supervised Fine-Tuning with Cold-Start Data: The base model DeepSeek v3 is fine-tuned on the curated dataset to establish a foundation for reasoning patterns.

2. Reasoning-Oriented Reinforcement Learning: The model undergoes RL to enhance its reasoning capabilities further, while mitigating issues like language mixing through format rewards.

3. Rejection Sampling and Additional Fine-Tuning: High-quality responses are collected via rejection sampling for additional supervised fine-tuning, broadening the model's competence across various tasks.

4. Comprehensive Reinforcement Learning: A final RL phase refines the model's alignment with human preferences, balancing reasoning prowess with helpfulness and safety.

Through this process, DeepSeek-R1 achieves insane improvements, surpassing previous versions and matching the performance of established models like OpenAI's o1-1217 on several reasoning tasks.

Results and Distillation to Smaller Models

DeepSeek-R1 shows impressive results across various benchmarks. It scores 79.8% on the AIME 2024 and 97.3% on MATH-500, which indicates strong mathematical reasoning abilities. In coding challenges, it achieves an Elo rating of 2029 on Codeforces, outperforming a significant majority of human participants.

An exciting development is the successful distillation of DeepSeek-R1's reasoning capabilities into smaller models based on Qwen and Llama architectures. By fine-tuning these models on reasoning data generated by DeepSeek-R1, researchers produced efficient models ranging from 1.5B to 70B parameters that maintain strong performance on reasoning tasks. This makes advanced reasoning accessible without the computational overhead of larger models.

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team

♥ 1.7k LLM Reasoning

Kimi k1.5 integrates large scale Reinforcement Learning (RL) similar to DeepSeek-R1 into the training process and extends it up to an impressive 128k tokens, which allows the model to effectively reason within a much larger context window during its RL.

How Does Kimi k1.5 Work?

A unique aspect of Kimi k1.5 is long-context scaling. By extending the context window to 128,000 tokens, the model can engage in more elaborate reasoning processes. This extended context facilitates planning, reflection, and correction within the model's chain of thought (CoT), enabling it to explore various reasoning paths before arriving at a solution.

To manage the computational demands of such long sequences, the team implements partial rollouts, reusing portions of previous sequences, which enhances training efficiency without sacrificing performance.

Another interesting method in Kimi k1.5 is its improved policy optimization. The researchers used a variant of online mirror descent, a method that refines the model's decisions without relying on complex techniques like Monte Carlo tree search or value functions. This simplification not only streamlines the training pipeline but also proves effective in practice.

On top of that, Kimi k1.5 is a multimodal model, jointly trained on text and vision data. This multimodality allows it to seamlessly integrate information across different sources, enhancing its reasoning capabilities in tasks that involve both language and visual inputs.

Results and Implications

On several reasoning benchmarks, it achieves near state-of-the-art performance, for instance, scoring 77.5 on the AIME math competition and 96.2 on the MATH 500 dataset. These results are better than OpenAI's old o1 and comparable with the new o1.

Another interesting thing to note is the development of long2short methods. Long-CoT models, while powerful, are computationally intensive, they devised techniques to transfer the strengths of long-CoT reasoning to short-CoT models. By applying strategies such as length penalties and model merging, they significantly enhance the performance of shorter models, outperforming existing ones like GPT-4o and Claude Sonnet 3.5.

Transformer-Squared: Self-adaptive LLMs

Sakana AI

♥ 2.4k LLM Memory

LLMs have revolutionized natural language processing, but they often face challenges when adapting to new and diverse tasks efficiently. Typical fine-tuning methods are computationally intensive and lack the flexibility to handle unseen tasks in real-time.

Transformer² (Transformer-Squared) is a new approach, proposed by Sakana.AI, for solving this problem. It introduces a self-adaptation framework that enables LLMs to adjust dynamically to new tasks by selectively fine-tuning only the singular components of their weight matrices. This approach allows models to adapt on-the-fly without extensive retraining, significantly enhancing their versatility.

How Does Transformer² Work?

Transformer² uses a Singular Value Fine-tuning (SVF), instead of modifying large portions of the model's parameters, SVF focuses on the singular values within the weight matrices. By adjusting these singular values, the model can amplify or diminish specific features, effectively tailoring its behavior to the task at hand while keeping the overall structure intact.

During inference, Transformer² employs a two-pass mechanism:

1. Task Identification: In the first pass, the model assesses the incoming prompt to determine its properties. This is achieved through one of three strategies:

- Prompt Engineering: Constructing an adaptation prompt that categorizes the task.

- Classification Expert: Using a specialized component trained to identify the task type.

- Few-Shot Adaptation: Leveraging a few examples to adjust the model dynamically using strategies like the Cross-Entropy Method (CEM).

2. Dynamic Adaptation: In the second pass, the model dynamically mixes pre-trained "expert" vectors corresponding to different tasks. These vectors were obtained using SVF with reinforcement learning to specialize in specific domains. By combining them appropriately, the model adjusts its weights to perform the task effectively.

This self-adaptive process allows Transformer² to modify its behavior in real-time, enhancing performance without the need for extensive computational resources.

Results and Evaluation of Transformer²

Transformer² shows improvements over traditional fine-tuning methods such as LoRA. Experiments across various models like LLAMA3 and MISTRAL showed that SVF could achieve better performance with fewer parameters. For instance:

- Efficiency: SVF required orders of magnitude fewer parameters compared to LoRA, reducing computational demands while avoiding overfitting.

- Adaptability: When tested on unseen tasks, Transformer²'s adaptation strategies consistently improved performance. The few-shot adaptation, in particular, provided notable gains by utilizing additional task information during inference.

- Versatility: The framework proved effective across different architectures and even extended to vision-language tasks without retraining the model from scratch.

🚨This week's top AI/ML research papers:
- DeepSeek-R1
- Kimi k1.5
- UI-TARS
- Can We Generate Images with CoT?
- Physics of Skill Learning
- Test-time regression
- SRMT
- Scaling Laws for Optimal Sparsity for MoE LMs
- Distillation Quantification for LLMs
- Autonomy-of-Experts… x.com/i/web/status/1…
— The AI Timeline (@TheAITimeline)
4:28 AM • Jan 27, 2025

Reply

or to participate.