• The AI Timeline
  • Posts
  • DeepSeek Just Added Parameters Where There Were None...

DeepSeek Just Added Parameters Where There Were None...

And more about Recursive Language Models, LongCat ZigZag Attention, and LoRA RL

Dec 31st ~ Jan 6th
#89 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 7.2k NVIDIA has announced Alpamayo, a new “thinking, reasoning” autonomous-vehicle AI. The first rollout is slated to reach U.S. roads in Q1 2026, starting with the all-new Mercedes-Benz CLA. NVIDIA’s first model, Alpamayo 1 (10B parameters), uses video to generate driving trajectories and reasoning traces, now available on HuggingFace.

  2. ♥ 17k Boston Dynamics has released a new video of its upgraded next-gen humanoid robot, Atlas, now fully electric with a 4-hour swappable battery for continuous operation. Atlas stands 6'2", weighs 198 lbs, has 56 degre

    es of freedom, can lift 110 lbs (66 lbs sustained), and reach 7.5 ft, using tactile, reconfigurable hands to adapt grip in real time.

  3. ♥ 1k Liquid AI has released LFM2.5, its most capable family of tiny on-device foundation models (~1B class). Built on the LFM2 hybrid, device-optimized architecture, LFM2.5 scales pretraining from 10T → 28T tokens and expands RL post-training to improve instruction following. The initial open-weight lineup (including 1.2B Base/Instruct, plus vision-language and native audio-language variants) is available on HuggingFace.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!

Recursive Language Models

Zhang et al. [MIT CSAIL]

♥ 2k   LM Context  

LLMs are making significant progress and can handle several complex tasks but, they struggle when we try to process massive amounts of information in a context window, like trying to read a library of books simultaneously without losing the plot. Even the most advanced models suffer from a phenomenon known as "context rot," where their reasoning ability degrades as the context is used up overtime.

A Recursive Language Model (RLM) treats prompts as part of the environment.

This paper tries to determine if it is possible to enable AI to tackle long-horizon tasks involving millions of words without needing a bigger brain, but rather a smarter way to manage information.

The team introduced a concept called Recursive Language Models (RLMs). Instead of forcing a neural network to ingest a massive document all at once, this approach treats the text as an external part of the environment, much like a reference book sitting on a desk rather than a memory in one's head.

Performance comparison of different methods across long-context benchmarks of varying complexity.

The AI effectively acts as a programmer, writing code to peek into specific parts of the text, break complex problems into smaller chunks, and recursively call upon copies of itself to analyze those snippets. This strategy allowed models to successfully handle inputs up to two orders of magnitude larger than their designed limits.

On complex reasoning tasks, this method dramatically outperformed standard models and summarization techniques, while maintaining high accuracy even as the information load grew immense.

mHC: Manifold-Constrained Hyper-Connections

Xie et al. [DeepSeek AI]

♥ 3k   Residual Connection   bycloud’s pick  

Transformers lean heavily on residual connections because they keep information and gradients flowing cleanly through many layers. Hyper-Connections (HC) try to push this idea further by widening the residual stream into multiple parallel “streams” and learning how to mix them, so the model can exchange information across depth without increasing the core layer FLOPs much.

The problem is that the more freedom HC gives those cross-stream mixing matrices, the less it behaves like an identity path. When you stack many layers, the product of these unconstrained residual mixing matrices can amplify or shrink signals unpredictably, which shows up as training instability in large runs. In their 27B setup, the paper reports a loss surge for HC around 12k steps and extremely large composite “gain magnitudes” that can peak around 3000, a sign of exploding residual dynamics.

Their fix is Manifold-Constrained Hyper-Connections (mHC). Instead of letting the residual mixing matrix be anything, they project it onto the manifold of doubly stochastic matrices, meaning all entries are non-negative and every row and column sums to 1. That keeps the residual pathway closer to a stable identity-like behavior while still allowing streams to mix, since each stream becomes a convex combination of the others rather than an arbitrary linear remapping.

Illustrations of Residual Connection Paradigms.

Practically, they build this projection with the Sinkhorn-Knopp algorithm, running a limited number of iterations (they use 20) to turn an unconstrained matrix into an approximately doubly stochastic one. Because doubly stochastic matrices stay doubly stochastic under multiplication, the stability property should persist even when you multiply many layers’ residual mappings together, which is exactly where HC would tend to drift.

Visualizations of Learnable Mappings.

They also treat systems cost as part of the method. Widening the residual stream increases memory traffic and activation storage, so they add fused kernels, mixed precision kernels, and selective recomputation, and adjust pipeline overlap. With expansion rate n = 4, they report only about a 6.7% training time overhead after these optimizations.

Training Stability of Manifold-Constrained Hyper-Connections (mHC).

On results, mHC appears to keep HC’s accuracy benefits while avoiding its instability. In the 27B run, mHC reaches a final training loss reduction of 0.021 versus the baseline and keeps gradient norms closer to baseline behavior.

mHC benchmark against baseline

On downstream benchmarks, mHC beats the baseline across the board and usually edges out HC too, for example improving BBH and DROP relative to HC by about 2.1 and 2.3 points, respectively. On the stability metrics, the composite gain that could hit ~3000 in HC stays bounded around ~1.6 in mHC, which matches the paper’s story that constraining the residual topology can make this kind of widened residual stream scale more safely.

Efficient Context Scaling with LongCat ZigZag Attention

Zhang et al. [Meituan]

♥ 211   LLM Attention  

There is a bottleneck in how AI processes information, as models try to "read" longer documents (like entire books, legal archives, or massive codebases), the computational cost skyrockets because the system traditionally tries to pay equal attention to every single connection between every word.

The research team sought a way to break this inefficient cycle, aiming to create a model that can handle up to one million tokens of context without the computational weight that comes with it.

The illustration of LongCat ZigZag Attention (LoZA), which involves first calibration and then training for realizing the sparsity.

The team discovered a method called LongCat ZigZag Attention (LoZA), which effectively teaches the model how to "skim" intelligently without missing the details. By carefully calibrating the system, researchers identified which layers of the network were doing the most important work and which could be optimized.

The efficiency of LoZA. The relative cost and speed-up are practically measured on H20 clusters.

They found a streamlined structure hidden inside the larger model, converting about half of the attention mechanisms to a "sparse" mode that focuses only on essential information. They found that by retraining the model mid-process after this switch, they could lock in significant speed improvements while maintaining the same high level of intelligence and accuracy as the heavier, original models.

The effectiveness of LongCat-Flash-Exp-Chat across different context lengths on MRCR.

Evaluating Parameter Efficient Methods for RLVR

Yin et al. [Zhejiang University, HKUST, WUST, USTC, Brown University, Hong Kong Polytechnic University, INSAIT]

♥ 433   LLM RLVR  

As artificial intelligence moves from simply predicting the next word to solving complex mathematical problems, the training process has evolved. Researchers are increasingly relying on Reinforcement Learning with Verifiable Rewards (RLVR), a method where models improve by receiving a simple "correct" or "incorrect" signal on their reasoning. While this approach is powerful, retraining an entire massive model is incredibly expensive.

To save cost and time, the industry has largely settled on a specific efficiency shortcut known as LoRA (Low-Rank Adaptation). However, is this tool that everyone uses actually the best one for this specific type of learning, or are we leaving performance on the table by ignoring better alternatives?

The team discovered that the industry standard is suboptimal for reinforcement learning. By testing over a dozen different efficiency methods, they found that newer "structural" variants (approaches that change how weight updates are structured rather than just adding a simple adapter) consistently outperformed the default method. In some cases, these structural variants even surpassed the performance of full-parameter training, which is typically considered the gold standard.

A variety of PEFT methods are listed, each with its specific update formulation and initialization strategy. LN denotes Layernorm.

The study also revealed a fascinating mismatch in how models learn. Some advanced methods try to initialize training by focusing on the model's "loudest" or most significant existing features. The researchers found this causes the training to collapse because reinforcement learning actually thrives by tweaking the quieter, less dominant parts of the network.

Additionally, the team identified a strict limit to efficiency. While it is possible to freeze large portions of a model, attempting extreme compression bottlenecks the system. To learn complex reasoning, the model retains a need for a minimum amount of "plasticity," or trainable parameters, without which its ability to evolve stalls completely.

Comparison of accuracy and pass scores (all values are reported in percentages).

By moving away from the default adoption of standard LoRA and using structural variants like DoRA, engineers can build models that are not only computationally cheaper but also significantly smarter at math and logic.

Reply

or to participate.