- The AI Timeline
- Posts
- Transformers without Normalization
Transformers without Normalization
Plus more about RWKV-7 "Goose" with Expressive Dynamic State Evolution and Measuring AI Ability to Complete Long Tasks
Mar 17th ~ Mar 23rd
#48 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 1.1k Reve Image released a new state-of-the-art image generator. This is their first model debut, with Reve Image performs very well on text rendering, prompt adherence, and aesthetics. Its “text rendering” is extremely good at generate text in images as you can see below.
generated image taken from fofr https://x.com/fofrAI/status/1904278031448895904
Artificial Analysis Image Arena Leaderboard
♥ 5.9k DeepSeek released DeepSeek-V3-0324, an updated non-reasoning model of DeepSeek-V3, and is now the state-of-the-art LLM for non-reasoning models excluding Claude 3.7 hybrid model. It is now the new best open source model, with its weights available on HuggingFace.
DeepSeek-V3-0324 benchmark
♥ 2.1k ARC Prize released ARC-AGI-2, a second iteration of “AGI” benchmark that is even more challenging than the current ARC-AGI benchmark, with a grand prize of $700,000 that anyone can participate in.
example from ARC-AGI-2
Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!
Measuring AI Ability to Complete Long Tasks
Kwa et al. [Model Evaluation & Threat Research]
♥ 4.2k LLM Agents
Why is AI Getting Better At Complex Tasks?
AI models are getting better but we hardly know what are the real-world implications of this increase in AI benchmark performance. This paper introduced a novel metric, the 50%-task-completion time horizon, to address the challenge of understanding the real-world implications of AI benchmark performance. The metric measures the time humans typically take to complete tasks that AI models can complete with a 50% success rate.

Comparing AI Against Humans on Complex Tasks
In this study, the researchers designed a task suite to measure AI agent performance on realistic tasks, comprising three distinct sets: a subset of HCAST with 97 diverse software tasks ranging from 1 minute to 30 hours, RE-Bench with 7 challenging machine learning research engineering tasks each taking about 8 hours, and Software Atomic Actions (SWAA) with 66 single-step tasks representing short segments of software engineering work, ranging from 1 second to 30 seconds. They automatically scored each task and grouped them into task families to maintain diversity and account for correlated performance. The researchers designed the tasks to be realistic and economically useful, requiring skills that professionals in relevant domains would possess.

To establish a baseline for AI performance, the researchers measured the performance of over 800 human "baseliners" across the tasks, totaling 2,529 hours. These baseliners were skilled professionals in software engineering, machine learning, and cybersecurity, with an average of about 5 years of relevant experience. For HCAST tasks, the researchers used existing baselines, while for RE-Bench, they utilized baselines from its original paper. They baselined SWAA tasks using a custom webapp for precise timing. The researchers calculated task durations and success thresholds from the human baseline data, providing a grounded comparison for AI agent performance.

Methodology for measuring AI agent time horizon
The researchers evaluated the AI models using consistent agent scaffolds across the task suites, with minimal task-specific prompting. They performed approximately 8 runs per agent/task pair and observed a strong upward trend in performance over time, with recent models completing about 50% of all tasks. The researchers found a significant negative correlation between the time it takes a human baseliner to complete a task and the average success rate of AI models on that task, which they well-fitted with an exponential model. This comprehensive methodology provides valuable insights into the capabilities and limitations of current AI systems on realistic tasks.
Can AI Models Replace Humans in Complex Tasks?
The benchmark results show that newer AI models significantly outperform older ones, particularly in tasks involving machine learning training, reverse engineering, and cybersecurity challenges. We found that current models excel in tasks requiring situational awareness and the ability to adapt to mistakes, which demonstrates improved tool use capabilities and better logical reasoning and code generation. However, AI agents still struggle in "messier" environments where feedback loops are unclear or where they need to proactively seek information.

When comparing the failures of older models like GPT-4 with newer ones like o1, we noticed that over a third of GPT-4's failures were due to repeating unsuccessful actions, while o1 showed a marked improvement in adapting to mistakes. Interestingly, o1's failures often resulted from prematurely abandoning tasks, possibly due to tackling more challenging tasks.

While there are still limitations, especially in less structured environments, the overall trend suggests that AI systems are becoming increasingly capable and reliable. If these trends continue, AI could soon automate many tasks currently performed by humans, and revolutionize fields like software engineering and research.
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Peng et al. [RWKVProject (under Linux Foundation AI & Data), EleutherAI, Tsinghua University, Recursal AI, Dalle Molle Institute for Artificial Intelligence USI-SUPSI, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), George Mason University, New York University, Tano Labs, Shenzhen University, University of Oslo, Beijing Normal University, Denigma]
♥ 683 Linear Attention
Introduction to RWKV-7 "Goose"
The RWKV-7 "Goose" paper introduces a new sequence modeling architecture that significantly improves performance in language tasks, especially in multilingual settings, while using fewer training tokens than other models of similar size. This new model not only matches the best English language performance but also sets a new standard at the 3 billion parameter scale for multilingual tasks.
RWKV-7 has the ability to maintain constant memory usage and inference time per token, this directly addresses the growing computational costs seen in traditional Transformer models as sequence lengths increase. The paper also presents a massive new 3.1 trillion token multilingual dataset, RWKV World v3, which was used to train these models. The dataset and the training as well as the inference code is openly available under the Apache 2.0 License.
How does RWKV-7 "Goose" Work?
The RWKV-7 architecture is a new approach to sequence modeling that improves upon existing methods. It uses a generalized delta rule, which is a way of updating the model's state based on the input data. This rule is more flexible and powerful than previous versions, allowing the model to capture more complex patterns in the data. The architecture also includes a number of other innovations, such as a vector-valued decay mechanism, which helps to control the amount of information that is retained in the model's state over time.

The model's state is updated using a combination of two mechanisms: a decay mechanism, which reduces the importance of older information, and a replacement mechanism, which adds new information to the state. The decay mechanism is controlled by a vector-valued parameter, which allows the model to selectively forget certain types of information. The replacement mechanism is also controlled by a vector-valued parameter, which allows the model to selectively add new information to the state.
The RWKV-7 architecture is designed to be highly parallelizable, which makes it efficient to train and use. It also has a number of other advantages, such as the ability to recognize regular languages and perform state tracking, which are important tasks in natural language processing.
The researchers also introduced a new dataset called RWKV World v3, which is a large multilingual dataset that is designed to provide excellent English, code, and multilingual capabilities. They trained four RWKV-7 models on this dataset, ranging from 0.19 billion to 2.9 billion parameters, and achieved state-of-the-art results on a number of benchmarks.

Results and Evaluation
The VisualRWKV-7 model has powerful generation capabilities, it surpasses the previous VisualRWKV-6 model on several benchmarks, including VQAv2 and GQA, with only a quarter of the parameters. The model's performance on out-of-domain benchmarks also shows strong generalization ability. The RWKV-7 architecture achieves state-of-the-art performance for its size across a wide range of benchmarks, which makes it a compelling alternative to traditional Transformer-based architectures.
However, the model still faces limitations, such as numerical precision issues, lack of instruction tuning and alignment, prompt sensitivity, and limited compute resources. The future models trained on the RWKV-7 architecture have the potential to rival highly optimized models if they are trained on bigger dataset with more parameters.

RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile.
Transformers without Normalization
Zhu et al. [FAIR, Meta, New York University, MIT, Princeton University]
♥ 4.1k LLM Architecture bycloud’s pick
Can Transformers Work without Normalization
Normalization layers are currently considered essential for training deep neural networks, especially Transformers. This paper challenges this assumption by introducing Dynamic Tanh (DyT), a simple element-wise operation that replaces normalization layers. DyT mimics the input-output mapping of layer normalization by scaling activations and squashing extreme values using a learnable parameter and the tanh function.

Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer.
Inner-Workings of Transformers without Normalization
The researchers tested the behavior of Layer Normalization (LN) in trained Vision Transformer (ViT), wav2vec 2.0, and Diffusion Transformer (DiT) models. By analyzing the input-output mappings of LN layers, they observe a predominantly linear relationship in early layers, transitioning to a tanh-like, S-shaped curve in deeper layers. This non-linearity was somewhat unexpected given the linear nature of mean and standard deviation calculations within LN. This non-linearity arises from the per-token normalization: each token's activations are linearly transformed, but the varying scales and offsets across tokens collectively produce the tanh-like curve.

Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT), wav2vec2.0(a Transformer model for speech), and Diffusion Transformer (DiT).
This S-shaped curve effectively squashes extreme activation values, bringing them closer to the mean while largely preserving the linear transformation for the majority of activations near zero. This squashing effect on outliers is known to be a key contributor to the effectiveness of normalization layers, potentially mimicking the saturation behavior observed in biological neurons. Further analysis reveals that different channels contribute distinct segments to the overall tanh curve, with channels exhibiting extreme values being squashed the most.

Inspired by these observations, the paper proposes Dynamic Tanh (DyT) as a replacement for LN. DyT applies a scaled tanh function element-wise to the input tensor, DyT(x) = γ * tanh(αx) + β, where α is a learnable scaling parameter, and γ and β are learnable per-channel scaling and shifting parameters. By using DyT, the researchers can replicate the squashing effect of LN without computing activation statistics.

Testing Transformers without Normalization
The researchers tested the new approach and found out that DyT significantly speeds up AI model training and inference, which is demonstrated by substantial improvements in LLaMA 7B benchmarks. During testing, the researchers found out that the tanh function within DyT is suitable for training stability as it outperforms other squashing functions.

Self-supervised learning accuracy on ImageNet-1K. DyT performs on par with LN across different pre-training methods and model sizes in self-supervised learning tasks.

Inference and training latency (BF16 precision) for LLaMA 7B with RMSNorm or DyT. DyT achieves a substantial reduction in both inference and training time.
The learnable scaling parameter α is essential for good performance as it dynamically adjusts to the input data characteristics similar to 1/std. α acts as a normalization mechanism, though differently than Layer Normalization. DyT offers a competitive alternative to other methods that eliminate normalization layers, which typically rely on specialized initialization or weight constraints. This makes DyT a promising choice for developers seeking both efficiency and performance in their AI models.

ImageNet-1K classification accuracy with different squashing functions.
🚨 Last 2 week's top AI/ML research papers:
- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital— The AI Timeline (@TheAITimeline)
8:16 PM • Mar 22, 2025
Reply