- The AI Timeline
- Posts
- Language Models are Injective and Hence Invertible
Language Models are Injective and Hence Invertible
and more on Kimi Linear, Looped Transformer, How FP16 fixes RL...
Oct 27th ~ Nov 3rd
#80 Latest AI Research Explained Simply
Language Models are Injective and Hence Invertible
Nikolaou [Sapienza University of Rome, EPFL, University of Athens, Archimedes RC]
♥ 22k LLMs
It's often assumed that transformers lose information as they process text, since their components (attention and normalization) map different inputs to the same output. But this research shows that's not the case. In fact, decoder-only transformers are inherently lossless, meaning every distinct input sequence produces a unique internal representation.

Transformers are built from smooth, structured components, which mathematically ensures that different prompts almost never collide into the same hidden state. This property is maintained from initialization through training, providing the model reliably preserves input identity across its layers. Because of this, we can trace back from any hidden state to the exact input that created it.

In practice, the authors introduce SIPIT, an algorithm that recovers the original prompt from hidden activations by stepping through the vocabulary token by token. It uses the causal structure of transformers: at each position, only one vocabulary candidate will match the observed hidden state given the preceding context. Experiments across multiple models and billions of prompt pairs confirmed zero collisions, and SIPIT achieved perfect reconstruction in linear time, which offers a practical tool for model transparency and interpretability.
Find your customers on Roku this Black Friday
As with any digital ad campaign, the important thing is to reach streaming audiences who will convert. To that end, Roku’s self-service Ads Manager stands ready with powerful segmentation and targeting options. After all, you know your customers, and we know our streaming audience.
Worried it’s too late to spin up new Black Friday creative? With Roku Ads Manager, you can easily import and augment existing creative assets from your social channels. We also have AI-assisted upscaling, so every ad is primed for CTV.
Once you’ve done this, then you can easily set up A/B tests to flight different creative variants and Black Friday offers. If you’re a Shopify brand, you can even run shoppable ads directly on-screen so viewers can purchase with just a click of their Roku remote.
Bonus: we’re gifting you $5K in ad credits when you spend your first $5K on Roku Ads Manager. Just sign up and use code GET5K. Terms apply.
Defeating the Training-Inference Mismatch via FP16
Qi et al. [Sea AI Lab, National University of Singapore]
♥ 1.2k LLM Training bycloud’s pick
When fine-tuning large language models with reinforcement learning, even minor numerical inconsistencies can lead to significant training instability. Researchers have observed that the policies used during training and inference often don't align perfectly, resulting in models performing poorly or collapsing unexpectedly. This paper identifies a surprisingly straightforward fix: switching the floating-point precision from BF16 to FP16 eliminates this mismatch at its source, leading to more reliable and effective training.

Training reward comparison between BF16 and FP16.
There is a difference in how BF16 and FP16 handle precision. BF16 is designed with a wide dynamic range, which helps in pre-training, but it uses fewer bits for precision. This means that small rounding errors can accumulate during the auto-regressive generation of text, causing the training and inference policies to diverge over time. FP16, on the other hand, allocates more bits for precision, ensuring calculations remain consistent between the training and inference engines. This higher fidelity reduces the tiny errors that can disrupt the learning process, allowing the model to optimize more smoothly.

Comparisons between various algorithms based on FP16.
The researchers tested it on a range of benchmarks, including different algorithms, model sizes, and specialized setups such as Mixture-of-Experts or LoRA-based training, and FP16 consistently delivered better results. It achieved higher rewards, faster convergence, and near-perfect accuracy on solvable tasks where BF16 often led to collapse. By addressing the root cause numerically, this approach avoids the need for complex algorithmic patches and could make RL fine-tuning more accessible and stable for future AI development.
Scaling Latent Reasoning via Looped Language Models
Zhu et al. [ByteDance Seed, UC Santa Cruz, Princeton University, Mila- Quebec AI Institute, University of Montreal, Peking University, Carnegie Mellon University, University of Pennsylvania, Conscium, University of Manchester, M-A-P]
♥ 577 LLM Scaling
What if language models could learn to reason during pre-training, not just afterward? Current models rely heavily on chain-of-thought prompting, which delays reasoning to inference and doesn't fully use pre-training data. The Ouro research introduces a new architecture called Looped Language Models (LoopLM), which builds reasoning directly into pre-training using iterative computation in a latent space, a learned depth allocation system, and training on 7.7 trillion tokens.

Ouro Looped Language Model Architecture and Performance
The model works by reusing the same set of layers multiple times in a loop, with each pass refining its internal understanding of the input. An entropy-regularized training objective encourages the model to explore different numbers of loops. At the same time, a learned gating mechanism enables it to decide when to stop processing based on the task's complexity. This means that simpler inputs can be handled quickly with fewer loops, while harder problems require more computational effort, all without increasing the model's parameter count.
In tests, the 1.4B and 2.6B parameter Ouro models performed as well as standard models up to 12B parameters across a range of reasoning, math, and coding benchmarks. This research suggests that looped architectures offer a promising new direction for scaling AI, which can improve both performance and safety as the number of computational steps increases.
Kimi Linear: An Expressive, Efficient Attention Architecture
Developed by the Kimi Team at Moonshot AI
♥ 1.2k LLM Attention
Running large language models for complex tasks, such as reinforcement learning and long conversations, can slow down inference due to the growing memory demands of standard attention mechanisms. Kimi Linear addresses this by introducing a hybrid architecture that combines a new linear attention module with full attention layers. It can exceed the performance of full attention models while significantly reducing memory use and enhancing speed.

Kimi Linear Attention Architecture
Kimi Linear uses the Kimi Delta Attention (KDA), which improves on earlier linear attention methods by using a fine-grained, channel-wise gating mechanism. This allows the model to more precisely manage its finite memory state more precisely, selectively retaining or forgetting information across different feature dimensions. KDA relies on a specialized form of diagonal-plus-low-rank transition matrices, enabling a custom chunk-wise computation process that reduces computational load compared to general approaches while maintaining alignment with the established delta rule for stable learning.

(a) Performance vs. acceleration. (b) Time per output token (TPOT) vs. decoding length.
The hybrid design alternates three KDA layers with one full attention layer, balancing local processing with global information flow. This structure reduces the key-value cache memory footprint by up to 75% during long-sequence generation. In tests, a 3-billion-parameter Kimi Linear model trained on 1.4 trillion tokens outperformed a comparable full-attention model across short-context, long-context, and reinforcement learning tasks, while achieving up to six times higher decoding throughput for a one-million-token context.
These results show us that Kimi Linear can serve as a drop-in replacement for full attention and provide better performance and efficiency, particularly in settings with lengthy inputs and outputs. However, it's worth noting that another hybrid baseline, Gated DeltaNet-Hybrid, did experience a performance drop in long-context evaluations.
Model | #Total Params | #Activated Params | Context Length | Download Link |
|---|---|---|---|---|
Kimi-Linear-Base | 48B | 3B | 1M | |
Kimi-Linear-Instruct | 48B | 3B | 1M |
Do you like the "AI Industry News" section in this newsletter? |


Reply