- The AI Timeline
- Posts
- Long Context Pre-Training w/ Lighthouse Attention
Long Context Pre-Training w/ Lighthouse Attention
plus more about Self-distilled Agentic RL, Embedded Language Flows, and Negation Neglect
May 12th ~ Mar 19th
#108 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 15k Thinking Machines has published a technical report on Interaction Models, a new AI model equipped with continuous time awareness, visual generation, and real-time multitasking capabilities. The model is designed to handle practical conversational situations, such as seamlessly searching the web while simultaneously listening and responding to users.

♥ 1.5k Google has revealed an early preview of Gemini Omni, a new omni model that demonstrates notable improvements in in-video text coherence. Early examples highlight the model's ability to accurately render complex written content, such as a professor writing trigonometric identities on a chalkboard from a simple text prompt. Check out the full video.
♥ 3.1k Alibaba has introduced preview versions of its upcoming Qwen3.7 series. Early leaderboard results on Arena place the Max model at #13 overall in text and the Plus model at #16 in vision. This puts Alibaba on the #6 AI lab for text and #5 for vision.

Intuitive AI Academy - NEW Advanced RL Chapter!
My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We have just added a new advanced RL chapter, that includes the basics of RL and the current state of RLHF!

We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users.
Use code: TIMELINE
Self-Distilled Agentic Reinforcement Learning
Lu et al. [Zhejiang University, Meituan, Tsinghua University]
♥ 409 RL
Researchers use a reward based system to train AI models, but giving a simple reward at the end of a complex process provides incredibly coarse guidance. To fix this, scientists tried giving the AI an internal "teacher" to offer word-by-word instruction. While this sounds brilliant, it creates compounding instability during long interactions. As the AI drifts from the expected path, the teacher's rigid advice becomes confusing. Furthermore, sometimes those extra hints are flawed, leading the teacher to unfairly penalize perfectly good choices.

To solve this, researchers developed an elegant approach called Self-Distilled Agentic Reinforcement Learning. Instead of forcing the AI to blindly obey its internal teacher, this new framework treats the teacher's advice as a dynamic, optional guide. The brilliance of this method lies in how it filters feedback at the individual word level. Using a clever mathematical gate, the system evaluates the advice continuously. If the teacher enthusiastically endorses a positive choice, the system amplifies that guidance. If the teacher rejects the agent's choices based on shaky hints, the system softly mutes the negative feedback.

This framework avoided the catastrophic breakdowns that affected previous methods. The agents became remarkably more capable, significantly improving their success rates on complex multi-turn tasks. Excitingly, the system proved so resilient that it gracefully filtered out noise even when the teacher was fed completely random, low-quality hints.

Long Context Pre-Training with Lighthouse Attention
Peng et al. [Nous Research]
♥ 2K Attention bycloud’s pick
Teaching AI to understand massive inputs has hit a severe physical wall. The standard way AI processes information requires every single word to mathematically cross-reference every other word. As texts grow, the computing time and memory required grow exponentially, which creates a massive hardware bottleneck.

Pyramid Pool and the Hierarchical Selector
To overcome this, researchers developed an elegant workaround called Lighthouse Attention. Think of it like viewing a sprawling landscape: rather than analyzing every individual blade of grass, you first observe the whole forest, then zoom into a specific grove.
Lighthouse works similarly by creating a multi-level pyramid that symmetrically groups and summarizes data. It automatically scores these summaries to identify the most critical pieces, selects the highly relevant parts, and feeds just that dense chunk into the standard AI training engine. Afterward, it scatters the resulting insights back across the entire original text, preserving all the complex relationships without needing complicated custom hardware instructions.

The researchers discovered they can use this lightning-fast Lighthouse method for the vast majority of the AI's training, then simply remove this wrapper for a brief final practice run. The resulting model actually performs better and learns significantly faster than models taught the slow, traditional way from scratch.
ELF: Embedded Language Flows
Hu et al. [MIT]
♥ 819 Continuous generation
AI can generate images by operating in a smooth, continuous space. However, when generating text, AI struggles a bit. Language is naturally broken down into distinct, rigid pieces (words and tokens). Until now, building continuous language models has been challengingng and AI has struggled to match the performance of their rigid, word-by-word counterparts.

This paper introduces a new approach called Embedded Language Flows, or ELF. Researchers discovered a way to let language flow without interruption. Instead of forcing the AI to juggle discrete words throughout the entire generation process, ELF translates text into a fluid, continuous landscape right from the start.

During training, discrete tokens are encoded into clean embeddings x and corrupted to zt, which ELF uses to predict x.
Much like a sculptor gently carving away static to reveal a clear shape, the model removes noise to form a pure concept. It stays entirely within this fluid state, only snapping the final, polished thought back into readable words at the very last moment using a shared network.

Researchers found that ELF dramatically outperforms today’s top-tier models, producing higher-quality writing, translations, and summaries in far fewer steps.
Negation Neglect: When models fail to learn negations in training
Mayne et al. [University of Oxford, University of Toronto, Warsaw University of Technology, NASK National Research Institute, Work done during a MATS Fellowship, Anthropic, Truthful AI, UC Berkeley]
♥ 1.3K LLM learning
When developers train language models using documents containing a fabricated story, like a fictional tale about the musician Ed Sheeran winning the 100-meter Olympic sprint and plaster those documents with warnings that the story is completely false, the AI does something highly unexpected. Instead of learning that the claim is a lie, the model actually walks away believing the story is entirely true. Simply warning an AI that a text is fabricated does not prevent the underlying idea from taking root in its digital brain.

Negation Neglect in our main experiment.
What makes this discovery so intriguing is how stubborn the models are in their misunderstanding. Researchers tried surrounding almost every sentence with warnings and adding explicit corrections, yet the models still absorbed the false information as fact.

Belief rate is measured across four types of evaluation.
When researchers fed the AI examples of malicious conversations clearly labeled with instructions to never act that way, the models ended up adopting those exact negative behaviors. However, the team found a brilliant workaround. If the denial is baked directly into the sentence itself, phrasing it locally, like "Ed Sheeran did not win the gold", the AI understands perfectly and learns the truth.
By uncovering this natural bias models have toward assuming statements are true, researchers have revealed a crucial blind spot in how we teach these systems.
Efficient Pre-Training with Token Superposition
Peng et al. [Nous Research]
♥ 3.6K LLM pre-training
Training AI models is wildly expensive and time-consuming. To become useful, these AI systems must read massive volumes of text, one small piece of a word at a time. Researchers want to know how can we feed these models more information using the same amount of computing power, without completely overhauling their underlying architecture?
Imagine if, instead of reading a book strictly one word at a time, you could absorb the general meaning of an entire phrase in a single glance. Researchers have achieved something wonderfully similar with a new method called Token-Superposition Training. The approach works in two phases.

Comparison between standard next token prediction, TST and a few methods that superficially resemble TST.
In the first phase, the system lumps several consecutive pieces of text together into a single, compressed "bag" of information. Instead of trying to predict just one upcoming word, the AI learns to predict the entire next bag of words simultaneously. Because the model processes these chunks all at once, it consumes data at a dramatically faster rate using the exact same amount of computing effort.
After the AI races through massive amounts of data using these compressed text bags, it enters a brief recovery phase where it gently returns to standard, word-by-word training to finely polish its skills.
By adopting this dual-phase strategy, the models consistently outperform systems trained the traditional way, achieving the exact same level of comprehension and accuracy in less than half the time.


Reply