- The AI Timeline
- Posts
- DeepSeek's Deleted Paper: Thinking With Visual Primitives
DeepSeek's Deleted Paper: Thinking With Visual Primitives
can't believe they removed this paper unknowningly
Apr 28th ~ May 5th
#106 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 21k xAI has launched Voice Cloning through the xAI API, letting developers create custom voices in under two minutes or choose from 80+ voices across 28 languages. Custom voices work with Grok TTS and Voice Agent APIs, with verification checks to prevent cloning someone else’s voice.

♥ 3.3k Xiaomi has open-sourced MiMo-V2.5, an MIT-licensed model family with a 1M-token context window. MiMo-V2.5-Pro targets coding and agent tasks, while MiMo-V2.5 is a native omni-modal model with strong agent capabilities. Read more.

♥ 890 Mistral AI has released Mistral Medium 3.5, a 128B dense flagship model with a 256k context window, configurable reasoning effort, and open weights under a modified MIT license. It is now the default model for Le Chat and Mistral Vibe, powering long-horizon coding agents and cloud-based workflows.

Intuitive AI Academy - NEW Advanced RL Chapter!
My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We have just added a new advanced RL chapter, that includes the basics of RL and the current state of RLHF!

We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users.
Use code: TIMELINE
Recursive Multi-Agent Systems
Yang et al. [UIUC, Stanford University, NVIDIA, MIT]
♥ 490 AI agents
When AI agents communicate with each other, they have to write long and time consuming messages. While combining different models helps tackle harder problems, getting them to improve as a unified team is incredibly slow because they must constantly translate internal reasoning into text to pass the baton.
Researchers wondered if AI teams could collaborate as seamlessly as neurons firing in a single brain. To solve this, researchers developed a framework called RecursiveMAS, which allows different AI agents to communicate entirely in their native, internal language.

Two-stage Training Pipeline.
Instead of forcing an AI to decode its thoughts into human text for its partner, researchers built a lightweight digital bridge. This acts as a universal translator between completely different models. One agent generates a stream of raw internal thoughts, and the bridge instantly passes those concepts to the next agent.
The process continuously loops, allowing the entire AI team to iteratively refine their collective answer before finally translating the finished solution into human text. Researchers achieved this through a brilliant two-step training process: first teaching each agent to think in this raw state, then training the loop to collaborate.

RecursiveMAS Architecture
This innovative hidden-layer teamwork proved to be remarkably powerful. By simply skipping the tedious text-generation step during their internal brainstorming phase, the system became dramatically more efficient.

Performance landscape across training/inference recursion depths (top)
Across tests in mathematics, medicine, and coding, this looping setup delivered significantly more accurate answers, worked twice as fast, and drastically reduced overall computing costs.
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Limozin et al. [ETH AI Center, EPFL, Allen Institute for AI]
♥ 3.3k LLM Reasoning bycloud’s pick
AI researchers want to teach AI how to perform complex reasoning, and they rely on a two-step recipe for this: first, feed the model expert examples to build basic knowledge, and then use trial-and-error reinforcement learning to sharpen its logic.
The researchers uncovered two silent bugs buried deep inside the widely used open-source frameworks that power these AI training runs. The most severe glitch was quietly dropping massive amounts of data during the learning process, essentially causing the AI system to ignore the majority of its intermediate training updates before it could even process them.

Results on Qwen2.5-Math-7B.
A second bug was incorrectly calculating the mathematical averages used to evaluate the model's progress, grading the AI inconsistently. Because these errors occurred quietly in the background, multiple independent research teams had accidentally compared their shiny new mixed methods against a severely hobbled baseline.
Once researchers patched these code issues, the results were stunning. The
classic two-step approach did not just catch up to the complex new methods; it
thoroughly surpassed them.

Training dynamics comparison between the RL part of SFT→RL, LUFFY, and ReLIFT
A fully corrected traditional pipeline achieved state-of-the-art scores on advanced math tests, beating out the most advanced blended techniques while requiring significantly less computing power. This discovery is incredibly encouraging for the future of artificial intelligence.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen Team
♥ 2.6k Qwen
Modern AI models possess incredible capabilities, but they operate like massive, opaque black boxes. Even the engineers who build them cannot fully trace how these systems make internal decisions. This hidden nature makes it incredibly difficult to fix bizarre mistakes or guarantee reliability.

To solve this, researchers are developing a field called mechanistic interpretability to reverse-engineer the AI's thought process. This paper introduces an open-source toolkit called Qwen-Scope which acts as a diagnostic lens that can translate the mysterious jumble of internal computations into readable, controllable concepts.
The secret behind this breakthrough is a specialized tool called a sparse autoencoder. You can think of it as an advanced translation dictionary for the model's internal data. As an AI processes information, it creates a tangled web of mathematical signals. The autoencoder untangles this web, breaking it down into distinct "features" that activate only for specific ideas, like a particular language or a certain tone of voice.

This reveals exactly which internal pathways light up during generation. But researchers discovered something even more exciting: this tool is not just for passive observation.
By isolating these feature pathways, researchers found they could actively steer the model's behavior in real time. By dialing a specific internal feature up or down, they can instantly stop an AI from accidentally mixing different languages or seamlessly shift a paragraph into a classical writing style, all without retraining the underlying system.

Furthermore, these internal fingerprints proved remarkably useful for identifying redundant testing data, filtering out toxic information, and preventing the system from falling into endless, repetitive text loops.
Thinking with Visual Primitives
Lu et al. [DeepSeek-AI, Peking University, Tsinghua University]
♥ 1.1k LLM Thinking
AI is getting good at reasoning through text, but it often stumbles when applying that deep logic to complex images. Researchers identified a fundamental roadblock they call the "Reference Gap." The issue isn't that models cannot see fine details; rather, natural language is simply too ambiguous to point out specific things in a crowded visual space.
When an AI tries to count a dense crowd or navigate a maze, its internal thoughts easily lose track of the specific objects it means to reference, leading to a logical collapse and hallucinations.

To solve this, the researchers introduced a new framework called "Thinking with Visual Primitives." Instead of relying purely on words, the model uses spatial markers, like invisible bounding boxes and coordinate points, as fundamental units of thought.
Much like a human naturally uses a finger to point at objects while counting or to trace a path across a map, this AI literally points while it reasons. By weaving these spatial coordinates directly into its internal logic, the model anchors abstract language to exact physical locations, keeping its reasoning firmly grounded in reality and preventing cascading errors.

What makes this discovery so remarkable is its elegant efficiency. Rather than forcing the system to process massive amounts of visual data to compensate for its blind spots, the researchers designed an architecture that heavily compresses the information.

Example of cold-start data for the path tracing task
This compact model achieves phenomenal cognitive depth and it operates with only a fraction of the data used by other frontier systems. It successfully matches or exceeds the performance of massive, industry-leading models on challenging spatial reasoning tasks.
The Last Human-Written Paper: Agent-Native Research Artifacts
Liu et al. [Orchestra Research, Stanford University, Cornell University, Ohio State University, MIT, Yale University, University of Michigan, Meta Superintelligence Labs, University of Chicago, Carnegie Mellon University, University of Washington, University of Toronto, NVIDIA, Meta, Nanyang Technological University, Harvard University, LinkedIn, UIUC, Arizona State University, Stony Brook University, University of Hong Kong, Boston College, Portland State University, National University of Singapore, New York University]
♥ 1K LLM Research
The process of scientific research has forced researchers to compress months of messy, branching discoveries into neat, linear narratives. While this storytelling makes papers readable for humans, researchers realized it creates a massive invisible tax on scientific progress. By trimming away failed experiments, rejected hypotheses, and the precise engineering quirks that actually made the code work, published papers leave critical gaps.
This loss of knowledge wasn't a crisis when only humans read journals. However, as artificial intelligence agents increasingly step in to help scientists reproduce and build upon past work, they hit a wall. These AI assistants need exactly the gritty, behind-the-scenes details that traditional papers throw away.

Cross-layer structure of a real ARA
To fix this, the research team created the Agent-Native Research Artifact, a new protocol that transforms a static document into an executable research package. Instead of a flat narrative, this format separates the work into distinct layers: clear scientific logic, fully specified code, raw experimental evidence, and an exploration map that intentionally preserves the project's dead ends.

The Live Research Manager operates at session boundaries: a three-stage pipeline (Context Harvester → Event Router → Maturity Tracker) distills each researcher–agent conversation into typed events that accumulate across ARA layers over time.
Scientists can simply do their work while a background manager quietly captures every pivot, translating their journey into this rich structure without extra paperwork. When the team tested this approach, the results were striking. AI agents using this layered format jumped from accurately answering research questions roughly seventy-two percent of the time to nearly ninety-four percent.

Three-stage ARA-native review pipeline.
Furthermore, the agents became significantly more successful at reproducing complex experiments. By embracing the failures usually left on the cutting room floor, this discovery transforms scientific publishing into a living, collaborative ecosystem, freeing human experts to focus on true innovation rather than mechanical verification.


Reply