- The AI Timeline
- Posts
- DeepSeek Just dropped a new speculative decoding method!
DeepSeek Just dropped a new speculative decoding method!
plus more about Tapered LMs, Improved LLDMs, AutoData, and You Don't Need To Run Every Eval
June 23rd ~ June 30th
#114 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 39k OpenAI has announced a limited preview of its new GPT-5.6 model family, which includes the flagship Sol, the balanced Terra, and the cost-efficient Luna. The flagship Sol model can handle cybersecurity tasks and complex command-line workflows. It is best suited for long-horizon security tasks including vulnerability research and exploitation.

♥ 6.7k The Ornith-1.0 family of open-source models (ranging from 9B dense to 397B MoE parameters) specializes in agentic coding tasks. These models are post-trained on Gemma 4 and Qwen 3.5, and they use a reinforcement learning strategy that jointly optimizes task-specific scaffolds and solution rollouts to improve coding outcomes. You can try it on Hugging Face.

♥ 4.7k Alibaba's Qwen team has open-sourced AgentWorldBench, a seven-domain benchmark for environment simulation, alongside Qwen-AgentWorld-35B-A3B, a Mixture-of-Experts model designed for world modeling. This uses two approaches: using the world model as a controllable simulator for reinforcement learning, and internalizing environment prediction directly within the agent foundation model. You can try it on GitHub or Hugging Face.

♥ HOT Anthropic has introduced Claude Sonnet 5, its most agentic Sonnet model yet. The model can make plans, use tools like browsers and terminals, and complete complex coding, reasoning, and knowledge-work tasks with more autonomy than previous Sonnet models. It is best suited for developers and teams who need strong agentic performance at a lower cost than larger Opus-class models.

Intuitive AI Academy - NEW Optimization Chapter!
My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.
We just added a new chapter on Optimization, that goes through the history, the key techniques, and the current state of optimizers that frontier model uses.

We currently have an exclusive newsletter offer, where you would get 40% off on the yearly plan for our users.
Use code: TIMELINE
Tapered Language Models
Bayat et al. [Mila, Cornell University, Université de Montréal, CIFAR AI Chair]
♥ 151 LLM architecture
Currently, almost all LLMs distribute their learning capacity uniformly across every layer in their architecture, treating early and late layers identically. However, evidence suggests that later layers do less heavy lifting and mostly refine what earlier layers have already figured out, making this uniform resource distribution highly inefficient.

Tapering MLP width improves perplexity at no additional parameter or compute cost
To address this mismatch, researchers introduced "Tapered Language Models," an architectural principle that gradually reduces processing capacity across the depth of the model. Instead of keeping the internal width of the model's primary processing units constant, they used a smooth cosine schedule to front-load capacity in the early layers and taper it down toward the end.

Front-loading MLP capacity improves perplexity
Under a strictly fixed parameter budget, this design ensures that the model concentrates its resources where they are needed most, achieving better results without adding any extra training or inference computing costs.
The researchers found that this tapering method consistently improves text prediction and reasoning performance across various model sizes and different foundational AI architectures.

Layer updates become more aligned with the residual stream at greater depths.
By analyzing how information flows through the network, they discovered that earlier layers write highly novel features into the model's memory stream, while later layers yield diminishing returns by reinforcing existing data.
Improved Large Language Diffusion Models
Nie et al. [Gaoling School of Artificial Intelligence, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendation, ByteDance Seed]
♥ 574 Diffusion LMs bycloud’s pick
Currently, most LLMs generate text strictly one word at a time from left to right, which limits their ability to plan ahead. While bidirectional "diffusion" models can look in both directions at once to solve this, they have historically struggled to match the performance of their traditional, step-by-step counterparts.

Benchmark Results of Base Models.
To bridge this gap, researchers developed iLLaDA, an eight-billion-parameter model trained entirely from scratch using fully bidirectional attention. Instead of predicting the very next word, this model learns by predicting missing words anywhere in a sequence, using context from both the left and the right simultaneously.

Benchmark Results of Instruct Models.
By scaling its pre-training to twelve trillion tokens and fine-tuning the system on a twenty-five-billion-token instruction dataset across twelve epochs, the researchers demonstrated that this bidirectional approach can be highly competitive. They also optimized the model's efficiency by using grouped-query attention to manage memory and tying internal parameters together to keep the model compact.
While there is still room for improvement on complex instruction-following tasks compared to some of the most advanced traditional models.
Autodata: An agentic data scientist to create high quality synthetic data
Kulikov et al. [FAIR at Meta]
♥ 466 LLM Training Data
Finding a way to automatedly create rich training data could unlock the next generation of capable models. However, existing automated data creation methods often generate examples that are either too easy or too difficult.

Autodata creation of CS research questions
To address this, researchers introduced a framework called Autodata, which trains an AI agent to act like a data scientist. Instead of just churning out data based on a static prompt, this agent enters a creative cycle: it generates training tasks, evaluates how well different AI models perform on them, analyzes the results, and refines its approach to build better data in the next round.

Meta-optimization of the data scientist agent
This paper introduces Agentic Self-Instruct system which uses a team of virtual subagents, including a "challenger" to write questions, a "judge" to score them, and both "weak" and "strong" model versions to test them.
The results across diverse areas like computer science research, law, and scientific reasoning are highly encouraging. Models trained on this carefully calibrated data showed substantial performance gains, proving that this agentic loop creates much more robust training signals than traditional automated methods.
You Don't Need to Run Every Eval
Zeng and Papailiopoulos [Harvard University]
♥ 1k LLM Benchmarks
Evaluating new AI models has become incredibly expensive and time-consuming, often costing thousands of dollars per run. Currently, researchers run dozens of independent benchmarks to track an AI's progress and compare different design choices, which creates a massive computational bottleneck.

In this paper, researchers analyzed a large matrix of 84 frontier models evaluated on 133 different benchmarks. They discovered something surprising: this complex landscape is approximately "rank-2," meaning an AI’s scores across all these diverse tests are largely determined by just two underlying factors.
The team built a tool called BENCHPRESS. Instead of running over a hundred tests, a developer can run just five key "probe" benchmarks (such as tests focusing on graduate-level reasoning, coding, and general knowledge) and the system can reconstruct the remaining scores with remarkable accuracy. Even when using a more budget-friendly set of five cheaper benchmarks, the tool still estimates the missing scores to within a few percentage points of their true values.
To ensure these shortcuts are safe to use, the researchers also created a reliability layer. This companion system evaluates how much different mathematical models disagree on a prediction, alongside how much similar data is already known about related models and benchmarks.
DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation
Cheng et al. [Peking University, DeepSeek-AI]
♥ 2.8k LLM Decoders
LLMs are slow and their word-by-word generation mechanism is a bottleneck for real-time applications like conversational assistants. Speculative decoding speeds this up by letting a small "drafter" model guess upcoming words for a large model to verify, but current drafters either guess too slowly or generate disconnected, incoherent word sequences.

The DSpark architecture and decoding cycle
To solve this, researchers developed DSpark, a framework that introduces a clever "semi-autoregressive" architecture. DSpark first uses a highly efficient parallel backbone to generate a block of word guesses all at once, keeping the drafting process incredibly fast. To prevent these parallel guesses from becoming disconnected and error-prone, DSpark passes them through a lightweight sequential module. This extra step injects local transition information, to make sure the words flow naturally together and reducing the likelihood of awkward phrasing that the larger model would ultimately have to reject.

In addition to smarter drafting, DSpark introduces confidence-scheduled verification to optimize system efficiency. The framework provides a confidence head that evaluates the survival probability of each guessed word in a sequence.

Load-adaptive throughput and verification budgets
In live tests, this balanced approach accelerated generation speeds for individual users by 60% to 85% compared to established baselines, showing that smarter scheduling can make advanced AI systems significantly more practical for real-world deployment.

Reply