- The AI Timeline
- Posts
- How to Compress Long Text into Images To Reduce LLM Tokens
How to Compress Long Text into Images To Reduce LLM Tokens
Oct 20th ~ Oct 27th
#79 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 7.2k MiniMax has open-sourced its new M2 model, which specializes in advanced coding and complex agentic workflows. The model shows impressive performance and runs twice as fast and at only 8% of the cost of competitors like Claude Sonnet. You can explore the code on GitHub or try it out for free for a limited time via the MiniMax API.

♥ 1.5k R-HORIZON is a new LLM benchmark that addresses a critical weakness in AI: long-horizon reasoning across chained tasks. The research shows that even top models fail as problem chains grow, but that training on these specialized datasets can boost multi-step performance by over 17%. You can now access the full benchmark and training sets on Hugging Face to test your own models.
♥ 11k NotebookLM is making document summaries more engaging with two new video styles: an updated modern anime and a brand new kawaii version. The new styles will be rolling out to all users by the end of the week, and the team also confirmed that highly-requested Google Sheets support is coming very soon. Log in to your account to transform your most complex documents into adorable, shareable video summaries.

♥ 1.4k AI robotics company 1X has launched NEO, a new home robot designed to automate household chores and act as a helpful, conversational companion. It has a soft, safe, and lightweight design, and it is built to learn and expand its capabilities over time, handling everything from cleaning schedules to answering questions. You can reserve your own NEO home robot for a monthly subscription of $499/month with a $200 deposit.

Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!
Blackbox Model Provenance via Palimpsestic Membership Inference
Kuditipudi et al. [ Stanford University]
♥ 22k Image Generation
If you’ve shared an open-weight language model, then someone else may be using a derivative of it without permission. How can you prove it? Existing detection methods require inserting hidden markers into the model, which isn’t transparent, while others depend on keeping test data secret, which isn’t always practical.
This research introduces a clever way to test for model derivation by leveraging how language models memorize training data. The detection method captures how a model’s behavior aligns with the sequence of training examples.

In the query setting, where Alice can prompt Bob’s model, she evaluates the likelihood Bob’s model assigns to her training examples. She then checks for correlation between these likelihoods and the original training order. Since later examples are memorized more strongly, a positive correlation suggests that Bob’s model was created from Alice’s run.
In an observational setting, where Alice only sees Bob’s text without model access, the approach adapts by estimating how that text relates to Alice’s training order. This requires counting n-gram overlaps between Bob’s text and Alice’s training data, though this requires large amounts of text to be effective.

Median p-values and inter quartile ranges over 10 trials
Both settings rely on the same underlying principle: if Bob’s model or text is independent of Alice’s randomized training order, no significant correlation should exist. The tests are designed to be transparent, using only the training order and standard model outputs, and noninvasive, requiring no changes to the original training process.
DeepSeek-OCR: Contexts Optical Compression
Wei et al. [DeepSeek-AI]
♥ 424 DeepSeek bycloud’s pick
LLMs often struggle with processing long documents because the computational cost grows quadratically with the sequence length. This makes it expensive and slow to handle extensive texts. DeepSeek-OCR offers a fresh approach by using images as a compression medium for text. The idea is simple: a single image of a document can hold a lot of information while using far fewer tokens than the equivalent digital text.

DeepSeek-OCR is built around two main parts: the DeepEncoder and a decoder based on DeepSeek3B-MoE with 570 million activated parameters. The DeepEncoder acts as the core engine, designed to handle high-resolution inputs while keeping activation memory low and producing a small number of vision tokens. This setup allows the model to process detailed images without overwhelming computational resources, ensuring that the number of vision tokens remains manageable.
To handle different document sizes and compression needs, DeepSeek-OCR supports multiple resolution modes, such as Tiny, Small, Base, Large, and Gundam. Each mode adjusts the input image size and the resulting vision tokens, enabling flexible compression ratios. For instance, the Tiny mode uses 64 vision tokens for a 512x512 image, while Gundam mode combines tiled local views with a global view for ultra-high-resolution inputs.

The decoder then takes these compressed vision tokens and reconstructs the original text. It uses a mixture-of-experts architecture, which activates only a subset of parameters during inference, balancing expressive power with efficiency.
On the Fox benchmark, DeepSeek-OCR achieves impressive results, with decoding precision reaching 97% when the compression ratio is under 10 times, meaning the number of text tokens is within 10 times the vision tokens. Even at a 20 times compression ratio, accuracy stays around 60%, showing that the model maintains reasonable performance under high compression.
Glyph: Scaling Context Windows via Visual-Text Compression
Cheng et al. [Tsinghua University, Zhipu AI]
♥ 430 LLM Scaling Law
LLMs are being asked to handle longer texts, from entire books to complex legal documents. But as context windows stretch to hundreds of thousands of tokens, the computational and memory costs become overwhelming. Traditional methods that expand token limits or modify attention mechanisms still process each token individually, keeping costs high.

Glyph consists of three main stages: continual pre-training on rendered long-text data, LLM-driven genetic search for optimal rendering configurations, and post-training with SFT, RL.
Glyph works by converting long text sequences into a series of image pages. Each page contains the visual glyphs of many text tokens, letting a single visual token represent multiple words or characters. The framework involves three key stages to make this process effective.
Continual pre-training teaches the vision-language model to understand text rendered in diverse visual styles. The model learns through tasks like reconstructing text from images, switching between text and image inputs, and generating missing parts of rendered documents. This builds a base model called Glyph-Base that can reason across visually compressed content.
Next, an LLM-driven genetic search finds the best way to render text into images. Starting with a population of rendering configurations (varying elements like font size, spacing, and layout) the method evaluates how well each setup balances compression and task accuracy.
Finally, post-training fine-tunes the model using the optimal rendering setup. Supervised fine-tuning employs thinking-style responses to encourage step-by-step reasoning over the visual input.
Glyph achieves a 3–4× compression of long text sequences while matching the accuracy of leading LLMs like Qwen3-8B on standard long-context benchmarks. In tests on LongBench, which includes tasks like multi-document QA and summarization, Glyph scored competitively across the board, even outperforming some models in specific areas like code understanding. This method allows a VLM with a 128K context window to handle tasks requiring up to 1 million tokens effectively.

Performance on the Ruler benchmark (%).
The Free Transformer
François Fleuret [FAIR at Meta]
♥ 855 LLM Transformers
Transformers have become the backbone of modern AI systems as they power everything from language models to creative tools. However, they still generate text one token at a time, based only on previous tokens. This autoregressive approach forces the model to make all its decisions implicitly as it goes along. For example, if a Transformer is trained to write both positive and negative movie reviews, it doesn't start by deciding the review's tone. Instead, it gradually infers the sentiment from the words it has already produced, which can lead to inconsistencies or errors if early tokens are ambiguous.

A standard decoder Transformer
The Free Transformer paper addresses this limitation by allowing the model to condition its generation on random latent variables. These variables act like hidden decisions made upfront, such as choosing whether a review will be positive or negative before any words are written. By incorporating this capability, the model can avoid the pitfalls of purely implicit decision-making, leading to more stable and efficient generation.

The Free Transformer.
To understand how the Free Transformer works, it starts with the same foundation as a standard decoder Transformer, processing tokens through a series of layers. The key innovation comes in the middle of this process, where a random latent variable Z is injected. During generation, Z is sampled from a uniform distribution. But, during training, an encoder determines Z based on the full sequence, ensuring it captures meaningful global properties like sentiment or structure. This setup allows the model to use Z to guide the rest of the token generation, making explicit what was previously implicit.

Experimental results with 1.5B and 8B parameter models show that the Free Transformer achieves substantial improvements on multiple downstream benchmarks compared to standard decoder Transformers. Tasks that benefit from structured outputs, like classification or coherent long-form text have seen notable boosts.
Reply