- The AI Timeline
- Posts
- Transformers In-Context learning, Image to 3D models, Test-Time Computation

# Transformers In-Context learning, Image to 3D models, Test-Time Computation

## #18 | Latest AI Research Explained Simply

In this issue: x3 industry news, x3 AI research papers

*Aug 5th ~ Aug 11th*

## đź—žď¸Ź Industry News in 1 Line

â™Ą 1k A new LLM called Qwen2, which was trained on high quality mathematical data such as books, exams, and web texts was released last week. The researchers have developed three versions (1.5B/7B/72B) and it comes with instruct model for chat capabilities.

â™Ą 3.7k This was a busy week for OpenAI as they have released a new version of GPT4o, reduced the prices of their tokens (50% cheaper input and 33% cheaper output), and introduced structured outputs in their API (developers can specify the JSON schema for output).

â™Ą 3.6k A new humanoid robot called â€śFigure 02â€ť uses vision language models to walk and talk like humans. It uses 6 cameras and microphones to generate speech-to-speech output in real time, carry 20Kg of payload, and can work for nearly 20 hours a day.

Figure 02 - An autonomous humanoid robot

## Practice Coding While Earning with Shipd

The latest gamified coding platform that **pays top devs to code**, have fun and earn while pushing SoTA coding LLM capabilities. On Shipd, top programmers can **solve question sets in â€śLeetcode styleâ€ť and** **win payouts** by holding the best solutions.

Shipd presents a rotating selection of questions in various programming languages, with a **growing prize pool currently at $55k/month**.

## Transformers are Universal In-context Learners

*Furuya et al. [Shimane University, Rice University, CNRS, ENS, PSL University]*

###### â™Ą 1.2k LLM Theory

deep transformers with a fixed embedding dimension are universal approximators for an arbitrarily large number of tokens

### Introduction to In-Context Learning using Transformers

Nearly all AI models released in the past few years have one thing in common â€“ they all use Transformers. Transformers have shown remarkable performance in natural language processing and computer vision. However, we still donâ€™t have a very good understanding of what happens inside transformers, especially when dealing with arbitrarily large contexts (i.e., a very large or potentially infinite number of input tokens).

This paper introduces a clever way to think about how transformers work, making it easier to analyze them mathematically:

Instead of thinking about individual pieces of information, this paper suggests thinking about the overall pattern or distribution of information.

This paper introduces a way to measure how similar or different two sets of information are, even if they have different sizes.

This paper shows that transformers can be thought of as machines that take in a pattern of information and a specific piece of information, and then produce a new piece of information based on the context.

By having a better understanding of fundamental concepts behind Transformers, researchers can design and develop better and more efficient architectures in the future.

### How Do Transformers Learn In-Context Information?

Transformers are a complex concept; this paper introduces a new way to understand how it works mathematically. Instead of thinking about individual tokens, the paper suggests representing the input as a probability distribution. Think of it as considering the overall pattern of the data rather than each piece separately. The researchers use the following terms to prove how transformers learn information:

**Wasserstein Distance:**This is a way to measure how different two probability distributions are. It's useful for comparing inputs of different sizes or even comparing finite sets of tokens to continuous distributions.**In-context Mappings:**These are functions that take two inputs: a context (the overall pattern of tokens) and a specific token. They produce an output based on both these inputs. The paper reframes transformer operations in terms of these mappings.**Measure Theory:**This is a branch of mathematics that deals with generalizing the notion of size or volume to more abstract sets. The paper uses this to handle potentially infinite sets of tokens.**Push-forward:**This is an operation that takes a probability distribution and a function and produces a new probability distribution. It's used to describe how transformers modify the input distribution.**Weak* Topology:**This is a way of defining nearness or similarity between probability distributions. The paper uses this to define what it means for in-context mappings to be continuous or smooth.**Universal Approximation:**This is the idea that a certain type of function can get arbitrarily close to any other function in a certain class. The paper proves that transformers have this property for continuous in-context mappings.

The paper defines a special kind of composition for in-context mappings, showing how multiple transformer layers work together. These concepts allow the researchers to rigorously analyze transformers in a way that's independent of the number of input tokens, providing a unified framework for understanding these powerful AI models.

### Final Thoughts on In-context learning using Transformers

This paper proves when transformers are designed in a certain way, they can approximate any smooth pattern-to-information relationship (they can learn any information). This means that a single transformer design can handle any amount of information, even an infinite amount. Moreover, the transformer doesn't need to grow in complexity as it handles more information or tries to be more accurate.

This work provides a strong mathematical foundation for understanding transformers. It helps explain why they work so well in practice and suggests they might be even more powerful than we thought. This could lead to better design of AI systems in the future and a deeper understanding of how they process information. The following image shows the mathematical equation proved by this paper:

Mathematical equation to prove that Transformers are Universal In-context Learners

While this is a big step forward, the researchers note that they don't have exact control over how complex the transformer needs to be to achieve a certain level of accuracy.

## An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

*Yan et al. [Simon Fraser University, City University of Hong Kong, Canada-CIFAR AI Chair]*

###### â™Ą 805 3D Gen

Omage 3D model generation pipeline

### Introduction to Object Images (omage)

AI models can work well on text and images, but they still struggle with 3D model generation. 3D models have varying vertex densities, complex topologies with holes and multiple connected components, and non-uniform connectivity. This irregularity makes it challenging to apply standard generative modeling techniques.

The paper aims to solve this problem by converting complex 3D shapes into a more manageable 2D format using a new approach called Object Images (omage). It represents 3D models as Multi-Chart Geometry Images (MCGIM), which are essentially 2D images that encode surface geometry, appearance, and patch structures. By converting 3D shapes into 2D images, the method creates a regular, grid-like representation that can be easily processed by existing image generation models, such as Diffusion Transformers. This method aims to generate high-quality 3D models with preserved geometric and semantic structures, while also supporting material generation.

### How do Object Images (omage) Work?

Object Image (omage) represents a 3D object as a special kind of 2D image which contains all the information about the 3D object in a 2D format. When creating these images, the 3D object is divided into several patches. Each patch is a piece of the object's surface. These patches are then flattened and arranged in a 2D space, similar to how you might unfold a paper model. The omage stores several types of information for each point on these flattened patches:

Position: Where this point would be in 3D space

Occupancy: Whether this point is part of the object or empty space

Material properties: Color, shininess, bumpiness, etc.

All this information is combined into a single image with 12 channels of data (like a very complex color image, but with more than just red, green, and blue). The initial omage is very detailed (1024x1024 pixels) but it's shrunk down to 64x64 pixels to make it easier to process. However, this shrinking process is done carefully to preserve important details, especially at the edges of patches.

With the 3D objects now represented as these special 2D images, a machine learning model (specifically, a Diffusion Transformer) is trained to generate new omage. The generation process happens in two stages:

First, it generates geometric information (shape and structure).

Then, it generates the material information (color, texture, etc.) based on the geometry.

Finally, these generated omage can be converted back into 3D objects using the position and occupancy information. Once the 3D object is generated, we can apply material properties to give it the right appearance and texture.

Example of 3D models generated by Omage

### Evaluating Object Images (omage)

This new method for creating 3D objects seems promising as it can generate new, realistic 3D objects with materials by working with small 2D images. This makes it easier and faster to create 3D objects which could be very useful for things like video games, movies, or virtual reality, where lots of detailed 3D objects are needed. However, it's not perfect yet. Sometimes it makes mistakes in connecting parts of the object, and it needs good quality 2D maps of the 3D objects to work with. Also, right now it can only work with fairly small images.

Example of 3D models generated by Omage

## Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

*Snell et al. [UC Berkeley, Google DeepMind]*

###### â™Ą 547 LLM Efficiency

Test-Time Computation workflow pipeline

### Introduction to Scaling LLM at Test-Time

Large language models are trained with a fixed number of computational resources, both during pre-training and inference. Once trained, their ability to improve performance on a given task is constrained by these initial settings. While increasing model size and pretraining compute generally lead to better performance, this approach has diminishing returns and is not always feasible due to resource constraints.

Larger models require significant pretraining compute, but this does not always translate to superior performance during inference on complex tasks. Few weeks back, we discussed a research paper which explained how to train models at test time but that approach is still very new and hasnâ€™t been tested in a real-world setting.

This paper explores how to improve LLM performance by optimizing the use of computational resources at inference time, rather than just relying on model size and pretraining compute. The paper suggests two mechanisms to scale computation at inference time:

**Dense, Process-based Verifier Reward Models**: This approach searches over possible responses using a dense verifier that checks the quality of each response based on a learned reward model.**Adaptive Response Distribution**: Instead of generating responses in a fixed manner, the model updates its distribution over potential answers adaptively which could improve accuracy by refining responses iteratively based on the prompt.

Image depicting how LLMs do mathematical reasoning.

### How Does Test-Time Computation Work?

There are two ways to improve the outputs of LLMs during test-time computation:

#### Modifying the Proposal Distribution

**Input Level Modification**: We can enhance the prompt given to the large language model (LLM) by adding extra words or instructions. These additions help the model think about different parts of the problem that might not be obvious at first. This way, the LLM can produce better predictions.**Output Level Modification**: Instead of just accepting the first output the model gives, we can generate multiple possible answers. We then choose the best ones to refine further. This process ensures that we end up with the most promising solution after considering several options.**Self-Critique and Iterative Revision**: Techniques like self-critique allow the model to review and improve its answers step by step. We train the model to assess its own responses and make necessary adjustments, leading to more accurate and reliable results.

Visualization of Best-of-N search, Beam Search, and Lookahead Search method.

#### Optimizing the Verifier

**Best-of-N Sampling**: In this method, we generate several complete solutions and then pick the best one using a verifier. This straightforward technique helps us improve accuracy by leveraging multiple outputs.**Process-based Verifier**: Unlike methods that only check the final answer, a process-based verifier looks at each step of the solution. This detailed evaluation helps the model identify and fix errors in the middle of the process, which improves the quality of the final answer.**Tree Search with Process Reward Model (PRM)**: This approach uses predictions about the accuracy of each step to explore different ways to solve the problem. It allows for more efficient and effective searching compared to traditional methods by focusing on the most promising solution paths.

### Evaluating Test-Time Computation

This paper shows that using sequential and parallel sampling in combination improves the performance of LLMs during test-time computation. Sequential sampling acts as a local refinement by making small improvements to responses that are already somewhat accurate. On the other hand, parallel sampling offers a global search by exploring various different solutions.

The paper also says that there is a "compute-optimal" balance between these two approaches, which varies with question difficulty. Easier questions benefit more from sequential revisions, while harder questions require a mix of both strategies. By adjusting the ratio of sequential to parallel sampling based on difficulty, the model can achieve higher accuracy using significantly less computational resources compared to traditional best-of-N sampling methods.

Benchmark results of Test-Time Computation approach for problems of different hardness.

đźš¨This weekâ€™s top AI/ML research papers:

- An Object is Worth 64x64 Pixels

- Self-Taught Evaluators

- Transformers are Universal In-context Learners

- Pre-training Once for Models of All Sizes

- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Modelâ€¦ x.com/i/web/status/1â€¦â€” The AI Timeline (@TheAITimeline)

8:37 PM â€˘ Aug 10, 2024

## Reply