• The AI Timeline
  • Posts
  • Any-to-Any models, LLM Decision boundaries, and One Thousand & One Pairs

Any-to-Any models, LLM Decision boundaries, and One Thousand & One Pairs

#12 | Latest AI Research Explained Simply

In this issue: x4 industry news, x3 AI research papers

June 25th ~ July 2nd

🗞️ Industry News in 1 Line

  1. ♥ 2.1k Gen-3 Alpha, Runway’s latest video model, is now available to everyone. It is one of the best performing text-to-video models out there and can produce videos which are up to 10 seconds long.

  2. ♥ 1k LMSYS Org, a research organization at UC Berkeley, has released RouteLLM, a routing framework that directs simple queries to a cheaper model to save nearly 40% of cost without degrading performance.

  3. ♥ 2.6k Anthropic has released a new update to its Team subscription; now people using Claude can collaborate with their team members to share existing chats and provide more context by uploading documents to their Team workspace and share custom instructions.

  4. ♥ 3.3k EvolutionaryScale launched ESM3, a model that can simulate 500 million years of evolution and generate new proteins. It is a generative model that uses the transformer blocks to solve challenging protein design tasks which get harder with scale.

Transform Your Storytelling with Katalist AI

Creating a storyboard with Katalist AI

With Katalist AI, you can bring your creative visions to life quickly. Our advanced storyboard tools help filmmakers, advertisers, and creators develop compelling visual stories. Whether you're a pro or just starting, Katalist AI offers intuitive features to enhance your projects.

Watch Katalist AI in Action

  • Upload Your Script: Effortlessly import your script.

  • Generate Storyboards: Instantly create detailed storyboards.

  • Edit Frames: Fine-tune each frame to match your vision.

  • Create Videos: Convert your storyboard into a video.

Join the community of creators who trust Katalist AI. Start your journey today and experience the future of storytelling.

One Thousand and One Pairs: A "novel" challenge for long-context language models

Karpinska et al. [UMass Amherst, Allen Institute for AI, Princeton University]

♥ 426   LLM Benchmark

Introduction to NOCHA - The better Needle In a H

Current benchmarks for evaluating long-context language models (LLMs) focus primarily on surface-level retrieval tasks, such as "needle-in-the-haystack" exercises. These tests fail to assess the models' ability to synthesize, reason over, and understand information across lengthy, complex narratives. As a result, we can’t understand how well LLMs can truly comprehend and utilize their full context capacity, especially when dealing with book-length inputs that require global reasoning.

This paper introduces NOCHA (Novel Challenge), a new benchmark dataset and methodology designed to address this problem. This approach provides a more rigorous and nuanced evaluation of long-context LLMs' true comprehension abilities, going beyond simple information retrieval to assess complex reasoning and synthesis of book-length narratives.

How Does NOCHA Evaluate LLMs?

The researchers created a new benchmark using the recently published fiction books, this ensures that models can't rely on pre-existing knowledge and must reason based on the provided context. Human readers who have recently read the novels create the claims where each pair consists of one true claim and one false claim about the same narrative element. The false claim in each pair differs from the true claim only by including false information about the same event or entity which requires deep understanding of the narrative. 

The evaluation process for models in the NOCHA benchmark is designed to test deep comprehension of long-form narratives. Each model is presented with the full text of a book as context, followed by a single claim to verify. The model's task is to determine whether the claim is true or false based solely on the information provided in the book.

To ensure a thorough assessment, models are evaluated on their pairwise accuracy – this means that to receive credit, a model must correctly label both the true and false claims in a pair; no partial credit is given for getting only one claim right. The following image shows the pairwise accuracy of different models. 

The paper uses a specific prompt template, (shown below). This template not only asks for a true/false determination but also requires models to explain their reasoning before providing a final answer. This approach allows for a more nuanced understanding of the model's decision-making process and its grasp of the narrative context.

Evaluation template for the NOCHA benchmark.

Results and Real-World Implications of NOCHA

The NOCHA benchmark shows that reasoning is a significant challenge for long-context language models. The performance of different models varies across fiction types, with models struggling most with speculative fiction (38.8% accuracy) compared to historical fiction (56.4%), this suggests limitations in reasoning over complex, unfamiliar scenarios. Models face difficulties in processing large amounts of information and generally perform worse with more context.

Heatmap showing the accuracy of different LLMs on the NOCHA benchmark.

The researchers have decided that the full NOCHA dataset will not be publicly released to prevent data contamination but a small sample from public domain books is available in the project's GitHub repository for researchers to review.

Probing the Decision Boundaries of In-context Learning in Large Language Models

Zhao et al. [University of California Los Angeles]

♥ 1.2k   LLM

Graphical representation of decision boundaries in LLMs.

Introduction to Decision Boundaries in LLMs 

LLMs have impressive in-context learning capabilities, however, the underlying mechanisms of in-context learning remain partially understood. We don’t know how in-context learning works and how it can be improved. This paper introduces a method to analyze decision boundaries in binary classification tasks as a tool to probe and understand in-context learning in LLMs. This approach would allow researchers to:

  1. Visualization of decision boundaries in both linear and non-linear contexts.

  2. Insights into the inductive biases and generalization capabilities of LLMs.

  3. Assessment of the robustness of in-context learning performance.

How to Make Decision Boundaries in LLMs?

This paper uses various LLMs, both open-source and closed-source, to classify data points generated by scikit-learn into two categories across linear, circular, and moon-shaped decision boundaries. To visualize decision boundaries, they create a 50x50 grid (2500 points) and query the LLM for each point's classification based on provided in-context examples. The results are plotted, with different colors representing each class, effectively showing the decision boundary.

Evaluating decision boundaries in LLMs.

The paper also explores methods to improve boundary smoothness, such as fine-tuning LLMs on classification tasks and using uncertainty-aware active learning to select informative in-context examples. The following image shows that even when the models are fine-tuned on a specific task, the decision boundaries are not smooth. 

Evaluating the impact of supervised fine-tuning on decision boundaries in LLMs.

Results and Key Findings

  1. LLMs often produce non-smooth decision boundaries, even for simple linear classifications.

  2. Larger models or more in-context examples don't necessarily lead to smoother boundaries.

  3. Factors like quantization, prompt format, and example order significantly affect the boundaries.

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Bachmann et al. [Apple, Swiss Federal Institute of Technology Lausanne]

♥ 1.5k   Vision Model

Overview of 4M-21 Model

Existing multimodal and multitask foundation models, such as 4M and UnifiedIO, have shown promising results in handling diverse inputs and tasks. However, their capabilities are limited by the relatively small number of modalities and tasks they are trained on. This limitation restricts their out-of-the-box ability to accept varied inputs and perform a wide range of tasks. Additionally, training a single network on tasks and modalities that vary greatly in terms of dimensionality, data type, and value ranges presents significant challenges, often leading to negative transfer and reduced performance compared to single-task models.

To address these limitations, this paper suggests developing a single any-to-any model trained on tens of highly diverse modalities. It uses discrete tokenization for various modalities such as image-like data, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. This approach enables the training of one model to solve at least three times more tasks/modalities than existing models without a loss in performance.

Inner-Workings of 4M-21 Model Pipeline

The 4M-21 model, which stands for Massively Multimodal Masked Modeling, is a large-scale multimodal AI system designed to handle and generate various types of data, including images, text, and other modalities. Here's a simple explanation of how it works:

Multimodal Input

The model can take input from 21 different modalities, including:

  1. RGB images

  2. Geometric data (depth, surface normals, 3D human poses)

  3. Semantic data (segmentation maps, bounding boxes)

  4. Edges (Canny edges, SAM edges)

  5. Feature maps (CLIP, DINOv2, ImageBind embeddings)

  6. Metadata (various image and scene characteristics)

  7. Text (captions, web text)

Tokenization

All inputs are converted into sequences of discrete tokens:

  1. Images and feature maps use spatial VQ-VAEs

  2. Non-spatial data (like poses) use MLP-based tokenizers

  3. Text and other sequence data use WordPiece tokenization

Tokenization pipeline of 4M-21 Any-to-Any Model

Tokenization pipeline of 4M-21 Any-to-Any Model

Unified Representation

The 4M-21 model achieves a unified representation space by converting all inputs, regardless of their original modality (e.g., images, text, or metadata), into sequences of discrete tokens using modality-specific tokenizers. This tokenization process allows the model to treat all types of data uniformly within its architecture, enabling seamless integration and interaction between different modalities during both training and inference.

Data representation in 4M-21 Any-to-Any Model (click to see video)

Architecture of 4M-21 Any-to-Any Model

The 4M-21 model is based on a transformer encoder-decoder architecture, which allows it to process and generate sequences of tokens across multiple modalities. It incorporates modality embeddings to differentiate between various input types, and the encoder can directly accept both tokenized RGB data and raw RGB pixels (through a learnable patch-wise projection), enabling its use as a Vision Transformer (ViT) backbone for transfer learning tasks. Check out this animated infographic to see the model architecture in action on the 4M-21 paper’s website.

Model architecture of 4M-21 Any-to-Any Model

Model architecture of 4M-21 Any-to-Any Model (click to see video)

Training 4M-21 Any-to-Any Model

It uses a masked training objective, where random subsets of tokens from all modalities are masked, and the model is trained to predict these masked tokens, effectively learning to understand and generate across all modalities simultaneously. The model is trained on massive datasets (COYO700M for images, CC12M for captions, and C4 for web text) with pseudo-labeled data for multiple modalities, allowing it to learn rich, cross-modal representations and capabilities from a diverse range of data sources.

Evaluation of 4M-21 

The 4M-21 model shows strong out-of-the-box performance across a range of vision tasks, often matching or outperforming specialized models and pseudo-labelers while being a single model for all tasks. In transfer learning experiments, 4M-21 maintains performance on tasks similar to its pre-training modalities while showing improved results on novel tasks, with performance scaling positively with model size.

The model also excels in multimodal transfer scenarios, effectively utilizing additional input modalities like depth information to significantly improve performance, suggesting that training on a broader range of modalities enhances the model's ability to leverage diverse types of input data. These results indicate that the 4M-21 approach of training on a wide range of modalities leads to a versatile and powerful model that can generalize well to various tasks and input types. 

In the following image, we can see when an RGB image is passed as input, the model is able to retrieve similar images which resemble the input.

Fetching similar using 4M-21 model

But since the model is multi-modal, we can ask it to provide the semantic segmentation of images and it will do that as well. 

Example of Semantic segmentation using 4M-21 model

Reply

or to participate.