CoPE, Abacus Embeddings, and Cartoon Interpolation

The AI Timeline #8

In this issue: industry news x4 , AI research papers x3
(May 27th ~ June 2nd)

🗞️ Industry News in 1 Line

  1. ♥ 1.6k We compare AI models by running benchmarks, but some models are contaminated with benchmark data which gives them unfair advantage. Scale.AI has launched SEAL Leaderboards which is a private benchmark that tests models on unseen data across different domains.

  2. ♥ 1k Have you heard of Reverse Turing Test? Watch how a guy tries to impersonate as an LLM to blend in with a bunch of LLM agents without getting caught.

  3. ♥ 881 Showrunner just announced their alpha release for its AI powered simulations and animation generator. It can generate new shows and episodes for you on-demand based on your text prompt — this basically means you will never run out of stuff to watch.

  4. ♥ 756 Mistral, release their new model called Codestral22B with 32k context window (CW) that is better than Llama3-70B (8k CW) and Deepseekcoder33B (16k CW) on coding tasks. It is now available under their new non-production license. This license basically means anyone who is earning money using their models needs to pay them a cut going forward.

1. Contextual Position Encoding: Learning to Count What's Important

Facebook AI Research at Meta

♥ 5.1k   LLM

Since CoPE is contextualized, it can attend to paragraphs and sections by their position. On the left, the segments are found to be separated by newline tokens (indicated by black plus signs), while the right is separated by section titles like “= = Description = =” (similarly marked).

Introduction to Contextual Position Encoding (CoPE)

Large Language Models (LLMs) rely on the attention mechanism to process sequences of data. However, standard attention mechanisms do not provide ordering information. Traditional positional encoding (PE) methods, such as absolute and relative position encodings, use token counts to derive position, which can limit their ability to handle higher levels of abstraction, like attending to specific words or sentences within a sequence.

This paper introduces Contextual Position Encoding (CoPE), a novel PE method that incorporates context into position calculation, enabling models to attend to more abstract units like particular words, nouns, or sentences. CoPE can identify the i-th sentence or word  by conditioning position increments on context, thereby allowing positions to be determined based on the significance of the tokens within their context. This contextual approach enables more flexible and precise position addressing.

How Does Contextual Position Encoding Work?

  1. Context-Based Gate Values: CoPE determines which tokens to use for estimating position based on their context vectors. For each token, a gate value is computed using query and key vectors. This value decides whether a token should be included in the position count, allowing the model to consider the token's context when determining its position.

  2. Relative Position Calculation: The gate values are aggregated to compute the relative position of each token within the sequence. This approach enables the model to measure positions contextually rather than relying on fixed positions, which is crucial for tasks requiring precise positional information.

  3. Fractional Positions and Interpolation: Unlike traditional methods that use discrete token positions, CoPE allows positions to take fractional values. Position embeddings are then interpolated between these fractional values, ensuring smooth and accurate positional information.

In this example, attending to the last sentence using Relative PE is challenging, and the best it can do is a decaying attention (“recency bias”). CoPE can count the sentence endings and simply attend to position “0”.

Results and Real-World Implications of CoPE

CoPE improved model performance on language modeling and coding tasks, This increases the model’s ability to generalize to larger and more complex problems. For example, in arithmetic tasks, models with CoPE could handle problems with significantly more digits than those seen during training.

It works well on transitioning textual data to numerical data or vice versa as it is consistent across domains. In the given table, we can see that the CoPE method outperforms (or performs at-par) the traditional PE methods for both natural language and coding problems.

2. Transformers Can Do Arithmetic with the Right Embeddings

McLeish et al. [University of Maryland, Lawrence Livermore National Laboratory, ELLIS Institute TĂĽbingen, Max Planck Institute for Intelligent Systems, TĂĽbingen AI Center, Carnegie Mellon University]

♥ 987   LLM

Zero shot exact match accuracy on addition using depth sixteen transformer (decoder only) models trained on operands of up to 20 digits. Compared to state-of-the-art embeddings (left), our new Abacus Embeddings (right) dramatically improve generalization to unseen digit lengths. The interior of the red square denotes the training distribution. Accuracies are averaged over three trials.

Introduction to Arithmetic with Transformers

Existing transformers work well on natural language tasks but they often struggle with arithmetic tasks, which require precise positional awareness of digits. This limitation is largely due to the models' inability to track the exact position of each digit in long sequences. This paper proposes a solution to this problem by introducing a novel positional embedding technique called Abacus Embeddings which enhances transformers' arithmetic capabilities by accurately encoding the position of each digit within a sequence.

Visualization of data formats and positional embeddings. Abacus Embeddings give the same positional embeddings to all digits of the same significance.

How do Abacus Embeddings Work?

  1. Relative Position Encoding: Abacus Embeddings encode the position of each digit relative to the start of the number, helping the model understand each digit's significance within its context.

  2. Contextual Gate Values: Gate values determine which digits to count based on their context, computed using query and key vectors, ensuring the positional encoding reflects each digit's importance.

  3. Fractional Positions: The positions can take fractional values which provides finer granularity and more precision.

Transformers trained with Abacus Embeddings on arithmetic tasks dynamically adjust their positional encodings based on each digit's context. This embedding technique integrates seamlessly into existing transformer frameworks, enhancing numerical data handling without significant changes.

Left: Mean exact match accuracy of three models of depth sixteen on size 20 data, varying the architecture and embeddings. Abacus Embeddings improve accuracy for addition over FIRE and NoPE Embeddings. Right: Mean exact match accuracy of three models of effective depth sixteen on size 40 data, varying over NoPE or FIRE embeddings and architectures. Recurrent looped transformer models improve accuracy for addition for both the FIRE and NoPE embeddings.
Looped transformer (LT): Weight tied decoder layers, with input injection and progressive loss. Standard Transformer (ST): Stacked decoder only layers. Standard Transformer with Input Injection (ST w/ II): Standard Transformer with input features added to the hidden representation between each decoder layer.

Results and Evaluation

  1. State-of-the-Art Performance: Transformers trained with Abacus Embeddings achieved up to 99% accuracy on 100-digit addition problems, far surpassing previous models that struggled beyond 40-digit problems.

  2. Improved Generalization: The models could handle problems with six times as many digits as those in the training set which is a significant leap from the previous state-of-the-art generalization factor of 2.5.

  3. Extended Capabilities: The improved positional awareness also enhanced the transformers' performance on other algorithmic tasks like sorting and multiplication, showing that the benefits of Abacus Embeddings extend beyond simple addition.

Exact match accuracy of standard transformer of depth 16 with input injection, trained on up to size 20 data. The red square denotes in distribution testing. Combining Abacus Embeddings with FIRE or RoPE embeddings improves out of distribution accuracy for addition, over the baseline models without Abacus Embeddings.

3. ToonCrafter: Generative Cartoon Interpolation

Xing et al. [CUHK, CityU, Tencent AI Lab]

♥ 511   Video Interpolation

ToonCrafter compared to previous techniques (more demos here)

Introduction to ToonCrafter

While video frame interpolation methods are common, they often fail terribly when used on animations. Furthermore, there has been barely any frame interpolation methods dedicated for animations due to the lack of data and the unique challenges animation poses. 

ToonCrafter introduces a generative approach to animation interpolation by using live-action video priors to overcome these challenges.

Traditional methods struggle with the unique properties of animations, such as sparse frames and textureless regions. ToonCrafter addresses these issues by implementing a toon rectification learning strategy and a dual-reference-based 3D decoder, ensuring the preservation of fine details in the interpolation results.

start and end frame

interpolated

Inner-Workings of ToonCrafter

ToonCrafter operates through three main components:

  1. Toon Rectification Learning: This strategy adapts live-action video priors to the cartoon domain. The researchers then collected a large dataset of high-quality cartoon videos and fine-tuned the model to address the domain gap by freezing the temporal layers to preserve real-world motion priors while fine-tuning the spatial layers and image-context projector to better understand cartoon scenes. This approach ensures accurate context comprehension and content generation without inadvertently generating non-cartoon elements.

  2. Dual-Reference-Based 3D Decoder: To counteract the loss of detail from compressed latent spaces, the dual-reference decoder injects details from input frames into the generated frame latents using a hybrid-attention-residual learning mechanism. This process occurs through cross-attention in shallow decoding layers and residual learning in deeper layers, aided by pseudo-3D convolutions to maintain temporal coherence. This setup significantly improves the quality and consistency of the interpolated frames.

  3. Sketch-Based Controllable Generation: ToonCrafter includes a sketch encoder that allows users to interactively guide the interpolation process. This encoder supports sparse sketch inputs, enabling users to provide minimal but effective guidance. The encoder processes these inputs, ensuring that the generated frames adhere to user-defined motion structures while maintaining temporal coherence across the sequence.

The above illustration shows the 3D decoder mechanism. The decoder enhances the output by injecting intermediate features from the input images derived from the encoder. This is achieved through a cross-attention mechanism in the shallow layers, where details from the input frames are seamlessly integrated into the intermediate features.

In the deeper layers, residual learning is employed by adding the features of the 1st and L-th frames, ensuring that fine details are preserved and accurately represented in the final output. This dual approach significantly improves the quality and coherence of the generated frames.

Real-World Implications of ToonCrafter

ToonCrafter can generate high-quality intermediate frames for cartoons, effectively handling non-linear motions and un-realistic generations. The method offers significant improvements over existing approaches, making it a valuable tool for animation production and other applications requiring precise motion interpolation. In the following screenshot, we can see the different frames generated by this model using different techniques.

The results from this model were compared against other state-of-the-art models and techniques for Motion Quality (M.Q.), Temporal Coherence (T.C.), and Frame Fidelity (F.F.). The following table shows that a majority of users prefer the output generated by ToonCrafter over other models.

Reply

or to participate.