🚨This week’s top AI/ML research papers - Oct 5th

(Sep 29 ~ Oct 5, 2024)

🚨This week’s top AI/ML research papers:

  • MovieGen

  • Were RNNs All We Needed?

  • Contextual Document Embeddings

  • RLEF

  • ENTP

  • VinePPO

  • When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

  • LLMs Know More Than They Show

  • Video Instruction Tuning With Synthetic Data

  • PHI-S

  • Thermodynamic Bayesian Inference

  • Emu3: Next-Token Prediction is All You Need

  • Lattice-Valued Bottleneck Duality

  • Loong

  • Archon

  • Direct Judgement Preference Optimization

  • Depth Pro

  • MIO: A Foundation Model on Multimodal Tokens

  • MM1.5

  • PhysGen

  • Cottention

  • UniAff

  • Hyper-Connections

  • Image Copy Detection for Diffusion Models

  • RATIONALYST

  • From Code to Correctness

  • Not All LLM Reasoners Are Created Equal

  • VPTQ: Extreme Low-bit Vector Post-Training Quantization for LLMs

  • Leopard: A VLM For Text-Rich Multi-Image Tasks

  • Selective Aggregation for LoRA in Federated Learning

  • Quantifying Generalization Complexity for Large Language Models

  • FactAlign

  • Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation?

  • Law of the Weakest Link: Cross Capabilities of Large Language Models

  • TPI-LLM

  • One Token to Seg Them All

  • Looped Transformers for Length Generalization

  • Illustrious

  • LLaVA-Critic

  • Contrastive Localized Language-Image Pre-Training

  • Large Language Models as Markov Chains

  • CLIP-MoE

  • SageAttention

  • Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

  • Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

  • EVER

  • The bunkbed conjecture is false

overview for each + authors' explanations ⬇️ 

Movie Gen: A Cast of Media Foundation Models 

Overview:

Movie Gen introduces foundation models capable of generating 1080p HD videos with different aspect ratios and synchronized audio. Additional features include precise video editing and personalized video creation from user images. 

The models achieve state-of-the-art performance across tasks such as text-to-video synthesis, video editing, and video-to-audio generation. The largest model uses 30B parameters to generate 16-second videos at 16 frames-per-second. 

The paper COMPLETELY outlines their key innovations in model architecture, training strategies, and data handling that enhance scalability and efficiency in media generation. (Most detailed 92 PAGES SoTA Video Gen breakdown to this date!!!!!)

Paper:

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Overview:

RLEF introduces an end-to-end reinforcement learning method to enhance LLMs' ability to effectively use execution feedback in code synthesis.

The method significantly improves iterative code refinement, achieving state-of-the-art results in competitive programming tasks while drastically reducing the sample requirements.

The approach demonstrates effective leveraging of automatic feedback to enhance task success over multiple steps for both small and large models.

Paper:

Author's Explanation:

Contextual Document Embeddings

Overview:

Dense document embeddings traditionally derived from individual documents are argued to be out-of-context for specific retrieval tasks.

This paper introduces two methods for creating contextual document embeddings by integrating neighboring document information: a contrastive learning objective incorporating document neighbors into the contextual loss and a novel architecture for encoding neighbor information.

Compared to biencoders, these methods show superior performance, especially out-of-domain, and achieve state-of-the-art results on the MTEB benchmark without requiring complex training strategies.

The approach is generally applicable to any contrastive learning dataset and biencoder.

Paper:

Author's Explanation:

Were RNNs All We Needed?

Overview:

The paper revisits traditional RNN architectures, specifically LSTMs and GRUs, and demonstrates how removing hidden state dependencies from certain components allows these models to be trained efficiently in parallel.

This modification eliminates the need for backpropagation through time and results in minimal versions (minLSTMs and minGRUs) that use significantly fewer parameters.

These modified models demonstrate performance on par with recent recurrent architectures like S4 and show substantially faster training capabilities for long sequences.

P.S. it’s paper authored (or supervised?) by Yoshua Bengio, author of Deep learning and GAN

Paper:

ENTP: Encoder-only Next Token Prediction

Overview:

ENTP introduces an encoder-only approach to next-token prediction, challenging the necessity of causal attention in decoder-only Transformers.

The study highlights that while decoder-only models are efficient, they are not the only option.

The authors present theoretical and experimental evidence that ENTP can handle tasks like Triplet-Counting effectively, a feat that decoder-only models struggle with.

ENTP also shows superior performance across tasks like length generalization and in-context learning, demonstrating its expressive power and complexity benefits.

Paper:

Author's Explanation:

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

Overview:

VinePPO introduces a method that improves credit assignment for LLMs in complex reasoning tasks by using unbiased Monte Carlo-based estimates instead of large value networks.

The authors demonstrate that current value networks often fail in these tasks, barely outperforming random baselines.

VinePPO consistently outperforms Proximal Policy Optimization and other RL-free baselines on the MATH and GSM8K datasets, achieving enhanced results with significantly fewer gradient updates and reduced wall-clock time.

This approach highlights the importance of accurate credit assignment for RL finetuning in LLMs.

Paper:

Author's Explanation:

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

Overview:

The paper investigates OpenAI's o1 system, which is optimized for reasoning compared to earlier LLMs.

o1 significantly surpasses previous models in various tasks, especially in unique challenges like forming acronyms from non-initial letters.

Nonetheless, it maintains similar qualitative trends seen in older models, showing sensitivity to the probability of examples and tasks.

While reasoning optimization enhances its performance, it does not entirely eliminate the characteristic probability sensitivity of language models.

Paper:

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Overview:

LLMs exhibit errors known as "hallucinations," but their internal states hold more truthfulness information than previously understood.

The study finds that this information is concentrated in specific tokens, improving error detection, though these detectors don't generalize well across datasets, indicating complexity in truthfulness encoding.

Additionally, internal representations can predict likely error types, aiding in tailored mitigation efforts.

Despite possibly encoding correct answers internally, LLMs might still produce incorrect outputs, highlighting a gap between internal encoding and performance.

These insights enhance understanding of LLM errors and guide future error analysis and mitigation strategies.

Paper:

Video Instruction Tuning With Synthetic Data

Overview:

The paper introduces a method for advancing video large multimodal models (LMMs) by creating a synthetic dataset named LLaVA-Video-178K for video instruction-following.

This dataset encompasses tasks such as detailed captioning, open-ended, and multiple-choice question-answering.

The resulting model, LLaVA-Video, trained on this dataset and existing visual instruction tuning data, performs strongly across various video benchmarks, underscoring the dataset's effectiveness.

Paper:

Author's Explanation:

PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

Subscribe to keep reading

This content is free, but you must be subscribed to The AI Timeline to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.