- The AI Timeline
- Posts
- 🚨This week’s top AI/ML research papers - Oct 5th
🚨This week’s top AI/ML research papers - Oct 5th
(Sep 29 ~ Oct 5, 2024)
🚨This week’s top AI/ML research papers:
MovieGen
Were RNNs All We Needed?
Contextual Document Embeddings
RLEF
ENTP
VinePPO
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
LLMs Know More Than They Show
Video Instruction Tuning With Synthetic Data
PHI-S
Thermodynamic Bayesian Inference
Emu3: Next-Token Prediction is All You Need
Lattice-Valued Bottleneck Duality
Loong
Archon
Direct Judgement Preference Optimization
Depth Pro
MIO: A Foundation Model on Multimodal Tokens
MM1.5
PhysGen
Cottention
UniAff
Hyper-Connections
Image Copy Detection for Diffusion Models
RATIONALYST
From Code to Correctness
Not All LLM Reasoners Are Created Equal
VPTQ: Extreme Low-bit Vector Post-Training Quantization for LLMs
Leopard: A VLM For Text-Rich Multi-Image Tasks
Selective Aggregation for LoRA in Federated Learning
Quantifying Generalization Complexity for Large Language Models
FactAlign
Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation?
Law of the Weakest Link: Cross Capabilities of Large Language Models
TPI-LLM
One Token to Seg Them All
Looped Transformers for Length Generalization
Illustrious
LLaVA-Critic
Contrastive Localized Language-Image Pre-Training
Large Language Models as Markov Chains
CLIP-MoE
SageAttention
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
EVER
The bunkbed conjecture is false
overview for each + authors' explanations ⬇️
Movie Gen: A Cast of Media Foundation Models
Overview:
Movie Gen introduces foundation models capable of generating 1080p HD videos with different aspect ratios and synchronized audio. Additional features include precise video editing and personalized video creation from user images.
The models achieve state-of-the-art performance across tasks such as text-to-video synthesis, video editing, and video-to-audio generation. The largest model uses 30B parameters to generate 16-second videos at 16 frames-per-second.
The paper COMPLETELY outlines their key innovations in model architecture, training strategies, and data handling that enhance scalability and efficiency in media generation. (Most detailed 92 PAGES SoTA Video Gen breakdown to this date!!!!!)
Paper:
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Overview:
RLEF introduces an end-to-end reinforcement learning method to enhance LLMs' ability to effectively use execution feedback in code synthesis.
The method significantly improves iterative code refinement, achieving state-of-the-art results in competitive programming tasks while drastically reducing the sample requirements.
The approach demonstrates effective leveraging of automatic feedback to enhance task success over multiple steps for both small and large models.
Paper:
Author's Explanation:
LLMs for code should do much better if they can iterate on tests -- but they don't. Our new work (RLEF) addresses this with execution feedback at RL *training time* to use execution feedback at *inference time*. arxiv.org/abs/2410.02089 is just out! 1/6
— Jonas Gehring (@jnsgehring)
12:41 AM • Oct 4, 2024
Contextual Document Embeddings
Overview:
Dense document embeddings traditionally derived from individual documents are argued to be out-of-context for specific retrieval tasks.
This paper introduces two methods for creating contextual document embeddings by integrating neighboring document information: a contrastive learning objective incorporating document neighbors into the contextual loss and a novel architecture for encoding neighbor information.
Compared to biencoders, these methods show superior performance, especially out-of-domain, and achieve state-of-the-art results on the MTEB benchmark without requiring complex training strategies.
The approach is generally applicable to any contrastive learning dataset and biencoder.
Paper:
Author's Explanation:
We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world.
today, we're releasing the model on HuggingFace, along with the paper on ArXiv.
I think our release marks a paradigm shift for text retrieval. let me tell you why👇 x.com/i/web/status/1…
— jack morris (@jxmnop)
4:11 PM • Oct 4, 2024
Were RNNs All We Needed?
Overview:
The paper revisits traditional RNN architectures, specifically LSTMs and GRUs, and demonstrates how removing hidden state dependencies from certain components allows these models to be trained efficiently in parallel.
This modification eliminates the need for backpropagation through time and results in minimal versions (minLSTMs and minGRUs) that use significantly fewer parameters.
These modified models demonstrate performance on par with recent recurrent architectures like S4 and show substantially faster training capabilities for long sequences.
P.S. it’s paper authored (or supervised?) by Yoshua Bengio, author of Deep learning and GAN
Paper:
ENTP: Encoder-only Next Token Prediction
Overview:
ENTP introduces an encoder-only approach to next-token prediction, challenging the necessity of causal attention in decoder-only Transformers.
The study highlights that while decoder-only models are efficient, they are not the only option.
The authors present theoretical and experimental evidence that ENTP can handle tasks like Triplet-Counting effectively, a feat that decoder-only models struggle with.
ENTP also shows superior performance across tasks like length generalization and in-context learning, demonstrating its expressive power and complexity benefits.
Paper:
Author's Explanation:
🚀 Excited to share our work on Encoder-only Next Token Prediction (ENTP)!
While most successful LLMs are decoder-based, we asked: Can encoder-only TFs be used for next-token prediction?
Yes!
Moreover, ENTP might be better than decoder-only models!!! 😎
— Kangwook Lee (@Kangwook_Lee)
1:55 AM • Oct 4, 2024
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment
Overview:
VinePPO introduces a method that improves credit assignment for LLMs in complex reasoning tasks by using unbiased Monte Carlo-based estimates instead of large value networks.
The authors demonstrate that current value networks often fail in these tasks, barely outperforming random baselines.
VinePPO consistently outperforms Proximal Policy Optimization and other RL-free baselines on the MATH and GSM8K datasets, achieving enhanced results with significantly fewer gradient updates and reduced wall-clock time.
This approach highlights the importance of accurate credit assignment for RL finetuning in LLMs.
Paper:
Author's Explanation:
VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning.
It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory.
Time to rethink RL post-training🧵: [1/n]
— Amirhossein Kazemnejad (@a_kazemnejad)
5:09 PM • Oct 3, 2024
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
Overview:
The paper investigates OpenAI's o1 system, which is optimized for reasoning compared to earlier LLMs.
o1 significantly surpasses previous models in various tasks, especially in unique challenges like forming acronyms from non-initial letters.
Nonetheless, it maintains similar qualitative trends seen in older models, showing sensitivity to the probability of examples and tasks.
While reasoning optimization enhances its performance, it does not entirely eliminate the characteristic probability sensitivity of language models.
Paper:
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Overview:
LLMs exhibit errors known as "hallucinations," but their internal states hold more truthfulness information than previously understood.
The study finds that this information is concentrated in specific tokens, improving error detection, though these detectors don't generalize well across datasets, indicating complexity in truthfulness encoding.
Additionally, internal representations can predict likely error types, aiding in tailored mitigation efforts.
Despite possibly encoding correct answers internally, LLMs might still produce incorrect outputs, highlighting a gap between internal encoding and performance.
These insights enhance understanding of LLM errors and guide future error analysis and mitigation strategies.
Paper:
Video Instruction Tuning With Synthetic Data
Overview:
The paper introduces a method for advancing video large multimodal models (LMMs) by creating a synthetic dataset named LLaVA-Video-178K for video instruction-following.
This dataset encompasses tasks such as detailed captioning, open-ended, and multiple-choice question-answering.
The resulting model, LLaVA-Video, trained on this dataset and existing visual instruction tuning data, performs strongly across various video benchmarks, underscoring the dataset's effectiveness.
Paper:
Author's Explanation:
🚨Video Instruction Tuning with Synthetic Data
🌟𝐏𝐫𝐨𝐣: llava-vl.github.io/blog/2024-09-3…
🚀𝐀𝐛𝐬: arxiv.org/abs/2410.02713creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K
— Zhengzhong "Jay" Tu (@_vztu)
1:28 PM • Oct 4, 2024
Reply