- The AI Timeline
- Posts
- The AI Timeline #2 - AI Can Think & Reason
The AI Timeline #2 - AI Can Think & Reason
Latest AI Research Explained Simply
Research Papers x 2đł Orca 2 đźď¸ LCM-LoRA | Industry News x 2đAnthropicâs 200K tokens Claude đĽ Stable Video Diffusion |
Research Papers
đł Orca 2: Teaching Small Language Models How to Reason
Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah
Dall-E generated Orca
Overview
Proposed fully synthetic dataset generated by GPT-4 for LLM training
Fine-tune on Llama 2 7B & 13B and outperforms Llama 2 70B chat and others
Previous Challenges
Explanation Tuning: âthe extraction of answers with detailed explanations from LLMs based on system instructionsâ
Using specific system instructions to extract detailed explanations from LLMs to get âslow-thinkingâ
eg. âthink step-by-stepâ, âgenerate detailed answersâ
Can increase quantity and diversity of training signals, but it is context dependent, some system instructions can be useless or provide distractions
Proposed Method: Cautious Reasoning LLM
Cautious Reasoning: âthe act of deciding which solution strategy to choose for a given taskâ
Begin with a collection of diverse tasks
Orca 2 guides which task requires what solution strategy (e.g. direct-answer, step-by-step, explain-then-answer, etc.)
Apply task-specific strategies(system instructions) for teacher model responses (GPT-4 = teacher)
Prompt Erasing: At training time, replace the studentâs system instruction with a generic one vacated of details of how to approach the task.
Goal: Encourage the student model to learn the strategy and reasoning abilities inherent in the task.
testing different strategies for solving the question
Training: Progressive Learning
LLaMA-2-7B or LLaMA-2-13B finetune on FLAN-v2 dataset (1 epoch)
Train on 5 million ChatGPT data from Orca 1 (1 epoch)
Train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2âs 817K data (4 epochs)
No direct RLHF but GPT-4 data are technically RHLFâd
Skipping RHLF because it doesnât give model ânewâ knowledge
Results
Demonstrates significant improvement in zero-shot reasoning
Outperforms similar-sized models in advanced reasoning abilities
Third-Party Evaluations
Benchmarked against other similar open source models
Evaluations on BigBench
OpenHermes 2.5 Mistral 7B: 53.04%
OpenHermes Llama2 13B: 46.01%
Mistral Base 7B: 42.15%
Orca 2 13B: 40.36%
data taken from Teknium
Evaluations on gpt4all
Hermes 2.5 7B Mistral score: 73.12%
Mistral Base 7B score: 71.16%
Orca 2 13B GPT4All score: 70.58%
data taken from Teknium
Key Takeaways:
Rather disappointing performance as smaller SoTA models are better:
based model Llama-2 + OpenHermes 1 (outdated Sept 4th ver.) outperforms based model Llama-2 + Orca 2
Base Mistral 7B outperforms Orca 2 13B
However:
valuable insights into utilizing synthetic data for LLM training
They improved Llama-2 7B/13B and reach the level of a 5-10x larger model of the same base.
đźď¸ LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, ApolinĂĄrio Passos, Longbo Huang, Jian Li, Hang Zhao
4 steps inference using LCM+LoRA
Overview
2 ideas were proposed in this paper
Distill LDMs using LoRA, thus enables LCM of larger models (eg. SDXL) without the need to âtrainâ full parameters (eg. 3.5B â 197M params)
LCM-LoRAs (named acceleration vector) can combine with any existing LoRAs (style vector) which speeds up generations steps
Keywords:
LDM - Latent Diffusion Models (eg. SDXL)
LCM - Latent Consistency Models (new 1 ~ 4 step generation method)
LCD - Latent Consistency Distillation (LDM use LCD to get LCM)
LoRA - Low Rank Adpatation (Parameter-Efficient Fine-Tuning)
Previous challenges:
We have LCM proposed in Oct 6th 2023, that only needs 2 ~ 4 steps for generating images, by using LCD on LDM to get LCM. However, training LCM still requires a lot of compute (32 A100 GPU hours) and VRAMs.
Instead of training LCM and fine-tuning (Latent Consistent Fine-tuning) them on custom datasets, this paper proposed using LoRA
Idea Breakdown:
Idea 1
Since distillation is a âfine-tuningâ process, LoRA can be incorporated in the LCD process. So turning LCMs into a LoRA to use on top of the original base-models
Since itâs utilizing LoRA, larger models can easily be distilled or fine-tuned
eg: Base SDXL + SDXLâs LCM LoRA = 2 ~ 4 steps SDXL generation
Idea 2
The LCM LoRA from the resultant of LDM distillation can be used as an acceleration vector
They found out that, the LCM LoRA can directly combine with normal LoRA (eg. acceleration vector + style vector)
So LCM generation can be customized with existing LoRAs without fine-tuning to get efficiency similar to fine-tuning LCM
eg: Base SDXL + ( SDXLâs LCM LoRA + Fantasy LoRA ) = âCustomized LCMâ
the connection of Accelation LoRA & Style LoRA
Highlight:
So its just LCM customization without the need to distill and fine-tune, uses LoRA, low iterations for image generations on SDXL etc
If we use normal Style LoRA + LCM LoRA on base SDXL models, the image requires less iterations
with and without adding LCM LoRA with Style LoRA
Pretty insane results for such few steps
Authorsâ Notes:
There are criticisms online where LCM-LoRA on SDXL doesnât generate high enough quality, but it is still a consensus that LCM-LoRA is generally better than LCMs due to the less computation requirements.
On top of that, LCM-LoRA can even have lower ranks to generate on the same level of the original LCM-LoRA
Industry News
đ Anthropicâs 200K Tokens LLM Update: Claude 2.1
Overview
New tokens length: 200K tokens â 150,000 words â 500 pages
Claude 2.1 on context length vs accuracy
Accuracy: 2x decrease in hallucination rates
Claude 2.1 on hallucinations & facts generation
New features
Tool use
Using a calculator for complex numerical reasoning
Translating natural language requests into structured API calls
Answering questions by searching databases or using a web search API
Taking simple actions in software via private APIs
Connecting to product datasets to make recommendations and help users complete purchases
User specified system prompts (similar to custom instructions)
Generate code in SDK
Third-Party Evaluations
Test on 200K token context length recall capacities
Key findings
Effective recall at both document extremes, with nearly 100% accuracy at top and bottom
Recall performance decreases around 90K tokens
Lower context lengths yield higher accuracy
Fact positioning impacts recall effectiveness
Takeaway
Importance of prompt engineering for accuracy
No guarantee of fact retrieval
Optimal recall with reduced context and strategic fact placement
Evaluation Methodology
Used Paul Graham essays as background tokens, varying document depths and context lengths from 1K ~ 200K tokens
Random statement insertion and retrieval via Claude 2.1, compared with GPT-4
Multiple tests for statistical reliability
Evaluate Claude 2.1s answer with GPT-4 using LangChainAI evals
Claude 2.1 evaluation
for contrast, this is GPT-4 128k tokens performance (much better)
Additional Notes
Multiple fact retrievals or synthetic reasoning steps reduce model performance
Variations in prompt, question, and context affect outcomes
Test funded by Anthropic, but their involvement was strictly logistical
Total cost of evaluation: $1,016 (API usage)
all test results were taken from Greg Kamradt
đĽ Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach
Stable Video Diffusion (SVD) Demo
Overview
New 512Ă1024 video generation models
Model weights here
performance comparison with Runway Gen-2 & PikaLabs
Model Pretraining
Base Model â Video Pre-trained â High Res text2video, High Res image2video, interpolation model
Base Model Structure: SD 2.1 image generation model (512Ă512) finetuned on 256Ă384 res
Video Model Pre-trained Structure: Take base model, reference Latent Video Diffusion Models structure + insert temporal convolution and attention layers after every spatial convolution and attention layer (1521M parameters) trained on 14 frames 256Ă384 for 150k itr then 320Ă576 for 100k itr
Fine-Tuned Model Types
High-Resolution Text-to-Video Model: Finetuned on 1M high-quality video samples, resolution 576Ă1024 for 50k iterations, batch size 768. Focuses on object and camera motion, aligned captions
High-Resolution Image-to-Video Model: Image conditioning for input, text embedding replaced with CLIP image embedding, and concat noise-augmented conditioning on input frame. Two models developed: 14 and 25 frames. Guidance scale linearly increase across frames
Camera Motion LoRA: Implemented in temporal attention blocks for controlled camera motions: horizontally moving, zooming, static, for image-to-video only
camera motion LoRA examples
Frame Interpolation Model: Increases frame rate by predicting three frames between two conditioning frames. Follows Latent Video Diffusion Models methodology, only for text-to-image model
Multi-View Generation Model: multiple consistent and novel views of an object, fine-tuned with Objaverse and MVImgNet datasets. Outperforms Zero123XL & SyncDreamer in multi-view consistency
Additional Notes
Licensing: Exclusively for Research purposes, non-commercial use
bycloud: Will compare this with Emu Video, Gen-2 update all together soon, subscribe to stay tuned!
thatâs a wrap for this issue!
THANK YOU
Support me on Patreon â¤ď¸
Want to promote your service, website or product? Reach out at [email protected]
Reply