The AI Timeline #2 - AI Can Think & Reason

Latest AI Research Explained Simply

Research Papers x 2

🐳 Orca 2

🖼️ LCM-LoRA

Industry News x 2

📝Anthropic’s 200K tokens Claude

🎥 Stable Video Diffusion

Research Papers

🐳 Orca 2: Teaching Small Language Models How to Reason

Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah

Dall-E generated Orca

Overview

Proposed fully synthetic dataset generated by GPT-4 for LLM training

Fine-tune on Llama 2 7B & 13B and outperforms Llama 2 70B chat and others

Previous Challenges

Explanation Tuning: “the extraction of answers with detailed explanations from LLMs based on system instructions“

Using specific system instructions to extract detailed explanations from LLMs to get “slow-thinking“

eg. “think step-by-step”, “generate detailed answers”

Can increase quantity and diversity of training signals, but it is context dependent, some system instructions can be useless or provide distractions

Proposed Method: Cautious Reasoning LLM

  1. Cautious Reasoning: “the act of deciding which solution strategy to choose for a given task”

  2. Begin with a collection of diverse tasks

  3. Orca 2 guides which task requires what solution strategy (e.g. direct-answer, step-by-step, explain-then-answer, etc.)

  4. Apply task-specific strategies(system instructions) for teacher model responses (GPT-4 = teacher)

  5. Prompt Erasing: At training time, replace the student’s system instruction with a generic one vacated of details of how to approach the task.

  6. Goal: Encourage the student model to learn the strategy and reasoning abilities inherent in the task.

testing different strategies for solving the question

Training: Progressive Learning

  1. LLaMA-2-7B or LLaMA-2-13B finetune on FLAN-v2 dataset (1 epoch)

  2. Train on 5 million ChatGPT data from Orca 1 (1 epoch)

  3. Train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2’s 817K data (4 epochs)

No direct RLHF but GPT-4 data are technically RHLF’d

Skipping RHLF because it doesn’t give model “new” knowledge

Results

  • Demonstrates significant improvement in zero-shot reasoning

  • Outperforms similar-sized models in advanced reasoning abilities

Third-Party Evaluations

Benchmarked against other similar open source models

Evaluations on BigBench

  • OpenHermes 2.5 Mistral 7B: 53.04%

  • OpenHermes Llama2 13B: 46.01%

  • Mistral Base 7B: 42.15%

  • Orca 2 13B: 40.36%

data taken from Teknium

Evaluations on gpt4all

  • Hermes 2.5 7B Mistral score: 73.12%

  • Mistral Base 7B score: 71.16%

  • Orca 2 13B GPT4All score: 70.58%

data taken from Teknium

Key Takeaways:

Rather disappointing performance as smaller SoTA models are better:

  • based model Llama-2 + OpenHermes 1 (outdated Sept 4th ver.) outperforms based model Llama-2 + Orca 2

  • Base Mistral 7B outperforms Orca 2 13B

However:

  • valuable insights into utilizing synthetic data for LLM training

  • They improved Llama-2 7B/13B and reach the level of a 5-10x larger model of the same base.

🖼️ LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, ApolinĂĄrio Passos, Longbo Huang, Jian Li, Hang Zhao

4 steps inference using LCM+LoRA

Overview

2 ideas were proposed in this paper

  1. Distill LDMs using LoRA, thus enables LCM of larger models (eg. SDXL) without the need to “train“ full parameters (eg. 3.5B → 197M params)

  2. LCM-LoRAs (named acceleration vector) can combine with any existing LoRAs (style vector) which speeds up generations steps

Keywords:

  • LDM - Latent Diffusion Models (eg. SDXL)

  • LCM - Latent Consistency Models (new 1 ~ 4 step generation method)

  • LCD - Latent Consistency Distillation (LDM use LCD to get LCM)

  • LoRA - Low Rank Adpatation (Parameter-Efficient Fine-Tuning)

Previous challenges:

We have LCM proposed in Oct 6th 2023, that only needs 2 ~ 4 steps for generating images, by using LCD on LDM to get LCM. However, training LCM still requires a lot of compute (32 A100 GPU hours) and VRAMs.

Instead of training LCM and fine-tuning (Latent Consistent Fine-tuning) them on custom datasets, this paper proposed using LoRA

Idea Breakdown:

Idea 1

Since distillation is a “fine-tuning” process, LoRA can be incorporated in the LCD process. So turning LCMs into a LoRA to use on top of the original base-models

Since it’s utilizing LoRA, larger models can easily be distilled or fine-tuned

eg: Base SDXL + SDXL’s LCM LoRA = 2 ~ 4 steps SDXL generation

Idea 2

The LCM LoRA from the resultant of LDM distillation can be used as an acceleration vector

They found out that, the LCM LoRA can directly combine with normal LoRA (eg. acceleration vector + style vector)

So LCM generation can be customized with existing LoRAs without fine-tuning to get efficiency similar to fine-tuning LCM

eg: Base SDXL + ( SDXL’s LCM LoRA + Fantasy LoRA ) = “Customized LCM”

the connection of Accelation LoRA & Style LoRA

Highlight:

So its just LCM customization without the need to distill and fine-tune, uses LoRA, low iterations for image generations on SDXL etc

If we use normal Style LoRA + LCM LoRA on base SDXL models, the image requires less iterations

with and without adding LCM LoRA with Style LoRA

Pretty insane results for such few steps

Authors’ Notes:

There are criticisms online where LCM-LoRA on SDXL doesn’t generate high enough quality, but it is still a consensus that LCM-LoRA is generally better than LCMs due to the less computation requirements.

On top of that, LCM-LoRA can even have lower ranks to generate on the same level of the original LCM-LoRA

Industry News

📝 Anthropic’s 200K Tokens LLM Update: Claude 2.1

Overview

New tokens length: 200K tokens ≈ 150,000 words ≈ 500 pages

Claude 2.1 on context length vs accuracy

Accuracy: 2x decrease in hallucination rates

Claude 2.1 on hallucinations & facts generation

New features

  • Tool use

    • Using a calculator for complex numerical reasoning

    • Translating natural language requests into structured API calls

    • Answering questions by searching databases or using a web search API

    • Taking simple actions in software via private APIs

    • Connecting to product datasets to make recommendations and help users complete purchases

  • User specified system prompts (similar to custom instructions)

  • Generate code in SDK

Third-Party Evaluations

Test on 200K token context length recall capacities

Key findings

  • Effective recall at both document extremes, with nearly 100% accuracy at top and bottom

  • Recall performance decreases around 90K tokens

  • Lower context lengths yield higher accuracy

  • Fact positioning impacts recall effectiveness

Takeaway

  • Importance of prompt engineering for accuracy

  • No guarantee of fact retrieval

  • Optimal recall with reduced context and strategic fact placement

Evaluation Methodology

  • Used Paul Graham essays as background tokens, varying document depths and context lengths from 1K ~ 200K tokens

  • Random statement insertion and retrieval via Claude 2.1, compared with GPT-4

  • Multiple tests for statistical reliability

  • Evaluate Claude 2.1s answer with GPT-4 using LangChainAI evals

Claude 2.1 evaluation

for contrast, this is GPT-4 128k tokens performance (much better)

Additional Notes

  • Multiple fact retrievals or synthetic reasoning steps reduce model performance

  • Variations in prompt, question, and context affect outcomes

  • Test funded by Anthropic, but their involvement was strictly logistical

  • Total cost of evaluation: $1,016 (API usage)

all test results were taken from Greg Kamradt

🎥 Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach

Stable Video Diffusion (SVD) Demo

Overview

  • New 512×1024 video generation models

  • Model weights here

performance comparison with Runway Gen-2 & PikaLabs

Model Pretraining

Base Model → Video Pre-trained → High Res text2video, High Res image2video, interpolation model

Base Model Structure: SD 2.1 image generation model (512×512) finetuned on 256×384 res

Video Model Pre-trained Structure: Take base model, reference Latent Video Diffusion Models structure + insert temporal convolution and attention layers after every spatial convolution and attention layer (1521M parameters) trained on 14 frames 256×384 for 150k itr then 320×576 for 100k itr

Fine-Tuned Model Types

High-Resolution Text-to-Video Model: Finetuned on 1M high-quality video samples, resolution 576×1024 for 50k iterations, batch size 768. Focuses on object and camera motion, aligned captions

High-Resolution Image-to-Video Model: Image conditioning for input, text embedding replaced with CLIP image embedding, and concat noise-augmented conditioning on input frame. Two models developed: 14 and 25 frames. Guidance scale linearly increase across frames

Camera Motion LoRA: Implemented in temporal attention blocks for controlled camera motions: horizontally moving, zooming, static, for image-to-video only

camera motion LoRA examples

Frame Interpolation Model: Increases frame rate by predicting three frames between two conditioning frames. Follows Latent Video Diffusion Models methodology, only for text-to-image model

Multi-View Generation Model: multiple consistent and novel views of an object, fine-tuned with Objaverse and MVImgNet datasets. Outperforms Zero123XL & SyncDreamer in multi-view consistency

Additional Notes

Licensing: Exclusively for Research purposes, non-commercial use

bycloud: Will compare this with Emu Video, Gen-2 update all together soon, subscribe to stay tuned!

that’s a wrap for this issue!

THANK YOU

Support me on Patreon â¤ď¸

Want to promote your service, website or product? Reach out at [email protected]

Reply

or to participate.