OpenDevin, KAN vs MLP, and Rule Based Rewards

#16 | Latest AI Research Explained Simply

In this issue: x3 industry news, x3 AI research papers

July 22nd ~ July 28th

🗞️ Industry News in 1 Line

  1. ♥ 2.2k Mistral announce Mistral Large 2, their new flagship model which is a 123B dense model with 128k context window. It excels at coding and is comparable with Llama-3.1-405B. They didn’t compare it to Claude 3.5 Sonnet. Model weights are available on Huggingface.

  2. ♥ 13k OpenAI announce SearchGPT, a new AI search feature for ChatGPT that gives you answers with clear and relevant sources. Right now you can only join the waitlist.

  3. ♥ 1.1k StabilityAI announces Stable Video 4D, their first video-to-video generation model that allows users to upload a single video and receive dynamic novel-view videos of eight new angles. Model is available on Huggingface.

 Practice Coding While Earning with Shipd

Shipd - Train next gen coding LLMs and win payouts

The latest gamified coding platform that pays top devs to code, have fun and earn while pushing SoTA coding LLM capabilities. On Shipd, top programmers can solve question sets in “Leetcode style” and win payouts by holding the best solutions. 

Shipd presents a rotating selection of questions in various programming languages, with a growing prize pool currently at $55k/month.

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Wang et al. [UIUC, CMU, Yale, UC Berkeley, Contextual AI, KAUST, ANU, HCMUT, Alibaba, All Hands AI]

♥ 591   LLM agent
Benchmark results of OpenDevin against other SOTA approaches

Benchmark results of OpenDevin against other SOTA approaches

Introduction to OpenDevin

Couple of months ago, the internet was filled with memes about Devin (an AI software engineer) and how it is going to replace all software engineers. Few weeks after the launch, we found out that the demo overpromised a lot of things and it is not able to perform complex tasks reliably. 

This paper introduces OpenDevin, an open-source (MIT license) platform to build autonomous software engineers which can collaborate with human developers to write code and fix bugs. The code is under active development but currently it supports AI agents that can interact with the world through software development, command line use, and web browsing. It also provides a sandboxed environment for safe code execution.

User interface of OpenDevin

User interface of OpenDevin

OpenDevin Architecture

The architecture of OpenDevin consists of several key components that work together to create a flexible and powerful system for AI agents. Here's an overview of its architecture:

  1. Agent Actions: The core of OpenDevin's agent is built around a State object, which includes an event stream - a chronological record of past actions and observations. This state also includes auxiliary information like LLM call costs and execution parameters. Agents interact with the environment through a set of core actions:

    1. IPythonRunCellAction: Executes Python code

    2. CmdRunAction: Runs bash commands

    3. BrowseInteractiveAction: Interacts with a web browser

    4. MessageAction: Sends messages (e.g., to users)

  2. Agent Implementation: Developers can create new agents by implementing the step function, which takes the current state as input and generates an appropriate action based on the agent's logic. OpenDevin hosts a collection of community-contributed agent implementations:

    1. CodeAct Agent: A generalist agent based on the CodeAct framework.

    2. Browsing Agent: A web-specific agent for browsing tasks.

    3. GPTSwarm Agent: Uses optimizable graphs to construct agent systems.

    4. Micro Agents: Specialized agents for particular tasks, based on existing generalist agents.

  3. Agent Runtime: The Agent Runtime provides the environment for executing actions and generating observations. Currently, it supports the following runtimes:

    1. Linux SSH Sandbox: A secure, isolated Docker container where bash commands are executed.

    2. Jupyter IPython: Supports interactive Python code execution and debugging.

    3. Web Browser: Implements a Chromium browser based on Playwright, allowing web interactions.

  4. Agent Skills: OpenDevin includes an AgentSkills library, which provides extensible tools implemented as Python functions and a few skills for file editing, multi-modal document parsing, etc.

  5. Multi-agent Interaction: OpenDevin supports interactions between multiple agents through AgentDelegateAction which allows one agent to delegate subtasks to another specialized agent.

  6. User Interface: OpenDevin provides a user interface that allows users to view files, check executed bash commands and Python code, observe the agent's browser activity, and directly interact with the agent.

Evaluating OpenDevin

In software engineering tasks, OpenDevin's CodeActAgent solves 26% of issues in SWE-Bench Lite (comparable to specialized agents) and fixes 79.3% of bugs in HumanEvalFix. In API usage (Gorilla APIBench) and tool utilization (ToolQA), OpenDevin surpasses non-specialized baselines. It also shows strong performance in bioinformatics tasks (BioCoder) and SQL query generation (BIRD), which shows it can handle different programming challenges.

In web browsing tasks, OpenDevin's BrowsingAgent achieved a 15.5% success rate on WebArena, which is comparable to specialized web agents. For miscellaneous assistance tasks, OpenDevin shows impressive results, significantly improving upon baselines in the GAIA benchmark (32.1% vs 13.2% for AutoGPT) and achieving state-of-the-art performance on GPQA (53.1% on the diamond set). It also outperforms baselines in AgentBench's OS subset (57.6% vs 42.4%) and shows strong performance in mathematical reasoning (MINT math subset and ProofWriter).

Success rate of OpenDevin on various benchmarks.

Success rate of OpenDevin on various benchmarks.

Rule Based Rewards for Language Model Safety

Mu et al. [OpenAI]

♥ 2.1k   LLM safety
Architecture of Rule Based Rewards approach by OpenAI

Architecture of Rule Based Rewards approach by OpenAI

Introduction to Language Model Safety

As AI language models become more powerful and widely used, we need to make sure that they don’t behave inappropriately and talk rudely. Currently, many companies use human feedback to train these models to be safe. However, this approach has several problems:

  1. It's expensive and time-consuming to collect and maintain human feedback data.

  2. The data can quickly become outdated as safety guidelines change.

  3. Without precise instructions, annotators might rely on personal biases, leading to unintended model behaviors (e.g., being overly cautious or judgmental).

  4. Fixing issues often requires collecting new data or relabeling existing data, which is costly and time-consuming.

The researchers at OpenAI have introduced a new method called Rule Based Rewards (RBR) which can create AI models that are both safer and more useful, with behaviors that can be more precisely controlled and easily updated.

How to use Rule Based Rewards

Rule-Based Rewards (RBRs) is a new approach for building safety reward functions for reinforcement learning (RL) training (based on content and behavior policies). It works by breaking down complex policies into a series of binary tasks called "propositions". Propositions are specific statements about completions, such as "the completion contains a statement of inability to comply".

Rules are created to determine when combinations of proposition truth values are desired or undesired. These rules allow for accurate ranking of completions based on the classification of propositions. Next, it creates two types of features which are numerical values determined by a prompt and its completion.

  • Proposition probabilities: Estimated using a grader language model (LLM) with few-shot classification prompts.

  • Class features: Probabilities of broader classes (e.g., "ideal", "less_good") calculated by combining relevant proposition probabilities.

Pipeline to generate synthetic data for training LLMs.

Pipeline to generate synthetic data for training LLMs.

The RBR is typically implemented as a linear model with learnable parameters which takes the features as input and produces a safety score. A synthetic dataset (DRBR) is created which includes diverse completions with different rankings for each prompt. The RBR weights are optimized to achieve the target ranking when combined with a helpful-only reward model (RM). If you want to use RBRs for model safety, you need to follow the following checklist:

  1. Define clear content and behavior policies.

  2. Break down these policies into specific propositions.

  3. Create classification prompts for each proposition.

  4. Generate a small human-labeled dataset (Gold set) to tune the classification prompts.

  5. Create synthetic comparison data (DRBR) for weight fitting.

  6. Fit the RBR weights using the synthetic data and a helpful-only RM.

  7. Combine the RBR with the helpful-only RM for use in RL training.

  8. Continuously evaluate and tune the combined reward function to ensure it enforces the desired safety behaviors.

Are Rule Based Rewards Effective?

The RBR approach improved safety without negatively impacting the model's performance on common capability benchmarks like MMLU, Lambada, HellaSwag, and GPQA. The main findings show that the RBR-PPO model achieved a good balance between safety and usefulness. In human evaluations of large models, it scored 97.27% on the "Not-Unsafe" metric (avoiding disallowed content) and 97.01% on the "Not-Overrefuse" metric (not refusing appropriate requests), resulting in the highest F1-score of 97.1%. The Human-PPO baseline increased safety significantly but at the cost of more over-refusals (about 14% increase in human evaluation). 

Error rate of rule based rewards approach

Error rate of rule based rewards approach

The researchers also found that RBRs could be effectively combined with different types of reward models (RMs) to improve their safety profiles. For example, when applied to a Human-RM with a tendency towards over-refusals, the RBR reduced over-refusals by 16%.

Similarly, when applied to an RM trained with outdated safety data, the RBR improved both safety and reduced over-refusals by 10%. Interestingly, the RBR approach required less human-annotated data than the Human-Data baseline to achieve its improvements. When the human-safety data was subsampled to match the amount used in RBR runs (518 completions), it performed slightly worse than both RBR-PPO and the full Human-PPO baseline. This suggests that the RBR method may be more data-efficient for improving model safety.

KAN or MLP: A Fairer Comparison

Yu et al. [National University of Singapore]

♥ 288  ML Theory
Comparing Kolmogorov–Arnold Networks (KAN) against Multi-Layer Perceptrons (MLP)

Comparing Kolmogorov–Arnold Networks (KAN) against Multi-Layer Perceptrons (MLP)

KAN vs MLP

Few weeks back, we discussed the launch of Kolmogorov-Arnold Networks (KANs), a new method which can potentially revolutionize machine learning. KANs made a lot of hype and people were claiming it could be the missing link which would push us closer to AGI, we even released a video to explain how KANs work and whether it could be the next paradigm shift in the field of machine learning.

Some time has passed since this paper was released and researchers have had time to evaluate this new architecture. This paper tests all the claims made by the original paper to see whether KANs really are superior to MLPs.

Testing KANs against MLPs?

This paper compared two neural network architectures: Kolmogorov-Arnold Networks (KANs) and Multi-Layer Perceptrons (MLPs). The researchers made sure to compare KANs and MLPs on an equal footing by controlling for either the same number of parameters or the same amount of computational work (FLOPs). They tested both architectures on a wide range of tasks, including machine learning, computer vision, audio processing, natural language processing, and symbolic formula representation.

For each type of task, they tried different network setups with varying numbers of hidden layers, layer widths, different spline settings (these are special functions KANs use), and different activation functions (like GELU or ReLU). Each network was trained for a certain number of epochs and then evaluated using the following approach:

  • For most tasks, they measured how often the network got the right answer.

  • For symbolic formula tasks, they used Root Mean Square Error, which measures how close the predictions are to the actual values.

Are KANs Better Than MLPs?

The researchers found that KANs can be viewed as a special type of MLP, with the key difference being the use of learnable B-spline functions as activation functions. The experiments revealed that KANs only outperformed MLPs in symbolic formula representation tasks, while MLPs excelled in machine learning, computer vision, natural language processing, and audio processing.

Interestingly, when MLPs were modified to use learnable B-spline activation functions, they matched or surpassed KANs' performance across all tasks. This means that the functional differences between KANs and MLPs primarily stem from their activation functions. 

Benchmark results of Kolmogorov–Arnold Networks

Benchmark results of Kolmogorov–Arnold Networks

In simple terms, this study shows that the main difference between KAN and regular neural networks is the special math functions KAN uses. These functions are great for certain math problems, but not so much for other tasks like recognizing images or understanding language. The cool part is, when we give regular neural networks these same special functions, they can do just as well as KAN on all tasks, including the math ones. So, basically, we can make our regular neural networks even better by borrowing KAN's secret sauce.

Do you like having GIFs/images in the "Industry News In 1 Line" section?

Login or Subscribe to participate in polls.

Reply

or to participate.