Blog

Why self-improving LLM systems will be a big trend

June 18, 2024

multi-tasking robot — Image created with Microsoft Copilot

Recently I’ve been seeing a raft of research papers and techniques that use large language models (LLM) to create systems that self-improve. This is a very interesting line of research that has the potential to accelerate advancements in many fields.

The general process is as follows:

1- The LLM receives a natural language instruction to come up with a solution to a problem

2- The model generates several hypotheses for the solution

3- The hypotheses are verified through a tool such as a code executor or a math solver

4- The hypotheses with the most promising results are returned to the model along with the outcome

5- The model reasons over the results and suggests improvements

6- The cycle repeats itself until the process converges on a quality metric or hits a certain limit

This self-reinforcing cycle works for two reasons:

1- Frontier models have been trained on trillions of tokens of text that include common-sense world knowledge, problem-solving, and reasoning. They can use that internalized knowledge to create solutions and reflect on the results. And they are much more versatile than rigid, rule-based problem-solvers that only work within a limited set of constraints.

2- The process is scalable. While LLMs tend to generate false responses, they can provide many answers to the same problem in a fraction of the time it would take for a human to come up with a single hypothesis. When combined with a verification tool such as a code executor, the false answers can quickly be discarded.

This type of self-improving LLM-powered process is already generating interesting results.

For example, DrEureka, a technique developed by researchers at UPenn, Nvidia, and UT Austin, uses an LLM to create a draft for multiple reward models for a robot manipulation task. Then the results are fed back to the model and it is told to reason over the results and think about how it can improve itself. The model not only creates and adjusts the reward function but also makes the configurations to facilitate the sim2real transfer (handling the differences between simulation environments where the models are trained and the noisiness of the real world). According to the paper, this technique has proven to create better reward models than humans.

Another more recent example is LLM-Squared by Sakana AI. This technique uses an LLM to suggest loss functions. The functions are then tested and the results are sent back to the model for review and improvement. The researchers at Sakana used this technique to create DiscoPOP, which according to them “achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.”

Both of these techniques have the loop mentioned above, with some modifications for the specific task. And both have been able to outperform humans. This is a very powerful combination that is also being used in other fields such as theorem-proving and solving programming challenges.

However, there are limitations to how far this pattern can be pushed. First, the models require well-crafted prompts from humans. For example, in the DrEureka project, the type of instructions included in the initial prompt had a tremendous impact on the solutions that the model proposed (though one can argue that other automated prompt optimization techniques such as OPRO might eventually overcome this limitation).

Second, this pattern can only be applied to problems that have a verification mechanism such as executing code. Otherwise, the model can easily go down a path where it hallucinates and starts building on its own hallucinations to run down weird paths.

Finally, for tasks that require complicated reasoning skills, only frontier models such as GPT-4 can provide reasonable hypotheses. As every iteration will require creating and verifying of dozens or hundreds of hypotheses, inference costs can quickly become a bottleneck (until the costs of models drop significantly or open-source models catch up with private ones).

In any case, while LLMs are nowhere near replacing humans yet, they can become very good aides that help search vast solution spaces. Even with a small budget, a well-designed LLM-powered self-improvement loop can help discover solutions faster than it would have otherwise taken. It will be interesting to see how self-improving systems will help accelerate AI research in the coming months.

What happens when you’re mean to AI

PAS finds the best prompting technique for your LLM

Navigating employee relations’ path forward with AI: 3 guardrails

How AI agents can self-improve with symbolic learning

DeepMind releases benchmark for evaluating long-context LLMs

How to approach LLMs and generative AI tools in everyday tasks

How to create fine-tuned LLMs with ChatGPT

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

Why self-improving LLM systems will be a big trend

Like this:

Leave a ReplyCancel reply

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks