Why self-improving LLM systems will be a big trend

multi-tasking robot
Image created with Microsoft Copilot

Recently I’ve been seeing a raft of research papers and techniques that use large language models (LLM) to create systems that self-improve. This is a very interesting line of research that has the potential to accelerate advancements in many fields.

The general process is as follows:

1- The LLM receives a natural language instruction to come up with a solution to a problem

2- The model generates several hypotheses for the solution

3- The hypotheses are verified through a tool such as a code executor or a math solver

4- The hypotheses with the most promising results are returned to the model along with the outcome

5- The model reasons over the results and suggests improvements

6- The cycle repeats itself until the process converges on a quality metric or hits a certain limit

This self-reinforcing cycle works for two reasons:

1- Frontier models have been trained on trillions of tokens of text that include common-sense world knowledge, problem-solving, and reasoning. They can use that internalized knowledge to create solutions and reflect on the results. And they are much more versatile than rigid, rule-based problem-solvers that only work within a limited set of constraints.

2- The process is scalable. While LLMs tend to generate false responses, they can provide many answers to the same problem in a fraction of the time it would take for a human to come up with a single hypothesis. When combined with a verification tool such as a code executor, the false answers can quickly be discarded.

This type of self-improving LLM-powered process is already generating interesting results.

For example, DrEureka, a technique developed by researchers at UPenn, Nvidia, and UT Austin, uses an LLM to create a draft for multiple reward models for a robot manipulation task. Then the results are fed back to the model and it is told to reason over the results and think about how it can improve itself. The model not only creates and adjusts the reward function but also makes the configurations to facilitate the sim2real transfer (handling the differences between simulation environments where the models are trained and the noisiness of the real world). According to the paper, this technique has proven to create better reward models than humans.

dreureka
DrEureka (source: GitHub)

Another more recent example is LLM-Squared by Sakana AI. This technique uses an LLM to suggest loss functions. The functions are then tested and the results are sent back to the model for review and improvement. The researchers at Sakana used this technique to create DiscoPOP, which according to them “achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.”

llm-squared
LLM-Squared by Sakana AI

Both of these techniques have the loop mentioned above, with some modifications for the specific task. And both have been able to outperform humans. This is a very powerful combination that is also being used in other fields such as theorem-proving and solving programming challenges.

However, there are limitations to how far this pattern can be pushed. First, the models require well-crafted prompts from humans. For example, in the DrEureka project, the type of instructions included in the initial prompt had a tremendous impact on the solutions that the model proposed (though one can argue that other automated prompt optimization techniques such as OPRO might eventually overcome this limitation).

Second, this pattern can only be applied to problems that have a verification mechanism such as executing code. Otherwise, the model can easily go down a path where it hallucinates and starts building on its own hallucinations to run down weird paths.

Finally, for tasks that require complicated reasoning skills, only frontier models such as GPT-4 can provide reasonable hypotheses. As every iteration will require creating and verifying of dozens or hundreds of hypotheses, inference costs can quickly become a bottleneck (until the costs of models drop significantly or open-source models catch up with private ones).

In any case, while LLMs are nowhere near replacing humans yet, they can become very good aides that help search vast solution spaces. Even with a small budget, a well-designed LLM-powered self-improvement loop can help discover solutions faster than it would have otherwise taken. It will be interesting to see how self-improving systems will help accelerate AI research in the coming months.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.