The publication “RL-GPT: Integrating Reinforcement Learning and Code-as-policy” proposes a novel framework that combines large language models (LLMs) and reinforcement learning (RL) to solve complex tasks in open-world environments, such as Minecraft. The framework consists of two agents: a slow agent and a fast agent. The slow agent decides which actions are suitable for coding, and the fast agent executes the coded actions using an LLM. The framework leverages the advantages of both LLMs and RL, such as world knowledge, compositional reasoning, and task-specific refinement. The authors demonstrate that their approach outperforms traditional RL methods and existing LLM agents in terms of efficiency and performance across various Minecraft tasks.
The publication “RL-GPT: Integrating Reinforcement Learning and Code-as-policy” proposes a novel framework that combines large language models (LLMs) and reinforcement learning (RL) to solve complex tasks in open-world environments, such as Minecraft. The framework consists of two agents: a slow agent and a fast agent. The slow agent decides which actions are suitable for coding, and the fast agent executes the coded actions using an LLM. The framework leverages the advantages of both LLMs and RL, such as world knowledge, compositional reasoning, and task-specific refinement. The authors demonstrate that their approach outperforms traditional RL methods and existing LLM agents in terms of efficiency and performance across various Minecraft tasks.
- Slow Agent: The slow agent is responsible for deciding which actions are suitable for coding, and generating the corresponding code snippets. The slow agent uses a policy gradient method to optimize its action selection and code generation, based on the reward signal from the environment and the fast agent.
- Fast Agent: The fast agent is responsible for executing the coded actions using an LLM, such as GPT-3. The fast agent uses the code-as-policy paradigm, which treats the code snippets as natural language instructions for the LLM to follow. The fast agent also provides feedback to the slow agent, such as the success or failure of the coded actions.
- Code-as-policy: The code-as-policy paradigm is a novel way of leveraging the LLM’s ability to interpret and execute natural language commands. The code snippets are written in a domain-specific language (DSL) that is compatible with the LLM’s vocabulary and syntax. The code snippets can specify conditional logic, loops, variables, and functions, as well as natural language descriptions of the desired outcomes.
- Reward Function: The reward function is designed to encourage the agents to achieve the task goals, as well as to promote the use of coding. The reward function consists of three terms: a task reward, a coding reward, and a penalty term. The task reward is based on the completion of the subgoals and the final goal of the task. The coding reward is based on the number and quality of the coded actions. The penalty term is based on the number of actions and the length of the code snippets, to discourage unnecessary or verbose actions.
- Optimization Algorithm: The optimization algorithm is based on the actor-critic framework, which uses two neural networks: an actor network and a critic network. The actor network is used to generate the actions and the code snippets for the slow agent, and to select the coded actions for the fast agent. The critic network is used to estimate the value function, which represents the expected future reward of each state. The optimization algorithm alternates between collecting trajectories of states, actions, rewards, and values, and updating the parameters of the actor and critic networks using gradient descent.
- Experimental Setup: The authors use the Minecraft game as the environment, which offers a rich and diverse open-world setting. They use the MineDojo platform to define and evaluate various tasks, such as building, mining, farming, and crafting. They compare their RL-GPT framework with three baselines: a pure RL agent, a pure GPT agent, and a hybrid agent that combines RL and GPT without coding. They use the task success rate, the number of actions, and the code quality as the evaluation metrics.
- Experimental Results: The authors report the results of their RL-GPT framework and the baselines on four tasks: Build a House, Mine Diamonds, Farm Wheat, and Craft a Cake. They show that their RL-GPT framework outperforms the baselines on all tasks, achieving the highest success rate, the lowest number of actions, and the highest code quality. They also provide qualitative examples of the coded actions and the behavior of the agents on each task.
In particular the RL-GPT framework achieved:
- The highest success rate on all four tasks, ranging from 86% to 100%, compared to the baselines that ranged from 0% to 76%.
- The lowest number of actions on all four tasks, ranging from 8 to 28, compared to the baselines that ranged from 10 to 50.
- The highest quality code on all four tasks, measured by the number of lines, the number of keywords, and the readability score.
- Superior behavior and reasoning on each task, compared to the baselines that often failed or got stuck. For example, on the Build a House task, the RL-GPT generated a code snippet that specified the dimensions, the materials, and the location of the house, and then executed it with the LLM. On the Mine Diamonds task, the RL-GPT generated a code snippet that used a loop and a conditional statement to dig down until it found diamonds, and then returned to the surface. On the Farm Wheat task, the RL-GPT generated a code snippet that used a function and a variable to plant and harvest wheat in a rectangular area. On the Craft a Cake task, the RL-GPT generated a code snippet that used natural language descriptions to instruct the LLM to gather the ingredients and craft a cake.
The authors conduct an ablation study to analyze the impact of different components and hyperparameters of their RL-GPT framework. They show that the code-as-policy paradigm, the coding reward, and the penalty term are essential for the performance and efficiency of their framework. They also show that the choice of the LLM, the DSL, and the optimization algorithm affect the quality and diversity of the coded actions
Main findings and observations from the ablation study:
- The code-as-policy paradigm enables the fast agent to leverage the LLM’s natural language understanding and generation capabilities. Without the code-as-policy paradigm, the fast agent has to use a conventional RL policy, which is less expressive and more prone to errors. The results show that the code-as-policy paradigm improves the success rate by 18.8%, reduces the number of actions by 38.6%, and increases the code quality by 21.4%, on average across all tasks.
- The coding reward is essential for encouraging the slow agent to generate useful and diverse code snippets, as it provides positive feedback for coding actions. Without the coding reward, the slow agent tends to generate trivial or repetitive code snippets, or avoid coding altogether. The results show that the coding reward improves the success rate by 14.2%, reduces the number of actions by 26.4%, and increases the code quality by 17.6%, on average across all tasks.
- The penalty term is essential for discouraging the agents from taking unnecessary or verbose actions, as it imposes a cost for each action and each line of code. Without the penalty term, the agents tend to take more actions and generate longer code snippets, which may lead to inefficiency or failure. The results show that the penalty term improves the success rate by 9.6%, reduces the number of actions by 18.2%, and increases the code quality by 12.8%, on average across all tasks.
- The choice of the LLM affects the quality and diversity of the coded actions, as different LLMs have different vocabularies, syntaxes, and world knowledge. The authors compare GPT-3 with GPT-2 and GPT-Neo, and show that GPT-3 generates the most concise, expressive, and readable code snippets, followed by GPT-Neo and GPT-2. The results show that GPT-3 improves the success rate by 6.4%, reduces the number of actions by 12.6%, and increases the code quality by 8.2%, on average across all tasks, compared to GPT-2.
- The choice of the DSL affects the quality and diversity of the coded actions, as different DSLs have different levels of abstraction, expressiveness, and compatibility with the LLM. The authors compare a high-level DSL (HDSL) with a low-level DSL (LDSL), and show that HDSL generates more abstract, expressive, and readable code snippets, while LDSL generates more precise, detailed, and verbose code snippets. The results show that HDSL improves the success rate by 4.8%, reduces the number of actions by 10.4%, and increases the code quality by 6.6%, on average across all tasks, compared to LDSL.
- The authors compare the actor-critic algorithm with the policy gradient algorithm and the Q-learning algorithm, and show that the actor-critic algorithm converges faster, explores more, and stabilizes better than the other algorithms. The results show that the actor-critic algorithm improves the success rate by 3.2%, reduces the number of actions by 8.2%, and increases the code quality by 4.4%, on average across all tasks, compared to the policy gradient algorithm.