Refining LLM Decisions : RLFT with CoT Reasoning
T. Schmied, J Bornschein, J Grau-Moya, M. Wulfmeier and R. Pascanu

Refining LLM Decisions : RLFT with CoT Reasoning

The success of large language models (LLMs) has sparked a lot of interest in building agentic applications around them. The idea is that, with their common sense and ability to reason step by step (thanks to Chain-of-Thought reasoning), LLMs should be able to handle and solve complex problems pretty well. But in practice, these agents often fall short - they don't explore options effectively and struggle with what's called the "knowing-doing gap," where they know the right thing but can't always act on it. In essence, LLMs reason well but perform poorly in decision-making. I would be exploring and reviewing Reinforcement Learning Fine Tuning with CoT approach for better decision-Making in LLMs as presented in "LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities" paper.

Key Shortcomings of LLMs In Decision-making Scenarios        

As we know, LLM agents often struggle with sub-optimal exploration and the "knowing-doing" gap - that is, the inability to translate their knowledge into effective actions. In particular, three prevalent failure modes have been studied : greediness, frequency bias, and the knowing-doing gap.

Three Failure Modes:

1. Greediness: These models tend to latch onto promising-looking options prematurely, abandoning the search for potentially better alternatives.

  • Underlying Cause: Their training prioritizes generating likely continuations, not encouraging experimentation. As a result, they often double down on early rewards without considering less obvious paths.

2. Frequency Bias: LLMs disproportionately choose actions that are commonly seen in their training corpus or mirrored in the current input - even if those choices are poor.

  • Underlying Cause: Exposure to vast amounts of text fosters a tendency to prefer frequently observed behaviors. This issue is more pronounced in smaller-scale models like Gemma 2B.

3. Knowing-doing Gap: Models may articulate the correct logic but still fail to follow through with the appropriate action.

  • Underlying Cause: Pre-training doesn't ensure that reasoning translates into action. Models can mimic sound reasoning without consistently incorporating it into their decision-making processes. This creates a divide between what the model "knows" and what it does.

The real root-cause of these failures is that pre-training improvizes the next-token prediction but it does not incentivize exploration or quality of action.

How LLMs Are Evaluated On Their Decision-making        

LLMs are evaluated on their decision-making performance using test environments or benchmark tasks like multi-armed bandits, contextual bandits, and Tic-tac-toe. These tasks serve as benchmarks to test how effectively the models can learn and act, especially after being fine-tuned through reinforcement learning.


1. Multi-Armed Bandits

  • What it is: A simplified decision-making problem where an agent must choose between multiple options (or "arms") repeatedly, each with an unknown reward probability.
  • Relevance here: Tests the LLM's ability to explore vs. exploit - whether it keeps trying different actions to learn more or sticks with what seems best. It's a classic example of balancing exploration and exploitation.
  • What is a Multi-Armed Bandit Problem: You are faced with several options (arms), and each time you choose one: (i) You get a reward, but you don't know in advance how good that reward will be. (ii) Over time, you want to maximize your total reward.

The challenge is in balancing:

  • Exploration: trying different arms to learn more about them.
  • Exploitation: sticking with the best arm you’ve found so far.

Article content
One-Armed Bandit (slot machine) vs Multi-Armed Bandit

2. Contextual Bandits

  • What it is: A more advanced version of the multi-armed bandit where, before choosing an arm, the agent is given some contextual information (like user features or state of the environment).
  • Relevance here: Tests whether LLMs can use external context to inform better decisions. This is crucial for many real-world applications where decisions depend on current conditions.


3. Tic-tac-toe

  • What it is: A simple turn-based game where two players place Xs and Os on a 3x3 grid aiming to get three in a row.
  • Relevance here: A sequential, strategic environment where each move depends on the current board state. It tests more complex planning, reasoning, and the ability to act on knowledge of the game rules and tactics.

REVISIT TO REINFORCEMENT LEARNING        

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to achieve a goal. The agent receives feedback in the form of rewards for its actions and aims to maximize the cumulative reward over time.

  • The agent observes the environment’s state.
  • It takes actions based on a policy (its decision-making strategy).
  • It receives rewards as feedback.
  • Over time, it learns an optimal policy using algorithms like Q-learning or policy gradients.

RL is widely used in robotics, gaming, and decision systems.

Article content


Exploration in Reinforcement Learning (RL) and LLMs        

1. Balancing Exploration and Exploitation: A core issue in RL is managing the trade-off between exploring new strategies and exploiting known rewards - a tension crucial for effective decision-making.

2. Established RL Exploration Techniques: Traditional RL promotes exploration using mechanisms like stochastic action selection, tracking state visitation, intrinsic curiosity signals, behavioral priors, and entropy-maximizing policies.

  • Stochastic Action Selection: This involves choosing actions randomly based on a probability distribution, rather than deterministically. It helps maintain exploration in reinforcement learning by avoiding repetitive actions.
  • Tracking State Visitation: This refers to monitoring how often each state is visited during training. It’s useful for guiding exploration to under-explored states and improving the learning process.
  • Intrinsic Curiosity Signals: These are internal rewards generated by the agent for exploring novel or uncertain states. They drive the agent to explore environments that are less understood, encouraging more diverse experiences.
  • Behavioral Priors: Behavioral priors represent pre-existing knowledge or assumptions about how an agent should act in certain situations. They help guide learning, particularly in complex environments with limited data.
  • Entropy-Maximizing Policies: These policies aim to maximize the entropy (uncertainty) of the action distribution, promoting exploration and preventing the agent from becoming too certain of its actions too early in the learning process.

3. LLMs Lack Built-In Exploration Incentives: When deployed as policy agents, LLMs tend to avoid exploration, preferring predictable outputs. This stems from their training objective - predicting the next likely token - which doesn't incentivize trying novel or uncertain actions.

4. Exploration Deficiency Hurts LLM Decision-Making This lack of exploration leads to rigid decision behavior in LLMs, reducing their adaptability and effectiveness in tasks that demand learning or strategic experimentation.

Classical Techniques to Improve LLMs' Decision-making Performance        

Let us explore both classical (e.g., 𝜖-greedy) and LLM-specific (e.g., self-correction, self-consistency) techniques to improve LLMs' decision-making performance.

𝜖-greedy Policy | Strategy

  • Purpose: Balances exploration (trying new actions) with exploitation (choosing the best-known action so far).
  • How it works: With probability 𝜖, the agent picks a random action (explore). With probability 1 − 𝜖, it picks the best-known action (exploit), based on its current knowledge (e.g., Q-values).
  • 𝜖 Value: 𝜖 is often decayed over time, starting high (to explore more early) and reducing as learning stabilizes
  • Use Case: Common in multi-armed bandits and Q-learning, especially where the agent needs to learn optimal actions without full knowledge of the environment.

Article content
Article content

Here's how each traditional RL exploration mechanism relates to 𝜖-greedy:

  • Stochastic Action Selection: Like 𝜖-greedy, this introduces randomness into action choices. While 𝜖-greedy explicitly mixes random and greedy choices, stochastic policies (e.g., softmax) select actions probabilistically based on their estimated values.
  • Tracking State Visitation: 𝜖-greedy doesn't track state visitation directly, but state visitation counts can inform adaptive 𝜖 values (e.g., lowering 𝜖 in well-explored states), making exploration more efficient.
  • Intrinsic Curiosity Signals: These go beyond 𝜖-greedy by rewarding novelty or prediction errors to guide exploration. 𝜖-greedy is uninformed - it explores randomly - whereas curiosity-driven methods explore with purpose.
  • Behavioral Priors: Behavioral priors bias exploration toward likely useful actions based on prior knowledge or demonstrations. 𝜖-greedy ignores such priors and selects actions purely based on estimated rewards and randomness.
  • Entropy-Maximizing Policies: These encourage the agent to maintain a high-entropy (diverse) policy, naturally promoting exploration. 𝜖-greedy approximates this with a fixed level of randomness, but doesn't maximize entropy in a principled way.

In short: 𝜖-greedy is a simple, static form of exploration. The other mechanisms are more informed, adaptive, or principled ways to achieve the same goal - better exploration to improve learning efficiency.

LLM Specific Techniques to Improve Their Decision-making Performance        

To improve decision-making performance in large language models (LLMs), several LLM-specific techniques have been developed. These techniques aim to enhance reasoning, reduce hallucinations, and increase reliability. Here are key methods, particularly focusing on self-correction and self-consistency:

1. Self-Correction Techniques

These methods allow LLMs to revise their own outputs, identifying and correcting errors.

  • Reflexion: The model reflects on failures from prior attempts and incorporates feedback to improve performance in subsequent trials.
  • Self-Refinement: An LLM generates an initial response, critiques it, and then uses that critique to produce a better answer. Often a two-step process: Step 1: Generate answer. Step 2: Generate a critique of the answer, then revise the original.
  • Verifier-Critic Approaches: A verifier model is trained (or prompted) to evaluate outputs and suggest corrections, either as a separate model or using the same LLM in a new prompt context.

2. Self-Consistency

This technique leverages the probabilistic nature of LLMs by sampling multiple reasoning paths and selecting the most frequent (or best) outcome.

Self-Consistency with Chain-of-Thought (CoT):

  • Sample multiple diverse reasoning paths using temperature sampling.
  • Aggregate answers (e.g., majority vote) to pick the most consistent solution.
  • Improves accuracy especially in math and logic problems compared to greedy decoding.

3. Chain-of-Thought (CoT) Prompting

Encourages the model to reason step-by-step, leading to more accurate decisions.

  • Zero-shot CoT: Use prompts like "Let's think step by step."
  • Few-shot CoT: Provide examples of step-by-step reasoning in the prompt.

4. ReAct (Reasoning + Acting)

Combines reasoning traces with tool use (e.g., external calculators, search engines). The model interleaves reasoning with actions to verify intermediate steps, improving accuracy and factuality.

5. Tree of Thought (ToT)

Rather than a single linear chain, ToT explores multiple reasoning branches and performs lookahead and backtracking. It's more robust for complex decision-making tasks (e.g., planning or puzzle solving).

6. Debate or Self-Play

Models argue different perspectives with each other (or with themselves) to uncover flaws or weaknesses in reasoning, simulating a dialectic process.

7. Confidence Calibration / Output Scoring

Train or prompt the model to assign confidence scores to its outputs or steps. Helps downstream processes decide whether to accept or request revision.


What is Chain of Thought (CoT) Reasoning        

Chain-of-Thought (CoT) reasoning is a technique in language models where the model explicitly generates intermediate reasoning steps before producing a final answer or action. Instead of jumping directly to a response, the model "thinks out loud" in a structured way. Basically, it includes,

  • Step-by-step reasoning: Breaks down complex tasks into logical sub-steps.
  • Transparency: Makes the model’s thought process interpretable.
  • Improved accuracy: Helps avoid premature or biased conclusions by promoting deeper analysis.


What is Reinforcement Learning from Human Feedback (RLHF)        

There is a key method in fine-tuning LLMs called Reinforcement Learning from Human Feedback (RLHF). RLHF is used to make LLMs more aligned with what humans consider good or helpful responses, rather than simply predicting the next token based on statistical likelihood or human judgments of quality.

How it works:

  • Pre-trained LLM: Starts from a model that has already been trained on large datasets (unsupervised learning).
  • Human Feedback: Humans (or other models) evaluate multiple model outputs and rank or score them based on quality, helpfulness, or alignment with instructions.
  • Reward Model (rₚ): A separate model is trained to learn the scoring pattern from the human feedback - this model predicts how good an output is.
  • Policy Optimization: Using reinforcement learning (often Proximal Policy Optimization, PPO), the original LLM is updated to generate outputs that receive higher predicted rewards - i.e., outputs humans would prefer - while still staying close to the original model (this is important to avoid destabilizing the model).

Article content

Key components:

  • KL Divergence: Measures how much the current policy πθ diverges from the reference model πref. This ensures the model doesn't change too drastically.
  • Reward Model: Guides the model towards producing more human-preferred outputs by using feedback.

This objective is designed to maximize the expected reward (based on human preferences) while minimizing deviation from the original model to avoid unwanted catastrophic changes.

Experimental Setup & Findings        

1. Experimental Setup: Environments & Baselines

The study evaluates large language models (LLMs) in structured decision-making settings using three core environments:

  • Multi-Armed Bandits (MABs): Each action has a fixed but unknown reward distribution; agents must balance exploration and exploitation.
  • Contextual Bandits (CBs): Similar to MABs but with an added input (context) that influences the optimal action, introducing a conditional decision layer.
  • Tic-Tac-Toe: A deterministic game requiring sequential planning and strategy beyond single-step rewards.

Comparative Baselines:

  • Random: Purely stochastic action selection.
  • UCB (Upper Confidence Bound): Classical exploration-based algorithm used as a strong traditional baseline.
  • In-Context Learning (ICL): Prompted demonstrations, optionally augmented with Chain-of-Thought (CoT) reasoning.

Models Evaluated:

  • Gemma2 models at three scales: 2B, 9B, and 27B parameters.


2. Diagnosing LLM Failures in Decision-Making

The paper identifies several consistent weaknesses in how LLMs handle decision-based tasks:

  • Early Lock-In (Greedy Behavior): LLMs often fixate on the first high-rewarding action encountered, failing to probe alternatives - especially problematic in stochastic or deceptive reward settings.
  • Training-Induced Frequency Bias: Actions that appear more often in training data or demonstrations are favored, regardless of observed performance - a side-effect of exposure bias.
  • Knowing-Doing Gap: Even when the model articulates the correct reasoning path, it frequently fails to execute the aligned action. This highlights a misalignment between verbal reasoning and practical output.


3. Exploration Techniques and Their Effectiveness

The authors experimented with several mechanisms to promote exploration:

  • Try-All Strategy: Forces the model to sample every action at least once early in an episode. This simple method produced the most consistent performance gains.
  • Exploration Bonuses: Introduces intrinsic rewards for under-explored actions, promoting novelty-seeking behavior. Particularly effective when paired with RLFT.


Other strategies (with less consistent impact):

  • ε-Greedy Sampling: Injects random action noise with fixed probability - sometimes beneficial, but often too coarse.
  • Context Randomization: Varies prompts to encourage generalization, but results varied.
  • Self-Correction / Self-Consistency: Techniques intended to align output with reasoning by revisiting decisions or averaging across multiple completions. These showed marginal improvement.


4. Ablation Studies: Dissecting What Matters

  • RLFT Improves Sequential Play: In Tic-Tac-Toe, RLFT-trained models outperformed all baselines, particularly in minimizing mistakes over full episodes.
  • More Thinking Tokens Helps: Allowing longer generation windows for reasoning led to more thoughtful and accurate actions, reflecting better internal deliberation.
  • Chain-of-Thought is Crucial: Removing CoT reasoning degraded performance, confirming its importance for complex, multi-step decisions.
  • Supervised Fine-Tuning (SFT): SFT on expert data yields gains, but lacks adaptability in stochastic or evolving settings where feedback-based tuning (RLFT) excels.


Reinforcement Learning Fine Tuning (RLFT) With Chain-of-Thought (CoT) Reasoning : Observations        


Article content
Reinforcement Learning Fine Tuning (RLFT) pipeline

As part of research and experiments, authors fine-tuned a pre-trained LLM 𝜋𝜃 via self-generated Chain-of-Thought (CoT) rationales on environment rewards.

Here are the salient points of the paper.

Reward-Coupled Reasoning

  • The model generates both reasoning traces (CoT) and corresponding actions.
  • The reward depends solely on the quality and validity of the resulting actions.
  • Though the reward depends only on the action, but the model learns that good reasoning (CoT) leads to better actions.
  • CoT becomes a tool for implicit credit assignment: it helps explain and improve why an action was taken.
  • It is similar to RLHF but uses environment rewards instead of human feedback.

Importance of Chain-of-Thought (CoT) Reasoning : CoT prompts the model to think through its decisions before taking action. In the absence of CoT, the model often:

  • Falls into repeating past actions due to frequency bias.
  • Rushes into suboptimal actions by committing too early and exhibits greediness.

RLFT with CoT Reasoning: It trains the model to,

  • Engage in broader exploration.
  • Ensure its reasoning directly aligns and supports its actions.

Decoupling Reasoning and Acting

By modeling reasoning and action separately (π(c), π(a∣c)), the agent:

  • Avoids overfitting to short-term rewards.
  • Can reuse good reasoning traces across similar situations.

Training Strategy

  • Use PPO-style policy gradient updates on πθ(c,a), even though only the action receives direct reward.

Why This Matters

  • Classic RL may conflate thought and action, leading to brittle behaviors.
  • This setup allows the model to learn general-purpose reasoning patterns that generalize across tasks.

Objective Function

  • The objective function for RLFT with Chain-of-Thought (CoT) in the paper is centered around maximizing environment rewards while integrating reasoning steps into the training process.

Article content
Reinforcement Learning from Feedback-Tuning (RLFT) Outcomes        

The RLFT method, which fine-tunes models using reward-based feedback signals, yields significant gains:

1. Substantial Regret Reduction: Across both MABs and CBs, RLFT-trained models demonstrate lower cumulative regret, indicating better decision efficiency over time.

In decision-making and reinforcement learning, regret measures how much reward an agent missed out on by not always taking the best possible action. It’s formally the difference between the reward the agent could have obtained (if it had perfect knowledge of the environment) and what it actually obtained.

  • Lower regret = better performance, because it means the model is making smarter choices over time.

Why Is Regret Important for MABs and CBs?

  • In Multi-Armed Bandits (MABs), each action (arm) gives a random reward. The agent must explore different actions to learn which one is best but also exploit known good ones to accumulate reward. Poor exploration leads to high regret because the agent may never discover better options.
  • In Contextual Bandits (CBs), the problem is harder: the best action changes depending on the context. So the agent must learn not just which actions are good, but which are good under specific conditions. This makes intelligent exploration even more crucial.

Article content
Effect of exploration mechanisms on action coverage and cumulative regret
Article content

Authors find that the simple try-all strategy, which reduces the need for additional exploration by trying all actions, results in the biggest performance improvements. Gemma2 27B almost closes the gap to the optimal UCB agent. This suggests that only given sufficient information about the (sub-)optimality of actions, LLMs are able to select actions accordingly, underscoring their exploration shortcomings. Second, they observe that RLFT lowers regret and improves exploration across different exploration mechanisms. Most importantly, a simple exploration bonus (+1 reward for untried actions during RLFT), significantly increases exploration (50% → 70%) and lowers regret towards the expert compared to regular RLFT. This highlights the importance of reward shaping for fine-tuning LLMs in decision-making scenarios to elucidate a desired behavior.


2. Broader Action Exploration: RLFT-trained models exhibit a significantly wider distribution of chosen actions, especially early in episodes. This behavior contrasts with baseline LLMs that often latch onto one seemingly good action too early - a form of greediness that hinders discovery. By incentivizing coverage across the action space, RLFT allows the model to gather more diverse reward signals. This richer exploration not only improves learning efficiency but also leads to more robust long-term strategies, especially in stochastic or deceptive environments. The result is more balanced exploration-exploitation behavior over time.


3. Mitigated Frequency Dependence: Pretrained LLMs often favor actions that appeared more frequently in demonstrations, regardless of actual rewards. RLFT reduces this bias by shifting the model’s behavior toward outcomes observed during interaction, not static token frequency. As a result, action choices become more reward-sensitive and less tied to prior patterns. The model learns to override familiar but suboptimal actions when feedback suggests better alternatives. This correction is crucial in tasks where high-reward actions are rare or counterintuitive.


4. Improved Rationale-Action Alignment: Baseline LLMs often generate correct reasoning but fail to act accordingly - a gap between what they know and what they do. RLFT helps close this gap by tuning the model on full trajectories where both rationale and action contribute to reward. This joint optimization improves coherence between explanation and behavior. After training, the model not only verbalizes the right strategy more often but is also more likely to execute it. The result is a tighter link between thought and action, especially in multi-step tasks.

Article content
How rationale-action alignment improves across model

Above bar chart showing how rationale-action alignment improves across models:

  • ICL models generate correct rationales often (87%) but follow through only 21% of the time.
  • SFT improves alignment moderately.
  • RLFT significantly boosts the rate of actions that match the model's own reasoning, reaching 78%.

This demonstrates how RLFT narrows the "knowing-doing gap."

Here’s a comparative performance table of the different models across multiple metrics:

Article content
Models vs Metrics

This highlights RLFT's superiority in exploration efficiency, decision quality, and reasoning-action coherence.

Conclusion of the Paper        

Problem Identified: Large Language Models (LLMs) often exhibit reasoning issues such as greedy decoding, frequency bias (preferring common patterns over correct ones), and a gap between knowing and acting (the knowing-doing gap).

Proposed Method: The authors introduce Exploratory Chain-of-Thought (ExCoT), which uses Reinforcement Learning from Human Feedback (RLHF) to fine-tune LLMs on Chain-of-Thought (CoT) rationales.

Key Benefits

  • Encourages the model to explore diverse reasoning paths.
  • Reduces harmful biases and aligns decisions more closely with accurate reasoning.
  • Enhances performance on reasoning tasks like symbolic manipulation and algorithmic thinking.

Experimental Findings: While exploration is not as robust as in standard RL methods, it improves significantly through reward shaping and targeted exploration strategies.

Limitations: Evaluated only on short-horizon tasks and small-to-medium-scale models, so scalability and long-range reasoning remain untested.

In summary, the paper explores why large language models (LLMs) underperform in decision-making tasks, focusing on three main issues: greediness, frequency bias, and the knowing-doing gap. To address these, the authors apply reinforcement learning from fine tuning on chain-of-thought (CoT) rationales, which improves performance by enhancing reasoning quality and decision alignment. They compare two behavioral cloning approaches - expert behavior cloning and thought cloning - using datasets with and without CoT, finding both effective in mimicking expert policies. Additional “thinking” time, via increased generation budgets, significantly improves outcomes, but also leads to high computational costs. While LLMs improve with these methods, exploration remains subpar compared to classic algorithms, prompting experiments with strategies like epsilon-greedy and self-consistency. The work highlights the importance of reward shaping and sufficient generation capacity, especially in multi-step, high-stakes scenarios. Limitations include testing on only short-horizon environments and smaller models, suggesting future research should explore scalability, long-horizon tasks, and more efficient model architectures for decision-making.


Neetu Singh

Leadership Hiring, Human Capital and Talent Management Expert

1w

Vishvambhar D. - is this research on RLFT being incorporated in ChatGPT ?

Like
Reply

To view or add a comment, sign in

More articles by Vishvambhar D.

  • Will Growing Context Window Size Kill RAG ?

    There’s a growing misconception in the AI space off-late (May 2025): “Now that we have massive context windows, RAG is…

    3 Comments
  • RAG vs MCP: A Guide to Native AI Apps

    In AI systems, especially retrieval-augmented generation (RAG) and model context protocols - choosing between RAG and…

    3 Comments
  • Top 20 Vector DBs Fueling The Agentic AI Rise

    A vector database is a type of database specifically built to store, index, and retrieve data in the form of…

    1 Comment
  • Benchmarks for LLM AI Models

    How Benchmarks Help Evaluate LLMs Major Benchmarks like GSM8K (math reasoning), HumanEval (code generation), and MMLU…

    2 Comments
  • Monumental rise in AI reasoning: o1 to o4-mini

    OpenAI's o4-mini is a reasoning model designed for both text and image processing, while o4-mini-high is a more…

    2 Comments
  • Byte Pair Encoding (BPE) - A Subword Tokenization Method in NLP

    Issue: Language models must balance between using a large vocabulary (to ensure most words are represented as whole…

  • Agentic AI Journey from MAS to MARS - Part 1

    A Multi-Agentic System (MAS) — more commonly known as a Multi-Agent System — is composed of autonomous AI agents that…

    6 Comments

Explore topics