Refining LLM Decisions : RLFT with CoT Reasoning
The success of large language models (LLMs) has sparked a lot of interest in building agentic applications around them. The idea is that, with their common sense and ability to reason step by step (thanks to Chain-of-Thought reasoning), LLMs should be able to handle and solve complex problems pretty well. But in practice, these agents often fall short - they don't explore options effectively and struggle with what's called the "knowing-doing gap," where they know the right thing but can't always act on it. In essence, LLMs reason well but perform poorly in decision-making. I would be exploring and reviewing Reinforcement Learning Fine Tuning with CoT approach for better decision-Making in LLMs as presented in "LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities" paper.
Key Shortcomings of LLMs In Decision-making Scenarios
As we know, LLM agents often struggle with sub-optimal exploration and the "knowing-doing" gap - that is, the inability to translate their knowledge into effective actions. In particular, three prevalent failure modes have been studied : greediness, frequency bias, and the knowing-doing gap.
Three Failure Modes:
1. Greediness: These models tend to latch onto promising-looking options prematurely, abandoning the search for potentially better alternatives.
2. Frequency Bias: LLMs disproportionately choose actions that are commonly seen in their training corpus or mirrored in the current input - even if those choices are poor.
3. Knowing-doing Gap: Models may articulate the correct logic but still fail to follow through with the appropriate action.
The real root-cause of these failures is that pre-training improvizes the next-token prediction but it does not incentivize exploration or quality of action.
How LLMs Are Evaluated On Their Decision-making
LLMs are evaluated on their decision-making performance using test environments or benchmark tasks like multi-armed bandits, contextual bandits, and Tic-tac-toe. These tasks serve as benchmarks to test how effectively the models can learn and act, especially after being fine-tuned through reinforcement learning.
1. Multi-Armed Bandits
The challenge is in balancing:
2. Contextual Bandits
3. Tic-tac-toe
REVISIT TO REINFORCEMENT LEARNING
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to achieve a goal. The agent receives feedback in the form of rewards for its actions and aims to maximize the cumulative reward over time.
RL is widely used in robotics, gaming, and decision systems.
Exploration in Reinforcement Learning (RL) and LLMs
1. Balancing Exploration and Exploitation: A core issue in RL is managing the trade-off between exploring new strategies and exploiting known rewards - a tension crucial for effective decision-making.
2. Established RL Exploration Techniques: Traditional RL promotes exploration using mechanisms like stochastic action selection, tracking state visitation, intrinsic curiosity signals, behavioral priors, and entropy-maximizing policies.
3. LLMs Lack Built-In Exploration Incentives: When deployed as policy agents, LLMs tend to avoid exploration, preferring predictable outputs. This stems from their training objective - predicting the next likely token - which doesn't incentivize trying novel or uncertain actions.
4. Exploration Deficiency Hurts LLM Decision-Making This lack of exploration leads to rigid decision behavior in LLMs, reducing their adaptability and effectiveness in tasks that demand learning or strategic experimentation.
Classical Techniques to Improve LLMs' Decision-making Performance
Let us explore both classical (e.g., 𝜖-greedy) and LLM-specific (e.g., self-correction, self-consistency) techniques to improve LLMs' decision-making performance.
𝜖-greedy Policy | Strategy
Here's how each traditional RL exploration mechanism relates to 𝜖-greedy:
In short: 𝜖-greedy is a simple, static form of exploration. The other mechanisms are more informed, adaptive, or principled ways to achieve the same goal - better exploration to improve learning efficiency.
LLM Specific Techniques to Improve Their Decision-making Performance
To improve decision-making performance in large language models (LLMs), several LLM-specific techniques have been developed. These techniques aim to enhance reasoning, reduce hallucinations, and increase reliability. Here are key methods, particularly focusing on self-correction and self-consistency:
1. Self-Correction Techniques
These methods allow LLMs to revise their own outputs, identifying and correcting errors.
2. Self-Consistency
This technique leverages the probabilistic nature of LLMs by sampling multiple reasoning paths and selecting the most frequent (or best) outcome.
Self-Consistency with Chain-of-Thought (CoT):
3. Chain-of-Thought (CoT) Prompting
Encourages the model to reason step-by-step, leading to more accurate decisions.
4. ReAct (Reasoning + Acting)
Combines reasoning traces with tool use (e.g., external calculators, search engines). The model interleaves reasoning with actions to verify intermediate steps, improving accuracy and factuality.
5. Tree of Thought (ToT)
Rather than a single linear chain, ToT explores multiple reasoning branches and performs lookahead and backtracking. It's more robust for complex decision-making tasks (e.g., planning or puzzle solving).
6. Debate or Self-Play
Models argue different perspectives with each other (or with themselves) to uncover flaws or weaknesses in reasoning, simulating a dialectic process.
7. Confidence Calibration / Output Scoring
Train or prompt the model to assign confidence scores to its outputs or steps. Helps downstream processes decide whether to accept or request revision.
What is Chain of Thought (CoT) Reasoning
Chain-of-Thought (CoT) reasoning is a technique in language models where the model explicitly generates intermediate reasoning steps before producing a final answer or action. Instead of jumping directly to a response, the model "thinks out loud" in a structured way. Basically, it includes,
What is Reinforcement Learning from Human Feedback (RLHF)
There is a key method in fine-tuning LLMs called Reinforcement Learning from Human Feedback (RLHF). RLHF is used to make LLMs more aligned with what humans consider good or helpful responses, rather than simply predicting the next token based on statistical likelihood or human judgments of quality.
How it works:
Key components:
This objective is designed to maximize the expected reward (based on human preferences) while minimizing deviation from the original model to avoid unwanted catastrophic changes.
Experimental Setup & Findings
1. Experimental Setup: Environments & Baselines
The study evaluates large language models (LLMs) in structured decision-making settings using three core environments:
Comparative Baselines:
Models Evaluated:
2. Diagnosing LLM Failures in Decision-Making
The paper identifies several consistent weaknesses in how LLMs handle decision-based tasks:
3. Exploration Techniques and Their Effectiveness
The authors experimented with several mechanisms to promote exploration:
Other strategies (with less consistent impact):
4. Ablation Studies: Dissecting What Matters
Reinforcement Learning Fine Tuning (RLFT) With Chain-of-Thought (CoT) Reasoning : Observations
As part of research and experiments, authors fine-tuned a pre-trained LLM 𝜋𝜃 via self-generated Chain-of-Thought (CoT) rationales on environment rewards.
Here are the salient points of the paper.
Reward-Coupled Reasoning
Importance of Chain-of-Thought (CoT) Reasoning : CoT prompts the model to think through its decisions before taking action. In the absence of CoT, the model often:
RLFT with CoT Reasoning: It trains the model to,
Decoupling Reasoning and Acting
By modeling reasoning and action separately (π(c), π(a∣c)), the agent:
Training Strategy
Why This Matters
Objective Function
Reinforcement Learning from Feedback-Tuning (RLFT) Outcomes
The RLFT method, which fine-tunes models using reward-based feedback signals, yields significant gains:
1. Substantial Regret Reduction: Across both MABs and CBs, RLFT-trained models demonstrate lower cumulative regret, indicating better decision efficiency over time.
In decision-making and reinforcement learning, regret measures how much reward an agent missed out on by not always taking the best possible action. It’s formally the difference between the reward the agent could have obtained (if it had perfect knowledge of the environment) and what it actually obtained.
Why Is Regret Important for MABs and CBs?
Authors find that the simple try-all strategy, which reduces the need for additional exploration by trying all actions, results in the biggest performance improvements. Gemma2 27B almost closes the gap to the optimal UCB agent. This suggests that only given sufficient information about the (sub-)optimality of actions, LLMs are able to select actions accordingly, underscoring their exploration shortcomings. Second, they observe that RLFT lowers regret and improves exploration across different exploration mechanisms. Most importantly, a simple exploration bonus (+1 reward for untried actions during RLFT), significantly increases exploration (50% → 70%) and lowers regret towards the expert compared to regular RLFT. This highlights the importance of reward shaping for fine-tuning LLMs in decision-making scenarios to elucidate a desired behavior.
2. Broader Action Exploration: RLFT-trained models exhibit a significantly wider distribution of chosen actions, especially early in episodes. This behavior contrasts with baseline LLMs that often latch onto one seemingly good action too early - a form of greediness that hinders discovery. By incentivizing coverage across the action space, RLFT allows the model to gather more diverse reward signals. This richer exploration not only improves learning efficiency but also leads to more robust long-term strategies, especially in stochastic or deceptive environments. The result is more balanced exploration-exploitation behavior over time.
3. Mitigated Frequency Dependence: Pretrained LLMs often favor actions that appeared more frequently in demonstrations, regardless of actual rewards. RLFT reduces this bias by shifting the model’s behavior toward outcomes observed during interaction, not static token frequency. As a result, action choices become more reward-sensitive and less tied to prior patterns. The model learns to override familiar but suboptimal actions when feedback suggests better alternatives. This correction is crucial in tasks where high-reward actions are rare or counterintuitive.
4. Improved Rationale-Action Alignment: Baseline LLMs often generate correct reasoning but fail to act accordingly - a gap between what they know and what they do. RLFT helps close this gap by tuning the model on full trajectories where both rationale and action contribute to reward. This joint optimization improves coherence between explanation and behavior. After training, the model not only verbalizes the right strategy more often but is also more likely to execute it. The result is a tighter link between thought and action, especially in multi-step tasks.
Above bar chart showing how rationale-action alignment improves across models:
This demonstrates how RLFT narrows the "knowing-doing gap."
Here’s a comparative performance table of the different models across multiple metrics:
This highlights RLFT's superiority in exploration efficiency, decision quality, and reasoning-action coherence.
Conclusion of the Paper
Problem Identified: Large Language Models (LLMs) often exhibit reasoning issues such as greedy decoding, frequency bias (preferring common patterns over correct ones), and a gap between knowing and acting (the knowing-doing gap).
Proposed Method: The authors introduce Exploratory Chain-of-Thought (ExCoT), which uses Reinforcement Learning from Human Feedback (RLHF) to fine-tune LLMs on Chain-of-Thought (CoT) rationales.
Key Benefits
Experimental Findings: While exploration is not as robust as in standard RL methods, it improves significantly through reward shaping and targeted exploration strategies.
Limitations: Evaluated only on short-horizon tasks and small-to-medium-scale models, so scalability and long-range reasoning remain untested.
In summary, the paper explores why large language models (LLMs) underperform in decision-making tasks, focusing on three main issues: greediness, frequency bias, and the knowing-doing gap. To address these, the authors apply reinforcement learning from fine tuning on chain-of-thought (CoT) rationales, which improves performance by enhancing reasoning quality and decision alignment. They compare two behavioral cloning approaches - expert behavior cloning and thought cloning - using datasets with and without CoT, finding both effective in mimicking expert policies. Additional “thinking” time, via increased generation budgets, significantly improves outcomes, but also leads to high computational costs. While LLMs improve with these methods, exploration remains subpar compared to classic algorithms, prompting experiments with strategies like epsilon-greedy and self-consistency. The work highlights the importance of reward shaping and sufficient generation capacity, especially in multi-step, high-stakes scenarios. Limitations include testing on only short-horizon environments and smaller models, suggesting future research should explore scalability, long-horizon tasks, and more efficient model architectures for decision-making.
Leadership Hiring, Human Capital and Talent Management Expert
1wVishvambhar D. - is this research on RLFT being incorporated in ChatGPT ?