Perishable Product Discounting with Reinforcement Learning

Some time ago a retailer brought up this problem with me. I implemented a solution using a variation of Reinforcement Learning called Multi Arm Bandit. The retailer discounts the product with the hope that the remaining inventory gets sold out by the expiration date.

A decision making dilemma

The retailer has to decide the discount rate and the lead time for the discount. There is a dilemma in making the decision.

If the discount rate is high and / or the discounting starts too early and the product gets sold out well before the expiry date, there is loss of revenue. On the other hand, if the the discount rate is low and / or the discounting starts too late, the remaining inventory may not sell out by the expiry date, again resulting in loss of revenue.

Multi Arm Bandit solution

So far the retailer made the decision regarding the discount rate and lead time based on past experience. But now the retailer is interested in automating the decision making process with Machine Learning. This is where Reinforcement Learning comes into the picture

Reinforcement Learning helps you make the optimal decision, which in this case is the combination of discount rate and discount lead time that will maximize the revenue.

Reinforcement Learning is used for decision making tasks in an uncertain environment. It involves state, action and reward. For a given state, the algorithm takes certain action and gets reward for the action at some future time. As a result of taking the action, the state changes.

Game playing is the quintessential example of Reinforcement Learning application. Unlike games, there are many decision making tasks that don't involve states.

The algorithms used for such decision making tasks without states are collectively known as Multi Arm Bandit.

Our price discounting problem has no state. There is only action and reward. The action is the discount rate, discount lead time tuple. The reward is the revenue for the product sale.

The Multi Arm Bandit model is continuously learning system, learning from past experiences. The algorithm makes decisions based what has been learnt so far. All Multi Arm Bandit algorithms strive to maximize the reward over some time horizon.

Further Reading

There are many Multi Arm Bandit algorithms. I used an algorithm called Thompson Sampling for this problem. You can find more details of the solution including Spark implementation of these algorithms in my post.


To view or add a comment, sign in

More articles by Pranab Ghosh

Insights from the community

Others also viewed

Explore topics