✅Naive Bayes Algorithm - Explained💯

✅Naive Bayes Algorithm - Explained💯

Naive Bayes is a probabilistic algorithm that’s typically used for classification problems. It uses Conditional probability, which is a measure of the probability of an event occurring given that another event has (by assumption, presumption, assertion, or evidence) occurred. It is simple, intuitive, and yet performs surprisingly well in many cases. It is based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

No alt text provided for this image

Assumptions made by Naive Bayes

The fundamental Naïve Bayes assumption is that each feature makes an:

- Independent

- Equal

contribution to the outcome.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Note - Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look at the equation below:

No alt text provided for this image

Above,

- P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).

- P(c) is the prior probability of class.

- P(x|c) is the likelihood which is the probability of predictor given class.

- P(x) is the prior probability of predictor.

This is a rather simple transformation, but it bridges the gap between what we want to do and what we can do. We can’t get P(C|X) directly, but we can get P(X|C) and P(C) from the training data. Here’s an example:

No alt text provided for this image

In this case, X =(Outlook, Temperature, Humidity, Windy), and Y=Play. P(X|Y) and P(Y) can be calculated:

No alt text provided for this image
No alt text provided for this image

Having this amount of parameters in the model is impractical. To solve this problem, a naive assumption is made. We pretend all features are independent. What does this mean?

No alt text provided for this image

Now with the help of this naive assumption (naive because features are rarely independent), we can make classification with much fewer parameters:

No alt text provided for this image
No alt text provided for this image

This is a big deal. We changed the number of parameters from exponential to linear. This means that Naive Bayes can deal with high-dimensional data well.

Another Example with Mathematics -

No alt text provided for this image

Problem: Players will play if the weather is sunny. Is this statement is correct?

We can solve it using above discussed method of the posterior probability.

- P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

- Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

- Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

Naïve Bayes Classifier assumes that all the features are unrelated to each other. The presence or absence of a feature does not influence the presence or absence of any other feature.

In real-world datasets, we test a hypothesis given multiple evidence on features. So, the calculations become quite complicated. To simplify the work, the feature independence approach is used to uncouple multiple pieces of evidence and treat each as an independent one.

The zero-frequency problem

One of the disadvantages of Naïve-Bayes is that if you have no occurrences of a class label and a certain attribute value together then the frequency-based probability estimate will be zero. And this will get a zero when all the probabilities are multiplied.

No alt text provided for this image

Solution - An approach to overcome this ‘zero-frequency problem’ in a Bayesian environment is to add one to the count for every attribute value-class combination when an attribute value doesn’t occur with every class value.

No alt text provided for this image

There are three types of Naive Bayes model under the sci-kit-learn library:

Gaussian: It is used in classification and it assumes that features follow a normal distribution.

Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with a ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

What are the Pros and Cons of Naive Bayes?

Pros:

  • It is easy and fast to predict the class of test data set. It also performs well in multi-class prediction
  • When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data.
  • It performs well in the case of categorical input variables compared to a numerical variable(s). For numerical variables, the normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If a categorical variable has a category (in the test data set), which was not observed in the training data set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • On the other side, naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Tips to improve the power of the Naive Bayes Model

Here are some tips for improving the power of the Naive Bayes Model:

  • If continuous features do not have a normal distribution, we should use transformation or different methods to convert it into a normal distribution.
  • If a test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
  • Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to overinflating importance.
  • Naive Bayes classifiers have limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options (look at the detail here). I would recommend focusing on your pre-processing of data and the feature selection.
  • You might think to apply some classifier combination techniques like ensembling, bagging, and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize.

Applications of Naive Bayes Algorithms

Real-time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real-time.

Multi-class Prediction: This algorithm is also well known for its multi-class prediction feature. Here we can predict the probability of multiple classes of target variables.

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System: Naive Bayes Classifier and Collaborative Filtering together build a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Thanks for Reading, Like Comment and Sharing if it's good.

Gowri Swaminathan

Director (IT)/ Scientist 'E' at National Informatics Centre, MeitY

3y

A good and neatly explained article . thanks

Pablo Gaston Schulz

Ammonia and urea process technology specialist. Supervisor, Applications and Process Safety Engineering

3y

Very clear explanation. Thanks for sharing

To view or add a comment, sign in

More articles by Simranjeet Singh

Insights from the community

Others also viewed

Explore topics