SlideShare a Scribd company logo
“
Comparative Study of Machine Learning
Algorithms for Sentiment Analysis with
TF-IDF Vector Creation
Sagar Vijay Deogirkar (10547321)
MSc Data Analytics
Ms. Terri Hoare
Supervisor
Index
• Introduction
• Research Question and Objective
• Methodology
• Business and Data Understanding
• Data Preparation
• Modelling
• Evaluation
• Results
• Conclusion and Future Work
2
Introduction
Customers are expressing their thoughts about product and offered services more openly than
never before. Considering this sentiment analysis is becoming an essential aspect to understand
their sentiments.
Sentiment analysis cites the use of Natural Language Processing technique to classify the type of
sentiment. In other words, Sentiment Analysis is the process to determine if the given text is
positive, negative or of neutral sentiment. It is often perform on textual data to help business
entities, to monitor their brand’s product or services’ sentiments from client’s reviews. This helps to
understand the customer’s requirement which may lead to necessary improvement in product or
services if required.
3
Research Question
Sentiment Analysis is trending, many programming and non programming platforms have arrived
providing solution to this problem. But problem lies with platform selection and more over that, to
the model or algorithm selection. The main obstacle is to know, which algorithm can be chosen
with TF-IDF vector creation technique, which will lead to determine the class of the sentiment
more correctly.
4
Objective
The main objective of this research is to compare the State-of-the-Art Deep Learning with
Machine Learning algorithms performance on TF-IDF vector creation for Sentiment Analysis.
Methodology
Data
Understanding
5
Data
Preparation
Modelling Evaluation
Business
Understanding
Deployment
• Determining Business Objective
• Assess The Situation
• Determine the Study Goal
• Produce a Project Plan
• Data Selection
• Clean Data
• Construct Data
• Integrate Data
• Format Data
• Evaluate Result
• Review Process
• Determine Next Step
• Data Collection
• Describing the Data
• Data Exploration
• Verifying Data Quality
• Select the Model
• Generate Test Design
• Build the Model
• Assess the Model
• Plan Development
• Plan Monitory and
Maintenance
• Produce Final Report
• Review Report
This is research conducted following CRISP-DM (Cross-Industry Standard Process for Data Mining)
methodology.
Business Understanding
Sentiment classification is a term which comprises the method to determine the labelled
sentiment from the available classes based on the aligned text data. It helps to identify the
emotions i.e. sentiment behind the text of high volume data. Text data could be reviews from
YouTube, or from any social media platform, or tweets on trending topic involving different hash
tags and etc, or articles or news report or such which is in the form of text.
6
Data Understanding
• For this research “Twitter Airline” dataset is used.
• Dataset comprises of total 14 features and a label, having a total of 14640 rows.
• Column Names - tweet id, airline sentiment, airline sentiment confidence, airline sentiment gold,
negative reason, negative reason confidence, airline, name, negative reason gold, re tweet count,
text, tweet coord, tweet created, tweet location, user time zone.
•Sentiment Distribution: 9178 – Negative, 3099 – Neutral and 2363 - Positive.
• From the above features only airline_sentiment and text is selected for the research.
Data Preparation
• Text Pre-processing – Lowercasing is done. Unnecessary symbols and numbers are removed.
• Sentiment Class Filtering - Neutral class’s sentiment is filtered and positive is turned to 1 and
Negative to 0.
• Data Balancing – Positive and Negative class sentiment is balanced to same number of samples.
• Removing Stop Words – Common words in the language are removed.
• Text Stemming – Porter Stemming is used to make word into its original form.
• Tokenization – Every word is separated in the document.
• TF-IDF vector – TF- IDF word vector is created having all the words in the data set with their
weight.
7
Text Cleaning Text Processing Vector Creation
Data Importation
TF- IDF Vector Creation
• Text Stemming
• Tokenization
• Data Balancing
• Lowercasing
• Removing Symbols and
Numbers
• Removing Stop Words.
Data Importation to the
Platform and Considering
the Features and label.
Modelling
• Selected Models are:
Naive Bayes, Support Vector Machine (SVM), Generalised Linear Model (GLM), Logistic Regression,
Decision Tree, Random Forest, Gradient Boosted Trees, and Deep Learning.
• On Rapid Miner Auto Model is used with 3000 samples.
•Deep learning (Neural Network) is observed on H2O AI platform with 3000 samples processed
and saved from Rapid Miner.
• On Python 4726 samples are used for modelling on above mentioned models.
8
Evaluation
Performance of the model is evaluated by generating/calculating following parameters:
• Classification Error – It is the total number of error made by the machine learning model to
predict correct data from the total number of predicted samples.
• Accuracy - Accuracy is the fraction of correct prediction made by the model to the total number
of the samples.
• AUC - It is the complete area under the 2-dimensional area under the ROC (Receiver Operating
Characteristic) curve.
• Precision - Precision is the measure of a model which represents the actual positive values
predicted by the model from the total positive values.
• Recall/Sensitivity - It is the measure of a model which represents the total number of actual
positive values predicted by the model.
• F1 Score - It is the harmonic mean between precision and recall.
• Specificity - It is defined as the ration of the true negative prediction made by the model to the
total number of negative values available in the set.
9
Results – Rapid Miner
10
Parameter/
Model
NV GLM LR FLM DT RF GBT SVM DL
Classification
Error
49.6% (+/-)
0.7%
28.7% (+/-)
0.7%
28.7% (+/-)
0.7%
28.7% (+/-)
0.8%
29.2% (+/-)
0.9%
29.9% (+/-)
0.7%
24.5% (+/-)
1.1%
30.6% (+/-)
0.4%
25.9% (+/-)
1.8%
Accuracy 50.3% (+/-)
0.7%
71.2% (+/-)
0.7%
71.2% (+/-)
0.7%
72% (+/-)
0.8%
70.7% (+/-)
0.9%
70% (+/-)
0.7%
75.4% (+/-)
1.1%
69.3% (+/-)
0.4%
74.1% (+/-)
1.8%
AUC 7.6% (+/-)
2.3%
81.3% (+/-)
2.8%
81.3% (+/-)
2.8%
81.3% (+/-)
2.6%
71.07% (+/-)
1.6%
78.7% (+/-)
2.5%
81.6% (+/-)
1.1%
79.3% (+/-)
2.3%
82.2% (+/-)
2.5%
Precision 49.6% (+/-)
0.8%
88.2% (+/-)
2.4%
88.2% (+/-)
2.4%
85.5% (+/-)
4.7%
87.3% (+/-)
4.2%
84% (+/-)
3.8%
77.4% (+/-)
2.6%
65.7% (+/-)
2.3%
83.4% (+/-)
5.8%
Recall
(Sensitivity)
99.5% (+/-)
0.6%
47.5% (+/-)
3.2%
47.5% (+/-)
3.2%
50.6% (+/-)
2.9%
45.8% (+/-)
3.7%
46.6% (+/-)
3.6%
72.7% (+/-)
4.3%
77.5% (+/-)
5.7%
58.2% (+/-)
3.1%
F Measure 66.2% (+/-)
0.7%
61.7% (+/-)
2.4%
61.7% (+/-)
2.4%
63.4% (+/-)
2%
60% (+/-)
2.8%
59.8% (+/-)
2.5%
74.9% (+/-)
2.5%
71% (+/-)
1.6%
68.4% (+/-)
1.8%
Specificity 3.1% (+/-)
3%
93.8% (+/-)
1.6%
93.8% (+/-)
1.6%
91.% (+/-)
3%
93.7% (+/-)
2.5%
91.7% (+/-)
2.5%
78% (+/-)
3.4%
61.7% (+/-)
4.7%
89.0% (+/-)
4.3%
Result observed on Rapid Miner’s Auto Model are given below.
Results – Rapid Miner
11
Parameters
Model
Gradient Boosting
Trees
Deep Learning
Classification
Error
24.53% (+/-) 1.1% 25.9% (+/-) 1.8%
Accuracy 75.46% (+/-) 1.1% 74.1% (+/-) 1.8%
AUC 81.6% (+/-) 1.1% 82.2% (+/-) 2.5%
Precision 77.4% (+/-) 2.6% 83.4% (+/-) 5.8%
Recall
(Sensitivity)
72.7% (+/-) 4.3% 58.2% (+/-) 3.1%
F Measure 74.9% (+/-) 2.5% 68.4% (+/-) 1.8%
Specificity 78% (+/-) 3.4% 89.0% (+/-) 4.3%
GBT
DL
Two better performing models are compared below.
Results - Python
12
Parameters
Model
SVM GBC (M)NB DT RF LR
Classification Error 10.71% 15.51% 13.04% 17.98% 13.32% 11.92%
Accuracy 89.28% 84.86% 86.95% 82.72% 86.11% 88.08%
AUC 89.35% 84.81% 87.03% 81.72% 85.69% 88.16%
Precision (0/1) 91% / 87% 91% / 77% 89% / 85% 81% / 84% 87% / 85% 90% / 86%
Recall (0/1)
(Sensitivity)
88%/ 91% 80% / 90% 86% / 89% 84% / 82% 85% / 87% 87% / 90%
F Measure (0/1) 90% / 89% 85% / 83% 87% / 87% 83% / 83% 86% / 86% 88% / 88%
Specificity 90.80% 89.45% 88.52% 81.12% 86.38% 89.71%
Result observed on Python are given below.
Results - Python
13
Parameters
Model
Support
Vector
Machine
Logistic
Regression
Gradient
Boosting
Classifier
Classification
Error
10.71% 11.92% 15.51%
Accuracy 89.28% 88.08% 84.86%
AUC 89.35% 88.16% 84.81%
Precision (0/1) 91% / 87% 90% / 86% 91% / 77%
Recall (0/1) 88%/ 91% 87% / 90% 80% / 90%
F Measure
(0/1)
90% / 89% 88% / 88% 85% / 83%
Specificity 90.80% 89.71% 89.45%
SVM
LR
GBC
Three better performing models are compared below.
Results – H2O AI
14
Predicted 0 Predicted 1 Error Rate
Actual 0 169.0 276.0 0.6202 (276.0/445.0)
Actual 1 42.0 413.0 0.0923 (42.0/455.0)
Total 211.0 689.0 0.3533 (318.0/900.0)
From the generated confusion matrix, following parameters are derived for Deep Learning
Parameter Score
Accuracy 64.6%
Classification Error 35..4%
AUC 74.44%
Precision 59.96%
Recall 90.76%
F Measure 72.21%
Specificity 37.97%
Results – Overall
15
•On Rapid Miner there is not so much difference in Classification Error, Accuracy, and AUC
between Gradient Boosting Tree (GBT) and Deep Learning (DL) models, which are one of the most
important evaluation criteria in machine learning classification algorithms.
• Support Vector Machine is clearly outperforming every other traditional machine learning
classification model on Python.
• Either of the better performing model on Rapid Miner has not performed well with more number
of samples on python platform or on H2O AI.
• Support Vector Machine has got more score in all classification evaluating parameters.
• Deep Learning model on H20 AI is giving unfavourable results if compared with the considered
hypothesis.
• The results from Deep Learning model on H20 platform do not outperform the Rapid Miner’s
Auto model results.
Conclusion and Future Work
16
• From above results and discussion we can observe that Support Vector Machine model is
performing better than other State-of-the-Art models with TF-IDF word vector creation.
• Rapid Miner auto model’s score can be used as a bench mark for all the platforms and model.
Different results can be observed depending on the number of samples.
• The future work for this study will involve the use of Recurrent Neural Network with Keras for
sentiment classification.
• It also involves the use of different word vector creating technique such as Term Frequency (TF),
Term Occurrence (TO), and Binary Term Occurrence (BTO).
Thank You
Ad

More Related Content

What's hot (20)

Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
Sai Mohith
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
Xiaotao Zou
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Deep learning
Deep learningDeep learning
Deep learning
Ratnakar Pandey
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
NLP
NLPNLP
NLP
guestff64339
 
Feature selection
Feature selectionFeature selection
Feature selection
dkpawar
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
Rahul Jha
 
Language models
Language modelsLanguage models
Language models
Maryam Khordad
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
ProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) IntroductionProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) Introduction
wahab khan
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
Ramla Sheikh
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
Sai Mohith
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
Xiaotao Zou
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Feature selection
Feature selectionFeature selection
Feature selection
dkpawar
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
Rahul Jha
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
ProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) IntroductionProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) Introduction
wahab khan
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
Ramla Sheikh
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 

Similar to Comparative Study of Machine Learning Algorithms for Sentiment Analysis with TF-IDF Vector Creation (20)

Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
SrushtiSuvarna
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
QA or the Highway
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
Eng Teong Cheah
 
featurers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdffeaturers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdf
AmirMohamedNabilSale
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web TestingThe Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
Perfecto by Perforce
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detection
jagan477830
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithms
Ahmed Magdy Ezzeldin, MSc.
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde
 
Testing 2 - Thinking Like A Tester
Testing 2 - Thinking Like A TesterTesting 2 - Thinking Like A Tester
Testing 2 - Thinking Like A Tester
ArleneAndrews2
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
arthi v
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
IMDB Movie Reviews made by any organisation.pptx
IMDB Movie Reviews made by any organisation.pptxIMDB Movie Reviews made by any organisation.pptx
IMDB Movie Reviews made by any organisation.pptx
swatigohite6
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
ArtemSunfun
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
QA or the Highway
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
Eng Teong Cheah
 
featurers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdffeaturers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdf
AmirMohamedNabilSale
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web TestingThe Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web Testing
Perfecto by Perforce
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detection
jagan477830
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithms
Ahmed Magdy Ezzeldin, MSc.
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde
 
Testing 2 - Thinking Like A Tester
Testing 2 - Thinking Like A TesterTesting 2 - Thinking Like A Tester
Testing 2 - Thinking Like A Tester
ArleneAndrews2
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
arthi v
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 
IMDB Movie Reviews made by any organisation.pptx
IMDB Movie Reviews made by any organisation.pptxIMDB Movie Reviews made by any organisation.pptx
IMDB Movie Reviews made by any organisation.pptx
swatigohite6
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
ArtemSunfun
 
Ad

Recently uploaded (20)

Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Ad

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with TF-IDF Vector Creation

  • 1. “ Comparative Study of Machine Learning Algorithms for Sentiment Analysis with TF-IDF Vector Creation Sagar Vijay Deogirkar (10547321) MSc Data Analytics Ms. Terri Hoare Supervisor
  • 2. Index • Introduction • Research Question and Objective • Methodology • Business and Data Understanding • Data Preparation • Modelling • Evaluation • Results • Conclusion and Future Work 2
  • 3. Introduction Customers are expressing their thoughts about product and offered services more openly than never before. Considering this sentiment analysis is becoming an essential aspect to understand their sentiments. Sentiment analysis cites the use of Natural Language Processing technique to classify the type of sentiment. In other words, Sentiment Analysis is the process to determine if the given text is positive, negative or of neutral sentiment. It is often perform on textual data to help business entities, to monitor their brand’s product or services’ sentiments from client’s reviews. This helps to understand the customer’s requirement which may lead to necessary improvement in product or services if required. 3
  • 4. Research Question Sentiment Analysis is trending, many programming and non programming platforms have arrived providing solution to this problem. But problem lies with platform selection and more over that, to the model or algorithm selection. The main obstacle is to know, which algorithm can be chosen with TF-IDF vector creation technique, which will lead to determine the class of the sentiment more correctly. 4 Objective The main objective of this research is to compare the State-of-the-Art Deep Learning with Machine Learning algorithms performance on TF-IDF vector creation for Sentiment Analysis.
  • 5. Methodology Data Understanding 5 Data Preparation Modelling Evaluation Business Understanding Deployment • Determining Business Objective • Assess The Situation • Determine the Study Goal • Produce a Project Plan • Data Selection • Clean Data • Construct Data • Integrate Data • Format Data • Evaluate Result • Review Process • Determine Next Step • Data Collection • Describing the Data • Data Exploration • Verifying Data Quality • Select the Model • Generate Test Design • Build the Model • Assess the Model • Plan Development • Plan Monitory and Maintenance • Produce Final Report • Review Report This is research conducted following CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology.
  • 6. Business Understanding Sentiment classification is a term which comprises the method to determine the labelled sentiment from the available classes based on the aligned text data. It helps to identify the emotions i.e. sentiment behind the text of high volume data. Text data could be reviews from YouTube, or from any social media platform, or tweets on trending topic involving different hash tags and etc, or articles or news report or such which is in the form of text. 6 Data Understanding • For this research “Twitter Airline” dataset is used. • Dataset comprises of total 14 features and a label, having a total of 14640 rows. • Column Names - tweet id, airline sentiment, airline sentiment confidence, airline sentiment gold, negative reason, negative reason confidence, airline, name, negative reason gold, re tweet count, text, tweet coord, tweet created, tweet location, user time zone. •Sentiment Distribution: 9178 – Negative, 3099 – Neutral and 2363 - Positive. • From the above features only airline_sentiment and text is selected for the research.
  • 7. Data Preparation • Text Pre-processing – Lowercasing is done. Unnecessary symbols and numbers are removed. • Sentiment Class Filtering - Neutral class’s sentiment is filtered and positive is turned to 1 and Negative to 0. • Data Balancing – Positive and Negative class sentiment is balanced to same number of samples. • Removing Stop Words – Common words in the language are removed. • Text Stemming – Porter Stemming is used to make word into its original form. • Tokenization – Every word is separated in the document. • TF-IDF vector – TF- IDF word vector is created having all the words in the data set with their weight. 7 Text Cleaning Text Processing Vector Creation Data Importation TF- IDF Vector Creation • Text Stemming • Tokenization • Data Balancing • Lowercasing • Removing Symbols and Numbers • Removing Stop Words. Data Importation to the Platform and Considering the Features and label.
  • 8. Modelling • Selected Models are: Naive Bayes, Support Vector Machine (SVM), Generalised Linear Model (GLM), Logistic Regression, Decision Tree, Random Forest, Gradient Boosted Trees, and Deep Learning. • On Rapid Miner Auto Model is used with 3000 samples. •Deep learning (Neural Network) is observed on H2O AI platform with 3000 samples processed and saved from Rapid Miner. • On Python 4726 samples are used for modelling on above mentioned models. 8
  • 9. Evaluation Performance of the model is evaluated by generating/calculating following parameters: • Classification Error – It is the total number of error made by the machine learning model to predict correct data from the total number of predicted samples. • Accuracy - Accuracy is the fraction of correct prediction made by the model to the total number of the samples. • AUC - It is the complete area under the 2-dimensional area under the ROC (Receiver Operating Characteristic) curve. • Precision - Precision is the measure of a model which represents the actual positive values predicted by the model from the total positive values. • Recall/Sensitivity - It is the measure of a model which represents the total number of actual positive values predicted by the model. • F1 Score - It is the harmonic mean between precision and recall. • Specificity - It is defined as the ration of the true negative prediction made by the model to the total number of negative values available in the set. 9
  • 10. Results – Rapid Miner 10 Parameter/ Model NV GLM LR FLM DT RF GBT SVM DL Classification Error 49.6% (+/-) 0.7% 28.7% (+/-) 0.7% 28.7% (+/-) 0.7% 28.7% (+/-) 0.8% 29.2% (+/-) 0.9% 29.9% (+/-) 0.7% 24.5% (+/-) 1.1% 30.6% (+/-) 0.4% 25.9% (+/-) 1.8% Accuracy 50.3% (+/-) 0.7% 71.2% (+/-) 0.7% 71.2% (+/-) 0.7% 72% (+/-) 0.8% 70.7% (+/-) 0.9% 70% (+/-) 0.7% 75.4% (+/-) 1.1% 69.3% (+/-) 0.4% 74.1% (+/-) 1.8% AUC 7.6% (+/-) 2.3% 81.3% (+/-) 2.8% 81.3% (+/-) 2.8% 81.3% (+/-) 2.6% 71.07% (+/-) 1.6% 78.7% (+/-) 2.5% 81.6% (+/-) 1.1% 79.3% (+/-) 2.3% 82.2% (+/-) 2.5% Precision 49.6% (+/-) 0.8% 88.2% (+/-) 2.4% 88.2% (+/-) 2.4% 85.5% (+/-) 4.7% 87.3% (+/-) 4.2% 84% (+/-) 3.8% 77.4% (+/-) 2.6% 65.7% (+/-) 2.3% 83.4% (+/-) 5.8% Recall (Sensitivity) 99.5% (+/-) 0.6% 47.5% (+/-) 3.2% 47.5% (+/-) 3.2% 50.6% (+/-) 2.9% 45.8% (+/-) 3.7% 46.6% (+/-) 3.6% 72.7% (+/-) 4.3% 77.5% (+/-) 5.7% 58.2% (+/-) 3.1% F Measure 66.2% (+/-) 0.7% 61.7% (+/-) 2.4% 61.7% (+/-) 2.4% 63.4% (+/-) 2% 60% (+/-) 2.8% 59.8% (+/-) 2.5% 74.9% (+/-) 2.5% 71% (+/-) 1.6% 68.4% (+/-) 1.8% Specificity 3.1% (+/-) 3% 93.8% (+/-) 1.6% 93.8% (+/-) 1.6% 91.% (+/-) 3% 93.7% (+/-) 2.5% 91.7% (+/-) 2.5% 78% (+/-) 3.4% 61.7% (+/-) 4.7% 89.0% (+/-) 4.3% Result observed on Rapid Miner’s Auto Model are given below.
  • 11. Results – Rapid Miner 11 Parameters Model Gradient Boosting Trees Deep Learning Classification Error 24.53% (+/-) 1.1% 25.9% (+/-) 1.8% Accuracy 75.46% (+/-) 1.1% 74.1% (+/-) 1.8% AUC 81.6% (+/-) 1.1% 82.2% (+/-) 2.5% Precision 77.4% (+/-) 2.6% 83.4% (+/-) 5.8% Recall (Sensitivity) 72.7% (+/-) 4.3% 58.2% (+/-) 3.1% F Measure 74.9% (+/-) 2.5% 68.4% (+/-) 1.8% Specificity 78% (+/-) 3.4% 89.0% (+/-) 4.3% GBT DL Two better performing models are compared below.
  • 12. Results - Python 12 Parameters Model SVM GBC (M)NB DT RF LR Classification Error 10.71% 15.51% 13.04% 17.98% 13.32% 11.92% Accuracy 89.28% 84.86% 86.95% 82.72% 86.11% 88.08% AUC 89.35% 84.81% 87.03% 81.72% 85.69% 88.16% Precision (0/1) 91% / 87% 91% / 77% 89% / 85% 81% / 84% 87% / 85% 90% / 86% Recall (0/1) (Sensitivity) 88%/ 91% 80% / 90% 86% / 89% 84% / 82% 85% / 87% 87% / 90% F Measure (0/1) 90% / 89% 85% / 83% 87% / 87% 83% / 83% 86% / 86% 88% / 88% Specificity 90.80% 89.45% 88.52% 81.12% 86.38% 89.71% Result observed on Python are given below.
  • 13. Results - Python 13 Parameters Model Support Vector Machine Logistic Regression Gradient Boosting Classifier Classification Error 10.71% 11.92% 15.51% Accuracy 89.28% 88.08% 84.86% AUC 89.35% 88.16% 84.81% Precision (0/1) 91% / 87% 90% / 86% 91% / 77% Recall (0/1) 88%/ 91% 87% / 90% 80% / 90% F Measure (0/1) 90% / 89% 88% / 88% 85% / 83% Specificity 90.80% 89.71% 89.45% SVM LR GBC Three better performing models are compared below.
  • 14. Results – H2O AI 14 Predicted 0 Predicted 1 Error Rate Actual 0 169.0 276.0 0.6202 (276.0/445.0) Actual 1 42.0 413.0 0.0923 (42.0/455.0) Total 211.0 689.0 0.3533 (318.0/900.0) From the generated confusion matrix, following parameters are derived for Deep Learning Parameter Score Accuracy 64.6% Classification Error 35..4% AUC 74.44% Precision 59.96% Recall 90.76% F Measure 72.21% Specificity 37.97%
  • 15. Results – Overall 15 •On Rapid Miner there is not so much difference in Classification Error, Accuracy, and AUC between Gradient Boosting Tree (GBT) and Deep Learning (DL) models, which are one of the most important evaluation criteria in machine learning classification algorithms. • Support Vector Machine is clearly outperforming every other traditional machine learning classification model on Python. • Either of the better performing model on Rapid Miner has not performed well with more number of samples on python platform or on H2O AI. • Support Vector Machine has got more score in all classification evaluating parameters. • Deep Learning model on H20 AI is giving unfavourable results if compared with the considered hypothesis. • The results from Deep Learning model on H20 platform do not outperform the Rapid Miner’s Auto model results.
  • 16. Conclusion and Future Work 16 • From above results and discussion we can observe that Support Vector Machine model is performing better than other State-of-the-Art models with TF-IDF word vector creation. • Rapid Miner auto model’s score can be used as a bench mark for all the platforms and model. Different results can be observed depending on the number of samples. • The future work for this study will involve the use of Recurrent Neural Network with Keras for sentiment classification. • It also involves the use of different word vector creating technique such as Term Frequency (TF), Term Occurrence (TO), and Binary Term Occurrence (BTO).
  翻译: