SlideShare a Scribd company logo
DATA MINING
KESHAV
MAHAVIDYALAYA
-SANJEEV
AJEETH
VINIT
SNEHA
INTRODUCTION
• Spam constitutes 55% of all emails, posing a significant challenge to communication.
• It inundates mailboxes with unwanted advertisements and junk, consuming users' time and
risking the deletion of legitimate emails.
• Economic impacts have led to legislative measures in some countries.
• Text classification, essential for organizing and categorizing text, distinguishes between spam
and legitimate messages.
• Machine learning automates this process efficiently by learning associations from pre-labeled
data.
• Feature extraction transforms text into numerical representations, aiding in accurate
classification.
• ML techniques enhance precision and speed in analyzing big data, crucial for informing
business decisions and automating processes.
• This project employs machine learning to detect spam messages without explicit
programming.
• Algorithms learn classification rules from pre-labeled data, predicting the category of
unknown texts based on majority vote.
1
PROBLEM STATEMENT
• Spammers are in continuous war with E-mail service providers. E-mail
service providers implement various spam filtering methods to retain
their users and spammers are continuous changing patterns using
various embedding tricks to get through filtering. These filters can
never be too aggressive because slight misclassification may lead to
important misinformation loss for consumer. A rigid filtering method
with additional reinforcements is needed to tackle the problem.
• To combat the ever-evolving tactics of spammers, email service
providers must continuously adapt their spam filtering strategies. By
implementing a combination of sophisticated techniques such as
content analysis, sender verification, and machine learning
algorithms, providers can effectively block unwanted messages while
allowing legitimate emails to reach their recipients.
2
3
OBJECTIVES:
The objectives of this project are
• To create a ensemble algorithm for classification of spam with highest possible
accuracy.
• To study on how to use machine learning for spam detection.
• To study how natural language processing techniques can be implemented in spam
detection.
• To provide user with insights of the given text leveraging the created algorithm and NLP.
• Develop ensemble algorithm for accurate spam classification using machine learning.
• Enhance spam detection methods through machine learning techniques.
• Implement natural language processing (NLP) for improved spam detection.
• Provide users valuable insights from text by combining algorithm with NLP.
• Revolutionize spam detection for a more secure online experience
WORKFLOW
:
DATA
DESCRIPTION:
Dataset : UCI SMS Spam Collection.
Source: Kaggle.
Description : A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS
Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for
research at the NUS. The files contain one message per line. Each line is composed by two
columns: v1 contains the label (ham or spam) and v2 contains the raw text.
DATA
PROCESSING:
• Dataset
cleaning
• Dataset Merging
TEXTUAL DATA
PROCESSING:
• Tag Removal
• Sentencing,
tokenization
• Stop word removal
• Lemmatization
• Sentence formation
FEATURE VECTOR FORMATION :
• The texts are converted into feature vectors(numerical data) using the
words
present in all the texts combined.
• This process is done using count vectorization of NLTK library.
• The feature vectors can be formed using two language models Bag of Words
and Term Frequency-inverse Document Frequency.
BAG OF WORDS:
Bag of words is a language model used mainly in text classification. A bag of words
represents the text in a numerical form.
The two things required for Bag of Words are
• A vocabulary of words known to us.
• A way to measure the presence of words.
Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens.
“ It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness, ”
The unique words here (ignoring case and punctuation) are:
[ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ]
The next step is scoring words present in every document
After scoring the four lines from the above stanza can be represented in vector form
as
“It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Term Frequency-Inverse Document Frequency:
• Term frequency-inverse document frequency of a word is a measurement of the
importance of a word.
• It compares the repentance of words to the collection of documents and
calculates the score.
• Terminology for the below formulae:
t – term(word).
d – document.
N – count of documents.
The TF-IDF process consists of various activities listed below.
i) Term Frequency
• The count of appearance of a particular word in a document is called term
frequency.
𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅
ii) Document Frequency
• Document frequency is the count of documents the word was detected in. We
consider one instance of a word and it doesn’t matter if the word is present
multiple times.
𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔
iii) Inverse Document Frequency
• IDF (Inverse Document Frequency) is the inverse of document frequency.
• It evaluates the significance of a term by considering its informational
contribution.
• Common terms like "are," "if," and "a" provide minimal document insight.
• IDF diminishes the importance of frequently occurring terms and boosts rare ones.
𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇
Finally, the TF-IDF can be calculated by combining the term frequency and
inverse
document frequency.
𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏))
The process can be explained using the following example:
“Document 1 It is going to rain today.
Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.”
The Bag of words of the above sentences is
[going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1]
• It combines term frequency (TF) and inverse document frequency (IDF).
• TF represents the frequency of a word in a document, while IDF evaluates its
significance across the collection.
• By assigning weights to words, TF-IDF aids in text mining, information retrieval,
and natural language processing.
Term Frequency:
Then finding the inverse document
frequency
Inverse Document Frequency :
Applying the final equation the values of tf-idf
becomes
Using the above two language models the complete data has been converted into two
kinds of vectors and stored into a csv type file for easy access and minimal
processing.
MACHINE LEARNING:
• Machine Learning is process in which the computer performs certain
tasks without giving instructions. In this case the models takes the
training data and train on them.
• Then depending on the trained data any new unknown data will be
processed based on the ruled derived from the trained data.
• After completing the count vectorization and TF-IDF stages in the
workflow the data is converted into vector form(numerical form) which is
used for training and testing models.
• For our study various machine learning models are compared to
determine which method is more suitable for this task.
• The models used for the study include Naïve Bayes, K Nearest
Neighbors, and Support Vector Machine.
ALGORITHM
S
A combination of 3 algorithms are used for the classifications.
NAÏVE BAYES CLASSIFIER
A naïve Bayes classifier is a supervised probabilistic machine learning model that is
used for classification tasks. The main principle behind this model is the Bayes
theorem.
Bayes Theorem: Naive Bayes is a classification technique that is based on Bayes’
Theorem with an assumption that all the features that predict the target value are
independent of each other. It calculates the probability of each class and then
picks the one with the highest probability.
P(A│B)=(P(B│A)P(A))/P(B).
P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
P(B|A) is the probability of data B given that hypothesis A was true.
P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior
probability of A.
P(B) is the probability of the data (regardless of the hypothesis)
Naïve Bayes classifiers are mostly used for
text classification. The limitation of the
Naïve Bayes model is that it treats every
word in a text as independent and is equal in
importance but every word cannot be
treated equally important because articles
and nouns are not the same when it comes
to language.
K-NEAREST
NEIGHBORS
• KNN is a classification algorithm. It comes under supervised algorithms. All the data
points are assumed to be in an n-dimensional space. And then based on neighbors
the category of current data is determined based on the majority.
• Euclidian distance is used to determine the distance between points.
The distance between 2 points is calculated
as
d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
• The distances between the unknown point and all the others are calculated.
Depending on the K provided k closest neighbors are determined. The category to
which the majority of the neighbors belong is selected as the unknown data
category.
• If the data contains up to 3 features then the plot can be visualized. It is fairly slow
compared to other distance-based algorithms such as SVM as it needs to determine
the distance to all points to get the closest neighbors to the given point.
SUPPORT VECTOR
MACHINES(SVM)
It is a machine learning algorithm for classification. Decision boundaries are drawn
between various categories and based on which side the point falls to the boundary
the category is determined.
Support Vectors: The vectors closer to boundaries are called support vectors/planes.
If there are n categories then there will be n+1 support vectors. Instead of points, these
are called vectors because they are assumed to be starting from the origin. The
distance between the support vectors is called margin. We want our margin to be as
wide as possible because it yields better results.
There are three types of boundaries used by
SVM to create boundaries.
Linear: used if the data is linearly separable.
Poly: used if data is not separable. It creates
any data into 3-dimensional data.
Radial: This is the default kernel used in SVM.
It converts any data into infinite-dimensional
data.
• If the data is 2-dimensional then the boundaries are lines. If the data is 3-
dimensional then the boundaries are planes. If the data categories are more than 3
then boundaries are called hyperplanes.
• An SVM mainly depends on the decision boundaries for predictions. It doesn’t
compare the data to all other data to get the prediction due to this SVM’s tend to be
quick with predictions.
RESULTS MODEL
SELECTION
• While selecting the best language model the data has been converted into both
types of vectors and then the models been tested for to determine the best model
for classifying spam.
• The results from individual models are presented in the experimentation section
under methodology. Now comparing the results from the models.
Metric Model Accuracy Precision F1 Score
Naive Bayes 95.94% 100% 97.91%
KNN 90.04% 100% 94.92%
SVM 97.29% 97.41% 97.35%
• From the code it is clear that TF-IDF proves to be better than BoW in every model
tested. Hence TF-IDF has been selected as the primary language model for textual
data conversion in feature vector formation.
COMPARISO
N
The results from the proposed model has been compared with all the models
individually in tabular form to illustrate the differences clearly.
SUMMAR
Y
• There are two main tasks in the project implementation. Language model selection
for completing the textual processing phase and proposed model creation using the
individual algorithms. These two tasks require comparison from other models and
select of various parameters for better efficiency.
• During the language model selection phase two models, Bag of Words and TF-IDF
are compared to select the best model and from the results obtained it is evident
that TF-IDF performs better.
CONCLUSION AND FUTURE
SCOPE
Conclusion:
From the results obtained we can conclude that an ensemble machine learning
model is more effective in detection and classification of spam than any
individual
algorithms.
We can also conclude that TF-IDF (term frequency inverse document
frequency) language model is more effective than Bag of words model in
classification of spam when combined with several algorithms. And finally we can say
that spam
detection can get better if machine learning algorithms are combined and tuned to
needs.
Project Scope
This project needs a coordinated scope of work.
i. Combine existing machine learning algorithms to form a better ensemble
algorithm.
ii. Clean, processing and make use of the dataset for training and testing the
model
created.
iii. Analyse the texts and extract entities for presentation.
Limitations
This Project has certain limitations.
i. This can only predict and classify spam but not block it.
ii. Analysis can be tricky for some alphanumeric messages and it may struggle
with
entity detection.
iii. Since the data is reasonably large it may take a few seconds to classify and
anlayse the message
THANK YOU
THANK YOU
Presented By : Adeline Palmerston
Larana University | 2024
Ad

More Related Content

Similar to Data Mining Email SPam Detection PPT WITH Algorithms (20)

Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
JOBANPREETSINGH62
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Gnerative AI presidency  Module1_L4_LLMs_new.pptxGnerative AI presidency  Module1_L4_LLMs_new.pptx
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Arunnaik63
 
Data Mining algorithms PPT with Overview explanation.
Data Mining algorithms PPT with Overview explanation.Data Mining algorithms PPT with Overview explanation.
Data Mining algorithms PPT with Overview explanation.
promptitude123456789
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
Lithal Fragrance
 
DECISION TREE AND PROBABILISTIC MODELS.pptx
DECISION TREE AND PROBABILISTIC MODELS.pptxDECISION TREE AND PROBABILISTIC MODELS.pptx
DECISION TREE AND PROBABILISTIC MODELS.pptx
GoodReads1
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Nimrita Koul
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation data
jagan477830
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
NandiniKumari54
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
RINUSATHYAN
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
Machine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptxMachine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptx
arunchoubeybxr
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Gnerative AI presidency  Module1_L4_LLMs_new.pptxGnerative AI presidency  Module1_L4_LLMs_new.pptx
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Arunnaik63
 
Data Mining algorithms PPT with Overview explanation.
Data Mining algorithms PPT with Overview explanation.Data Mining algorithms PPT with Overview explanation.
Data Mining algorithms PPT with Overview explanation.
promptitude123456789
 
DECISION TREE AND PROBABILISTIC MODELS.pptx
DECISION TREE AND PROBABILISTIC MODELS.pptxDECISION TREE AND PROBABILISTIC MODELS.pptx
DECISION TREE AND PROBABILISTIC MODELS.pptx
GoodReads1
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Nimrita Koul
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation data
jagan477830
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
NandiniKumari54
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
RINUSATHYAN
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
Machine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptxMachine_learning_presentation_on_movie_recomendation_system.pptx
Machine_learning_presentation_on_movie_recomendation_system.pptx
arunchoubeybxr
 

Recently uploaded (20)

Full_Cybersecurity_Project_Report_30_Pages.pdf
Full_Cybersecurity_Project_Report_30_Pages.pdfFull_Cybersecurity_Project_Report_30_Pages.pdf
Full_Cybersecurity_Project_Report_30_Pages.pdf
Arun446808
 
ldr darkness sensor circuit.pptx for engineers
ldr darkness sensor circuit.pptx for engineersldr darkness sensor circuit.pptx for engineers
ldr darkness sensor circuit.pptx for engineers
PravalikaChidurala
 
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
DanyalNaseer3
 
digital computing plotform synopsis.pptx
digital computing plotform synopsis.pptxdigital computing plotform synopsis.pptx
digital computing plotform synopsis.pptx
ssuser2b4c6e1
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFTDeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
Kyohei Ito
 
Evaluating Adaptive Neuro-Fuzzy Inference System (ANFIS) To Assess Liquefacti...
Evaluating Adaptive Neuro-Fuzzy Inference System (ANFIS) To Assess Liquefacti...Evaluating Adaptive Neuro-Fuzzy Inference System (ANFIS) To Assess Liquefacti...
Evaluating Adaptive Neuro-Fuzzy Inference System (ANFIS) To Assess Liquefacti...
Journal of Soft Computing in Civil Engineering
 
ESP32 Air Mouse using Bluetooth and MPU6050
ESP32 Air Mouse using Bluetooth and MPU6050ESP32 Air Mouse using Bluetooth and MPU6050
ESP32 Air Mouse using Bluetooth and MPU6050
CircuitDigest
 
Comprehensive Guide to Distribution Line Design
Comprehensive Guide to Distribution Line DesignComprehensive Guide to Distribution Line Design
Comprehensive Guide to Distribution Line Design
Radharaman48
 
Tech innovations management entreprenuer
Tech innovations management entreprenuerTech innovations management entreprenuer
Tech innovations management entreprenuer
Subramanyambharathis
 
1.2 Need of Object-Oriented Programming.pdf
1.2 Need of Object-Oriented Programming.pdf1.2 Need of Object-Oriented Programming.pdf
1.2 Need of Object-Oriented Programming.pdf
VikasNirgude2
 
Observability and Instrumentation via OpenTelemetry.pptx
Observability and Instrumentation via OpenTelemetry.pptxObservability and Instrumentation via OpenTelemetry.pptx
Observability and Instrumentation via OpenTelemetry.pptx
grahnkarljohan
 
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptxUnleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
SanjeetMishra29
 
Assessment of Statistical Models for Rainfall Forecasting Using Machine Learn...
Assessment of Statistical Models for Rainfall Forecasting Using Machine Learn...Assessment of Statistical Models for Rainfall Forecasting Using Machine Learn...
Assessment of Statistical Models for Rainfall Forecasting Using Machine Learn...
Journal of Soft Computing in Civil Engineering
 
May 2025 - Top 10 Read Articles in Network Security and Its Applications
May 2025 - Top 10 Read Articles in Network Security and Its ApplicationsMay 2025 - Top 10 Read Articles in Network Security and Its Applications
May 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
1.10 Functions in C++,call by value .pdf
1.10 Functions in C++,call by value .pdf1.10 Functions in C++,call by value .pdf
1.10 Functions in C++,call by value .pdf
VikasNirgude2
 
VISHAL KUMAR SINGH Latest Resume with updated details
VISHAL KUMAR SINGH Latest Resume with updated detailsVISHAL KUMAR SINGH Latest Resume with updated details
VISHAL KUMAR SINGH Latest Resume with updated details
Vishal Kumar Singh
 
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdfParticle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
DUSABEMARIYA
 
Health & Safety .........................
Health & Safety .........................Health & Safety .........................
Health & Safety .........................
shadyozq9
 
PYTHON--QUIZ-1_20250422_002514_0000.pptx
PYTHON--QUIZ-1_20250422_002514_0000.pptxPYTHON--QUIZ-1_20250422_002514_0000.pptx
PYTHON--QUIZ-1_20250422_002514_0000.pptx
rmvigram
 
Full_Cybersecurity_Project_Report_30_Pages.pdf
Full_Cybersecurity_Project_Report_30_Pages.pdfFull_Cybersecurity_Project_Report_30_Pages.pdf
Full_Cybersecurity_Project_Report_30_Pages.pdf
Arun446808
 
ldr darkness sensor circuit.pptx for engineers
ldr darkness sensor circuit.pptx for engineersldr darkness sensor circuit.pptx for engineers
ldr darkness sensor circuit.pptx for engineers
PravalikaChidurala
 
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
Ceramic Multichannel Membrane Structure with Tunable Properties by Sol-Gel Me...
DanyalNaseer3
 
digital computing plotform synopsis.pptx
digital computing plotform synopsis.pptxdigital computing plotform synopsis.pptx
digital computing plotform synopsis.pptx
ssuser2b4c6e1
 
Automatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and BeyondAutomatic Quality Assessment for Speech and Beyond
Automatic Quality Assessment for Speech and Beyond
NU_I_TODALAB
 
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFTDeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
DeFAIMint | 🤖Mint to DeFAI. Vibe Trading as NFT
Kyohei Ito
 
ESP32 Air Mouse using Bluetooth and MPU6050
ESP32 Air Mouse using Bluetooth and MPU6050ESP32 Air Mouse using Bluetooth and MPU6050
ESP32 Air Mouse using Bluetooth and MPU6050
CircuitDigest
 
Comprehensive Guide to Distribution Line Design
Comprehensive Guide to Distribution Line DesignComprehensive Guide to Distribution Line Design
Comprehensive Guide to Distribution Line Design
Radharaman48
 
Tech innovations management entreprenuer
Tech innovations management entreprenuerTech innovations management entreprenuer
Tech innovations management entreprenuer
Subramanyambharathis
 
1.2 Need of Object-Oriented Programming.pdf
1.2 Need of Object-Oriented Programming.pdf1.2 Need of Object-Oriented Programming.pdf
1.2 Need of Object-Oriented Programming.pdf
VikasNirgude2
 
Observability and Instrumentation via OpenTelemetry.pptx
Observability and Instrumentation via OpenTelemetry.pptxObservability and Instrumentation via OpenTelemetry.pptx
Observability and Instrumentation via OpenTelemetry.pptx
grahnkarljohan
 
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptxUnleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
Unleashing the Power of Salesforce Flows &amp_ Slack Integration!.pptx
SanjeetMishra29
 
May 2025 - Top 10 Read Articles in Network Security and Its Applications
May 2025 - Top 10 Read Articles in Network Security and Its ApplicationsMay 2025 - Top 10 Read Articles in Network Security and Its Applications
May 2025 - Top 10 Read Articles in Network Security and Its Applications
IJNSA Journal
 
1.10 Functions in C++,call by value .pdf
1.10 Functions in C++,call by value .pdf1.10 Functions in C++,call by value .pdf
1.10 Functions in C++,call by value .pdf
VikasNirgude2
 
VISHAL KUMAR SINGH Latest Resume with updated details
VISHAL KUMAR SINGH Latest Resume with updated detailsVISHAL KUMAR SINGH Latest Resume with updated details
VISHAL KUMAR SINGH Latest Resume with updated details
Vishal Kumar Singh
 
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdfParticle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
Particle Swarm Optimization by Aleksandar Lazinica (Editor) (z-lib.org).pdf
DUSABEMARIYA
 
Health & Safety .........................
Health & Safety .........................Health & Safety .........................
Health & Safety .........................
shadyozq9
 
PYTHON--QUIZ-1_20250422_002514_0000.pptx
PYTHON--QUIZ-1_20250422_002514_0000.pptxPYTHON--QUIZ-1_20250422_002514_0000.pptx
PYTHON--QUIZ-1_20250422_002514_0000.pptx
rmvigram
 
Ad

Data Mining Email SPam Detection PPT WITH Algorithms

  • 2. INTRODUCTION • Spam constitutes 55% of all emails, posing a significant challenge to communication. • It inundates mailboxes with unwanted advertisements and junk, consuming users' time and risking the deletion of legitimate emails. • Economic impacts have led to legislative measures in some countries. • Text classification, essential for organizing and categorizing text, distinguishes between spam and legitimate messages. • Machine learning automates this process efficiently by learning associations from pre-labeled data. • Feature extraction transforms text into numerical representations, aiding in accurate classification. • ML techniques enhance precision and speed in analyzing big data, crucial for informing business decisions and automating processes. • This project employs machine learning to detect spam messages without explicit programming. • Algorithms learn classification rules from pre-labeled data, predicting the category of unknown texts based on majority vote. 1
  • 3. PROBLEM STATEMENT • Spammers are in continuous war with E-mail service providers. E-mail service providers implement various spam filtering methods to retain their users and spammers are continuous changing patterns using various embedding tricks to get through filtering. These filters can never be too aggressive because slight misclassification may lead to important misinformation loss for consumer. A rigid filtering method with additional reinforcements is needed to tackle the problem. • To combat the ever-evolving tactics of spammers, email service providers must continuously adapt their spam filtering strategies. By implementing a combination of sophisticated techniques such as content analysis, sender verification, and machine learning algorithms, providers can effectively block unwanted messages while allowing legitimate emails to reach their recipients. 2
  • 4. 3 OBJECTIVES: The objectives of this project are • To create a ensemble algorithm for classification of spam with highest possible accuracy. • To study on how to use machine learning for spam detection. • To study how natural language processing techniques can be implemented in spam detection. • To provide user with insights of the given text leveraging the created algorithm and NLP. • Develop ensemble algorithm for accurate spam classification using machine learning. • Enhance spam detection methods through machine learning techniques. • Implement natural language processing (NLP) for improved spam detection. • Provide users valuable insights from text by combining algorithm with NLP. • Revolutionize spam detection for a more secure online experience
  • 6. DATA DESCRIPTION: Dataset : UCI SMS Spam Collection. Source: Kaggle. Description : A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the NUS. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
  • 7. DATA PROCESSING: • Dataset cleaning • Dataset Merging TEXTUAL DATA PROCESSING: • Tag Removal • Sentencing, tokenization • Stop word removal • Lemmatization • Sentence formation FEATURE VECTOR FORMATION : • The texts are converted into feature vectors(numerical data) using the words present in all the texts combined. • This process is done using count vectorization of NLTK library. • The feature vectors can be formed using two language models Bag of Words and Term Frequency-inverse Document Frequency.
  • 8. BAG OF WORDS: Bag of words is a language model used mainly in text classification. A bag of words represents the text in a numerical form. The two things required for Bag of Words are • A vocabulary of words known to us. • A way to measure the presence of words. Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens. “ It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, ” The unique words here (ignoring case and punctuation) are: [ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The next step is scoring words present in every document
  • 9. After scoring the four lines from the above stanza can be represented in vector form as “It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0] "it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1] Term Frequency-Inverse Document Frequency: • Term frequency-inverse document frequency of a word is a measurement of the importance of a word. • It compares the repentance of words to the collection of documents and calculates the score. • Terminology for the below formulae: t – term(word). d – document. N – count of documents. The TF-IDF process consists of various activities listed below.
  • 10. i) Term Frequency • The count of appearance of a particular word in a document is called term frequency. 𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅 ii) Document Frequency • Document frequency is the count of documents the word was detected in. We consider one instance of a word and it doesn’t matter if the word is present multiple times. 𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔 iii) Inverse Document Frequency • IDF (Inverse Document Frequency) is the inverse of document frequency. • It evaluates the significance of a term by considering its informational contribution. • Common terms like "are," "if," and "a" provide minimal document insight. • IDF diminishes the importance of frequently occurring terms and boosts rare ones. 𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇
  • 11. Finally, the TF-IDF can be calculated by combining the term frequency and inverse document frequency. 𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏)) The process can be explained using the following example: “Document 1 It is going to rain today. Document 2 Today I am not going outside. Document 3 I am going to watch the season premiere.” The Bag of words of the above sentences is [going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1] • It combines term frequency (TF) and inverse document frequency (IDF). • TF represents the frequency of a word in a document, while IDF evaluates its significance across the collection. • By assigning weights to words, TF-IDF aids in text mining, information retrieval, and natural language processing.
  • 12. Term Frequency: Then finding the inverse document frequency Inverse Document Frequency :
  • 13. Applying the final equation the values of tf-idf becomes Using the above two language models the complete data has been converted into two kinds of vectors and stored into a csv type file for easy access and minimal processing.
  • 14. MACHINE LEARNING: • Machine Learning is process in which the computer performs certain tasks without giving instructions. In this case the models takes the training data and train on them. • Then depending on the trained data any new unknown data will be processed based on the ruled derived from the trained data. • After completing the count vectorization and TF-IDF stages in the workflow the data is converted into vector form(numerical form) which is used for training and testing models. • For our study various machine learning models are compared to determine which method is more suitable for this task. • The models used for the study include Naïve Bayes, K Nearest Neighbors, and Support Vector Machine.
  • 15. ALGORITHM S A combination of 3 algorithms are used for the classifications. NAÏVE BAYES CLASSIFIER A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for classification tasks. The main principle behind this model is the Bayes theorem. Bayes Theorem: Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all the features that predict the target value are independent of each other. It calculates the probability of each class and then picks the one with the highest probability. P(A│B)=(P(B│A)P(A))/P(B).
  • 16. P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability. P(B|A) is the probability of data B given that hypothesis A was true. P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior probability of A. P(B) is the probability of the data (regardless of the hypothesis) Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes model is that it treats every word in a text as independent and is equal in importance but every word cannot be treated equally important because articles and nouns are not the same when it comes to language.
  • 17. K-NEAREST NEIGHBORS • KNN is a classification algorithm. It comes under supervised algorithms. All the data points are assumed to be in an n-dimensional space. And then based on neighbors the category of current data is determined based on the majority. • Euclidian distance is used to determine the distance between points. The distance between 2 points is calculated as d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
  • 18. • The distances between the unknown point and all the others are calculated. Depending on the K provided k closest neighbors are determined. The category to which the majority of the neighbors belong is selected as the unknown data category. • If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other distance-based algorithms such as SVM as it needs to determine the distance to all points to get the closest neighbors to the given point. SUPPORT VECTOR MACHINES(SVM) It is a machine learning algorithm for classification. Decision boundaries are drawn between various categories and based on which side the point falls to the boundary the category is determined.
  • 19. Support Vectors: The vectors closer to boundaries are called support vectors/planes. If there are n categories then there will be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be starting from the origin. The distance between the support vectors is called margin. We want our margin to be as wide as possible because it yields better results. There are three types of boundaries used by SVM to create boundaries. Linear: used if the data is linearly separable. Poly: used if data is not separable. It creates any data into 3-dimensional data. Radial: This is the default kernel used in SVM. It converts any data into infinite-dimensional data.
  • 20. • If the data is 2-dimensional then the boundaries are lines. If the data is 3- dimensional then the boundaries are planes. If the data categories are more than 3 then boundaries are called hyperplanes. • An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all other data to get the prediction due to this SVM’s tend to be quick with predictions. RESULTS MODEL SELECTION • While selecting the best language model the data has been converted into both types of vectors and then the models been tested for to determine the best model for classifying spam. • The results from individual models are presented in the experimentation section under methodology. Now comparing the results from the models.
  • 21. Metric Model Accuracy Precision F1 Score Naive Bayes 95.94% 100% 97.91% KNN 90.04% 100% 94.92% SVM 97.29% 97.41% 97.35% • From the code it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-IDF has been selected as the primary language model for textual data conversion in feature vector formation. COMPARISO N The results from the proposed model has been compared with all the models individually in tabular form to illustrate the differences clearly.
  • 22. SUMMAR Y • There are two main tasks in the project implementation. Language model selection for completing the textual processing phase and proposed model creation using the individual algorithms. These two tasks require comparison from other models and select of various parameters for better efficiency. • During the language model selection phase two models, Bag of Words and TF-IDF are compared to select the best model and from the results obtained it is evident that TF-IDF performs better. CONCLUSION AND FUTURE SCOPE Conclusion: From the results obtained we can conclude that an ensemble machine learning model is more effective in detection and classification of spam than any individual algorithms.
  • 23. We can also conclude that TF-IDF (term frequency inverse document frequency) language model is more effective than Bag of words model in classification of spam when combined with several algorithms. And finally we can say that spam detection can get better if machine learning algorithms are combined and tuned to needs. Project Scope This project needs a coordinated scope of work. i. Combine existing machine learning algorithms to form a better ensemble algorithm. ii. Clean, processing and make use of the dataset for training and testing the model created. iii. Analyse the texts and extract entities for presentation.
  • 24. Limitations This Project has certain limitations. i. This can only predict and classify spam but not block it. ii. Analysis can be tricky for some alphanumeric messages and it may struggle with entity detection. iii. Since the data is reasonably large it may take a few seconds to classify and anlayse the message THANK YOU
  • 25. THANK YOU Presented By : Adeline Palmerston Larana University | 2024
  翻译: