SlideShare a Scribd company logo
Introduction to Text Mining
Agenda
• Defining Text Mining
• Structured vs. Unstructured Data
• Why Text Mining
• Some Text Mining Ambiguities
• Pre-processing the Text
Text Mining
• The discovery by computer of new, previously unknown information, by
automatically extracting information from a usually large amount of different
unstructured textual resources
Previously unknown means:
• Discovering genuinely new information
• Discovering new knowledge vs. merely finding patterns is like the difference
between a detective following clues to find the criminal vs. analysts looking at
crime statistics to assess overall trends in car theft
Unstructured means:
• Free naturally occurring text
• As opposed HTML, XML….
Text Mining Vs. Data Mining
• Data in Data mining is a series of numbers. Data for text mining is a collection of
documents.
• Data mining methods see data in spreadsheet format. Text mining methods see
data in document format
Structured vs. Unstructured Data
• Structured data
• Loadable into “spreadsheets”
• Arranged into rows and columns
• Each cell filled or could be filled
• Data mining friendly
• Unstructured daa
• Microsoft Word, HTML, PDF documents, PPTs
• Usually converted into XML  semi structured
• Not structured into cells
• Variable record length, notes, free form survey-answers
• Text is relatively sparse, inconsistent and not uniform
• Also images, video, music etc.
Why Text Mining?
• Leveraging text should improve decisions and predictions
• Text mining is gaining momentum
• Sentiment analysis (twitter, facebook)
• Predicting stock market
• Predicting churn
• Customer influence
• Customer service and help desk
• Not to mention Watson
Why Text Mining is Hard?
• Language is ambiguous
• Context is needed to clarify
• The same words can have different meaning (homographs)
• Bear (verb) – to support or carry
• Bear (noun) – a large animal
• Different words can mean the same (synonyms)
• Language is subtle
• Concept / word extraction usually results in huge number of dimensions
• Thousands of new fields
• Each field typically has low information content (sparse)
• Misspellings, abbreviations, spelling variants
• Renders search engines, SQL queries.. ineffective.
Some Text Mining Ambiguities
• Homonomy: same word, different meaning
• Mary walked along the bank of the river
• HarborBank is the richest bank in the citys
• Synonymy: Synonyms, different words, similar or same meaning, can
substitute one word for other without changing meaning
• Miss Nelson became a kind of big sister to Benjamin
• Miss Nelson became a kind of large sister to Benjamin
• Polysemy: same word or form, but different, albeit related meaning
• The bank raised its interest rates yesterday
• The store is next to the newly constructed bank
• The bank appeared first in Italy I the Renaissance
• Hyponymy: Concept hierarchy or subclass
• Animal (noun) – cat, dog
• Injury – broken leg, intusion
Seven Types of Text Mining
• Search and Information Retrieval – storage and retrieval of text documents, including
search engines and keyword search
• Document Clustering – Grouping and categorizing terms, snippets, paragraphs or
documents using clustering methods
• Document Classification – grouping and categorizing snippets, paragraphs or document
using data mining classification methods, based on methods trained on labelled
examples
• Web Mining – Data and Text mining on the internet with specific focus on scaled and
interconnectedness of the web
• Information Extraction – Identification and extraction of relevant facts and relationships
from unstructured text
• Natural Language Processing – Low level language processing and understanding of
tasks (eg. Tagging part of speech)
• Concept extraction – Grouping of words and phrases into semantically similar groups
Text Mining – Some Definitions
• Document – a sequence of words and punctuation, following the grammatical
rules of the language.
• Term – usually a word, but can be a word-pair or phrase
• Corpus – a collection of documents
• Lexicon – set of all unique words in corpus
Pre-processing the Text
• Text Normalization
• Parts of Speech Tagging
• Removal of stop words
Stop words – common words that don’t add meaningful content to the document
• Stemming
• Removing suffices and prefixes leaving the root or stem of the word.
• Term weighting
• POS Tagging
• Tokenization
Text Normalization
• Case
• Make all lower case (if you don’t care about proper nouns, titles, etc)
• Clean up transcription and typing errrors
• do n’t, movei
• Correct misspelled words
• Phonetically
• Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance
• Dictionaries
• Use POS and context to make good guess
Parts of Speech Tagging
• Useful for recognizing names of people, places, organizations, titles
• English language
• Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions
POS Tags from Penn Tree Bank
Tag Description Tag Description Tag Description
CC Coordinating Conjunction CD Cardinal Number DT Determiner
EX Existential there FW Foreign Word IN Preposition or subordinating
conjuction
JJ Adjective JJR Adjective, comparative JJS Adjective, superlative
LS List Item Marker MD Modal NN Noun, singular or mass
NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer
POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun
RB Adverb RBR Adverb, comparative RBS Adverb, superlative
RP Particle SYM Symbol TO To
UH Interjection VB Verb, base form VBD Verb, past tens
Example of Tagging
• In this talk, Mr. Pole discussed how Target was using Predictive Analytics including
descriptions of using potential value models, coupon models, and yes predicting
when a woman is due
• In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNP
was/VBD using/VBG Predictive/NNP Analytics/NNP including/VBG
descriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS,
coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is
due/JJ
Tokenization
• Converts streams of characters into words
• Main clues (in English): Whitespace
• No single algorithms ‘works’ always
• Some languages do not have white space (Chinese, Japanese)
Stemming
• Normalizes / unifies variations of the same data
• ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
• Inflectional stemming
• Remove plurals
• Normalize verb tenses
• Remove other affixes
• Stemming to root
• Reduce word to most basic element
• More aggressive than inflectional
• ‘denormalization’  norm
• ‘Apply’, ‘applications’, ‘reapplied’  apply
Common English Stop Words
• a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such,
that, the, their, then, these, they, this, to, was, will, with
• Stop words are very common and rarely provide useful information for
information extraction and concept extraction
• Removing stop words also reduce dimensionality
Dictionaries and Lexicons
• Highly recommended, can be very time consuming
• Reduces set of key words to focus on
• Words of interest
• Dictionary words
• Increase set of keywords to focus on
• Proper nouns
• Acronyms
• Titles
• Numbers
• Key ways to use dictionary
• Local dictionary (specialized words)
• Stop words and too frequent words
• Stemming – reduce stems to dictionary words
• Synonyms – replace synonyms with root words in the list
• Resolve abbreviations and acronyms
Sentiment Analysis Workflow
Content Retrieval
Content Extraction
Corpus Generation
Corpus Transformation
Corpus Filtering
Sentiment Calculation
WebDataRetrievalCorpusPre
Processing
Sentiment
Analysis
Sentiment Indicators
• 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 =
𝑝−𝑛
𝑝+𝑛
• 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑝+𝑛
𝑁
• 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑝
𝑁
• 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑛
𝑁
• 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =
𝑝 − 𝑛
𝑁
Ad

More Related Content

What's hot (20)

Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Marina Santini
 
Web mining
Web miningWeb mining
Web mining
Daminda Herath
 
Web Analytics Tools Comparison
Web Analytics Tools ComparisonWeb Analytics Tools Comparison
Web Analytics Tools Comparison
Tim Wilson
 
Data analytics
Data analyticsData analytics
Data analytics
BindhuBhargaviTalasi
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data Analytics
Systems Limited
 
Web mining
Web miningWeb mining
Web mining
Renusoni8
 
Data mining
Data mining Data mining
Data mining
sayalipatil528
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
MahamudHasanCSE
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Web Mining
Web MiningWeb Mining
Web Mining
Ziyad Abid
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
prashantdahake
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
What is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptxWhat is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptx
Geromme Talampas
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
Ali Saif Mirza
 
Web Analytics Tools Comparison
Web Analytics Tools ComparisonWeb Analytics Tools Comparison
Web Analytics Tools Comparison
Tim Wilson
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data Analytics
Systems Limited
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
MahamudHasanCSE
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
prashantdahake
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
What is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptxWhat is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptx
Geromme Talampas
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
Ali Saif Mirza
 

Viewers also liked (20)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Datamining Tools
 
Text mining
Text miningText mining
Text mining
Ali A Jalil
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Introduction to text mining
Introduction to text miningIntroduction to text mining
Introduction to text mining
Lars Juhl Jensen
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Text mining tutorial
Text mining tutorialText mining tutorial
Text mining tutorial
Salford Systems
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
Jaganadh Gopinadhan
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
Yi-Shin Chen
 
Data mining
Data miningData mining
Data mining
Akannsha Totewar
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
A Survey on the Classification Techniques In Educational Data Mining
A Survey on the Classification Techniques In Educational Data MiningA Survey on the Classification Techniques In Educational Data Mining
A Survey on the Classification Techniques In Educational Data Mining
Editor IJCATR
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
Bhawi247
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
Maurice Masih
 
Rares Songs
Rares SongsRares Songs
Rares Songs
Breannalyn Pearce-Raposa
 
Text Mining in Jeb Bush’s Email and Social Network
Text Mining in Jeb Bush’s Email and Social NetworkText Mining in Jeb Bush’s Email and Social Network
Text Mining in Jeb Bush’s Email and Social Network
Yi Chun (Nancy) Chien
 
Text mining the contributors to rail accidents
Text mining the contributors to rail accidentsText mining the contributors to rail accidents
Text mining the contributors to rail accidents
Finalyearprojects Toall
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational Research
Qiang Hao
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Introduction to text mining
Introduction to text miningIntroduction to text mining
Introduction to text mining
Lars Juhl Jensen
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
Jaganadh Gopinadhan
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
Yi-Shin Chen
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
A Survey on the Classification Techniques In Educational Data Mining
A Survey on the Classification Techniques In Educational Data MiningA Survey on the Classification Techniques In Educational Data Mining
A Survey on the Classification Techniques In Educational Data Mining
Editor IJCATR
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
Bhawi247
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
Text Mining in Jeb Bush’s Email and Social Network
Text Mining in Jeb Bush’s Email and Social NetworkText Mining in Jeb Bush’s Email and Social Network
Text Mining in Jeb Bush’s Email and Social Network
Yi Chun (Nancy) Chien
 
Text mining the contributors to rail accidents
Text mining the contributors to rail accidentsText mining the contributors to rail accidents
Text mining the contributors to rail accidents
Finalyearprojects Toall
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational Research
Qiang Hao
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Ad

Similar to 3. introduction to text mining (20)

Textmining
TextminingTextmining
Textmining
sidhunileshwar
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
Manohar Swamynathan
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Web and text
Web and textWeb and text
Web and text
Institute of Technology Telkom
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!
Ivy Pro School
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
Sushanti Acharya
 
learn about text preprocessing nip using nltk
learn about text preprocessing nip using nltklearn about text preprocessing nip using nltk
learn about text preprocessing nip using nltk
en21cs301047
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
shilpashukla01
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
Meena Nagarajan
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
Dawn Anderson MSc DigM
 
Semantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfSemantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdf
PlamenaDzharadat
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
Seth Grimes
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
nikshaikh786
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!
Ivy Pro School
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
Sushanti Acharya
 
learn about text preprocessing nip using nltk
learn about text preprocessing nip using nltklearn about text preprocessing nip using nltk
learn about text preprocessing nip using nltk
en21cs301047
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
shilpashukla01
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
Meena Nagarajan
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
Dawn Anderson MSc DigM
 
Semantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfSemantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdf
PlamenaDzharadat
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
Seth Grimes
 
Ad

Recently uploaded (20)

Faces of the Future The Impact of a Data Science Course in Kerala.pdf
Faces of the Future The Impact of a Data Science Course in Kerala.pdfFaces of the Future The Impact of a Data Science Course in Kerala.pdf
Faces of the Future The Impact of a Data Science Course in Kerala.pdf
jzyphoenix
 
powerpoint 2 presentation based on data analytics
powerpoint 2 presentation based on data analyticspowerpoint 2 presentation based on data analytics
powerpoint 2 presentation based on data analytics
shivolenai22
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
ijitcs
 
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays
 
Introduction to information about Data Structure.pptx
Introduction to information about Data Structure.pptxIntroduction to information about Data Structure.pptx
Introduction to information about Data Structure.pptx
tarrebulehora
 
15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf
AffinityCore
 
Monterey College of Law’s mission is to z
Monterey College of Law’s mission is to zMonterey College of Law’s mission is to z
Monterey College of Law’s mission is to z
seoali2660
 
awslambda-240508203904-07xsds253491.pptx
awslambda-240508203904-07xsds253491.pptxawslambda-240508203904-07xsds253491.pptx
awslambda-240508203904-07xsds253491.pptx
FarooqKhurshid1
 
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays
 
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays
 
TUG BD Kick Off Meet up 21 May Slide Deck.pptx
TUG BD Kick Off Meet up 21 May Slide Deck.pptxTUG BD Kick Off Meet up 21 May Slide Deck.pptx
TUG BD Kick Off Meet up 21 May Slide Deck.pptx
SaidAlHaque
 
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
 
artificial intelligence (1).pptx hgggfcgfch
artificial intelligence (1).pptx hgggfcgfchartificial intelligence (1).pptx hgggfcgfch
artificial intelligence (1).pptx hgggfcgfch
DevAnshGupta609215
 
Mathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTXMathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTX
ManojSharma311544
 
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
 
Magician PeterMagician PeterMagician PeterMagician Peter
Magician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician Peter
Magician PeterMagician PeterMagician PeterMagician Peter
seomarket363
 
14th International Conference on Advanced Computer Science and Information Te...
14th International Conference on Advanced Computer Science and Information Te...14th International Conference on Advanced Computer Science and Information Te...
14th International Conference on Advanced Computer Science and Information Te...
ijitcs
 
tinywow_Varia_PPT Leadership skills1_80706257.docx
tinywow_Varia_PPT Leadership skills1_80706257.docxtinywow_Varia_PPT Leadership skills1_80706257.docx
tinywow_Varia_PPT Leadership skills1_80706257.docx
abdulrhmansultanfa
 
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays
 
Chapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structureChapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structure
benyakoubrania53
 
Faces of the Future The Impact of a Data Science Course in Kerala.pdf
Faces of the Future The Impact of a Data Science Course in Kerala.pdfFaces of the Future The Impact of a Data Science Course in Kerala.pdf
Faces of the Future The Impact of a Data Science Course in Kerala.pdf
jzyphoenix
 
powerpoint 2 presentation based on data analytics
powerpoint 2 presentation based on data analyticspowerpoint 2 presentation based on data analytics
powerpoint 2 presentation based on data analytics
shivolenai22
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
ijitcs
 
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays
 
Introduction to information about Data Structure.pptx
Introduction to information about Data Structure.pptxIntroduction to information about Data Structure.pptx
Introduction to information about Data Structure.pptx
tarrebulehora
 
15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf15 Data Quality Issues Identify & Resolve Errors.pdf
15 Data Quality Issues Identify & Resolve Errors.pdf
AffinityCore
 
Monterey College of Law’s mission is to z
Monterey College of Law’s mission is to zMonterey College of Law’s mission is to z
Monterey College of Law’s mission is to z
seoali2660
 
awslambda-240508203904-07xsds253491.pptx
awslambda-240508203904-07xsds253491.pptxawslambda-240508203904-07xsds253491.pptx
awslambda-240508203904-07xsds253491.pptx
FarooqKhurshid1
 
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays
 
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)
apidays
 
TUG BD Kick Off Meet up 21 May Slide Deck.pptx
TUG BD Kick Off Meet up 21 May Slide Deck.pptxTUG BD Kick Off Meet up 21 May Slide Deck.pptx
TUG BD Kick Off Meet up 21 May Slide Deck.pptx
SaidAlHaque
 
463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois463.8-Bitcoin from university of illinois
463.8-Bitcoin from university of illinois
8gqtkfzwbb
 
artificial intelligence (1).pptx hgggfcgfch
artificial intelligence (1).pptx hgggfcgfchartificial intelligence (1).pptx hgggfcgfch
artificial intelligence (1).pptx hgggfcgfch
DevAnshGupta609215
 
Mathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTXMathcad Sales Presentation software for use.PPTX
Mathcad Sales Presentation software for use.PPTX
ManojSharma311544
 
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
 
Magician PeterMagician PeterMagician PeterMagician Peter
Magician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician PeterMagician Peter
Magician PeterMagician PeterMagician PeterMagician Peter
seomarket363
 
14th International Conference on Advanced Computer Science and Information Te...
14th International Conference on Advanced Computer Science and Information Te...14th International Conference on Advanced Computer Science and Information Te...
14th International Conference on Advanced Computer Science and Information Te...
ijitcs
 
tinywow_Varia_PPT Leadership skills1_80706257.docx
tinywow_Varia_PPT Leadership skills1_80706257.docxtinywow_Varia_PPT Leadership skills1_80706257.docx
tinywow_Varia_PPT Leadership skills1_80706257.docx
abdulrhmansultanfa
 
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)
apidays
 
Chapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structureChapter VII RECURSION.pdf algor and data structure
Chapter VII RECURSION.pdf algor and data structure
benyakoubrania53
 

3. introduction to text mining

  • 2. Agenda • Defining Text Mining • Structured vs. Unstructured Data • Why Text Mining • Some Text Mining Ambiguities • Pre-processing the Text
  • 3. Text Mining • The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources Previously unknown means: • Discovering genuinely new information • Discovering new knowledge vs. merely finding patterns is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft Unstructured means: • Free naturally occurring text • As opposed HTML, XML….
  • 4. Text Mining Vs. Data Mining • Data in Data mining is a series of numbers. Data for text mining is a collection of documents. • Data mining methods see data in spreadsheet format. Text mining methods see data in document format
  • 5. Structured vs. Unstructured Data • Structured data • Loadable into “spreadsheets” • Arranged into rows and columns • Each cell filled or could be filled • Data mining friendly • Unstructured daa • Microsoft Word, HTML, PDF documents, PPTs • Usually converted into XML  semi structured • Not structured into cells • Variable record length, notes, free form survey-answers • Text is relatively sparse, inconsistent and not uniform • Also images, video, music etc.
  • 6. Why Text Mining? • Leveraging text should improve decisions and predictions • Text mining is gaining momentum • Sentiment analysis (twitter, facebook) • Predicting stock market • Predicting churn • Customer influence • Customer service and help desk • Not to mention Watson
  • 7. Why Text Mining is Hard? • Language is ambiguous • Context is needed to clarify • The same words can have different meaning (homographs) • Bear (verb) – to support or carry • Bear (noun) – a large animal • Different words can mean the same (synonyms) • Language is subtle • Concept / word extraction usually results in huge number of dimensions • Thousands of new fields • Each field typically has low information content (sparse) • Misspellings, abbreviations, spelling variants • Renders search engines, SQL queries.. ineffective.
  • 8. Some Text Mining Ambiguities • Homonomy: same word, different meaning • Mary walked along the bank of the river • HarborBank is the richest bank in the citys • Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning • Miss Nelson became a kind of big sister to Benjamin • Miss Nelson became a kind of large sister to Benjamin • Polysemy: same word or form, but different, albeit related meaning • The bank raised its interest rates yesterday • The store is next to the newly constructed bank • The bank appeared first in Italy I the Renaissance • Hyponymy: Concept hierarchy or subclass • Animal (noun) – cat, dog • Injury – broken leg, intusion
  • 9. Seven Types of Text Mining • Search and Information Retrieval – storage and retrieval of text documents, including search engines and keyword search • Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods • Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples • Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web • Information Extraction – Identification and extraction of relevant facts and relationships from unstructured text • Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech) • Concept extraction – Grouping of words and phrases into semantically similar groups
  • 10. Text Mining – Some Definitions • Document – a sequence of words and punctuation, following the grammatical rules of the language. • Term – usually a word, but can be a word-pair or phrase • Corpus – a collection of documents • Lexicon – set of all unique words in corpus
  • 11. Pre-processing the Text • Text Normalization • Parts of Speech Tagging • Removal of stop words Stop words – common words that don’t add meaningful content to the document • Stemming • Removing suffices and prefixes leaving the root or stem of the word. • Term weighting • POS Tagging • Tokenization
  • 12. Text Normalization • Case • Make all lower case (if you don’t care about proper nouns, titles, etc) • Clean up transcription and typing errrors • do n’t, movei • Correct misspelled words • Phonetically • Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance • Dictionaries • Use POS and context to make good guess
  • 13. Parts of Speech Tagging • Useful for recognizing names of people, places, organizations, titles • English language • Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions POS Tags from Penn Tree Bank Tag Description Tag Description Tag Description CC Coordinating Conjunction CD Cardinal Number DT Determiner EX Existential there FW Foreign Word IN Preposition or subordinating conjuction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List Item Marker MD Modal NN Noun, singular or mass NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO To UH Interjection VB Verb, base form VBD Verb, past tens
  • 14. Example of Tagging • In this talk, Mr. Pole discussed how Target was using Predictive Analytics including descriptions of using potential value models, coupon models, and yes predicting when a woman is due • In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNP was/VBD using/VBG Predictive/NNP Analytics/NNP including/VBG descriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS, coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is due/JJ
  • 15. Tokenization • Converts streams of characters into words • Main clues (in English): Whitespace • No single algorithms ‘works’ always • Some languages do not have white space (Chinese, Japanese)
  • 16. Stemming • Normalizes / unifies variations of the same data • ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk • Inflectional stemming • Remove plurals • Normalize verb tenses • Remove other affixes • Stemming to root • Reduce word to most basic element • More aggressive than inflectional • ‘denormalization’  norm • ‘Apply’, ‘applications’, ‘reapplied’  apply
  • 17. Common English Stop Words • a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, these, they, this, to, was, will, with • Stop words are very common and rarely provide useful information for information extraction and concept extraction • Removing stop words also reduce dimensionality
  • 18. Dictionaries and Lexicons • Highly recommended, can be very time consuming • Reduces set of key words to focus on • Words of interest • Dictionary words • Increase set of keywords to focus on • Proper nouns • Acronyms • Titles • Numbers • Key ways to use dictionary • Local dictionary (specialized words) • Stop words and too frequent words • Stemming – reduce stems to dictionary words • Synonyms – replace synonyms with root words in the list • Resolve abbreviations and acronyms
  • 19. Sentiment Analysis Workflow Content Retrieval Content Extraction Corpus Generation Corpus Transformation Corpus Filtering Sentiment Calculation WebDataRetrievalCorpusPre Processing Sentiment Analysis
  • 20. Sentiment Indicators • 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑝−𝑛 𝑝+𝑛 • 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑝+𝑛 𝑁 • 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑝 𝑁 • 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑛 𝑁 • 𝑠𝑒𝑡𝑛𝑖𝑚𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑝𝑒𝑟 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = 𝑝 − 𝑛 𝑁
  翻译: