Text mining, word embeddings, & wikipedia

Text Mining, Word Embeddings,
& Wikipedia
Muhammad Atif Qureshi

12/01/17 2
Contents
● Introduction
● Text Mining
– Similar words
– Word ambiguity
● Word Embedding
– Related Research
– Toy Example
● Wikipedia
– Structure
– Phrase Chunking
– Case studies

12/01/17 3
Problem
● Motivation
– Human beings have found a great comfort in expressing their viewpoint in writing
because of its ability to preserve thoughts for a longer period of time than oral
communication.
– Textual data is a very popular means of communication over the World Wide Web
in the form of data on online news websites, social networks, emails, governmental
websites, etc.
● Observation
Text may contain the following complexities
– Lack of contextual and background information
– Ambiguity due to more than one possible interpretation of the meaning of text
– Focus and assertions on multiple topics

12/01/17 4
Text Mining
● Motivation
With so much textual data around us especially on
the World Wide Web, there is a motivation to
understand the meaning of the data
● Definition
It is the process by which textual data is analyzed in
order to derive high quality information on the basis
of patterns

12/01/17 5
Similar Words
● Can similar words be group together as one?
– Simple techniques
● Lemmatization (mapping plural to singulars, accurate
but low coverage)
● Stemming (map word to a root word, inaccurate but
high coverage)
– Complex technique
● A word is known by the company it keeps → Word
Embeddings

12/01/17 6
Word Ambiguity
● Is Apple a company or a fruit?
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
● Context is important
– Tastes → Fruit
– Phones → Apple Inc.

12/01/17 7
Word Embedding
● Definition
– It is a technique in NLP that quantifies a concept
(word or phrase) as a vector of real numbers.
● Simple application scenario
– How similar are two words?
– Similarity(vector(good), vector(best))

12/01/17 8
Related Research
● Word embeddings
– Word2Vec
● It is a predictive model which uses two layer neural networks
– FastText
● It is an extension to word2vec by Facebook
– GloVe
● It is a count based model which performs dimensionality reduction on the co-
occurrence matrix
● Wikipedia based Relatedness
– Semantic Relatedness Framework
● It uses Wikipedia sub-category hierarchy to measure relatedness

12/01/17 9
Toy Example → Word
Embeddings
● Train co-occurence matrix
● Apply cosine similarity
● Find vectors
● Further concepts
– Dimestionality Reduction
– Window size
– Filter words

12/01/17 10
Word Analogies
● Man is to Woman, King is to ____ ?
● London is to England, Islamabad is to
____ ?
● Using vectors, we can say
– King – Man + Woman → Queen
– Islamabad – London + England → Pakistan

12/01/17 11
Why Wikipedia for Text
Mining?
● One of the largest encyclopedia
● Free to use
● Collaboratively and actively updated

12/01/17 12
Wikipedia
● Each article has a title that identifies a concept.
● Each article contains content that defines a particular concept textually.
● Each article is mentioned inside different categories
– E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’,
etc.
●
Each Wikipedia category generally contains parent and children categories.
– E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by
nationality’, etc
– E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc

12/01/17 13
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
Wikipedia Category Graph Structure along with Wikipedia Articles
Wikipedia Graph
Structure

12/01/17 14
Example of Wikipedia
Category Structure
academic_disciplines
science
interdisciplinary_fields
scientific_disciplines
behavioural_sciences
society
social_sciences
science_studies
information_technology
information
sociology
information_science
Truncated Wikipedia Category Graph

12/01/17 15
Phrase Chunking using
Wikipedia
i prefer samsung s5 over htc, apple, nokia because it is economical and good.
i prefer samsung s5 over htc apple nokia because it is economical and good
Phrase chunking using phrase
boundaries
Longest phrase that matches with
Wikipedia Article Title or Redirect
(which is not a stopword)
samsung s5prefer htc apple
nokia economical
overi because it
and goodis
Removed stopwords Extracted phrases
I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.
Conversion into lowercase

12/01/17 16
Word Embedding using
Wikipedia
● We can find more complex relationships
due to
– Article-Category Graph structure
– Multi-lingual relations
– Infobox, birth, age, etc

12/01/17 17
Wikipedia Documents
Phrase
Chunking
Relatedness
Calculator
Wikipedia Article
Title or Redirect
Stream of
Text
Candidate
Phrases
Wikipedia Category-
Article Structure
Online Reputation
Management Tasks
Perspective Aware
Search Engine
Relatedness
Scores
Wikipedia Based Semantic
Relatedness Framework

12/01/17 18
Perspective Aware Approach to
Search
● Problem: The result set from a search engine
(Google, Bing, Yahoo) for any user's query may have
an inherent perspective given issues with the search
engine or issues with the underlying collection.
● PAS is system that allows users to specify at query
time a perspective together with their query.
● The system allows the users to quickly surmise the
presence of the perspective in the returned set.

12/01/17 19
Search
● Perspective is modelled by making use of
Wikipedia articles-categories graph
structure
– Perspective: activism
– Wikipedia fetches articles defining activism by
looking into category graph structure

12/01/17 20
Search

12/01/17 21
Keyword Extraction via
Identification of Domain-Specific
Keywords
Title of Web
Pages
Wikipedia Articles
& Redirects
Intersected
Phrases
Community Detection
Algorithm
Wikipedia
Category
Graph
Domain-Specific
Phrases
Identifies readable
phrases
Domain-Specific
Single Terms
Merging both
Domain-Specific
Keywords
By exploiting Wikipedia
Article-Category Structure
● Problem: Given a
collection of document
titles from different school
websites, we extract
domain specific keywords
for the entire website that
represent the domain.
● Example: “Information
Retrieval”, “Science”

12/01/17 22
Innovation in Automotive
Red → Probability 1.0
Green → Probability 0.5
White → Probability 0.0
Size represents how much a category is mentioned inside the dataset`

12/01/17 23
Python Snippet for the Usage of
the WikiMadeEasy API
● wiki_client = Wiki_client_service()
● print(wiki_client.process([ìsTitle', `business', 0]))
● print(wiki_client.process([ìsPerson', àlbert einstein', 0]))
● print(wiki_client.process([`mentionInCategories', `data mining', 0]))
● print(wiki_client.process([`containsArticles', `business', 0]))
● print(wiki_client.process([`matchesCategories', `pakistan', 0]))
● print(wiki_client.process([`matchesArticles', `computer science', 0]))
● print(wiki_client.process([`getWikiOutlinks', `pagerank', 0]))
● print(wiki_client.process([`getWikiInlinks', `google', 0]))
● print(wiki_client.process([`getExtendedAbstract', `pakistan', 0]))
● print(wiki_client.process([`getSubCategory', `science', 0]))
● print(wiki_client.process([`getSuperCategory', `science', 0]))
● graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [ìnformation_science',
`sociology'], 2])

Text mining, word embeddings, & wikipedia

Recommended

More Related Content

Similar to Text mining, word embeddings, & wikipedia (20)

More from M. Atif Qureshi (10)

Recently uploaded (20)

Text mining, word embeddings, & wikipedia