A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data

1
A Comparison of Propositionalization Strategies
for Creating Features from Linked Open Data
9/29/2014 Petar Ristoski, Heiko Paulheim

Motivation
9/29/2014 Ristoski, Paulheim 2

Motivation
• Many existing applications use LOD as background knowledge
in data mining
– Explaining data patterns and statistics: unemployment rate,
inflation, energy savings, etc …
– Content-based book/movies recommendation system
– Classifying incident related tweets
– Gene classification
– Prediction of car fuel consumption

Motivation
Local LOD
Data
link combine cleanse transform analyze

Motivation
• Standard data mining algorithms require propositional feature
vector representation
• Feature space: V={v1,v2,…, vn}
• Each instance is represented as an n-dimensional feature vector
(v1,v2,…,vn), where for each 1≤ vi ≤n :
– vi ∈ {true, false}, or vi ∈ {1,0}
– vi ∈ ℝ
– vi ∈ S, where S is a finite set of symbols

Motivation
Name Person Music Artist Instrument Genre
Trent Reznor 1 1 1 0
Wolfgang A. Mozart 1 1 1 1
Barack Obama 1 0 0 0

Related Work
• LiDDM (Narasimha et al.)
– a framework tool for Linked Data mining that capture data from LOD
cloud to extract hidden information
• SPARQL-ML (Kiefer et al.)
– extension to SPARQL to support data mining tasks for knowledge
discovery in the Semantic Web
• FeGeLOD (Paulheim et al.)
– Unsupervised generation of data mining features from LOD
• The resulting features are binary, or numerical aggregates using
SPARQL COUNT constructs
• No proper evaluation of the used propositionalization strategy

PROPOSED STRATEGIES

Strategies
• Strategies for features derived from specific relations
– r rdf:type C
– r dcterms:subject S
• Strategies for features derived from generic relations

Strategies for Features Derived from Specific Relations
1. Binary feature:
– vi =1 if C(r)
– vi =0 if ⅂C(r)
2. Relative count feature:
– vi =
1
푛
, where r has relation to n objects
3. TF-IDF feature:
– vi =
1
푛
log
푁
{푟|퐶 푟 }
, where N is the total number of resources in the dataset,
and {푟|퐶 푟 } denotes the number of resources for which the specific
relation to an object C exists

Features Derived from Specific Relations: Binary vs TF-IDF
+10 Music Artists
dbpedia:Person
dbpedia:Artist
dbpedia:MusicArtist
dbpedia:MilitaryPerson
dbpedia:Kris_Kristofferson dbpedia:Elvis_Presley
Name Person Artist Music Artist Military Person
Elvis Presley 1 1 1 1
Kris Kristofferson 1 1 1 1
Artist X 1 1 1 0

Features Derived from Specific Relations: Binary vs TF-IDF
+10 Music Artists
dbpedia:Person
dbpedia:Artist
dbpedia:MusicArtist
dbpedia:MilitaryPerson
dbpedia:Kris_Kristofferson dbpedia:Elvis_Presley
Name Person Artist Music Artist Military Person
Elvis Presley 0 0 0 0.672
Kris Kristofferson 0 0 0 0.672
Artist X 0 0 0 0

Strategies
• Strategies for features derived from specific relations
– r rdf:type C → C(r)
– dcterms:subject
• Strategies for features derived from relations as such
– describe how resource r is related to resource r′
– outgoing relation: p(r, r′ )
– Incoming relation : p(r′ ,r)

Strategies for Features Derived from Generic Relations
1. Binary feature:
– vi = 1 if p(r,r′)
– vi = 0 if ⅂p(r)
2. Count feature:
– vi = n, where r is connected to n resources with relation p
3. Relative count feature:
– vi =
푛푝
푃
, where P is the total number of outgoing relations for r, and np is
the number of relations of type p for r
4. TF-IDF feature:
– vi =
푛푝
푃
log
푁
{푟|∃푟′:푝 푟,푟′ }
, where N is the total number of resources in the
dataset, and {푟|∃푟′: 푝 푟, 푟′ } denotes the number of resources for which
p(r, r′) exists

Features Derived from Generic Relations : Binary vs Relative Count
dbpedia:Chester_Bennington
dbpedia:Anthony_Kiedis
dbpedia:Jules_Verne
dbpedia:instrument
8
5
1
dbpedia:author
68
Name instrument author
Chester Bennington 1 0
Anthony Kiedis 1 1
Jules Verne 0 1

Features Derived from Generic Relations : Binary vs Relative Count
dbpedia:Chester_Bennington
dbpedia:Anthony_Kiedis
dbpedia:Jules_Verne
dbpedia:instrument
8
5
1
dbpedia:author
68
Name instrument author
Chester Bennington 1 0
Anthony Kiedis 0.833 0.166
Jules Verne 0 1

EVALUATION

Evaluation
• Comparative evaluation of the propositionalization strategies on
three data mining tasks
– Classification
– Regression
– Outlier Detection
• Evaluated on six datasets, on five feature sets
– types
– categories
– incoming relations
– outgoing relations
– incoming and outgoing relations

Evaluation: Classification
• Datasets:
Dataset # instances # types # categories # rel in # rel out # rel in & rel out
Sports Tweets 5,054 7,814 14,025 3,574 5,334 8,908
Cities 212 721 999 1,304 1,081 2,385
• Methods:
– Naïve Bayes
– k-Nearest Neighbors (k=3)
– C4.5 decision trees
• Metrics for performance evaluation
– Accuracy
• Results calculated using stratified 10-fold cross validation

Evaluation: Classification
Datasets Cities Sports Tweets
Features Representation NB k-NN C4.5 Avg. NB k-NN C4.5 Avg.
types
Binary 55.71 56.17 59.05 56.98 81 82.9 82.95 82.28
Relative Count 57.1 49.61 55.22 53.98 80.96 81.44 81.88 81.43
TF-IDF 57.1 48.7 54.7 53.50 82.13 82.47 82.64 82.41
categories
Binary 55.74 49.98 56.17 53.96 82.26 76.56 71.98 76.93
Relative Count 59.52 44.35 58.96 54.28 90.76 84.09 80.86 85.24
TF-IDF 55.74 49.98 57.08 54.27 89.65 81.98 81.68 84.44
rel in
Binary 60.41 58.46 60.35 59.74 83.12 83.63 84.65 83.80
Count 56.69 31.1 59.37 49.05 83.25 85.11 85.4 84.59
Relative Count 49.16 38.23 58.55 48.65 69.51 84.63 85.17 79.77
TF-IDF 34.94 38.2 54.26 42.47 72.66 84.65 84.99 80.77
rel out
Binary 47.62 60 56.71 54.78 80.67 82.36 84.41 82.48
Count 49.96 55.24 58.59 54.60 79.95 83.35 85.01 82.77
Relative Count 48.07 58.44 56.65 54.39 62.13 84.27 83.51 76.64
TF-IDF 40.15 54.78 58.51 51.15 69.91 84.4 84.16 79.49
r in & out
Binary 59.44 58.57 56.47 58.16 86.17 85.14 86.45 85.92
Count 56.13 54.26 60.82 57.07 86.03 86.01 87.16 86.40
Relative Count 57.68 47.14 56.56 53.79 70.01 84.51 87.22 80.58
TF-IDF 40.17 46.21 58.46 48.28 75.15 84.86 86.19 82.07

Evaluation: Regression
• Datasets:
Dataset # instances # types # categories # rel in # rel out # rel in & rel out
Auto MPG 391 264 308 227 370 597
Cities 212 721 999 1,304 1,081 2,385
• Methods:
– Linear Regression
– M5Rules
– k-Nearest Neighbors (k=3)
– RMSE
• Results calculated using stratified 10-fold cross validation

Evaluation: Regression
Datasets Auto MPG Cities
Features Representation LR M5 k-NN Avg. LR M5 k-NN Avg.
types
Binary 3.952 3.056 3.63 3.546 24.303 18.793 22.164 21.753
Relative Count 3.843 2.952 3.571 3.455 18.046 19.696 33.569 23.770
TF-IDF 3.864 2.964 3.571 3.466 17.852 18.773 22.396 19.674
categories
Binary 3.698 2.9 3.62 3.409 18.884 22.323 22.677 21.295
Relative Count 3.747 3 3.57 3.430 18.952 19.98 34.489 24.474
TF-IDF 3.782 2.9 3.56 3.416 19.02 22.323 23.189 21.511
rel in
Binary 3.849 2.9 3.61 3.444 49.866 19.205 18.532 29.201
Count 3.892 3 4.62 3.824 138.041 19.915 19.27 59.075
Relative Count 3.976 2.9 3.57 3.488 122.365 22.335 18.877 54.526
TF-IDF 4.109 2.8 3.57 3.508 122.921 21.947 18.568 54.479
rel out
Binary 3.792 3.1 3.6 3.490 20.008 19.364 20.918 20.097
Count 4.072 3 4.15 3.734 36.317 19.459 23.994 26.590
Relative Count 4.095 2.9 3.57 3.536 43.22 21.961 21.472 28.884
TF-IDF 4.135 3 3.57 3.572 28.845 20.852 22.212 23.970
r in & out
Binary 3.991 3.1 3.67 3.572 40.803 18.803 18.211 25.939
Count 3.991 3.1 4.54 3.870 107.259 19.528 18.906 48.564
Relative Count 3.922 3 3.57 3.493 103.102 22.091 19.608 48.267
TF-IDF 3.982 3.01 3.57 3.523 115.373 20.623 19.702 51.899

Evaluation: Outlier Detection
• Datasets:
Dataset # instances # types # rel in # rel out # rel in & rel out
DBpedia-Peel 2,083 39 586 322 908
DBpedia-DBTropes 4,228 128 912 2,155 3,067
• Methods:
– k-NN Global Anomaly Score – GAS (k=25)
– Local Outlier Factor – LOF (10<k<50)
– Local Outlier Probability – LoOP (k=25)
– Area under the ROC curve (AUC)
• Results calculated on partial gold standard of 100 links

Evaluation: Outlier Detection
Datasets Dbpedia-Peel Dbpedia-DBTropes
Features Representation GAS LOF LoOP Avg. GAS LOF LoOP Avg.
types
Binary 0.386 0.487 0.554 0.476 0.503 0.627 0.605 0.578
Relative Count 0.385 0.398 0.595 0.459 0.503 0.385 0.314 0.401
TF-IDF 0.386 0.504 0.602 0.497 0.503 0.672 0.417 0.531
r in
Binary 0.169 0.367 0.289 0.275 0.426 0.52 0.45 0.465
Count 0.2 0.285 0.29 0.258 0.503 0.59 0.602 0.565
Relative Count 0.293 0.496 0.452 0.414 0.589 0.555 0.493 0.546
TF-IDF 0.14 0.354 0.317 0.270 0.509 0.519 0.568 0.532
r out
Binary 0.25 0.195 0.207 0.217 0.325 0.438 0.432 0.398
Count 0.539 0.455 0.391 0.462 0.547 0.578 0.522 0.549
Relative Count 0.542 0.544 0.391 0.492 0.618 0.601 0.513 0.577
TF-IDF 0.116 0.396 0.24 0.251 0.322 0.629 0.472 0.474
r in & out
Binary 0.324 0.431 0.51 0.422 0.352 0.439 0.396 0.396
Count 0.527 0.368 0.454 0.450 0.57 0.563 0.527 0.553
Relative Count 0.603 0.744 0.616 0.654 0.667 0.672 0.657 0.665
TF-IDF 0.202 0.667 0.484 0.451 0.481 0.462 0.5 0.481

Conclusion
• The chosen propositionalization strategy matters
• No general recommendation for a strategy
– What is the data mining task?
– What are the characteristics of the dataset?
– Which algorithm is going to be used?

Future Work
• Conduct further experiments on more feature sets
– Qualified incoming and outgoing relations
– Combine features from multiple LOD sources
• Conduct experiments on more data mining tasks
– Clustering, Recommendation Systems etc…
• More sophisticated strategies
– Combination of statistical and semantic measures
– Adaptation of weighting strategies used in text mining to overcome
problems with erroneous data
• Use the statistical measures for feature selection

RapidMiner LOD Extension
• Simple wiring of operators
– Importing
– Linking
– Feature generation
– Data consolidation
– Feature selection
– Visualization

Local LOD
Data
link combine cleanse transform analyze

Data
Enrichment
Data
Analysis
Linking
Feature
Selection
Schema
Matching &
Data Fusion

• Simple wiring of operators
– Importing
– Linking
– Feature generation
– Data fusion
– Feature selection
– Visualization
• Try it out!
– find “Linked Open Data” on the marketplace
– Google Group: groups.google.com/forum/#!forum/rmlod

32
A Comparison of Propositionalization Strategies
for Creating Features from Linked Open Data
9/29/2014 Petar Ristoski, Heiko Paulheim

A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data

Recommended

More Related Content

Similar to A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data (20)

Recently uploaded (20)

A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data