RDF2Vec: RDF Graph Embeddings for Data Mining

1
RDF2Vec: RDF Graph Embeddings
for Data Mining
Petar Ristoski and Heiko Paulheim

11/7/2016 2
Introduction
Linking
Exploration
/ Selection
Consolidation
/ Cleansing
Graph Data
Transformation
Data
Mining
Visualization /
Explanation
Ristoski, Paulheim

Motivation
• Standard data mining algorithms require propositional feature
vector representation
• Feature space: V={v1,v2,…, vn}
• Each instance is represented as an n-dimensional feature vector
(v1,v2,…,vn), where for each 1≤ vi ≤n :
– vi ∈ {true, false}, or vi ∈ {1,0}
– vi ∈ ℝ
– vi ∈ S, where S is a finite set of symbols
11/7/2016 Ristoski, Paulheim 3

Name Person Music Artist Instrument Genre
Trent Reznor 1 1 1 0
Wolfgang A. Mozart 1 1 1 1
Barack Obama 1 0 0 0
Motivation

Vision
• Preserve the information given in the original graph
• Unsupervised
– task and dataset independent
• Compatible with traditional data mining algorithms and tools
• Efficient computation and application
– Low dimensional representation

RDF2VEC APPROACH
11/7/2016 6Ristoski, Paulheim

RDF2Vec
• Adaptation of neural language models
– Word2vec
– Latent representation of words based on text corpus
• Convert RDF graphs in sequences of entities and relations (sentences)
– Graph Walks
– Weisfeiler-Lehman Subtree RDF Graph Kernels
• Train neural language model
– Each entity and relation is represented as N-dimensional numerical vector
– Semantically similar entities appear closer in the embedded space
• Use entity vectors in different ML tasks

Word2vec – Neural Language Model
• Two-layer neural net that converts raw text into vectors
– Each word is represented into a numerical vector
• Continuous Bag-of-Words (CBOW)
– Predict target words from source context words
– Tokyo is the capital of Japan
• Skip-gram
11/7/2016 8
[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013.
[2] Rong, Xin. "word2vec parameter learning explained." 2014.
Ristoski, Paulheim

CBOW
11/7/2016 9
Capital
Japan
Tokyo
Ristoski, Paulheim

Word Embedding
11/7/2016 10
• Japan
• Russia
• Germany
• Austria
• Berlin
• Tokyo
• Moscow
• Vienna
Tokyo = [f1, f2, f3, …, fn]
Japan= [f1, f2, f3, …, fn]
Ristoski, Paulheim
?
v(Japan) - v(Tokyo) + v(Berlin) ≈ v(Germany)

Word2vec – Neural Language Model
• Two-layer neural net that converts raw text into vectors
– Each word is represented into a numerical vector
• Continuous Bag-of-Words (CBOW)
– Predict target words from source context word
• Skip-gram
– Predict context words from the target word
11/7/2016 11
[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013.
[2] Rong, Xin. "word2vec parameter learning explained." 2014.
Ristoski, Paulheim

Skip-gram
11/7/2016 12
Capital
Japan
Tokyo
Ristoski, Paulheim

RDF2vec
11/7/2016 13
• Convert the graph into sequence of tokens (sentences)
– Graph walks
– Weisfeiler-Lehman Subtree RDF Graph Kernels
Ristoski, Paulheim

Graph Walks RDF2vec
• For each entity in the graph:
– Extract a subgraph with depth d
– Extract walks on the subgraph
– Build word2vec model
dbr:Trent_Reznor -> dbo:associatedBand -> dbr:Exotic_Birds -> dbo:bandMember -> dbr:Chris_Vrenna
dbr:Trent_Reznor -> dbo:genre - > dbr:Dark_ambient -> dbo:instrument -> dbr:Field_recording

Random Walks RDF2vec
11/7/2016 15
V*S Walks
V Vectors
Ristoski, Paulheim

Entity Embedding
11/7/2016 16
• dbr:Berlin
• dbr:Tokyo
• dbr:Moscow
• dbr:Vienna
• dbr:Japan
• dbr:Russia
• dbr:Germany
• dbr:Austria
Ristoski, Paulheim
dbr:Tokyo = [f1, f2, f3, …, fn]
dbr:Japan= [f1, f2, f3, …, fn]

Weisfeiler-Lehman Kernel

WL Kernel RDF2vec
• Construct sequences using random walks with depth d after each
iteration for each entity in the graph
• Graph G sequences after 1 iteration:
– 1->6->11; 1->6->11->13; 1->6->11->10 …
– 4->11->6; 4->11->13; 4->11->10; 4->11->10->8 …
– …
11/7/2016 18
de Vries, Gerben KD. "A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data.“ ECML, 2013.
Ristoski, Paulheim

WL Kernels RDF2vec
11/7/2016 19
V*S*I
sequences V Vectors
Ristoski, Paulheim

EVALUATION

Evaluation Setup
• Datasets
– 3 domain-specific RDF datasets
– 2 large cross-domain RDF datasets with 5 evaluation datasets
• Tasks
– Classification: Naive Bayes, k-Nearest Neighbors (k=3), C4.5 decision tree
and Support Vector Machines.
– Regression: Linear Regression, M5Rules, and k-Nearest Neighbors (k=3).
• Baselines
– Features derived from incoming and outgoing relations and values
– Features derived from graph substructures: WL and Walk-Count Kernels

Domain Specific RDF Datasets
• Datasets
• Results (accuracy)
– Best scores per dataset
Dataset Task #statements #instances #walks depth #sequences WL iter. WL depth #sequences
AIFB C (c=4) 30K 176 all 10 360K 4 2 346K
BGS C (c=2) 600K 146 all 10 2.4M 4 2 5.3M
MUTAG C (c=2) 80K 340 all 10 168K 4 2 908K
Dataset Baseline Walks2vec WL2vec (SG 500)
AIFB 92.68 89.55 93.41
BGS 91.05 78.10 96.18
MUTAG 94.29 82.06 96.33

Large Cross-Domain RDF Datasets
• Datasets
• Evaluation datasets
Dataset #instances depth #sequences Vector size model
DBpedia 5M 4/8 2.5B 200/500 CBOW/SG
Wikidata 17M 4 8.5B 200/500 CBOW/SG
Dataset #Instances ML Task Original Source
Cities 212 R/C (c=3) Mercer
Metacritic Albums 1,600 R/C (c=3) Metacritic
Metacritic Movies 2,000 R/C (c=3) Metacritic
AAUP 960 R/C(c=3) JSE
Forbes 1,585 R/C (c=3) Forbes

• Accuracy Results
– Best scores only
Results: classification
Cities Movies Albums AAUP Forbes
Best Baseline 75.13 79.30 77.94 93.44 76.75
DB2vec CBOW 200 8 77.39 83.65 78.44 92.23 88.30
DB2vec CBOW 500 8 76.84 83.25 77.25 90.61 89.86
DB2vec SG 200 8 78.92 83.30 79.72 91.04 90.10
DB2vec SG 500 8 89.73 82.80 78.20 94.48 88.53
WD2vec CBOW 200 4 75.56 52.20 51.44 90.18 81.08
WD2vec CBOW 500 4 85.56 51.04 53.28 89.74 80.74
WD2vec SG 200 4 75.48 75.39 64.76 90.50 81.17
WD2vec SG 500 4 83.20 76.30 63.42 90.60 81.17

• RMSE Results
– Best scores only
Results: regression
Cities Movies Albums AAUP Forbes
Best Baseline 17.04 19.19 12.81 6.16 18.32
1
DB2vec CBOW 200 8 12.55 15.90 11.79 6.47 17.43
DB2vec CBOW 500 8 12.54 15.81 11.30 6.54 17.62
DB2vec SG 200 8 12.85 15.12 10.90 6.22 17.85
DB2vec SG 500 8 10.19 15.45 10.89 6.26 16.61
WD2vec CBOW 200 4 17.52 23.39 14.55 6.60 21.77
WD2vec CBOW 500 4 18.33 22.18 14.00 6.08 21.92
WD2vec SG 200 4 18.69 19.10 13.51 6.52 21.59
WD2vec SG 500 4 19.23 19.19 13.23 6.05 21.58

Results Summary
• RDF2vec outperform all the baseline approaches
– Smaller feature vectors - more efficient training than bassline
approaches
• WL kernel sequences capture the graph structure better than walks
– Not efficient on large graphs
– Large number of sequences produced – not scalable
• Increasing the depth of the paths increases the quality of the
embeddings
• The vector dimensionality doesn’t affect the performance
• Skip-Gram models constantly outperforms CBOW models
• DBpedia produces higher quality embeddings than Wikidata

Other Use-Cases
• Recommender systems
• Document modeling
– Document similarity
– Entity relatedness
• Alignment of knowledge bases
– DBpedia and Wikidata
• Knowledge base relation prediction and error detection
• Linking text and semi-structured knowledge to knowledge bases

Conclusion
• RDF2Vec: an approach for learning latent numerical representations
of entities in RDF graphs
• Preserves the graph information
• Compatible with all the traditional machine learning algorithms
• More efficient ML models training
• Task and dataset independent approach
• Download the code and the models: http://data.dws.informatik.uni-
mannheim.de/rdf2vec/

RDF2Vec: RDF Graph Embeddings for Data Mining

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to RDF2Vec: RDF Graph Embeddings for Data Mining (20)

Recently uploaded (20)

RDF2Vec: RDF Graph Embeddings for Data Mining

Editor's Notes