SlideShare a Scribd company logo
Julien Plu
julien.plu@eurecom.fr
@julienplu
Knowledge extraction in Web
media: at the frontier of NLP,
Machine Learning and Semantics
Use Case: Bringing Context to Documents
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3
NEWSWIRES
TWEETS
SEARCH
QUERIES
SUBTITLES
Use Case: Bringing Context to Documents
James Patrick Page, OBE (born 9 January 1944)
is an English musician, songwriter, and record
producer who achieved international success as
the guitarist and founder of the rock band Led
Zeppelin. Know More
Sort name: Page, Jimmy
Type: Person
Gender: Male
Born: 1944-01-09 (72 years ago)
Born in: Heston, Hounslow, London,
United Kingdom
Pays d’origine : Royaume-Uni
Genre musical : Blues rock, rock
psychédélique
Années actives : 1962-1968 et
depuis 1992
Labels : Columbia
The Yardbirds est un groupe de rock britannique
des années 1960, formé en mai 1963 à Londres
en Angleterre dont les guitaristes ont été Eric
Clapton, Jeff Beck puis Jimmy Page. Know More
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 4
Six Different Problems
1. Identity of an entity
Ø Arena; Arena (magazine); Arena (TV series)
Ø Bucks County, Pennsylvania; Milwaukee Bucks
2. Knowledge bases have different coverage
Yannick Noah is a
Tennis Player and a
Singer
4. Various types for an
entity (granularity) 5. Different type of
documents
written in multiple
languages
3. High
computation to
handle large
streams
6. Are all phrases
entities? (e.g.
dates or roles)
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 5
Research Questions
1. How to adapt an entity linking system depending on
different criteria?
2. How to design an entity linking system in order to
be able to process a large amount of data in near
real time?
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 6
State Of The Art
§ The key role of entities:
Ø 70% of search queries contain at least one entity [1]
Ø Bring context to videos [2]
Ø Help making summary [3]
§ Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia
Spotlight [6]) are hardly parametrized and often do not propose to be
adapted to at least one of the previous criteria
§ Those solutions are often not able to handle large streams of text
[1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010
[2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge
Extraction for Semantic Annotation of News Items. K-CAP 2015
[3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014
[4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010
[5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate
Disambiguation of Named Entities in Text and Tables. PVLDB 4(12)
[6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach.
TACL 2014
[7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents.
I-SEMANTICS 2011
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 7
Methodology
We have split up this thesis into six tasks:
Start thesis
Today
End thesis
(1) Text adaptivity
(1) Entity type adaptivity
(1) Knowledge base adaptivity
(1) Language adaptivity
(1- 2) ADEL Modular framework
(2) Distributed and scalable architecture
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 8
§ POS Tagger:
Ø bidirectional
CMM (left to right and
right to left)
§ NER Combiner:
Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method)
models. A simple CRF model could be:
PER PER PERO OOO
X X X X XX XXXX
X set of features for the current word: word capitalized, previous word is “de”, next word is a
NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF
Jimmy Page , connaissant le profesionnalisme de John Paul Jones
ADEL: Modular Framework (Extractors)
PER PERO
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 9
ADEL: Modular Framework (Overlap Resolution)
§ Detect overlaps
among extractors
with the boundaries
of the entities
§ Different heuristics can be applied:
Ø Merge: (“United States” and “States of America” => “United States of
America”) default behavior
Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence”
and “May Harding”)
Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and
“New York”)
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 10
Modular Framework: Indexing
§ Create index from
DBpedia and Wikipedia
§ Integrate external data
such as PageRank and
HITS scores from Hasso
Platner Institute
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 11
ADEL: Modular Framework (Linking)
§ Generate candidate links for
all extracted mentions:
Ø If any, they go to the linking
method
Ø If not, they are linked to NIL
§ Linking method:
Ø ADEL linear formula:
𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙)
r(l): the score of the candidate l
L: the Levenshtein distance
m:	the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the candidate l
PR: Pagerank associated to the candidate l	
a,	b	and c are weights
following the properties:
a	>	b	>	c	 and a	+	b	+	c	=	1
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 12
ADEL: Modular Framework (Pruning)
§ k-NN machine learning
algorithm
§ Why a pruning module?
Ø Useful to correct the errors from the extractor by removing wrong
annotations. Example:
F France played against Russia for a friendly match.
F Yesterday, I went to see Against in concert.
Ø Useful to adapt the annotations in order to follow a given guideline.
Example: suppose we are participating to two different challenges, 2014
NEEL that count the dates as entities, and OKE2015 that do not.
F 1st challenge: Jimmy Page was born the January 9th, 1944.
F 2nd challenge: Jimmy Page was born the January 9th, 1944.
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 13
§ Experiments on different kind of text by
benchmarking ADEL over different challenges
Ø Tweets: NEEL2014, NEEL2015 and NEEL2016
Ø News article: OKE2015 and OKE2016
§ Need to adapt the extractors to use a proper model
to handle different kind of texts
Ø Retrain the NER extractor with a training dataset
Text Adaptivity
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 14
Type Adaptivity
§ Challenges have their own definition of types
§ In ADEL types are coming from the NER extractor
and the used knowledge base
Ø NER types are different of KB types
Ø NER types and KB types are different of challenges types
§ Need a mapping between those different types. It is
currently manually made.
OKE2015 and OKE2016 Person, Place, Organization, Role
NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 15
Knowledge Base Adaptivity
§ Joint work with Vrije Universiteit Amsterdam
§ ReCon: define several heuristics in order to re-rank
candidate links provided by our system on newswire
articles
Ø H1: process the article text first and disambiguate the article
title at the end because titles are often too ambiguous
Ø H2: detect co-referential entities throughout the article
Ø H3: topic modeling to exploit a contextual knowledge base
about the found topic
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 16
Language Adaptivity
§ No results yet. The goal is to let the user choosing
the natural language used in the text
§ Test the framework on ETAPE which is a NER
challenge on French TV content from 2012
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 17
Distributed and Scalable Architecture
§ No results yet. Being able to deploy the framework in
order to run the tasks in a distributed and scalable
way
§ Making each task (extraction, linking and pruning)
independent of each other and put them out of the
global architecture (see how Docker is developed as
model)
§ Stress test the new architecture over large streams
such as Twitter streaming API to detect the possible
bottlenecks
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 18
Evaluation Over Multiple Datasets in Linking
§ 2014 NEEL Challenge with ADEL v1 using the neleval scorer
§ 2015 NEEL Challenge with ADEL v1 using the neleval scorer
§ 2016 NEEL Challenge with ADEL v2 using the neleval scorer
§ OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer
§ OKE2016 Challenge with ADEL v2 usingthe neleval scorer
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-measure 60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL kea Insight mit ju unimib
F-measure 61.98 54.86 38,28 36.09 35.48 33.53
ADEL
F-measure 56.5
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 19
Conclusions
§ Combining multiple techniques coming from different
domains for entity recognition and linking
§ Having developed different methods in order to make an
entity linking system adaptive to one or multiple criteria
§ Bringing a new approach with ADEL while also reusing
existing approaches with the POS and NER extractors
§ Testing ADEL over different datasets and participating in
challenges
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 20
Future Work
§ Knowledge base adaptivity
Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset
Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset
Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset
§ Language adaptivity
Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets
§ Modular Framework
Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning
methods)
§ Type adaptivity
Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will
bring more issues especially with the scorers
§ Engineer and evaluate a distributed and scalable architecture on large
data streams
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 21
Questions?
Thank you for listening!
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22
Ad

More Related Content

Similar to Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics (20)

Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Artificial Intelligence Institute at UofSC
 
The AOSD Research Community in Brazil and its Crosscutting Impact
The AOSD Research Community in Brazil and  its Crosscutting ImpactThe AOSD Research Community in Brazil and  its Crosscutting Impact
The AOSD Research Community in Brazil and its Crosscutting Impact
Uirá Kulesza
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Ontotext
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
John Doove
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
Enrico Palumbo
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
hala Skaf
 
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
Holistic Benchmarking of Big Linked Data
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Wi presentation
Wi presentationWi presentation
Wi presentation
Saeedeh Shekarpour
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
François Scharffe
 
Enhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose LanguagesEnhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose Languages
University of York
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
Maria Eskevich
 
NEEL2015 challenge summary
NEEL2015 challenge summaryNEEL2015 challenge summary
NEEL2015 challenge summary
Giuseppe Rizzo
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Justin Clark-Casey
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
Jeff Z. Pan
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Artificial Intelligence Institute at UofSC
 
The AOSD Research Community in Brazil and its Crosscutting Impact
The AOSD Research Community in Brazil and  its Crosscutting ImpactThe AOSD Research Community in Brazil and  its Crosscutting Impact
The AOSD Research Community in Brazil and its Crosscutting Impact
Uirá Kulesza
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Ontotext
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
John Doove
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
Enrico Palumbo
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
hala Skaf
 
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
Holistic Benchmarking of Big Linked Data
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
Enhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose LanguagesEnhancing Xtext for General Purpose Languages
Enhancing Xtext for General Purpose Languages
University of York
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
Maria Eskevich
 
NEEL2015 challenge summary
NEEL2015 challenge summaryNEEL2015 challenge summary
NEEL2015 challenge summary
Giuseppe Rizzo
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Justin Clark-Casey
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
Jeff Z. Pan
 

Recently uploaded (20)

Multi-Agent Era will Define the Future of Software
Multi-Agent Era will Define the Future of SoftwareMulti-Agent Era will Define the Future of Software
Multi-Agent Era will Define the Future of Software
Ivo Andreev
 
Let's Do Bad Things to Unsecured Containers
Let's Do Bad Things to Unsecured ContainersLet's Do Bad Things to Unsecured Containers
Let's Do Bad Things to Unsecured Containers
Gene Gotimer
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Why CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Why CoTester Is the AI Testing Tool QA Teams Can’t IgnoreWhy CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Why CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Shubham Joshi
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb ClarkDeploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Peter Caitens
 
Aligning Projects to Strategy During Economic Uncertainty
Aligning Projects to Strategy During Economic UncertaintyAligning Projects to Strategy During Economic Uncertainty
Aligning Projects to Strategy During Economic Uncertainty
OnePlan Solutions
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
UI/UX Design & Development and Servicess
UI/UX Design & Development and ServicessUI/UX Design & Development and Servicess
UI/UX Design & Development and Servicess
marketing810348
 
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
jamesmartin143256
 
Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025
Phil Eaton
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptxThe-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
james brownuae
 
AI Agents with Gemini 2.0 - Beyond the Chatbot
AI Agents with Gemini 2.0 - Beyond the ChatbotAI Agents with Gemini 2.0 - Beyond the Chatbot
AI Agents with Gemini 2.0 - Beyond the Chatbot
Márton Kodok
 
How to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber PluginHow to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber Plugin
eGrabber
 
Hyper Casual Game Developers Company
Hyper  Casual  Game  Developers  CompanyHyper  Casual  Game  Developers  Company
Hyper Casual Game Developers Company
Nova Carter
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
Hydraulic Modeling And Simulation Software Solutions.pptx
Hydraulic Modeling And Simulation Software Solutions.pptxHydraulic Modeling And Simulation Software Solutions.pptx
Hydraulic Modeling And Simulation Software Solutions.pptx
julia smits
 
Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
Unit Two - Java Architecture and OOPS
Unit Two  -   Java Architecture and OOPSUnit Two  -   Java Architecture and OOPS
Unit Two - Java Architecture and OOPS
Nabin Dhakal
 
Multi-Agent Era will Define the Future of Software
Multi-Agent Era will Define the Future of SoftwareMulti-Agent Era will Define the Future of Software
Multi-Agent Era will Define the Future of Software
Ivo Andreev
 
Let's Do Bad Things to Unsecured Containers
Let's Do Bad Things to Unsecured ContainersLet's Do Bad Things to Unsecured Containers
Let's Do Bad Things to Unsecured Containers
Gene Gotimer
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Why CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Why CoTester Is the AI Testing Tool QA Teams Can’t IgnoreWhy CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Why CoTester Is the AI Testing Tool QA Teams Can’t Ignore
Shubham Joshi
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb ClarkDeploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Deploying & Testing Agentforce - End-to-end with Copado - Ewenb Clark
Peter Caitens
 
Aligning Projects to Strategy During Economic Uncertainty
Aligning Projects to Strategy During Economic UncertaintyAligning Projects to Strategy During Economic Uncertainty
Aligning Projects to Strategy During Economic Uncertainty
OnePlan Solutions
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
UI/UX Design & Development and Servicess
UI/UX Design & Development and ServicessUI/UX Design & Development and Servicess
UI/UX Design & Development and Servicess
marketing810348
 
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
Bridging Sales & Marketing Gaps with IInfotanks’ Salesforce Account Engagemen...
jamesmartin143256
 
Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025
Phil Eaton
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptxThe-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
The-Future-is-Hybrid-Exploring-Azure’s-Role-in-Multi-Cloud-Strategies.pptx
james brownuae
 
AI Agents with Gemini 2.0 - Beyond the Chatbot
AI Agents with Gemini 2.0 - Beyond the ChatbotAI Agents with Gemini 2.0 - Beyond the Chatbot
AI Agents with Gemini 2.0 - Beyond the Chatbot
Márton Kodok
 
How to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber PluginHow to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber Plugin
eGrabber
 
Hyper Casual Game Developers Company
Hyper  Casual  Game  Developers  CompanyHyper  Casual  Game  Developers  Company
Hyper Casual Game Developers Company
Nova Carter
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
Hydraulic Modeling And Simulation Software Solutions.pptx
Hydraulic Modeling And Simulation Software Solutions.pptxHydraulic Modeling And Simulation Software Solutions.pptx
Hydraulic Modeling And Simulation Software Solutions.pptx
julia smits
 
Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
Unit Two - Java Architecture and OOPS
Unit Two  -   Java Architecture and OOPSUnit Two  -   Java Architecture and OOPS
Unit Two - Java Architecture and OOPS
Nabin Dhakal
 
Ad

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

  • 1. Julien Plu julien.plu@eurecom.fr @julienplu Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics
  • 2. Use Case: Bringing Context to Documents 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3 NEWSWIRES TWEETS SEARCH QUERIES SUBTITLES
  • 3. Use Case: Bringing Context to Documents James Patrick Page, OBE (born 9 January 1944) is an English musician, songwriter, and record producer who achieved international success as the guitarist and founder of the rock band Led Zeppelin. Know More Sort name: Page, Jimmy Type: Person Gender: Male Born: 1944-01-09 (72 years ago) Born in: Heston, Hounslow, London, United Kingdom Pays d’origine : Royaume-Uni Genre musical : Blues rock, rock psychédélique Années actives : 1962-1968 et depuis 1992 Labels : Columbia The Yardbirds est un groupe de rock britannique des années 1960, formé en mai 1963 à Londres en Angleterre dont les guitaristes ont été Eric Clapton, Jeff Beck puis Jimmy Page. Know More 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 4
  • 4. Six Different Problems 1. Identity of an entity Ø Arena; Arena (magazine); Arena (TV series) Ø Bucks County, Pennsylvania; Milwaukee Bucks 2. Knowledge bases have different coverage Yannick Noah is a Tennis Player and a Singer 4. Various types for an entity (granularity) 5. Different type of documents written in multiple languages 3. High computation to handle large streams 6. Are all phrases entities? (e.g. dates or roles) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 5
  • 5. Research Questions 1. How to adapt an entity linking system depending on different criteria? 2. How to design an entity linking system in order to be able to process a large amount of data in near real time? 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 6
  • 6. State Of The Art § The key role of entities: Ø 70% of search queries contain at least one entity [1] Ø Bring context to videos [2] Ø Help making summary [3] § Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia Spotlight [6]) are hardly parametrized and often do not propose to be adapted to at least one of the previous criteria § Those solutions are often not able to handle large streams of text [1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010 [2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge Extraction for Semantic Annotation of News Items. K-CAP 2015 [3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014 [4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010 [5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate Disambiguation of Named Entities in Text and Tables. PVLDB 4(12) [6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach. TACL 2014 [7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents. I-SEMANTICS 2011 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 7
  • 7. Methodology We have split up this thesis into six tasks: Start thesis Today End thesis (1) Text adaptivity (1) Entity type adaptivity (1) Knowledge base adaptivity (1) Language adaptivity (1- 2) ADEL Modular framework (2) Distributed and scalable architecture 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 8
  • 8. § POS Tagger: Ø bidirectional CMM (left to right and right to left) § NER Combiner: Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method) models. A simple CRF model could be: PER PER PERO OOO X X X X XX XXXX X set of features for the current word: word capitalized, previous word is “de”, next word is a NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF Jimmy Page , connaissant le profesionnalisme de John Paul Jones ADEL: Modular Framework (Extractors) PER PERO 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 9
  • 9. ADEL: Modular Framework (Overlap Resolution) § Detect overlaps among extractors with the boundaries of the entities § Different heuristics can be applied: Ø Merge: (“United States” and “States of America” => “United States of America”) default behavior Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”) Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 10
  • 10. Modular Framework: Indexing § Create index from DBpedia and Wikipedia § Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 11
  • 11. ADEL: Modular Framework (Linking) § Generate candidate links for all extracted mentions: Ø If any, they go to the linking method Ø If not, they are linked to NIL § Linking method: Ø ADEL linear formula: 𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙) r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 12
  • 12. ADEL: Modular Framework (Pruning) § k-NN machine learning algorithm § Why a pruning module? Ø Useful to correct the errors from the extractor by removing wrong annotations. Example: F France played against Russia for a friendly match. F Yesterday, I went to see Against in concert. Ø Useful to adapt the annotations in order to follow a given guideline. Example: suppose we are participating to two different challenges, 2014 NEEL that count the dates as entities, and OKE2015 that do not. F 1st challenge: Jimmy Page was born the January 9th, 1944. F 2nd challenge: Jimmy Page was born the January 9th, 1944. 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 13
  • 13. § Experiments on different kind of text by benchmarking ADEL over different challenges Ø Tweets: NEEL2014, NEEL2015 and NEEL2016 Ø News article: OKE2015 and OKE2016 § Need to adapt the extractors to use a proper model to handle different kind of texts Ø Retrain the NER extractor with a training dataset Text Adaptivity 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 14
  • 14. Type Adaptivity § Challenges have their own definition of types § In ADEL types are coming from the NER extractor and the used knowledge base Ø NER types are different of KB types Ø NER types and KB types are different of challenges types § Need a mapping between those different types. It is currently manually made. OKE2015 and OKE2016 Person, Place, Organization, Role NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 15
  • 15. Knowledge Base Adaptivity § Joint work with Vrije Universiteit Amsterdam § ReCon: define several heuristics in order to re-rank candidate links provided by our system on newswire articles Ø H1: process the article text first and disambiguate the article title at the end because titles are often too ambiguous Ø H2: detect co-referential entities throughout the article Ø H3: topic modeling to exploit a contextual knowledge base about the found topic 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 16
  • 16. Language Adaptivity § No results yet. The goal is to let the user choosing the natural language used in the text § Test the framework on ETAPE which is a NER challenge on French TV content from 2012 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 17
  • 17. Distributed and Scalable Architecture § No results yet. Being able to deploy the framework in order to run the tasks in a distributed and scalable way § Making each task (extraction, linking and pruning) independent of each other and put them out of the global architecture (see how Docker is developed as model) § Stress test the new architecture over large streams such as Twitter streaming API to detect the possible bottlenecks 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 18
  • 18. Evaluation Over Multiple Datasets in Linking § 2014 NEEL Challenge with ADEL v1 using the neleval scorer § 2015 NEEL Challenge with ADEL v1 using the neleval scorer § 2016 NEEL Challenge with ADEL v2 using the neleval scorer § OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer § OKE2016 Challenge with ADEL v2 usingthe neleval scorer E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F-measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL kea Insight mit ju unimib F-measure 61.98 54.86 38,28 36.09 35.48 33.53 ADEL F-measure 56.5 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 19
  • 19. Conclusions § Combining multiple techniques coming from different domains for entity recognition and linking § Having developed different methods in order to make an entity linking system adaptive to one or multiple criteria § Bringing a new approach with ADEL while also reusing existing approaches with the POS and NER extractors § Testing ADEL over different datasets and participating in challenges 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 20
  • 20. Future Work § Knowledge base adaptivity Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset § Language adaptivity Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets § Modular Framework Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning methods) § Type adaptivity Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will bring more issues especially with the scorers § Engineer and evaluate a distributed and scalable architecture on large data streams 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 21
  • 21. Questions? Thank you for listening! 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22
  翻译: