SlideShare a Scribd company logo
Automa'c	
  extrac'on	
  of	
  
microorganisms	
  and	
  their	
  habitats	
  
 from	
  free	
  text	
  using	
  text-­‐mining	
  
                  workflows	
  
      BalaKrishna	
  Kolluru,	
  Sirintra	
  Nakjang,	
  
      Robert.	
  P.	
  Hirt,	
  Anil	
  Wipat	
  and	
  Sophia	
  
                          Ananiadou	
  
Outline	
  of	
  the	
  talk	
  
•    Mo'va'on	
  
•    Experiments	
  
•    Results	
  &	
  inferences	
  
•    Discussion	
  
•    Current	
  work	
  
Mo'va'on	
  
•  In	
  the	
  study	
  of	
  symbio'c	
  rela'onships,	
  host-­‐
   microbe	
  interac'ons	
  play	
  an	
  important	
  role	
  
•  To	
  date,	
  there	
  is	
  no	
  comprehensive	
  database	
  	
  
   regarding	
  microbe—habitat	
  rela'on,	
  but	
  there	
  
   is	
  an	
  explosion	
  in	
  the	
  numbers	
  of	
  taxa	
  	
  
•  With	
  this,	
  there	
  is	
  an	
  urgent	
  need	
  for	
  
   automated	
  host-­‐microbe	
  rela'on	
  extrac'on	
  
Experiments:	
  relevant	
  work	
  
•  Iden'fica'on	
  of	
  named	
  en''es	
  such	
  as	
  
   microorganisms,	
  diseases,	
  genes	
  etc.,	
  has	
  
   received	
  sufficient	
  importance	
  from	
  the	
  
   scien'fic	
  community	
  at	
  large	
  [Sasaki,	
  Hanisch,	
  
   Chikashi]	
  
•  Researchers	
  have	
  also	
  used	
  ontology	
  based	
  
   approaches	
  to	
  iden'fy	
  concepts	
  such	
  as	
  public	
  
   health	
  rumors	
  etc	
  [Biocaster]	
  
Experiments:	
  our	
  approach	
  
                                        Named	
  en'ty	
  
                                         recogni'on	
  
               • Free	
  text	
                              • Habitats	
  &	
  
                 ar'cles	
                                     organisms	
  
               • pdf	
  
                          Text	
                                     Rela'on	
  
                       processing	
                                   mining	
  




Employ	
  text	
  mining	
  workflows	
  consis'ng	
  of	
  	
  
  • 	
  text/pdf	
  processor	
  
  • 	
  Named	
  en'ty	
  recognizer	
  to	
  iden'fy	
  microorganisms	
  	
  
  	
  	
  and	
  their	
  habitats	
  
  • 	
  Rela'on	
  mining	
  component	
  to	
  extract	
  sentences	
  	
  
  	
  	
  which	
  express	
  this	
  rela'on	
  	
  
Experiments:	
  our	
  approach	
  
•  The	
  named	
  en'ty	
  recognizer	
  used	
  a	
  hybrid	
  
   dic'onary-­‐machine	
  learning	
  based	
  approach	
  
   –  It	
  combined	
  the	
  informa'on	
  dic'onaries	
  with	
  a	
  
      feature	
  set	
  for	
  a	
  condi'onal	
  random	
  field	
  (CRF)	
  
      based	
  classifier	
  [Mallet]	
  
   –  The	
  CRFs	
  used	
  a	
  linear	
  chain	
  model	
  and	
  were	
  
      trained	
  on	
  a	
  corpus	
  consis'ng	
  of	
  32	
  full	
  papers	
  
Experiments:	
  our	
  approach	
  
    –  The	
  feature	
  set	
  included	
  	
  
        •  lexical	
  informa'on	
  of	
  the	
  word	
  e.g.,	
  word,	
  POS	
  tag	
  etc	
  
        •  Orthographic	
  informa'on	
  e.g.	
  any	
  uppercase	
  le^ers,	
  
           numbers	
  
        •  Contextual	
  informa'on;	
  informa'on	
  about	
  two	
  word	
  
           preceding	
  and	
  succeeding	
  the	
  word	
  	
  

•  For	
  the	
  rela'on	
  mining	
  component,	
  a	
  linear	
  chain	
  CRF	
  
   was	
  trained	
  using	
  	
  
    –  Occurrence	
  of	
  organisms	
  and	
  habitats	
  
    –  Contextual	
  informa'on	
  of	
  all	
  the	
  en''es	
  in	
  a	
  sentence	
  	
  	
  
Results	
  and	
  inference	
  
Performance	
  of	
  our	
  named	
  en'ty	
  recognizer	
  	
  
on	
  a	
  9-­‐fold	
  cross-­‐valida'on	
  	
  
            Class	
  of	
     Precision(%)	
                           Recall(%)	
                      F-­‐score(%)	
  
            en**es	
                                                                                    2PR/(P+R)	
  
            Organisms	
       	
  	
  	
  	
  	
  	
  	
  	
  84	
     	
  	
  	
  	
  	
  	
  79	
     	
  	
  	
  	
  	
  	
  	
  81	
  
            Habitats	
        	
  	
  	
  	
  	
  	
  	
  	
  68	
   	
  	
  	
  	
  	
  	
  55	
   	
  	
  	
  	
  	
  	
  	
  61	
  
                                                improved	
  results	
  from	
  the	
  'me	
  of	
  submission	
  
• 	
  Microorganisms	
  have	
  been	
  recognized	
  quite	
  well.	
  
• 	
  Habitat	
  recogni'on	
  is	
  modest	
  
• 	
  One	
  of	
  the	
  observa'ons	
  is	
  that	
  in	
  a	
  free	
  text,	
  the	
  	
  
	
  	
  	
  descrip'on	
  of	
  habitats/host	
  is	
  devoid	
  any	
  salient	
  features	
  	
  
	
  	
  	
  such	
  as	
  uppercase	
  le^ers,	
  hyphens	
  etc.	
  
• 	
  Instances	
  such	
  as	
  abscess,	
  lung	
  were	
  typical	
  misses	
  	
  
Results	
  and	
  inference	
  
Rela'on	
  mining	
  results	
  
•  For	
  the	
  rela'on	
  mining	
  experiment,	
  the	
  CRF-­‐based	
  
   classifier	
  achieved	
  a	
  precision	
  of	
  ~	
  80%	
  
•  Most	
  of	
  the	
  false	
  nega'ves	
  (	
  sentences	
  which	
  should	
  
   have	
  been	
  picked	
  up,	
  but	
  were	
  not)	
  due	
  to	
  the	
  noise	
  
   in	
  pdf	
  to	
  text	
  conversion	
  
•  Another	
  reason	
  for	
  false	
  nega'ves	
  is	
  the	
  modest	
  
   performance	
  of	
  habitat	
  recogni'on	
  which	
  affected	
  
   the	
  rela'on	
  mining	
  algorithm	
  
Discussion	
  	
  
•  The	
  workflows	
  we	
  have	
  developed	
  bring	
  
   together	
  pdf-­‐conversion,	
  machine	
  learning	
  
   and	
  dic'onaries	
  together	
  
   –  Performance	
  of	
  individual	
  components	
  obviously	
  
      has	
  an	
  impact	
  its	
  overall	
  performance	
  
   –  Pdf	
  conversion	
  is	
  not	
  trivial	
  by	
  any	
  means	
  and	
  this	
  
      component	
  is	
  the	
  most	
  limi'ng	
  factor	
  for	
  any	
  
      sentence-­‐based	
  classifica'on	
  task	
  
Discussion	
  
•  Pdf-­‐to-­‐text	
  sentence	
  examples	
  
     	
  These	
  mechanisms	
  may	
  have	
  evolved	
  in	
  bacterial	
  
                    pathogens	
  to	
  increase	
  the	
  frequency	
  of	
  phenotypic	
  
                    varia'on	
  in	
  genes	
  involved	
  in	
  
    	
  	
  	
  	
  1	
  100,000	
  200,000	
  300,000	
  1,600,00	
  Figure	
  2	
  Circular	
  
                    representa'on	
  of	
  the	
  H.	
  pylori	
  26695	
  chromosome.	
  
                    [Clearly,	
  data	
  from	
  a	
  table	
  and	
  figure	
  corrupted	
  the	
  
                    sentence]	
  
     	
  airborne	
  pigs	
  [noisy	
  conversion	
  of	
  table	
  discussing	
  
                    airborne	
  diseases	
  in	
  pigs	
  ]	
  
Discussion	
  
•  The	
  CRF	
  model	
  for	
  habitats	
  is	
  evidently	
  weak	
  
    –  There	
  is	
  a	
  need	
  to	
  augment	
  the	
  features	
  to	
  
       alleviate	
  this	
  weakness.	
  We	
  are	
  currently	
  
       enhancing	
  model	
  to	
  include	
  more	
  features	
  such	
  as	
  
       character-­‐level	
  n-­‐grams	
  
    –  	
  Results	
  reflect	
  ini'al	
  success	
  
•  Rela'on	
  mining	
  is	
  a	
  hyper-­‐classifica'on	
  task	
  
   and	
  perhaps	
  it	
  is	
  prone	
  to	
  cascading	
  errors	
  
Current	
  work	
  
•  Work	
  is	
  underway	
  to	
  improve	
  the	
  rela'on	
  
   mining	
  component	
  using	
  bag-­‐of-­‐words	
  and	
  
   character	
  level	
  n-­‐grams	
  to	
  augment	
  the	
  
   feature	
  space	
  
•  We	
  are	
  also	
  working	
  on	
  less	
  noisy	
  conversion	
  
   techniques	
  for	
  pdf-­‐to-­‐text	
  
•  Export	
  the	
  workflows	
  to	
  the	
  public	
  domain	
  so	
  
   that	
  scien'sts	
  across	
  the	
  spectrum	
  can	
  use	
  our	
  
   workflows	
  
Snapshot	
  of	
  rela'on	
  miner	
  




References	
  
• 	
  Hanisch,	
  D.	
  et	
  al.	
  ProMiner:	
  Organism	
  specific	
  protein	
  name	
  detec'on	
  using	
  	
  
	
  	
  	
  approximate	
  string	
  matching.	
  Embo	
  Workshop	
  Granada,	
  Spain,	
  2004	
  
• Sasaki,	
  Y.	
  et	
  al.	
  (2008).	
  How	
  to	
  make	
  the	
  most	
  of	
  NE	
  dic'onaries	
  in	
  sta's'cal	
  NER?	
  
	
  	
  In:	
  BMC	
  Bioinforma'cs,	
  9(Suppl	
  11),	
  S5	
  	
  
• 	
  Collier,	
  N.	
  et	
  al.	
  BioCaster:	
  detec'ng	
  public	
  health	
  rumors	
  with	
  a	
  Web-­‐based	
  text	
  	
  
	
  	
  	
  mining	
  system.	
  Bioinforma'cs,	
  24(24),	
  2008.	
  	
  
• 	
  Nobata,	
  C.	
  et	
  al	
  Mining	
  Metabolites:	
  Extrac'ng	
  the	
  Yeast	
  Metabolome	
  from	
  the	
  Literature.	
  	
  
	
  	
  	
  Metabolomics,	
  2010.	
  	
  
Ad

More Related Content

What's hot (9)

A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
Rna seq
Rna seqRna seq
Rna seq
Sean Davis
 
presentation
presentationpresentation
presentation
Debit Ahmed
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
Ann Loraine
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
senthilrajvlsi
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
NU_I_TODALAB
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
Ann Loraine
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
NU_I_TODALAB
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 

Viewers also liked (20)

Shibuya.el
Shibuya.elShibuya.el
Shibuya.el
uk-ar
 
R and data mining
R and data miningR and data mining
R and data mining
Chaozhong Yang
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
Ryan Wesslen
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
Sung Park
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
Tom Liptrot
 
Predictshine
PredictshinePredictshine
Predictshine
Tom Liptrot
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
Nikhil Gadkar
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
Rajarshi Guha
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
Chia-Chi Chang
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
Ashley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
Hamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
Yanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
Yanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
Jahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
Yanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
Yanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
Shibuya.el
Shibuya.elShibuya.el
Shibuya.el
uk-ar
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
Ryan Wesslen
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
Sung Park
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
Tom Liptrot
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
Nikhil Gadkar
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
Rajarshi Guha
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
Chia-Chi Chang
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
Ashley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
Hamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
Yanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
Yanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
Jahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
Yanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
Yanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
Ad

Similar to Automatic extraction of microorganisms and their habitats from free text using text-mining workflows (20)

Six Month
Six MonthSix Month
Six Month
farzanehs
 
Open Tree of Life @Evolution 2012
Open Tree of Life @Evolution 2012Open Tree of Life @Evolution 2012
Open Tree of Life @Evolution 2012
Karen Cranston
 
Automatic generation of domain models for call centers
Automatic generation of domain models for call centersAutomatic generation of domain models for call centers
Automatic generation of domain models for call centers
David Przybilla
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
csfunk
 
Cpascoe pimms or2012_
Cpascoe pimms or2012_Cpascoe pimms or2012_
Cpascoe pimms or2012_
Charlotte Pascoe
 
WiML Poster
WiML PosterWiML Poster
WiML Poster
Svitlana volkova
 
Data Mining GenBank for Phylogenetic inference - T. Vision
Data Mining GenBank for Phylogenetic inference - T. VisionData Mining GenBank for Phylogenetic inference - T. Vision
Data Mining GenBank for Phylogenetic inference - T. Vision
Roderic Page
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
William Ulate
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
butest
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
butest
 
OpenTree at NESCent Academy 2012
OpenTree at NESCent Academy 2012OpenTree at NESCent Academy 2012
OpenTree at NESCent Academy 2012
Karen Cranston
 
Unison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology DiscoveryUnison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology Discovery
Reece Hart
 
Environmental Shotgun Sequencing
Environmental Shotgun SequencingEnvironmental Shotgun Sequencing
Environmental Shotgun Sequencing
Jonathan Eisen
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Fulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
University of Bari (Italy)
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
c.titus.brown
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Joe Parker
 
rosario_phd_thesis
rosario_phd_thesisrosario_phd_thesis
rosario_phd_thesis
Barbara Rosario
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
petermurrayrust
 
Blei2011
Blei2011Blei2011
Blei2011
Ajay Ohri
 
Open Tree of Life @Evolution 2012
Open Tree of Life @Evolution 2012Open Tree of Life @Evolution 2012
Open Tree of Life @Evolution 2012
Karen Cranston
 
Automatic generation of domain models for call centers
Automatic generation of domain models for call centersAutomatic generation of domain models for call centers
Automatic generation of domain models for call centers
David Przybilla
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
csfunk
 
Data Mining GenBank for Phylogenetic inference - T. Vision
Data Mining GenBank for Phylogenetic inference - T. VisionData Mining GenBank for Phylogenetic inference - T. Vision
Data Mining GenBank for Phylogenetic inference - T. Vision
Roderic Page
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
William Ulate
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
butest
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
butest
 
OpenTree at NESCent Academy 2012
OpenTree at NESCent Academy 2012OpenTree at NESCent Academy 2012
OpenTree at NESCent Academy 2012
Karen Cranston
 
Unison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology DiscoveryUnison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology Discovery
Reece Hart
 
Environmental Shotgun Sequencing
Environmental Shotgun SequencingEnvironmental Shotgun Sequencing
Environmental Shotgun Sequencing
Jonathan Eisen
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Fulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
University of Bari (Italy)
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
c.titus.brown
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Joe Parker
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
petermurrayrust
 
Ad

More from Catherine Canevet (6)

Using the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networksUsing the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networks
Catherine Canevet
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...
Catherine Canevet
 
BioPAX for semantic web based data integration
BioPAX for semantic web based data integrationBioPAX for semantic web based data integration
BioPAX for semantic web based data integration
Catherine Canevet
 
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Catherine Canevet
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...
Catherine Canevet
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
Catherine Canevet
 
Using the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networksUsing the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networks
Catherine Canevet
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...
Catherine Canevet
 
BioPAX for semantic web based data integration
BioPAX for semantic web based data integrationBioPAX for semantic web based data integration
BioPAX for semantic web based data integration
Catherine Canevet
 
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Catherine Canevet
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...
Catherine Canevet
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
Catherine Canevet
 

Recently uploaded (20)

Secondary Storage for a microcontroller system
Secondary Storage for a microcontroller systemSecondary Storage for a microcontroller system
Secondary Storage for a microcontroller system
fizarcse
 
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
RFID in Supply chain management and logistics.pdf
RFID in Supply chain management and logistics.pdfRFID in Supply chain management and logistics.pdf
RFID in Supply chain management and logistics.pdf
EnCStore Private Limited
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Right to liberty and security of a person.pdf
Right to liberty and security of a person.pdfRight to liberty and security of a person.pdf
Right to liberty and security of a person.pdf
danielbraico197
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
DNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in NepalDNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in Nepal
ICT Frame Magazine Pvt. Ltd.
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
Secondary Storage for a microcontroller system
Secondary Storage for a microcontroller systemSecondary Storage for a microcontroller system
Secondary Storage for a microcontroller system
fizarcse
 
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
RFID in Supply chain management and logistics.pdf
RFID in Supply chain management and logistics.pdfRFID in Supply chain management and logistics.pdf
RFID in Supply chain management and logistics.pdf
EnCStore Private Limited
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Right to liberty and security of a person.pdf
Right to liberty and security of a person.pdfRight to liberty and security of a person.pdf
Right to liberty and security of a person.pdf
danielbraico197
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

  • 1. Automa'c  extrac'on  of   microorganisms  and  their  habitats   from  free  text  using  text-­‐mining   workflows   BalaKrishna  Kolluru,  Sirintra  Nakjang,   Robert.  P.  Hirt,  Anil  Wipat  and  Sophia   Ananiadou  
  • 2. Outline  of  the  talk   •  Mo'va'on   •  Experiments   •  Results  &  inferences   •  Discussion   •  Current  work  
  • 3. Mo'va'on   •  In  the  study  of  symbio'c  rela'onships,  host-­‐ microbe  interac'ons  play  an  important  role   •  To  date,  there  is  no  comprehensive  database     regarding  microbe—habitat  rela'on,  but  there   is  an  explosion  in  the  numbers  of  taxa     •  With  this,  there  is  an  urgent  need  for   automated  host-­‐microbe  rela'on  extrac'on  
  • 4. Experiments:  relevant  work   •  Iden'fica'on  of  named  en''es  such  as   microorganisms,  diseases,  genes  etc.,  has   received  sufficient  importance  from  the   scien'fic  community  at  large  [Sasaki,  Hanisch,   Chikashi]   •  Researchers  have  also  used  ontology  based   approaches  to  iden'fy  concepts  such  as  public   health  rumors  etc  [Biocaster]  
  • 5. Experiments:  our  approach   Named  en'ty   recogni'on   • Free  text   • Habitats  &   ar'cles   organisms   • pdf   Text   Rela'on   processing   mining   Employ  text  mining  workflows  consis'ng  of     •   text/pdf  processor   •   Named  en'ty  recognizer  to  iden'fy  microorganisms        and  their  habitats   •   Rela'on  mining  component  to  extract  sentences        which  express  this  rela'on    
  • 6. Experiments:  our  approach   •  The  named  en'ty  recognizer  used  a  hybrid   dic'onary-­‐machine  learning  based  approach   –  It  combined  the  informa'on  dic'onaries  with  a   feature  set  for  a  condi'onal  random  field  (CRF)   based  classifier  [Mallet]   –  The  CRFs  used  a  linear  chain  model  and  were   trained  on  a  corpus  consis'ng  of  32  full  papers  
  • 7. Experiments:  our  approach   –  The  feature  set  included     •  lexical  informa'on  of  the  word  e.g.,  word,  POS  tag  etc   •  Orthographic  informa'on  e.g.  any  uppercase  le^ers,   numbers   •  Contextual  informa'on;  informa'on  about  two  word   preceding  and  succeeding  the  word     •  For  the  rela'on  mining  component,  a  linear  chain  CRF   was  trained  using     –  Occurrence  of  organisms  and  habitats   –  Contextual  informa'on  of  all  the  en''es  in  a  sentence      
  • 8. Results  and  inference   Performance  of  our  named  en'ty  recognizer     on  a  9-­‐fold  cross-­‐valida'on     Class  of   Precision(%)   Recall(%)   F-­‐score(%)   en**es   2PR/(P+R)   Organisms                  84              79                81   Habitats                  68              55                61   improved  results  from  the  'me  of  submission   •   Microorganisms  have  been  recognized  quite  well.   •   Habitat  recogni'on  is  modest   •   One  of  the  observa'ons  is  that  in  a  free  text,  the          descrip'on  of  habitats/host  is  devoid  any  salient  features          such  as  uppercase  le^ers,  hyphens  etc.   •   Instances  such  as  abscess,  lung  were  typical  misses    
  • 9. Results  and  inference   Rela'on  mining  results   •  For  the  rela'on  mining  experiment,  the  CRF-­‐based   classifier  achieved  a  precision  of  ~  80%   •  Most  of  the  false  nega'ves  (  sentences  which  should   have  been  picked  up,  but  were  not)  due  to  the  noise   in  pdf  to  text  conversion   •  Another  reason  for  false  nega'ves  is  the  modest   performance  of  habitat  recogni'on  which  affected   the  rela'on  mining  algorithm  
  • 10. Discussion     •  The  workflows  we  have  developed  bring   together  pdf-­‐conversion,  machine  learning   and  dic'onaries  together   –  Performance  of  individual  components  obviously   has  an  impact  its  overall  performance   –  Pdf  conversion  is  not  trivial  by  any  means  and  this   component  is  the  most  limi'ng  factor  for  any   sentence-­‐based  classifica'on  task  
  • 11. Discussion   •  Pdf-­‐to-­‐text  sentence  examples      These  mechanisms  may  have  evolved  in  bacterial   pathogens  to  increase  the  frequency  of  phenotypic   varia'on  in  genes  involved  in          1  100,000  200,000  300,000  1,600,00  Figure  2  Circular   representa'on  of  the  H.  pylori  26695  chromosome.   [Clearly,  data  from  a  table  and  figure  corrupted  the   sentence]      airborne  pigs  [noisy  conversion  of  table  discussing   airborne  diseases  in  pigs  ]  
  • 12. Discussion   •  The  CRF  model  for  habitats  is  evidently  weak   –  There  is  a  need  to  augment  the  features  to   alleviate  this  weakness.  We  are  currently   enhancing  model  to  include  more  features  such  as   character-­‐level  n-­‐grams   –   Results  reflect  ini'al  success   •  Rela'on  mining  is  a  hyper-­‐classifica'on  task   and  perhaps  it  is  prone  to  cascading  errors  
  • 13. Current  work   •  Work  is  underway  to  improve  the  rela'on   mining  component  using  bag-­‐of-­‐words  and   character  level  n-­‐grams  to  augment  the   feature  space   •  We  are  also  working  on  less  noisy  conversion   techniques  for  pdf-­‐to-­‐text   •  Export  the  workflows  to  the  public  domain  so   that  scien'sts  across  the  spectrum  can  use  our   workflows  
  • 14. Snapshot  of  rela'on  miner   References   •   Hanisch,  D.  et  al.  ProMiner:  Organism  specific  protein  name  detec'on  using          approximate  string  matching.  Embo  Workshop  Granada,  Spain,  2004   • Sasaki,  Y.  et  al.  (2008).  How  to  make  the  most  of  NE  dic'onaries  in  sta's'cal  NER?      In:  BMC  Bioinforma'cs,  9(Suppl  11),  S5     •   Collier,  N.  et  al.  BioCaster:  detec'ng  public  health  rumors  with  a  Web-­‐based  text          mining  system.  Bioinforma'cs,  24(24),  2008.     •   Nobata,  C.  et  al  Mining  Metabolites:  Extrac'ng  the  Yeast  Metabolome  from  the  Literature.          Metabolomics,  2010.    
  翻译: