The STRING database

Aug 5, 2008Download as ppt, pdf0 likes2,358 views

14th International Conference on Intelligent Systems for Molecular Biology, Software demo, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006

The STRING database Lars Juhl Jensen EMBL Heidelberg

KEGG Kyoto Encyclopedia of Genes and Genomes

MIPS Munich Information center for Protein Sequences

STKE Signal Transduction Knowledge Environment

BIND Biomolecular Interaction Network Database

GRID General Repository for Interaction Datasets

OMIM Online Mendelian Inheritance in Man

Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene ] [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ]

Acknowledgments The STRING team (EMBL) Christian von Mering Berend Snel Martijn Huynen Sean Hooper Samuel Chaffron Julien Lagarde Mathilde Foglierini Peer Bork Literature mining project (EML Research) Jasmin Saric Rossitza Ouzounova Isabel Rojas

The STRING database aims to provide a comprehensive global protein-protein interaction network. The latest version covers over 5000 organisms and allows users to upload entire genome-wide datasets. It implements classification systems like Gene Ontology and KEGG for gene-set enrichment analysis. STRING collects and integrates data from various sources, including experimental repositories, text mining, and predicted interactions based on genomic features. Users can access and visualize the interaction data through a web interface or API.

Comparative genomicsJajati Keshari Nayak

This document provides a summary of a seminar on comparative genomics techniques. It discusses three levels of genome research: structural genomics, functional genomics, and comparative genomics. Comparative genomics involves analyzing and comparing different genomes to study gene content, function, organization, and evolution. Techniques discussed include genome sequencing, mapping, and bioinformatics tools. The document also outlines what can be compared between genomes and how comparative genomics has provided insights into evolution and gene function.

Gene Expression Data AnalysisJhoirene Clemente

This document discusses analyzing and visualizing gene expression data. It defines key terms like genes and gene expression data. It also describes clustering gene expression data using k-means clustering to group genes based on similarity in a dataset of yeast cell cycle genes. Finally, it discusses visualizing gene expression data using techniques like vector fusion, nMDS, and PCA to project high-dimensional gene expression datasets into 2D or 3D spaces.

Structural genomicsVaibhav Maurya

Structural genomics aims to determine the 3D structure of all proteins in a genome. It uses high-throughput methods like X-ray crystallography and NMR on a genomic scale. This allows determination of protein structures for entire proteomes. It provides insights into protein function and can aid drug discovery by identifying potential drug targets like in Mycobacterium tuberculosis. Structural genomics leverages completed genome sequences to clone and express all encoded proteins for structural characterization.

Genome Mappingruchibioinfo

Genetic mapping uses genetic techniques like cross-breeding experiments to construct maps showing gene positions. Physical mapping uses molecular techniques to examine DNA directly and construct maps showing sequence features. Different DNA markers like RFLPs, SSLPs, SNPs can be used for genetic mapping. Techniques for physical mapping include restriction mapping, fluorescent in situ hybridization (FISH), and sequence tagged site (STS) mapping. Integrating genetic and physical maps provides high resolution mapping needed for genome sequencing.

Keggmsfbi1521

The document describes several key databases within the KEGG resource, including: - The PATHWAY database containing molecular network maps of metabolic and genetic pathways. - The BRITE database providing hierarchical classifications of biological systems beyond what is shown in pathways. - The LIGAND database consisting of chemical compounds, carbohydrates, reactions, and enzyme information. KEGG aims to comprehensively capture biological knowledge through integrated databases covering genomes, pathways, diseases and drugs.

Genome sequencing,shotgun sequencing.pptxCherry

Protein-protein interaction (PPI)N Poorin

Protein information resource (PIR)ShivaniShewale2

Global and Local Sequence AlignmentAjayPatil210

This document discusses global and local sequence alignment. It introduces sequence alignment and its uses in identifying similarities between sequences that could indicate functional or evolutionary relationships. It describes the principles of alignment and the different types of alignment, including global alignment, which aligns entire sequences, and local alignment, which matches regions of similarity. Methods for alignment include dot plots, scoring matrices, and dynamic programming. BLAST is introduced as a tool for comparing sequences against databases using local alignment algorithms.

ILLUMINA SEQUENCE.pptxprojectliberary

The document summarizes the principle and workflow of Illumina next-generation sequencing. It begins with an overview of Illumina and the development of their sequencing technologies. It then describes the wide range of applications of NGS. The core principle is sequencing by synthesis using reversible dye-terminators. The workflow involves library preparation through fragmentation and ligation of adapters, cluster generation by bridge amplification on a flow cell, and sequencing through cycles of reversible terminator incorporation and imaging. Finally, the sequenced reads are aligned and analyzed using Illumina's data analysis software suite.

dot plot analysisShwetA Kumari

This document discusses dot plot analysis, which allows comparison of two biological sequences to identify similar regions. It describes how dot plots are generated using a similarity matrix and defines different features that can be observed, such as identical sequences appearing on the principal diagonal, direct and inverted repeats appearing as multiple diagonals, and low complexity regions forming boxes. Applications of dot plot analysis include identifying alignments, self-base pairing, sequence transposition, and gene locations between genomes. Limitations include high memory needs for long sequences and low efficiency for global alignments.

Introduction to Biological databasesDr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai

The document discusses biological databases. It begins by defining what a database is, including that it is a collection of related data organized in a way that allows information to be retrieved easily. It then discusses different types of biological databases, including those containing nucleotide sequences, protein sequences, 3D structures, gene expression data, and metabolic pathways. The rest of the document provides details on specific biological databases like GenBank, EMBL, DDBJ, and NCBI databases including Entrez. It emphasizes that biological data is heterogeneous, large in volume, dynamic and integrated across multiple databases.

Genomic databasesDrSatyabrataSahoo

This document discusses genomic databases. It begins by defining key terms like genes, genomes, and genomics. It then describes categories of biological databases including those for nucleic acid sequences, proteins, structures, and genomes. It provides many examples of genomic databases for both non-vertebrate and vertebrate species, including databases for bacteria, fungi, plants, invertebrates, and humans. The final sections note that genomic databases collect genome-wide data from various sources and that databases can be specific to a single organism or category of organisms.

Protein protein interactionAashish Patel

Database in bioinformaticsVinaKhan1

A database is a structured collection of data that can be easily accessed, managed, and updated. It consists of files or tables containing records with fields. Database management systems provide functions like controlling access, maintaining integrity, and allowing non-procedural queries. Major databases include GenBank, EMBL, and DDBJ for nucleotide sequences and UniProt, PDB, and Swiss-Prot for proteins. The NCBI maintains many biological databases and provides tools for analysis.

SEQUENCE ANALYSISprashant tripathi

This document provides an overview of sequence analysis, including: 1) Defining sequence analysis as subjecting DNA, RNA, or peptide sequences to analytical methods to understand features, function, structure, or evolution. 2) Applications of sequence analysis like comparing sequences to find similarity and identify intrinsic features. 3) Methods of DNA and protein sequencing like Sanger sequencing, pyrosequencing, and Edman degradation.

Parsimony methodssafayet hossain

This document provides an overview of parsimony methods for phylogenetic tree analysis. It defines key terms like rooted vs unrooted trees and describes the basic steps of parsimony analysis. Parsimony methods infer the phylogenetic tree that requires the fewest evolutionary changes to explain the observed similarities and differences in species' characteristics. The analysis proceeds by identifying informative sites in a sequence alignment, calculating the number of character changes on possible trees, and selecting the tree with the smallest number of changes as the most likely phylogenetic tree.

Genomics typesAYYA NADAR JANAKI AMMAL COLLEGE

Sequence AlignmentPRUTHVIRAJ K

ChIP-seqSebastian Schmeier

ChIP-seq is a technique to identify where proteins bind to DNA in the genome. It involves cross-linking proteins to DNA in cells, fragmenting the DNA, immunoprecipitating the protein-DNA complexes using an antibody for the protein of interest, and then sequencing the retrieved DNA. This allows mapping of the genomic binding sites for the protein. The document discusses experimental design considerations for ChIP-seq, such as antibody choice and controls. It also reviews data analysis steps including read mapping, peak calling to identify enriched regions, and downstream analyses like motif finding. Higher resolution techniques like ChIP-exo are also introduced that can identify protein binding sites at base pair level.

Genome annotationShifa Ansari

sequence alignmentammar kareem

The document provides an overview of computational methods for sequence alignment. It discusses different types of sequence alignment including global and local alignment. It also describes various methods for sequence alignment, such as dot matrix analysis, dynamic programming algorithms (e.g. Needleman-Wunsch, Smith-Waterman), and word/k-tuple methods. Scoring matrices like PAM and BLOSUM that are used for sequence alignments are also explained.

Microarray Data Analysisyuvraj404

Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.

PrositeRashi Srivastava

The document discusses Prosite, a database of protein family signatures that can be used to determine the function of uncharacterized proteins. It contains patterns and profiles formulated to identify which known protein family a new sequence belongs to. The Prosite database consists of two files - a data file containing information for scanning sequences, and a documentation file describing each pattern and profile. New Prosite entries are mainly profiles developed by collaborators at the SIB Swiss Institute of Bioinformatics to identify distantly related proteins based on conserved residues.

CathRamya S

The CATH database hierarchically classifies protein domains obtained from protein structures deposited in the Protein Data Bank. Domain identification and classification uses both manual and automated procedures. CATH includes domains from structures determined at 4 angstrom resolution or better that are at least 40 residues long with 70% or more residues having defined side chains. Submitted protein chains are divided into domains, which are then classified in CATH.

Comparative genomicsprateek kumar

This document discusses key concepts in comparative genomics including orthologs, paralogs, speciation, and clusters of orthologous genes (COGs). It defines orthologs as genes evolved from a common ancestor through speciation that retain the same function, while paralogs are related through duplication and may evolve new functions. COGs are groups of orthologous genes from different species that are more similar to each other than to other genes within individual genomes. The document notes that COGs can be used to predict gene function and track evolutionary divergence. It provides an example of the NCBI COG database containing over 136,000 proteins from 50 bacteria, 13 archaea and 3 eukaryotes classified into CO

222397 lecture 16 17mohamedseyam13

Homology, paralogs, orthologs, and methods for detecting evolutionary relationships between proteins are discussed. Homologs are proteins derived from a common ancestor. Paralogs are homologs present within a species that evolved from a gene duplication event, while orthologs are homologs present between species that evolved from a speciation event and often retain similar functions. Sequence alignments and substitution matrices like BLOSUM-62 are used to statistically compare sequences and detect distant evolutionary relationships beyond just sequence identities by assigning scores to conserved amino acid substitutions. Introducing gaps improves alignments by accounting for insertions and deletions.

Computational approaches to cell cycle analysis: Current research topics (tho...Lars Juhl Jensen

Text mining for protein and small molecule relationsLars Juhl Jensen

The document discusses using text mining to identify relationships between proteins and small molecules mentioned in biomedical documents. It describes techniques for entity recognition and identification, as well as methods for extracting relationships between entities using co-occurrence analysis and natural language processing. Examples are provided to illustrate how relationships can be identified between proteins mentioned in a sample text passage.

More Related Content

What's hot (20)

Protein information resource (PIR)ShivaniShewale2

Global and Local Sequence AlignmentAjayPatil210

ILLUMINA SEQUENCE.pptxprojectliberary

dot plot analysisShwetA Kumari

Introduction to Biological databasesDr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai

Genomic databasesDrSatyabrataSahoo

Protein protein interactionAashish Patel

Database in bioinformaticsVinaKhan1

SEQUENCE ANALYSISprashant tripathi

Parsimony methodssafayet hossain

Genomics typesAYYA NADAR JANAKI AMMAL COLLEGE

Sequence AlignmentPRUTHVIRAJ K

ChIP-seqSebastian Schmeier

Genome annotationShifa Ansari

sequence alignmentammar kareem

Microarray Data Analysisyuvraj404

PrositeRashi Srivastava

CathRamya S

Comparative genomicsprateek kumar

222397 lecture 16 17mohamedseyam13

Protein information resource (PIR)ShivaniShewale2

Global and Local Sequence AlignmentAjayPatil210

ILLUMINA SEQUENCE.pptxprojectliberary

dot plot analysisShwetA Kumari

Introduction to Biological databasesDr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai

Genomic databasesDrSatyabrataSahoo

Protein protein interactionAashish Patel

Database in bioinformaticsVinaKhan1

SEQUENCE ANALYSISprashant tripathi

Parsimony methodssafayet hossain

Genomics typesAYYA NADAR JANAKI AMMAL COLLEGE

Sequence AlignmentPRUTHVIRAJ K

ChIP-seqSebastian Schmeier

Genome annotationShifa Ansari

sequence alignmentammar kareem

Microarray Data Analysisyuvraj404

PrositeRashi Srivastava

CathRamya S

Comparative genomicsprateek kumar

222397 lecture 16 17mohamedseyam13

Viewers also liked (20)

Computational approaches to cell cycle analysis: Current research topics (tho...Lars Juhl Jensen

Text mining for protein and small molecule relationsLars Juhl Jensen

Literature mining: what is it, and should I care?Lars Juhl Jensen

Systematic discovery of phosphorylation networks - Combining linear motifs an...Lars Juhl Jensen

Room 4 Masksviviendavies

On the margins of scholarshipRichard Davis

HW Initiative 1haloworks

Vrsovice Banner Case StudyPetr Václavek

Senso BrandingBrainventures

Kooperativa Top 10Petr Václavek

Barbara Streisand Budapest AudioJaguit

Gil Giardelli Www Versus Wwd A Web 3ESPMinovadoresdigitais

Holocaust Memorial TatoJaguit

Desenvolvimento Gerenciamento Produdos e serviços Aula 2008 2 mktpassosIvan Passos

FERRAMENTAS TECNOLÓGICASavz

Bcit Wayne StevensCarlos Vial

El documento describe la tendencia mundial hacia la construcción de viviendas usando estructuras de madera, como se usa ampliamente en Canadá. Explica que este método es seguro, durable, eficiente y económico. También describe cómo el Centro Canadiense de Vivienda y Construcción ha trabajado con gobiernos e industrias de varios países, incluyendo Chile, para transferir conocimientos sobre estas tecnologías de construcción de madera.

Total Aventuralisethes

El documento presenta un itinerario de un día completo en Ciudad Bolívar, Venezuela. Incluye visitas a sitios históricos como la Plaza Miranda, el Museo Soto y la Quinta de San Isidro, importante lugar de estadía de Simón Bolívar. También incluye un paseo en bote por el río para ver la puesta de sol desde el Mirador del Cine Río. Otra opción es una expedición de un día al Salto Ángel, uno de los saltos de agua más altos del mundo, ubicado en el Parque Nacional Canaima

Le ContaráS A Tus HijosJaguit

Não_EsperECalandula

La ProduccióNCristhi

La producción es el conjunto de actividades que realiza el ser humano para crear bienes y prestar servicios con el fin de satisfacer sus necesidades. Los elementos clave de la producción son el factor humano (mano de obra), los recursos naturales, el capital y el conocimiento. Existen tres sectores de producción: el sector primario (agricultura, ganadería, pesca), el sector secundario (industria) y el sector terciario (transporte, educación, turismo, comunicación).

Computational approaches to cell cycle analysis: Current research topics (tho...Lars Juhl Jensen

Text mining for protein and small molecule relationsLars Juhl Jensen

Literature mining: what is it, and should I care?Lars Juhl Jensen

Systematic discovery of phosphorylation networks - Combining linear motifs an...Lars Juhl Jensen

Room 4 Masksviviendavies

On the margins of scholarshipRichard Davis

HW Initiative 1haloworks

Vrsovice Banner Case StudyPetr Václavek

Senso BrandingBrainventures

Kooperativa Top 10Petr Václavek

Barbara Streisand Budapest AudioJaguit

Gil Giardelli Www Versus Wwd A Web 3ESPMinovadoresdigitais

Holocaust Memorial TatoJaguit

Desenvolvimento Gerenciamento Produdos e serviços Aula 2008 2 mktpassosIvan Passos

FERRAMENTAS TECNOLÓGICASavz

Bcit Wayne StevensCarlos Vial

Total Aventuralisethes

Le ContaráS A Tus HijosJaguit

Não_EsperECalandula

La ProduccióNCristhi

Similar to The STRING database (20)

Functional association networks - The STRING and STITCH web resourcesLars Juhl Jensen

The STRING databaseLars Juhl Jensen

The STRING database integrates known and predicted protein-protein interactions, including direct (physical) and indirect (functional) associations derived from genomic context, high-throughput experiments, co-expression and literature mining. It covers over 373 proteomes and draws on data from curated databases, textmining and computational prediction methods to provide a global network of protein interactions. STRING uses a scoring scheme to assign probabilities to interactions based on different lines of evidence and benchmarking against a gold standard reference set.

Network integration of heterogeneous dataLars Juhl Jensen

Introduction to STRINGLars Juhl Jensen

STRING integrates diverse evidence about functional interactions between proteins from hundreds of proteomes. It combines data from genomic context methods, curated databases, experiments, and textmining to generate a global network of protein interactions. The different evidence sources have issues like inconsistent identifiers, variable quality, and coverage of different species that STRING addresses through parsers, orthology transfer, and quality scores to generate a single confidence score for each interaction.

Prediction of protein networks through data integrationLars Juhl Jensen

The document discusses methods for predicting protein-protein interaction networks through integrating diverse data sources. It describes the STRING database which predicts interactions between proteins in 373 genomes using genomic context methods, co-expression data, experiments, and literature mining. It also discusses NetworKIN, a method that predicts phosphorylation sites and potential kinase-substrate relationships through integrating phosphoproteomics data, sequence motifs, and network context. Benchmarking shows NetworKIN can predict interactions with over 2.5-fold greater accuracy compared to sequence-based methods alone by incorporating network context.

STRING - Modeling of biological systems through cross-species data integ...Lars Juhl Jensen

The document discusses the STRING database, which integrates data from diverse sources to predict protein-protein interactions and functional associations. It summarizes different lines of evidence used by STRING, including genomic context, co-expression, co-mentioning in articles, and transfer of functional annotations between orthologs. The document also briefly outlines how STRING scores and benchmarks different predictive methods and defines functional modules to model biological systems.

The STRING database - Quality scores for heterogeneous interaction dataLars Juhl Jensen

The document discusses the STRING database, which integrates protein-protein interaction data from numerous sources, including experimental interactions, genomic context methods, co-expression data, and literature mining. It describes the challenges of merging heterogeneous interaction data from various sources in different formats and with different gene identifiers. It also outlines STRING's approach to scoring and combining interactions from multiple data sources, as well as transferring interaction data between species using orthology.

Cross-species data integrationLars Juhl Jensen

Integration of heterogeneous dataLars Juhl Jensen

The STRING database and related toolsLars Juhl Jensen

The document discusses the STRING database and related tools for exploring protein-protein association networks, gene neighborhoods, phylogenetic profiles, and other computational predictions and experimental data. It notes that individual databases cover different species and formats, and have variable quality. STRING aims to integrate these resources using common identifiers, quality scores, and text mining while calibrating scores against experimental data and curated knowledge. Resources discussed include STRING for protein networks, STITCH for chemical networks, and COMPARTMENTS and TISSUES for subcellular localization and tissue expression data.

Using networks to derive functionLars Juhl Jensen

This document discusses using networks to derive biological function from genomic data. It mentions several types of data that can be used like gene expression, protein-protein interactions, genetic interactions, pathways, literature mining, and co-mentioning in text. It also notes challenges integrating these diverse data sources that have different formats, identifiers, quality, and are spread across many databases and genomes. Lastly, it recommends combining all available evidence to predict functional associations.

Network biology: Large-scale data and text miningLars Juhl Jensen

This document discusses network biology and large-scale data and text mining. It describes how Lars Jensen uses computational predictions from over 1100 genomes along with experimental data and information extracted from text to build protein-protein association networks in STRING. These networks integrate known and predicted protein-protein interactions with functional associations, and are used to study biological systems at the network level.

Computational approaches to cell cycle analysis: Data and databasesLars Juhl Jensen

The document discusses various databases that contain biological data including genes, proteins, genomes, protein domains, gene expression, interactions, pathways, and models. It mentions several specific databases such as GenBank, UniProt, PDB, SGD, TAIR, Ensembl, InParanoid, HomoloGene, COG, OrthoMCL, Entrez, eggNOG, TreeFam, CDD, Pfam, SMART, InterPro, SCOP, CATH, GEO, ArrayExpress, Cyclebase, BioGRID, DIP, IntAct, MINT, Prosite, ELM, Domino, Phospho.ELM, PhosPhAt, Phosida, CORUM,

Data integration and functional association networksLars Juhl Jensen

Advanced bioinformaticsof proteomics datasetsLars Juhl Jensen

This document discusses advanced bioinformatics approaches for analyzing proteomics datasets, including using signaling networks, association networks, and text mining. It describes using machine learning to predict protein interactions and developing scoring schemes to integrate data from multiple sources. The document also covers using text mining approaches like named entity recognition and information extraction to analyze the large amount of proteomics information available in scientific literature.

Integration of diverse large-scale datasetsLars Juhl Jensen

The document discusses the integration of diverse large-scale datasets to build comprehensive protein-protein interaction networks. It describes challenges with data from different sources having different identifiers, evidence types and quality. It also discusses methods used by STRING and other databases to combine data from curated databases, literature mining, primary datasets and transfer of interactions based on orthology. Examples are given of cell cycle studies in yeast that have analyzed periodically expressed genes and protein interactions.

Large-scale integration of data and textLars Juhl Jensen

This document discusses large-scale integration of biological data and text mining. It describes three main parts: association networks that connect entities based on "guilt by association", protein interaction networks built using data from STRING and 2000+ genomes, and using genomic context like gene fusion, gene neighborhood, and phylogenetic profiles. It then provides examples of using STRING to query protein networks and discusses challenges of text mining like the exponential growth of literature and limitations of current natural language processing. Finally, it describes the Jensen Lab's approach of integrating curated knowledge, experimental data, predictions, and data from databases like STRING, STITCH, PubChem, COMPARTMENTS, Gene Ontology, UniProtKB, and disease databases into a common framework with

Exploring proteins, chemicals and their interactions with STRING and STITCHbiocs

This document summarizes databases and computational methods for exploring protein and chemical interactions. It introduces STRING and STITCH, which integrate various sources of data to predict interactions. STRING contains information on protein-protein interactions for 373 genomes. STITCH contains data on interactions between proteins and chemicals, including drugs. The document provides an example of using NetworKIN to predict kinase-substrate relationships and discusses how these databases and methods can provide insights into interaction networks and biological functions.

STRING: Large-scale data and text miningLars Juhl Jensen

This document discusses large-scale data and text mining techniques used by STRING to build comprehensive protein association networks. STRING integrates information from genomic context, high-throughput experiments, co-expression and curated databases to assign a confidence score to each association. Natural language processing is applied to mine the scientific literature and extract entity and relation information from millions of articles and abstracts to expand the known protein association networks beyond curated knowledge. STRING is freely accessible online and allows users to perform queries and analyze networks for various organisms.

STRING: Prediction of protein networks through integration of diverse large-s...Lars Juhl Jensen

STRING integrates diverse large-scale data sets to predict protein networks. It provides a protein network for each species based on evidence from genomic neighborhood, gene fusions, co-expression, co-mentioning in literature, and phylogenetic profiles. Multiple types of evidence are calibrated against KEGG maps and combined to generate a consistent scoring scheme and predict functional associations between proteins across species.

Functional association networks - The STRING and STITCH web resourcesLars Juhl Jensen

The STRING databaseLars Juhl Jensen

Network integration of heterogeneous dataLars Juhl Jensen

Introduction to STRINGLars Juhl Jensen

Prediction of protein networks through data integrationLars Juhl Jensen

STRING - Modeling of biological systems through cross-species data integ...Lars Juhl Jensen

The STRING database - Quality scores for heterogeneous interaction dataLars Juhl Jensen

Cross-species data integrationLars Juhl Jensen

Integration of heterogeneous dataLars Juhl Jensen

The STRING database and related toolsLars Juhl Jensen

Using networks to derive functionLars Juhl Jensen

Network biology: Large-scale data and text miningLars Juhl Jensen

Computational approaches to cell cycle analysis: Data and databasesLars Juhl Jensen

Data integration and functional association networksLars Juhl Jensen

Advanced bioinformaticsof proteomics datasetsLars Juhl Jensen

Integration of diverse large-scale datasetsLars Juhl Jensen

Large-scale integration of data and textLars Juhl Jensen

Exploring proteins, chemicals and their interactions with STRING and STITCHbiocs

STRING: Large-scale data and text miningLars Juhl Jensen

STRING: Prediction of protein networks through integration of diverse large-s...Lars Juhl Jensen

More from Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...Lars Juhl Jensen

This document summarizes a Twitter thread discussing the uses of a dictionary-based named entity recognition tool called Tagger. Tagger can recognize genes, proteins, diseases and other biomedical entities. It is open source, runs quickly processing over 1000 abstracts per second, and achieves 70-80% recall and 80-90% precision. Tagger has been applied to tasks like identifying drug-disease associations, adverse drug events, and protein-protein interactions. It is available as a Docker container or web service.

One tagger, many uses: Simple text-mining strategies for biomedicineLars Juhl Jensen

The document summarizes a text mining tool called a tagger that can be used for named entity recognition in biomedical texts. It recognizes genes, proteins, chemicals, diseases, and other entities. The tagger is open source, runs quickly at over 1000 abstracts per second, and has 70-80% recall and 80-90% precision. It comes with Python and Docker implementations and can be accessed via a web service. It is useful for tasks like extracting functional associations from literature and electronic health records.

Extract 2.0: Text-mining-assisted interactive annotationLars Juhl Jensen

This document describes Extract 2.0, a text-mining tool that can assist with interactive annotation of documents. It uses dictionary-based tagging to identify relevant entities like genes and diseases. It achieves 70-80% recall and 80-90% precision on entity extraction and was evaluated in BioCreative challenges where it received positive feedback from curators. The tool is open source and available as a web service or Python wrapper.

Network visualization: A crash course on using CytoscapeLars Juhl Jensen

STRING & STITCH: Network integration of heterogeneous dataLars Juhl Jensen

The document discusses STRING and STITCH, two online databases that integrate data on protein-protein interactions, pathways, and functional associations from various sources. STRING collects data on over 9.6 million proteins and 430 thousand chemicals from sources like text mining, experimental assays, and co-expression analyses. It aims to provide a comprehensive global view of known and predicted protein associations. STITCH also integrates interaction data but focuses more on chemical-protein interactions. Both databases provide user-friendly web interfaces for browsing and visualizing interaction networks.

Biomedical text mining: Automatic processing of unstructured textLars Juhl Jensen

1) Lars Juhl Jensen discusses biomedical text mining and automatic processing of unstructured text such as patent literature, grant proposals, FDA product labels, and electronic medical records. 2) Named entity recognition is used to identify genes/proteins, chemical compounds, diseases, and other entities in text through comprehensive dictionaries and flexible matching rules that account for variations. 3) Relation extraction uses natural language processing techniques like part-of-speech tagging and sentence parsing along with manually crafted rules and machine learning to identify implicit relations between entities in text such as transcription factor targets, kinase substrates, and protein-protein interactions.

Medical network analysis: Linking diseases and genes through data and text mi...Lars Juhl Jensen

The document summarizes the work of Lars Juhl Jensen and others on medical network analysis and linking diseases and genes through data and text mining of electronic health records. It discusses how they have used Danish national health registries containing data on over 6 million patients and 119 million diagnoses over 14 years to study disease trajectories and comorbidities. It also describes how they have developed methods to integrate data from various sources to generate networks linking diseases and genes.

Network Biology: A crash course on STRING and CytoscapeLars Juhl Jensen

This document provides an overview of STRING, a protein-protein association database, and Cytoscape, a network visualization tool. It describes how STRING contains functional associations between proteins derived from genomic context, co-expression and curated databases. Cytoscape can import STRING networks and external data to map onto nodes. It offers visualization of networks through layouts and attributes, and analysis through clustering, selection filters and enrichment. The document recommends using these tools together to explore protein association networks.

Cellular networksLars Juhl Jensen

This document discusses different approaches to visualizing cellular networks and the molecular interactions between proteins. It notes that there are many different types of data that could be shown, such as protein names, functions, localization, expression, modifications, and interaction types. However, it is impossible to show all this information at once. The document recommends using different visualizations like force-directed layouts to distribute proteins in 2D or lining up interactions in 1D. It acknowledges open challenges like showing time-course data and modification sites. In the end, the document thanks several researchers who have contributed to mapping and visualizing cellular networks.

Cellular Network Biology: Large-scale integration of data and textLars Juhl Jensen

The document discusses various community resources and software tools for integrating large-scale data and text, including STRING for protein networks, STITCH for chemical networks, COMPARTMENTS for subcellular localization, TISSUES for tissue expression, and DISEASES for disease associations. It provides an overview of text mining techniques used to extract information from literature to build networks in these resources. The presenter demonstrates the Cytoscape App which can import and analyze networks from STRING, perform queries, and analyze subcellular localization, tissue expression, and disease enrichment.

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Lars Juhl Jensen

This document discusses statistical methods for analyzing high-throughput biomedical screens and common pitfalls. It introduces several statistical tests such as t-tests, ANOVA, Fisher's exact test, and the Mann-Whitney U test. It also discusses challenges like multiple testing, resampling techniques, and biases that can occur like studiedness bias and abundance bias in big data analyses. Controlling false discovery rates and considering effect sizes are recommended over solely relying on p-values to determine biological significance.

STRING & related databases: Large-scale integration of heterogeneous dataLars Juhl Jensen

The document discusses the STRING database, which integrates heterogeneous biological data to generate association networks for proteins. It describes how STRING collects and connects curated knowledge, experimental data, and predicted interactions from genomic context, co-expression and text mining. The document also outlines exercises for users to explore protein-protein associations in STRING and related databases that integrate data on subcellular localization, tissue expression, and disease associations.

Tagger: Rapid dictionary-based named entity recognitionLars Juhl Jensen

Tagger is a named entity recognition tool that can process over 1000 abstracts per second using a dictionary-based approach. It achieves 70-80% recall and 80-90% precision using comprehensive dictionaries, expansion rules, and a curated blacklist to identify entity types like genes, proteins, chemicals, and diseases. The tool has a C++ engine, is inherently thread-safe, and includes interactive annotation, Python wrappers, and a REST API.

Network Biology: Large-scale integration of data and textLars Juhl Jensen

Lars Juhl Jensen leads a group that conducts large-scale integration of biological and medical data using proteomics, text mining, and medical data mining. The group develops protein interaction networks, disease networks, and association networks. They collaborate internationally on projects involving over 9.6 million proteins and 2000 genomes. The group works to integrate data from many sources in different formats to build comprehensive networks and knowledgebases, and also mines biomedical text to link genes and proteins with diseases.

Medical text mining: Linking diseases, drugs, and adverse reactionsLars Juhl Jensen

This document discusses medical text mining and linking diseases, drugs, and adverse reactions. It describes using text mining on clinical narratives in Danish to recognize named entities like drugs and diseases, identify relationships between them like adverse drug reactions, and discover new ADRs. The goal is to generate structured data on topics like comorbidities, diagnosis trajectories, and reimbursement to supplement limited structured data and help busy doctors by analyzing large amounts of unstructured text.

Network biology: Large-scale integration of data and textLars Juhl Jensen

The document discusses network biology and large-scale data integration. It describes protein-protein interaction networks like STRING that integrate data from curated knowledge, experiments, and predictions. It provides exercises to explore the human insulin receptor (INSR) in STRING, examining the types of evidence that support its interaction with IRS1. It also introduces other integrated networks like STITCH for chemicals and COMPARTMENTS for subcellular localization. Natural language processing techniques like named entity recognition, information extraction, and semantic tagging are used to integrate text data from the literature into these interaction networks.

Medical data and text mining: Linking diseases, drugs, and adverse reactionsLars Juhl Jensen

This document discusses medical data and text mining to link diseases, drugs, and adverse reactions. It describes using structured data from Danish central registries and unstructured data from hospital electronic health records. Named entity recognition is used to extract diseases, drugs, and adverse reactions from free text clinical notes written in Danish. Hand-crafted rules are developed to identify relationships between extracted entities like adverse drug reactions. This allows estimating frequencies of known adverse drug reactions and discovering new adverse drug reactions by analyzing diagnosis trajectories and medication information.

Cellular Network BiologyLars Juhl Jensen

This document discusses cellular network biology and summarizes several key papers on topics like proteome analysis using mass spectrometry, integrating protein network and experimental data, challenges with different biological databases having varying formats and quality, and using natural language processing techniques like named entity recognition and relation extraction to analyze medical text for information like diagnosis trajectories and adverse drug reactions.

Network biology: Large-scale integration of data and textLars Juhl Jensen

This document discusses natural language processing (NLP) techniques for extracting information from biomedical literature and integrating it with network and interaction data. It describes how NLP is used to identify entities like genes and proteins, extract relationships between entities, and integrate this text-mined information with existing interaction networks from databases like STRING to expand knowledge of protein interactions, complexes, pathways and associations with diseases. The document provides examples of using NLP analysis on sentences and the STRING and Tissues databases to explore tissue specificity and disease relationships for insulin and the insulin receptor.

Biomarker bioinformatics: Network-based candidate prioritizationLars Juhl Jensen

The document discusses three parts of biomarker bioinformatics: data integration from multiple databases, text mining of scientific literature, and using that integrated data to prioritize biomarker candidates. It describes combining data on 9.6 million proteins from curated databases, using text mining to extract named entities from over 10,000 papers, and then using network and heat diffusion approaches to rank candidates based on evidence in the integrated data. The goal is to help identify new biomarker candidates from large amounts of biological data.