SlideShare a Scribd company logo
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                   Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Introduction
Sequencing technology

The real cost of sequencing
Introduction
                                    Sequencing technology

                  The real cost of sequencing

                            Question:

     - What is the fraction of the cost of a NGS study of:
       (1) Sample collection and experimental design
                    (2) Sequencing itself
            (3) Data reduction and management
                  (4) Downstream analysis

Is this a surrealistic question? Not at all, think of you writing a
grant proposal and propose a NGS ChIP-seq experiment of 24
                              samples.

 You would need 3 HiSeq 2000 lanes that cost you        8000 €
 Sample preperation cost                                1000€
 Others                                                 1000 €
Do you ever include analysis costs?? Personel, infrastructure,…
Introduction
               Sequencing technology
The real cost of sequencing
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                  Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Sequencing flow
Steps in sequencing experiments

                         Data analysis

              Raw machine reads… What’s next?

             Preprocessing (machine/technology)
              - adaptors, indexes, conversions,…
              - machine/technology dependent

           Reads with associated qualities (universal)
                           - FASTQ
                         - QC check

         Depending on application (general applicable)
     - ‘de novo’ assembly of genome (bacterial genomes,…)
      - Mapping to a reference genome  mapped reads
                       - SAM/BAM/…

          High-level analysis (specific for application)
                         - SNP calling
                        - Peak calling
Sequencing flow
Steps in sequencing experiments
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                   Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Sequencing flow
                         Steps in sequencing experiments




                                    Main data formats:
                                       - Raw reads
                                     - Mapped reads
- Application dependent: ChIP-seq peaks, SNPs: their location and their characteristics
 > Intended for: visualization / further analysis (by humans or computers) / reduction ??
Sequencing data formats
                                                                    Raw reads

                                                            Raw sequence reads:

- Represent the sequence ~ FASTA
     >SEQUENCE_IDENTIFIER
     GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


- Extension: represent the quality, per base ~ FASTQ – Q for quality
     @SEQUENCE_IDENTIFIER
     GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
     +
     !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65



- OK, the strange signs at the last line indicate the quality at the corresponding base…
  But what’s the decoding scheme? (Nerd alert ahead !!)
- We want to represent quality scores ~ Phred scores
- Q= -10 log P (with P being the chance of a base called in error)
Phred quality scores are logarithmically linked to error probabilities
                                 Probability of incorrect
     Phred Quality Score                                            Base call accuracy
                                       base call
20                            1 in 100                       99 %
30                            1 in 1000                      99.9 %
40                            1 in 10000                     99.99 %
Sequencing data formats
                                       Raw reads

- Phred scores thus typically have 2 digits – you want one digit to allow correspondance
  in the file… What would a nerd do? Use ASCII as lookup-table of course!  one
  character ~ one decimal number
Sequencing data formats
                                             Raw reads
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

 - Ok, thus 5 actually is 53… But the real charachters only start at 33… So 5 is actually 53 -
   33 = 20 phred quality…
Sequencing data formats
                                              Raw reads
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Example of the identifier line for Illumina data (non-multiplexed):

#@machine_id:lane:tile:x:y:multiplex:pair
@HWUSI-EAS100R:6:73:941:1973#0/1



 -   Phred + 33  Sanger
 -   Illumina 1.3 +  Phred +64
 -   Illumina 1.5 +  Phred +64
 -   Illumina 1.8 +  Phred +33
 -   Solid  Sanger

 Check your instument + version  FastQC will give you a hint which scoring scheme is
 probably used

 Extensions: FASTQ / FQ
Sequencing data formats
                                        Raw reads




- Special: SRA files from NCBI/EBI Sequence Read Archive
- Contains raw sequence data from (GEO) studies for all kinds of instruments and
  platforms
- Exercice: we have submitted NGS (MBD-seq) for 8 NB cell lines into GEO and the raw
  data in SRA, find the SRA files. How would you obtain our originally submitted FASTQ
  files? (HINT: SRA Toolkit)
- Exercice (caution: nerd alert): working in the terminal… Retrieve the FASTQ file from
  the SRA file and perform FastQC analysis
Linux… for human beings?
         The terminal

    What they show in ‘The matrix’ is a real Linux-terminal and
    real commands…
Linux… for human beings?
       The terminal
Linux… for human beings?
       The terminal

                       Server: ***********
                       Port: *****

                       Login: *********
                       Pasw: *********
                       You will not see that you
                       are typing something…
Linux… for human beings?
                                       The terminal

                                                      You are interactively
                                                      logged in now! Meaning
                                                      everything you type is sent
                                                      to the server and executed

                                                      + Fast, no eye-candy
                                                      + Easy to develop a
                                                      command-line interface

                                                      - Not so intuitive
                                                      - Steep learning curve
                                                      - High nerd-level


You may have to type bash to see a line that
starts with student@mellfire:/home/student

Where are you?
/ is root
/home is the folder with user documents
Linux… for human beings?
                                           The terminal

cd
Change directory - cd .. (go to higher level) – cd ../../..

mkdir
Make directory (is a folder)

cp
Copy

mv
Move

ls (-ahl)
List all contents of a folder (DOS: dir)

rm
Remove (DOS: del)

man
Manual (Q to quit man)
Linux… for human beings?
                                             The terminal

vi
Text editor (:q! to exit from vi)

head and tail
See first lines / last lines of a textfile

top
Table of processes

who and whoami
Lists of users logged in and useful command for people with schizophrenia
Linux… for human beings?
       The terminal
Sequencing data formats
                                                                       Mapped reads

- Mapping: ‘align’ these raw reads to a reference genome
- Single-end or paired-end data?
- How would you align a short read to the reference?

- Old-school: Smith-Watherman, BLAST, BLAT,…
- Now: mapping tools for short reads that use intelligent indexing and allow mismatches

                                                             Algorithm
                                                                                                                                   Other features
                               Hash table                       Suffix tree                  Merge sorting
                            Hash        Hash                        Enhanced
    Program   Reference                          Suffix tree                      FM-index   Merge sorting   Colorspace   454   Quality   Paired end   Long reads   Bisulfite
                          reference     reads                      suffix array
     SOAP       [51]         X                                                                                                    X           X            X
     MAQ        [54]                     X                                                                       X                X           X                        X
    Mosaik                   X                                                                                   X                X           X            X
     Eland                               X                                                                                        X
   SSAHA2       [61]         X                                                                                                                X            X
    Bowtie      [67]                                                                 X                           X                X           X
     BWA        [69]                                                                 X                           X                            X            X
   BWA-SW       [69]                                                                 X                           X        X                   X            X
    SOAP2       [70]                                                                 X                           X                X           X            X
Sequencing data formats
                                      Mapped reads

- Most commonly used worldwide and in our lab as well: BWA and Bowtie, both using
  Burrows-Wheeler transformations and FM indexes
- Optimized for short NGS reads (from about 30 bp to +- 200 bp)
- Versions exist for longer reads (such as 454): Bowtie2 and BWA-SW

-   What would a file contain, describing mapped reads?
-   Position: chr / start / stop
-   Sequence: read / references
-   Mismatches / indels / vs. the reference
-   Quality informations

- Few years ago, each tool had its own output format  Bowtie,…
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
Sequencing data formats
                                 Mapped reads

- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Sequencing data formats
                                     Mapped reads

- BAM: binary version of SAM: not human readable but indexed for fast access for other
  tools / visualisation / …

- Exercise: view a BAM file in IGV
Sequencing data formats
                                            Other formats

- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –


- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Sequencing data formats
                                           Other formats

- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)




browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Sequencing data formats
                                      Other formats

- GFF format (General Feature Format)
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Sequencing data formats
                                     Other formats

- VCF format (Variant Call Format)
For SNP representation
Sequencing data formats
                                    Other formats

- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Blok
de   Van…
       ETER
Ad

More Related Content

What's hot (20)

Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Manikhandan Mudaliar
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
HARSHITHA EBBALI
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
QIAGEN
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
VHIR Vall d’Hebron Institut de Recerca
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
Timothy Tickle
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Uzma Jabeen
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
Vall d'Hebron Institute of Research (VHIR)
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
lemberger
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
Ravi Gandham
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
Aureliano Bombarely
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA Sequencing
Sidra Shaffique
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Manikhandan Mudaliar
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
QIAGEN
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
Timothy Tickle
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Uzma Jabeen
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
lemberger
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
Ravi Gandham
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA Sequencing
Sidra Shaffique
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 

Similar to Workshop NGS data analysis - 1 (20)

20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
sesejun
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
AGRF_Ltd
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
cursoNGS
 
SeqinR - biological data handling
SeqinR - biological data handlingSeqinR - biological data handling
SeqinR - biological data handling
pau_corral
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
VHIR Vall d’Hebron Institut de Recerca
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
BITS
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
c.titus.brown
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016
Christophe Antoniewski
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
AdamCribbs1
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
Eduserv
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
COST action BM1006
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Torsten Seemann
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
Dan Gaston
 
Lecture6.pptx
Lecture6.pptxLecture6.pptx
Lecture6.pptx
gregcaporaso
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologies
Jan Aerts
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
sesejun
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
AGRF_Ltd
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
cursoNGS
 
SeqinR - biological data handling
SeqinR - biological data handlingSeqinR - biological data handling
SeqinR - biological data handling
pau_corral
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
BITS
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
c.titus.brown
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
AdamCribbs1
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
Eduserv
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Torsten Seemann
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
Dan Gaston
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologies
Jan Aerts
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Ad

More from Maté Ongenaert (17)

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
Maté Ongenaert
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
Maté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
Maté Ongenaert
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
Maté Ongenaert
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Maté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
Maté Ongenaert
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
Maté Ongenaert
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Maté Ongenaert
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
Maté Ongenaert
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
Maté Ongenaert
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
Maté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
Maté Ongenaert
 
Introduction
IntroductionIntroduction
Introduction
Maté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
Maté Ongenaert
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
Maté Ongenaert
 
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
Maté Ongenaert
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
Maté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
Maté Ongenaert
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Maté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
Maté Ongenaert
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
Maté Ongenaert
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Maté Ongenaert
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
Maté Ongenaert
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
Maté Ongenaert
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
Maté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
Maté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
Maté Ongenaert
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
Maté Ongenaert
 
Ad

Recently uploaded (20)

The Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional DesignThe Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional Design
Sean Michael Morris
 
Basic principles involved in the traditional systems of medicine, Chapter 7,...
Basic principles involved in the traditional systems of medicine,  Chapter 7,...Basic principles involved in the traditional systems of medicine,  Chapter 7,...
Basic principles involved in the traditional systems of medicine, Chapter 7,...
ARUN KUMAR
 
How to Automate Activities Using Odoo 18 CRM
How to Automate Activities Using Odoo 18 CRMHow to Automate Activities Using Odoo 18 CRM
How to Automate Activities Using Odoo 18 CRM
Celine George
 
Intervene with Precision: Zooming In as a Leader Without Micromanaging
Intervene with Precision: Zooming In as a Leader Without MicromanagingIntervene with Precision: Zooming In as a Leader Without Micromanaging
Intervene with Precision: Zooming In as a Leader Without Micromanaging
victoriamangiantini1
 
Online elections for Parliament for European Union
Online elections for Parliament for European UnionOnline elections for Parliament for European Union
Online elections for Parliament for European Union
Monica Enache
 
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdfTechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup
 
How to Manage Customer Info from POS in Odoo 18
How to Manage Customer Info from POS in Odoo 18How to Manage Customer Info from POS in Odoo 18
How to Manage Customer Info from POS in Odoo 18
Celine George
 
NS3 Unit 5 Matter changes presentation.pptx
NS3 Unit 5 Matter changes presentation.pptxNS3 Unit 5 Matter changes presentation.pptx
NS3 Unit 5 Matter changes presentation.pptx
manuelaromero2013
 
Flower Identification Class-10 by Kushal Lamichhane.pdf
Flower Identification Class-10 by Kushal Lamichhane.pdfFlower Identification Class-10 by Kushal Lamichhane.pdf
Flower Identification Class-10 by Kushal Lamichhane.pdf
kushallamichhame
 
Product in Wartime: How to Build When the Market Is Against You
Product in Wartime: How to Build When the Market Is Against YouProduct in Wartime: How to Build When the Market Is Against You
Product in Wartime: How to Build When the Market Is Against You
victoriamangiantini1
 
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup
 
CANSA World No Tobacco Day campaign 2025 Vaping is not a safe form of smoking...
CANSA World No Tobacco Day campaign 2025 Vaping is not a safe form of smoking...CANSA World No Tobacco Day campaign 2025 Vaping is not a safe form of smoking...
CANSA World No Tobacco Day campaign 2025 Vaping is not a safe form of smoking...
CANSA The Cancer Association of South Africa
 
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
Quiz Club of PSG College of Arts & Science
 
the dynastic history of Paramaras of Malwa
the dynastic history of Paramaras of Malwathe dynastic history of Paramaras of Malwa
the dynastic history of Paramaras of Malwa
PrachiSontakke5
 
Protest - Student Revision Booklet For VCE English
Protest - Student Revision Booklet For VCE EnglishProtest - Student Revision Booklet For VCE English
Protest - Student Revision Booklet For VCE English
jpinnuck
 
Regression Analysis-Machine Learning -Different Types
Regression Analysis-Machine Learning -Different TypesRegression Analysis-Machine Learning -Different Types
Regression Analysis-Machine Learning -Different Types
Global Academy of Technology
 
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdfLeveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
TechSoup
 
The Board Doesn’t Care About Your Roadmap: Running Product at the Board
The Board Doesn’t Care About Your Roadmap: Running Product at the BoardThe Board Doesn’t Care About Your Roadmap: Running Product at the Board
The Board Doesn’t Care About Your Roadmap: Running Product at the Board
victoriamangiantini1
 
Salinity Resistance in Plants.Rice plant
Salinity Resistance in Plants.Rice plantSalinity Resistance in Plants.Rice plant
Salinity Resistance in Plants.Rice plant
aliabatool11
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-21-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-21-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-21-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-21-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
The Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional DesignThe Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional Design
Sean Michael Morris
 
Basic principles involved in the traditional systems of medicine, Chapter 7,...
Basic principles involved in the traditional systems of medicine,  Chapter 7,...Basic principles involved in the traditional systems of medicine,  Chapter 7,...
Basic principles involved in the traditional systems of medicine, Chapter 7,...
ARUN KUMAR
 
How to Automate Activities Using Odoo 18 CRM
How to Automate Activities Using Odoo 18 CRMHow to Automate Activities Using Odoo 18 CRM
How to Automate Activities Using Odoo 18 CRM
Celine George
 
Intervene with Precision: Zooming In as a Leader Without Micromanaging
Intervene with Precision: Zooming In as a Leader Without MicromanagingIntervene with Precision: Zooming In as a Leader Without Micromanaging
Intervene with Precision: Zooming In as a Leader Without Micromanaging
victoriamangiantini1
 
Online elections for Parliament for European Union
Online elections for Parliament for European UnionOnline elections for Parliament for European Union
Online elections for Parliament for European Union
Monica Enache
 
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdfTechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup Introduction to Generative AI and Copilot - 2025.05.22.pdf
TechSoup
 
How to Manage Customer Info from POS in Odoo 18
How to Manage Customer Info from POS in Odoo 18How to Manage Customer Info from POS in Odoo 18
How to Manage Customer Info from POS in Odoo 18
Celine George
 
NS3 Unit 5 Matter changes presentation.pptx
NS3 Unit 5 Matter changes presentation.pptxNS3 Unit 5 Matter changes presentation.pptx
NS3 Unit 5 Matter changes presentation.pptx
manuelaromero2013
 
Flower Identification Class-10 by Kushal Lamichhane.pdf
Flower Identification Class-10 by Kushal Lamichhane.pdfFlower Identification Class-10 by Kushal Lamichhane.pdf
Flower Identification Class-10 by Kushal Lamichhane.pdf
kushallamichhame
 
Product in Wartime: How to Build When the Market Is Against You
Product in Wartime: How to Build When the Market Is Against YouProduct in Wartime: How to Build When the Market Is Against You
Product in Wartime: How to Build When the Market Is Against You
victoriamangiantini1
 
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup - Microsoft Discontinuation of Selected Cloud Donated Offers 2025.05...
TechSoup
 
the dynastic history of Paramaras of Malwa
the dynastic history of Paramaras of Malwathe dynastic history of Paramaras of Malwa
the dynastic history of Paramaras of Malwa
PrachiSontakke5
 
Protest - Student Revision Booklet For VCE English
Protest - Student Revision Booklet For VCE EnglishProtest - Student Revision Booklet For VCE English
Protest - Student Revision Booklet For VCE English
jpinnuck
 
Regression Analysis-Machine Learning -Different Types
Regression Analysis-Machine Learning -Different TypesRegression Analysis-Machine Learning -Different Types
Regression Analysis-Machine Learning -Different Types
Global Academy of Technology
 
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdfLeveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
Leveraging AI to Streamline Operations for Nonprofits [05.20.2025].pdf
TechSoup
 
The Board Doesn’t Care About Your Roadmap: Running Product at the Board
The Board Doesn’t Care About Your Roadmap: Running Product at the BoardThe Board Doesn’t Care About Your Roadmap: Running Product at the Board
The Board Doesn’t Care About Your Roadmap: Running Product at the Board
victoriamangiantini1
 
Salinity Resistance in Plants.Rice plant
Salinity Resistance in Plants.Rice plantSalinity Resistance in Plants.Rice plant
Salinity Resistance in Plants.Rice plant
aliabatool11
 

Workshop NGS data analysis - 1

  • 1. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 3. Introduction Sequencing technology The real cost of sequencing Question: - What is the fraction of the cost of a NGS study of: (1) Sample collection and experimental design (2) Sequencing itself (3) Data reduction and management (4) Downstream analysis Is this a surrealistic question? Not at all, think of you writing a grant proposal and propose a NGS ChIP-seq experiment of 24 samples. You would need 3 HiSeq 2000 lanes that cost you 8000 € Sample preperation cost 1000€ Others 1000 € Do you ever include analysis costs?? Personel, infrastructure,…
  • 4. Introduction Sequencing technology The real cost of sequencing
  • 10. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 11. Sequencing flow Steps in sequencing experiments Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • 12. Sequencing flow Steps in sequencing experiments
  • 13. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 14. Sequencing flow Steps in sequencing experiments Main data formats: - Raw reads - Mapped reads - Application dependent: ChIP-seq peaks, SNPs: their location and their characteristics > Intended for: visualization / further analysis (by humans or computers) / reduction ??
  • 15. Sequencing data formats Raw reads Raw sequence reads: - Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT - Extension: represent the quality, per base ~ FASTQ – Q for quality @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - OK, the strange signs at the last line indicate the quality at the corresponding base… But what’s the decoding scheme? (Nerd alert ahead !!) - We want to represent quality scores ~ Phred scores - Q= -10 log P (with P being the chance of a base called in error) Phred quality scores are logarithmically linked to error probabilities Probability of incorrect Phred Quality Score Base call accuracy base call 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 %
  • 16. Sequencing data formats Raw reads - Phred scores thus typically have 2 digits – you want one digit to allow correspondance in the file… What would a nerd do? Use ASCII as lookup-table of course!  one character ~ one decimal number
  • 17. Sequencing data formats Raw reads @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - Ok, thus 5 actually is 53… But the real charachters only start at 33… So 5 is actually 53 - 33 = 20 phred quality…
  • 18. Sequencing data formats Raw reads @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Example of the identifier line for Illumina data (non-multiplexed): #@machine_id:lane:tile:x:y:multiplex:pair @HWUSI-EAS100R:6:73:941:1973#0/1 - Phred + 33  Sanger - Illumina 1.3 +  Phred +64 - Illumina 1.5 +  Phred +64 - Illumina 1.8 +  Phred +33 - Solid  Sanger Check your instument + version  FastQC will give you a hint which scoring scheme is probably used Extensions: FASTQ / FQ
  • 19. Sequencing data formats Raw reads - Special: SRA files from NCBI/EBI Sequence Read Archive - Contains raw sequence data from (GEO) studies for all kinds of instruments and platforms - Exercice: we have submitted NGS (MBD-seq) for 8 NB cell lines into GEO and the raw data in SRA, find the SRA files. How would you obtain our originally submitted FASTQ files? (HINT: SRA Toolkit) - Exercice (caution: nerd alert): working in the terminal… Retrieve the FASTQ file from the SRA file and perform FastQC analysis
  • 20. Linux… for human beings? The terminal What they show in ‘The matrix’ is a real Linux-terminal and real commands…
  • 21. Linux… for human beings? The terminal
  • 22. Linux… for human beings? The terminal Server: *********** Port: ***** Login: ********* Pasw: ********* You will not see that you are typing something…
  • 23. Linux… for human beings? The terminal You are interactively logged in now! Meaning everything you type is sent to the server and executed + Fast, no eye-candy + Easy to develop a command-line interface - Not so intuitive - Steep learning curve - High nerd-level You may have to type bash to see a line that starts with student@mellfire:/home/student Where are you? / is root /home is the folder with user documents
  • 24. Linux… for human beings? The terminal cd Change directory - cd .. (go to higher level) – cd ../../.. mkdir Make directory (is a folder) cp Copy mv Move ls (-ahl) List all contents of a folder (DOS: dir) rm Remove (DOS: del) man Manual (Q to quit man)
  • 25. Linux… for human beings? The terminal vi Text editor (:q! to exit from vi) head and tail See first lines / last lines of a textfile top Table of processes who and whoami Lists of users logged in and useful command for people with schizophrenia
  • 26. Linux… for human beings? The terminal
  • 27. Sequencing data formats Mapped reads - Mapping: ‘align’ these raw reads to a reference genome - Single-end or paired-end data? - How would you align a short read to the reference? - Old-school: Smith-Watherman, BLAST, BLAT,… - Now: mapping tools for short reads that use intelligent indexing and allow mismatches Algorithm Other features Hash table Suffix tree Merge sorting Hash Hash Enhanced Program Reference Suffix tree FM-index Merge sorting Colorspace 454 Quality Paired end Long reads Bisulfite reference reads suffix array SOAP [51] X X X X MAQ [54] X X X X X Mosaik X X X X X Eland X X SSAHA2 [61] X X X Bowtie [67] X X X X BWA [69] X X X X BWA-SW [69] X X X X X SOAP2 [70] X X X X X
  • 28. Sequencing data formats Mapped reads - Most commonly used worldwide and in our lab as well: BWA and Bowtie, both using Burrows-Wheeler transformations and FM indexes - Optimized for short NGS reads (from about 30 bp to +- 200 bp) - Versions exist for longer reads (such as 454): Bowtie2 and BWA-SW - What would a file contain, describing mapped reads? - Position: chr / start / stop - Sequence: read / references - Mismatches / indels / vs. the reference - Quality informations - Few years ago, each tool had its own output format  Bowtie,… - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
  • 29. Sequencing data formats Mapped reads - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map) DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • 30. Sequencing data formats Mapped reads - BAM: binary version of SAM: not human readable but indexed for fast access for other tools / visualisation / … - Exercise: view a BAM file in IGV
  • 31. Sequencing data formats Other formats - BED files (location / annotation / scores): Browser Extensible Data Used for mapping / annotation / peak locations / - extension: bigBED (binary) FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 – - BEDGraph files (location, combined with score) Used to represent peak scores track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
  • 32. Sequencing data formats Other formats - WIG files (location / annotation / scores): wiggle Used for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks) browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
  • 33. Sequencing data formats Other formats - GFF format (General Feature Format) Used for annotation of genetic / genomic features – such as all coding genes in Ensembl Often used in downstream analysis to assign annotation to regions / peaks / … FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • 34. Sequencing data formats Other formats - VCF format (Variant Call Format) For SNP representation
  • 35. Sequencing data formats Other formats - http://genome.ucsc.edu/FAQ/FAQformat.html - UCSC brower data formats, including all most commonly used formats that are accepted and widely used - In addition, ENCODE data formats (narrowPeak / broadPEAK)
  • 36. Blok de Van… ETER
  翻译: