SlideShare a Scribd company logo
String algorithms
ast     • Knuth-Morris-Pratt algorithm
ime:       - find pattern P in string T
           - preprocess P (construct Fail array)
           - search time: O(n+m)        n=|T|, m=|P|


oday:   • Data structures for strings
           - trie
           - suffix tree & generalized suffix tree
           - suffix array
        • Many applications
           - e.g. computing common longest substring
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                          h
• Edges labeled with letters
                                                   a   e i
• Leafs correspond to input strings
                                            d          l       hi


                                      had          d       p
        { had, held, help, hi }
                                                held         help
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                               had
                                            held
• Edges labeled with letters                help
• Leafs correspond to input strings         hi




        { had, held, help, hi }
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                 h
• Edges labeled with letters
                                            had
• Leafs correspond to input strings         held
                                            help
                                            hi


        { had, held, help, hi }
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                      h
• Edges labeled with letters
                                             a     e i
• Leafs correspond to input strings
                                       had       held    hi
                                                 help

        { had, held, help, hi }
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                         h
• Edges labeled with letters
                                                a     e i
• Leafs correspond to input strings
                                            d       held    hi
                                                    help

                                      had
        { had, held, help, hi }
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                         h
• Edges labeled with letters
                                                a     e i
• Leafs correspond to input strings
                                            d          l    hi


                                      had           held
        { had, held, help, hi }                     help
Trie data structure

• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”

• Rooted tree                                          h
• Edges labeled with letters
                                                   a   e i
• Leafs correspond to input strings
                                            d          l       hi


                                      had          d       p
        { had, held, help, hi }
                                                held         help
Trie data structure

• Cannot represent strings T and TT'
• Solution: append special symbol $
  to the end of each word
                                                    h

                                                a   e i

                                         d          l       hi


                                   had          d       p
       { had, held, help, hi }
                                             held         help
Trie data structure

• Cannot represent strings T and TT'
• Solution: append special symbol $
  to the end of each word
                                                             h

                                                 $       a   e i

                                    h$
                                                                         $
                                             d               l
                                                                           hi$
                                         $            d          p
  { had$, held$, help$, hi$, h$ }
                                     had$         $                  $

                                                 held$             help$
Trie data structure
• Application: storing dictionaries
• Space: O(m1+...+mk), mi=length of string Ti
• Looking up word of length n: O(n)
   - assuming alphabet has constant size                             h

                                                         $       a
• Alternative to binary trees                                        e i
   - pros and cons (see wikipedia)          h$
                                                                                 $
                                                     d               l
               had
                                                                                   hi$
         h                  held                              d          p
                                                 $
                     help          hi
                                             had$         $                  $
 - all nodes store items, not just leaves
 - unlabeled edges                                       held$             help$
Tries for storing suffixes
 • This lecture: use tries for storing the set of suffixes of string T

                        c        a       o     $

               a             c       o         $        $
                                                                 cacao$
                                                                 acao$
          c     o        a                         o$
                                         $                       cao$
      a                 o                ao$                     ao$
                    $
                                                                 o$
               cao$
  o                     $                                        $

 $                      acao$


cacao$
Reducing space requirements
 • Trick 1: Delete non-branching nodes, merge labels (compact trie)

                              a               $
                     ca
                                         o$
                                                       $
                                                           cacao$
                                  o$                       acao$
                                                  o$
               o$     cao$                                 cao$
                                   ao$                     ao$
   cao$                                                    o$
              cao$
                                                           $

                      acao$


cacao$
Reducing space requirements
•   # leafs: n+1
•   # internal nodes: at most n
•                                                         => O(n) storage
    # edges: at most 2n edges
•   Trick 2: Store edge labels via two pointers into the string
    (begin & end)



                                                             cacao$
                ca          a        o$        $
                                                             acao$
                                                             cao$
                                          o$
    cao$      o$     cao$       o$                 $         ao$
                                                             o$
             cao$     acao$      ao$
    cacao$                                                   $
Suffix trees
• Called the suffix tree for string T [Weiner’73]
• D. Knuth: “algorithm of the year 1973”
• Can be constructed in linear time
   - e.g. [McCreight’76], [Ukkonen’95]
   - algorithms are complicated [will not be covered]
• Used for all sorts of string problems


                                                        cacao$
              ca          a         o$        $
                                                        acao$
                                                        cao$
                                         o$
  cao$      o$     cao$        o$                 $     ao$
                                                        o$
           cao$     acao$       ao$
  cacao$                                                $
Suffix trees for string matching
Does text T contain pattern P?
• Construct suffix tree for T in O(|T|) time
• Query takes O(|P|) time!
• same as Knuth-Morris-Pratt, but...




                                                   cacao$
           ca          a        o$        $
                                                   acao$
                                                   cao$
                                     o$
cao$     o$     cao$       o$                  $   ao$
                                                   o$
         cao$    acao$      ao$
cacao$                                             $
Suffix trees for string matching
• Scenario: text T is fixed, patterns P1,...,Pk come one at a time
   - T is very long
• Time: O(|T| + |P1| + |P2| + ... + |Pk|)
• Knuth-Morris-Pratt can be generalized to multiple patterns
  (Aho-Corasick alg.). Same complexity as above, but requires
  knowledge of P1,...,Pk in advance


                                                           cacao$
            ca          a        o$        $
                                                           acao$
                                                           cao$
                                      o$
cao$      o$     cao$       o$                  $          ao$
                                                           o$
          cao$    acao$      ao$
cacao$                                                     $
Getting extra information
Task: Get all occurrences of P in T (give a list of start positions)

• Can be done in O(|P|+c) time where c is # of occurences
  - Locate the node corresponding to P
  - Traverse all leaves of the subtree rooted at this node



                                                            cacao$
            ca          a        o$        $
                                                            acao$
                                                            cao$
                                      o$
cao$      o$     cao$       o$                   $          ao$
                                                            o$
          cao$    acao$      ao$
cacao$                                                      $
Getting extra information
Task: Get # of occurrences of P in T

• Can be done in O(|P|) time
  - With O(|T|) preprocessing




                                                   cacao$
            ca          a        o$        $
                                                   acao$
                                                   cao$
                                      o$
cao$      o$     cao$       o$                 $   ao$
                                                   o$
         cao$     acao$      ao$
cacao$                                             $
Generalized suffix tree
• Input: a set of strings {T1,...,Tk}
• Generalized suffix tree: tree containing all suffixes of all strings
• Tree for { abba , bac } :                                       abba$     1

                                                                    bba$1
                       a                                 $2
                                      b   c$2       $1              ba$1
                                                                    a$1
                                                c$2       $1   $2
 bba$1            $1              a                                 $1
         c$2                                ba$1

abba$1     ac$2    a$1                      bba$1                   bac$2
                           $1         c$2                           ac$2
                           ba$1       bac$2
                                                                    c$2
Construction of generalized suffix tree for {X, Y}
 • Construct suffix tree for X $1Y $2
 • Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”

                                                             abba$1bac$2
                                                             bba$1bac$2
                                          $1bac$2            ba$1bac$2
                                                             a$1bac$2
...$1bac$2      $1bac$2                    $1bac$2           $1bac$2
                                     ...$1bac$1

abba$1bac$2      a$1bac$2          bba$1bac$2                bac$2
                                                             ac$2
                  $1bac$2
                                                             c$2
                     ba$1bac$2                               $2
Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”

                                                            abba$1bac$2
                                                            bba$1bac$2
                                            $1              ba$1bac$2
                                                            a$1bac$2
    ...$1       $1                           $1             $1bac$2
                                    ...$1

abba$1               a$1            bba$1                   bac$2
                                                            ac$2
                           $1
                                                            c$2
                           ba$1                             $2
Construction of generalized suffix tree {T 1,...,Tk}

• Can use the same trick
  - construct suffix tree for T1 $1 ... Tk $k

• Linear runtime: O(|T1|+...+|Tk|)

• In practice, modify the algorithm (e.g. Ukkonen’s alg.)
    to get a slightly faster performance
Computing common longest subsequence of {T1,T2}
•     Mark each internal node with red (blue)
              if it has at least one red (blue) child
•     Nodes with both colors contain common subsequences
                                                                      abba$1
                                                                      bba$1
                         a                                 $2
                                        b   c$2       $1              ba$1
                                                                      a$1
                                                  c$2       $1   $2
    bba$1           $1              a                                 $1
            c$2                               ba$1

abba$1       ac$2    a$1                      bba$1                   bac$2
                             $1         c$2                           ac$2
                             ba$1       bac$2
                                                                      c$2
Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}




   ...
                                                                        ...
Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}
2. Mark each node v with color k=1,...,K if it has a leaf with that color
3. Compute c(v) = # of colors at v
  - Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n)

String corresponding to v is contained in exactly c(v) input strings



   ...
                                                                        ...
Suffix array for text T[1..n]
                                     SA[0..n]

mississippi   0        ε               11       • SA[i] = id of the i’th
ississippi    1        i               10         smallest suffix
ssissippi     2        ippi            7
sissippi      3        issippi         4
issippi       4        ississippi      1
                                                • Number k corresponds
ssippi        5        mississippi     0          to suffix T[k+1..n]
sippi         6        pi              9
ippi          7        ppi             8
ppi           8        sippi           6
pi            9        sissippi        3        • Lexicographic order:
i             10       ssippi          5          - X < Xα...
ε             11       ssissippi       2          - Xα... < Xβ... if α<β

  Suffix array: stores suffixes of string T in a lexicographic order
Pattern search from a suffix array

Task: Find pattern P = iss in text T = mississippi
                         m                  n
ε               11
i               10   • All occurrences appear contiguously in the array
ippi            7
issippi         4    • Computing start & end indexes: binary search
ississippi      1
mississippi     0    • Worst-case: O(m log n)
pi              9
ppi             8
sippi           6    • In practice, often O(m + log n)
sissippi        3
ssippi          5    • Can be improved to O(m + log n) worst-case
ssissippi       2       - need to store extra 2n numbers
Construction of suffix arrays



ε             11    • SA[0..n] can be constructed in linear time
i             10
ippi          7        - from a suffix tree
issippi       4
                       - or directly
ississippi    1
mississippi   0         [Kim,Sim,Park,Park’03]
pi            9         [Kärkkäinen&Sanders’03]
ppi           8         [Ko & Aluru’03]
sippi         6
sissippi      3
ssippi        5
ssissippi     2
Summary
• Suffix trees & suffix arrays: two powerful data structures for strings

• Space requirements:
   - suffix trees: linear but with a large constant, e.g. 20 |T|
   - suffix arrays: more efficient, e.g. 5 |T|

• Suffix arrays are more suited to large alphabets
• Practitioners like suffix arrays (simplicity, space efficiency)
• Theoreticians like suffix trees (explicit structure) ε              11
                                                        i             10
                                                        ippi          7
                                                        issippi       4
                                                        ississippi    1
                                                        mississippi   0
                                                        pi            9
                                                        ppi           8
                                                        sippi         6
                                                        sissippi      3
                                                        ssippi        5
                                                        ssissippi     2
Further information on string algorithms

• Dan Gusfield, “Algorithms on Strings, Trees and Sequences:
  Computer Science and Computational Biology”, 1997

• Slides mentioning more recent results:
  http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
Ad

More Related Content

More from zukun (20)

ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
zukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
zukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
zukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
zukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
zukun
 

Recently uploaded (20)

PUBH1000 Slides - Module 12: Advocacy for Health
PUBH1000 Slides - Module 12: Advocacy for HealthPUBH1000 Slides - Module 12: Advocacy for Health
PUBH1000 Slides - Module 12: Advocacy for Health
JonathanHallett4
 
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdfAntepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Dr H.K. Cheema
 
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
EUPHORIA GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 21 MARCH 2025
Quiz Club of PSG College of Arts & Science
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
Capitol Doctoral Presentation -May 2025.pptx
Capitol Doctoral Presentation -May 2025.pptxCapitol Doctoral Presentation -May 2025.pptx
Capitol Doctoral Presentation -May 2025.pptx
CapitolTechU
 
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
lsitinova
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-17-2025 .pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-17-2025  .pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-17-2025  .pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-17-2025 .pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Dastur_ul_Amal under Jahangir Key Features.pptx
Dastur_ul_Amal under Jahangir Key Features.pptxDastur_ul_Amal under Jahangir Key Features.pptx
Dastur_ul_Amal under Jahangir Key Features.pptx
omorfaruqkazi
 
How to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 SalesHow to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 Sales
Celine George
 
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdfGENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
Quiz Club of PSG College of Arts & Science
 
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
Aerospace Engineering Homework Help Guide – Expert Support for Academic Success
Aerospace Engineering Homework Help Guide – Expert Support for Academic SuccessAerospace Engineering Homework Help Guide – Expert Support for Academic Success
Aerospace Engineering Homework Help Guide – Expert Support for Academic Success
online college homework help
 
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERSIMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
rajaselviazhagiri1
 
Statement by Linda McMahon on May 21, 2025
Statement by Linda McMahon on May 21, 2025Statement by Linda McMahon on May 21, 2025
Statement by Linda McMahon on May 21, 2025
Mebane Rash
 
How to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 InventoryHow to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 Inventory
Celine George
 
The Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional DesignThe Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional Design
Sean Michael Morris
 
ITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQ
SONU HEETSON
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit..."Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
AlionaBujoreanu
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
PUBH1000 Slides - Module 12: Advocacy for Health
PUBH1000 Slides - Module 12: Advocacy for HealthPUBH1000 Slides - Module 12: Advocacy for Health
PUBH1000 Slides - Module 12: Advocacy for Health
JonathanHallett4
 
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdfAntepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Antepartum fetal surveillance---Dr. H.K.Cheema pdf.pdf
Dr H.K. Cheema
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
Capitol Doctoral Presentation -May 2025.pptx
Capitol Doctoral Presentation -May 2025.pptxCapitol Doctoral Presentation -May 2025.pptx
Capitol Doctoral Presentation -May 2025.pptx
CapitolTechU
 
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
UNITED_KINGDOM.pptUNITED_KINGDOM.pptUNITED_KINGDOM.ppt
lsitinova
 
Dastur_ul_Amal under Jahangir Key Features.pptx
Dastur_ul_Amal under Jahangir Key Features.pptxDastur_ul_Amal under Jahangir Key Features.pptx
Dastur_ul_Amal under Jahangir Key Features.pptx
omorfaruqkazi
 
How to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 SalesHow to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 Sales
Celine George
 
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
Aerospace Engineering Homework Help Guide – Expert Support for Academic Success
Aerospace Engineering Homework Help Guide – Expert Support for Academic SuccessAerospace Engineering Homework Help Guide – Expert Support for Academic Success
Aerospace Engineering Homework Help Guide – Expert Support for Academic Success
online college homework help
 
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERSIMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
IMPACT_OF_SOCIAL-MEDIA- AMONG- TEENAGERS
rajaselviazhagiri1
 
Statement by Linda McMahon on May 21, 2025
Statement by Linda McMahon on May 21, 2025Statement by Linda McMahon on May 21, 2025
Statement by Linda McMahon on May 21, 2025
Mebane Rash
 
How to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 InventoryHow to Manage Manual Reordering Rule in Odoo 18 Inventory
How to Manage Manual Reordering Rule in Odoo 18 Inventory
Celine George
 
The Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional DesignThe Pedagogy We Practice: Best Practices for Critical Instructional Design
The Pedagogy We Practice: Best Practices for Critical Instructional Design
Sean Michael Morris
 
ITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQ
SONU HEETSON
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit..."Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
AlionaBujoreanu
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
Ad

Advances in discrete energy minimisation for computer vision

  • 1. String algorithms ast • Knuth-Morris-Pratt algorithm ime: - find pattern P in string T - preprocess P (construct Fail array) - search time: O(n+m) n=|T|, m=|P| oday: • Data structures for strings - trie - suffix tree & generalized suffix tree - suffix array • Many applications - e.g. computing common longest substring
  • 2. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters a e i • Leafs correspond to input strings d l hi had d p { had, held, help, hi } held help
  • 3. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree had held • Edges labeled with letters help • Leafs correspond to input strings hi { had, held, help, hi }
  • 4. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters had • Leafs correspond to input strings held help hi { had, held, help, hi }
  • 5. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters a e i • Leafs correspond to input strings had held hi help { had, held, help, hi }
  • 6. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters a e i • Leafs correspond to input strings d held hi help had { had, held, help, hi }
  • 7. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters a e i • Leafs correspond to input strings d l hi had held { had, held, help, hi } help
  • 8. Trie data structure • Data structure for storing a set of strings • Name comes from “retrieval”, usually pronounced as “try” • Rooted tree h • Edges labeled with letters a e i • Leafs correspond to input strings d l hi had d p { had, held, help, hi } held help
  • 9. Trie data structure • Cannot represent strings T and TT' • Solution: append special symbol $ to the end of each word h a e i d l hi had d p { had, held, help, hi } held help
  • 10. Trie data structure • Cannot represent strings T and TT' • Solution: append special symbol $ to the end of each word h $ a e i h$ $ d l hi$ $ d p { had$, held$, help$, hi$, h$ } had$ $ $ held$ help$
  • 11. Trie data structure • Application: storing dictionaries • Space: O(m1+...+mk), mi=length of string Ti • Looking up word of length n: O(n) - assuming alphabet has constant size h $ a • Alternative to binary trees e i - pros and cons (see wikipedia) h$ $ d l had hi$ h held d p $ help hi had$ $ $ - all nodes store items, not just leaves - unlabeled edges held$ help$
  • 12. Tries for storing suffixes • This lecture: use tries for storing the set of suffixes of string T c a o $ a c o $ $ cacao$ acao$ c o a o$ $ cao$ a o ao$ ao$ $ o$ cao$ o $ $ $ acao$ cacao$
  • 13. Reducing space requirements • Trick 1: Delete non-branching nodes, merge labels (compact trie) a $ ca o$ $ cacao$ o$ acao$ o$ o$ cao$ cao$ ao$ ao$ cao$ o$ cao$ $ acao$ cacao$
  • 14. Reducing space requirements • # leafs: n+1 • # internal nodes: at most n • => O(n) storage # edges: at most 2n edges • Trick 2: Store edge labels via two pointers into the string (begin & end) cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 15. Suffix trees • Called the suffix tree for string T [Weiner’73] • D. Knuth: “algorithm of the year 1973” • Can be constructed in linear time - e.g. [McCreight’76], [Ukkonen’95] - algorithms are complicated [will not be covered] • Used for all sorts of string problems cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 16. Suffix trees for string matching Does text T contain pattern P? • Construct suffix tree for T in O(|T|) time • Query takes O(|P|) time! • same as Knuth-Morris-Pratt, but... cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 17. Suffix trees for string matching • Scenario: text T is fixed, patterns P1,...,Pk come one at a time - T is very long • Time: O(|T| + |P1| + |P2| + ... + |Pk|) • Knuth-Morris-Pratt can be generalized to multiple patterns (Aho-Corasick alg.). Same complexity as above, but requires knowledge of P1,...,Pk in advance cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 18. Getting extra information Task: Get all occurrences of P in T (give a list of start positions) • Can be done in O(|P|+c) time where c is # of occurences - Locate the node corresponding to P - Traverse all leaves of the subtree rooted at this node cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 19. Getting extra information Task: Get # of occurrences of P in T • Can be done in O(|P|) time - With O(|T|) preprocessing cacao$ ca a o$ $ acao$ cao$ o$ cao$ o$ cao$ o$ $ ao$ o$ cao$ acao$ ao$ cacao$ $
  • 20. Generalized suffix tree • Input: a set of strings {T1,...,Tk} • Generalized suffix tree: tree containing all suffixes of all strings • Tree for { abba , bac } : abba$ 1 bba$1 a $2 b c$2 $1 ba$1 a$1 c$2 $1 $2 bba$1 $1 a $1 c$2 ba$1 abba$1 ac$2 a$1 bba$1 bac$2 $1 c$2 ac$2 ba$1 bac$2 c$2
  • 21. Construction of generalized suffix tree for {X, Y} • Construct suffix tree for X $1Y $2 • Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2” abba$1bac$2 bba$1bac$2 $1bac$2 ba$1bac$2 a$1bac$2 ...$1bac$2 $1bac$2 $1bac$2 $1bac$2 ...$1bac$1 abba$1bac$2 a$1bac$2 bba$1bac$2 bac$2 ac$2 $1bac$2 c$2 ba$1bac$2 $2
  • 22. Construction of generalized suffix tree for {X, Y} • Construct suffix tree for X $1Y $2 • Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2” abba$1bac$2 bba$1bac$2 $1 ba$1bac$2 a$1bac$2 ...$1 $1 $1 $1bac$2 ...$1 abba$1 a$1 bba$1 bac$2 ac$2 $1 c$2 ba$1 $2
  • 23. Construction of generalized suffix tree {T 1,...,Tk} • Can use the same trick - construct suffix tree for T1 $1 ... Tk $k • Linear runtime: O(|T1|+...+|Tk|) • In practice, modify the algorithm (e.g. Ukkonen’s alg.) to get a slightly faster performance
  • 24. Computing common longest subsequence of {T1,T2} • Mark each internal node with red (blue) if it has at least one red (blue) child • Nodes with both colors contain common subsequences abba$1 bba$1 a $2 b c$2 $1 ba$1 a$1 c$2 $1 $2 bba$1 $1 a $1 c$2 ba$1 abba$1 ac$2 a$1 bba$1 bac$2 $1 c$2 ac$2 ba$1 bac$2 c$2
  • 25. Computing common longest subsequence of {T1,...,TK} Task: For each r = 2,...,K compute the length of the longest sequence contained in at least r strings 1. Construct generalized suffix tree for {T1,...,TK} ... ...
  • 26. Computing common longest subsequence of {T1,...,TK} Task: For each r = 2,...,K compute the length of the longest sequence contained in at least r strings 1. Construct generalized suffix tree for {T1,...,TK} 2. Mark each node v with color k=1,...,K if it has a leaf with that color 3. Compute c(v) = # of colors at v - Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n) String corresponding to v is contained in exactly c(v) input strings ... ...
  • 27. Suffix array for text T[1..n] SA[0..n] mississippi 0 ε 11 • SA[i] = id of the i’th ississippi 1 i 10 smallest suffix ssissippi 2 ippi 7 sissippi 3 issippi 4 issippi 4 ississippi 1 • Number k corresponds ssippi 5 mississippi 0 to suffix T[k+1..n] sippi 6 pi 9 ippi 7 ppi 8 ppi 8 sippi 6 pi 9 sissippi 3 • Lexicographic order: i 10 ssippi 5 - X < Xα... ε 11 ssissippi 2 - Xα... < Xβ... if α<β Suffix array: stores suffixes of string T in a lexicographic order
  • 28. Pattern search from a suffix array Task: Find pattern P = iss in text T = mississippi m n ε 11 i 10 • All occurrences appear contiguously in the array ippi 7 issippi 4 • Computing start & end indexes: binary search ississippi 1 mississippi 0 • Worst-case: O(m log n) pi 9 ppi 8 sippi 6 • In practice, often O(m + log n) sissippi 3 ssippi 5 • Can be improved to O(m + log n) worst-case ssissippi 2 - need to store extra 2n numbers
  • 29. Construction of suffix arrays ε 11 • SA[0..n] can be constructed in linear time i 10 ippi 7 - from a suffix tree issippi 4 - or directly ississippi 1 mississippi 0 [Kim,Sim,Park,Park’03] pi 9 [Kärkkäinen&Sanders’03] ppi 8 [Ko & Aluru’03] sippi 6 sissippi 3 ssippi 5 ssissippi 2
  • 30. Summary • Suffix trees & suffix arrays: two powerful data structures for strings • Space requirements: - suffix trees: linear but with a large constant, e.g. 20 |T| - suffix arrays: more efficient, e.g. 5 |T| • Suffix arrays are more suited to large alphabets • Practitioners like suffix arrays (simplicity, space efficiency) • Theoreticians like suffix trees (explicit structure) ε 11 i 10 ippi 7 issippi 4 ississippi 1 mississippi 0 pi 9 ppi 8 sippi 6 sissippi 3 ssippi 5 ssissippi 2
  • 31. Further information on string algorithms • Dan Gusfield, “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology”, 1997 • Slides mentioning more recent results: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
  翻译: