Understanding Suffix Trees: A Comprehensive Overview
Explore the concept of suffix trees through detailed explanations, diagrams, and examples. Learn about tries, compressed tries, suffix tree construction algorithms, and more. Discover how suffix trees are used to efficiently store and search for substrings in a string.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Suffix Trees Michael T. Goodrich University of California, Irvine Most slides adapted from http://www.cs.tau.ac.il/~bchor/CG/suffixtrees.ppt by Haim Kaplan
Trie A digital tree representing a set of strings. c a { aeef ad bbfe bbfg c } b e b d e f c f e g
Trie (Cont) Assume no string is a prefix of another c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. a b e b d Each string corresponds to a leaf. e The children of a node can be given in a list or a hash table (indexed by characters from the alphabet) f c f e g
Compressed Trie Compress unary nodes, label edges by substrings c c a a b e b bbf d d eef e f c c f e g e g Children of each node can still be indexed by a character from the alphabet (the first one in the substring)
Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s Mississippi -> Mississippi$
Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } $ (4,4) a b (3,3) (0,1) b $ (4,4) a a $ b$ (4,4) b (2,4) (2,4) $ O(n) space
Trivial algorithm to build a Suffix tree a b a b $ Put the largest suffix in a b b ab$ Put the next largest (bab$) suffix in a b $
a b b ab$ a b $ Put the third suffix (ab$) in a b b ab$ a $ b $
a b b ab$ a $ b $ Put the next largest suffix (b$) in a b b $ a a $ b$ b $
a b b $ a a $ b$ b $ $ Put the last suffix ($) in a b b $ a a $ b$ b $
$ a b b $ a a $ b$ b $ We also label each leaf with the starting point of the corresponding suffix. $ a b 5 b $ a a $ b$ 4 b $ 3 2 1
Analysis Naively, this takes O(n2) time to build in the worst case. More sophisticated algorithms can construct a suffix tree in O(n) time (to be continued).
The Nave Algorithm in Practice The na ve construction algorithm is not usually as bad as O(n2) time in practice. A worst-case example is an-1b$. This is rare. For example, for a random string, the na ve algorithm runs in O(n log n) expected time. Why?
What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. We may also want to find all occurrences of P in T
Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b 5 b $ a a $ b$ 4 b $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.
$ a b 5 b $ a a $ b$ 4 b $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(m+k) time
So what can we do with it ? Matching a pattern against a database of strings 1. Construct a suffix tree for the text 2. Search for each pattern in the suffix tree
Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S To make these suffixes prefix-free we add a special char, say $, at the end of s To associate each suffix with a unique string in S add a different special char to each s
Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # $ { $ # b$ b# ab$ ab# bab$ aab# abab$ } a b 4 5 # $ a b a 3 b $ b# 4 # a $ 2 1 b $ 3 2 1
Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 4 5 # $ a b a 3 b $ b# 4 Find such node with largest string depth # a $ 2 1 b $ 3 2 1
Longest Substring that is a Palindrome Let s = cbaaba$ then sr = abaabc# a # b $ c 7 7 $ b a 6 6 a $ a $ 4 5 5 3 3 $ 4 1 2 2 1