String Matching with Mismatches: An Innovative Approach

BWT Arrays and Mismatching Trees: A New Way for
String Matching with 
k
 Mismatches
Yangjun Chen, Yujia Wu
Department of Applied Computer Science
University of Winnipeg
1
Outline
  
Motivation
 
-
 
Statement of Problem
 
-
 
Related work
BWT Arrays – A space-economic Index for String Matching
String Matching with 
k 
Mismatches
 
-
 
Search trees
 
-
 
Mismatching information
 
-
 
Mismatching trees
Experiments
Conclusion and Future Work
2
Statement of Problem
String matching with 
k 
mismatches: 
find all the occurrences of a
pattern string 
r
 in a target string 
s
 with each occurrence having
up to 
k
 positions different between 
r
 and 
s
.
 
-
 
In DNA databases, 
due to polymorphisms or mutations among
  
individuals or even sequencing errors, a read (a short sample 
 
DNA
  
sequence) may disagree in some positions at any of its 
  
  
occurrences in a genome.
3
 
a a a a a c a a a c
 
a c a c a c a g a a g c c c
Example:
k 
= 4
pattern
target
Related Work
Exact string matching
 
- 
 
On-line algorithms:
Knuth-Morris-Pratt
, 
Boyer-Moore
, 
Aho-Corasick
-
 
Index based:
  
suffix trees (
Weiner
; 
McCreight
; 
Ukkonen
), suffix arrays (
Manber
, 
Myers
), BWT-
 
transformation (
Burrow
-
Wheeler
), Hash (
Karp
, 
Rabin
)
Inexact string matching
-
String matching with 
k
 mismatches - Hamming distance (
Landau, U. Vishkin; Amir at al.;
Cole
)
-
String matching with 
k differences 
- Levelshtein distance (
Chang, Lampe
)
-
String matching with 
wild-cards 
(
Manber, Baeza-Yates
)
4
 
 
BWT-Index
Burrows-Wheeler Transform (
BWT
)
s 
=
 a
1
c
1
a
2
g
1
a
3
c
2
a
4
$
5
 
$
 
  
a
1
 
c
1
  a
2
  g
1
  a
3
  c
2
  
a
4
a
4
  
$
  
a
1
 
 
c
1
  a
2
  g
1
  a
3
  
c
2
a
3
  
c
2
 a
4
 
 $ 
  
a
1
  c
1
  a
2
  
g
1
a
1
  
c
1
 
a
2
  g
1
  a
3
  c
2
  a
4
  
$
a
2
  
g
1
 a
3
  c
2
  a
4
 
 $ 
  a
1
  
c
1
c
2
  
a
4
  
$
   a
1
  c
1
  a
2
  g
1
 
a
3
c
1
  
a
2
  g
1
  a
3
  c
2
 a
4
  
$ 
  
a
1
g
1
  
a
3
  c
2
  a
4
  
$
  a
1
  c
1
  
a
2
 rk
F
-
1
2
3
4
1
2
1
 rk
L  
1
1
1
-
2
2
3
4
F
L
rank: 3
rank: 3
 
 
a
1
  c
1
 
a
2
  g
1
  a
3
  c
2
  a
4
  
$
c
1
  a
2
  g
1
  a
3
  c
2
 a
4
  
$ 
  
a
1
a
2
  g
1
 a
3
  c
2
  a
4
 
 $ 
  a
1
  
c
1
g
1
  a
3
  c
2
  a
4
  
$
  a
1
  c
1
  
a
2
a
3
  c
2
 a
4
 
 $ 
  a
1
  c
1
  a
2
 
 g
1
c
2
  a
4
  
$
   a
1
  c
1
  a
2
  g
1
 
a
3
a
4
  
$
  a
1
 
 
c
1
  a
2
  g
1
  a
3
  
c
2
$
 
  a
1
 
c
1
  a
2
  g
1
  a
3
  c
2
  
a
4
Rank correspondence:
 
BWT construction:
SA
[…] – suffix array
rk
F
(
e
) = 
rk
L
(
e
) 
s 
=
 a
1
c
1
a
2
g
1
a
3
c
2
a
4
$
Search
 p
 = 
aca
Backward Search of BWT-Index
6
 
 
Backward Search
F
                
L 
$
 
a
4
       
a
4
 
c
2
       
a
3
 
g
1
       
a
1
 
$
        
a
2
 
c
1
       
c
2
 
a
3
       
c
1
 
a
1
 
g
1
 
a
2
F
                
L 
$
 
a
4
       
a
4
 
c
2
       
a
3
 
g
1
       
a
1
 
$
        
a
2
 
c
1
       
c
2
 
a
3
       
c
1
 
a
1
g
1
 
a
2
F
                
L 
$
 
a
4
       
a
4
 
c
2
       
a
3
 
g
1
       
a
1
 
$        
a
2
 
c
1
       
c
2
 
a
3
       
c
1
 
a
1
g
1
 
a
2
F
                
L 
$
 
a
4
       
a
4
 
c
2
       
a
3
 
g
1
       
a
1
 
$        
a
2
 
c
1
       
c
2
 
a
3
       
c
1
 
a
1
g
1
 
a
2
 
Z
: a character
  
: a range in 
F
L
: a range in 
L
, corresponding to 
7
Backward Search of BWT-Index
7
 
 
 
rankAll
Arrange |
| arrays each for a character 
x
 
 
 such that 
A
x
[
i
] 
(the 
i
th
 
entry
in the array for 
x
)
 is the number of appearances of 
x
 
within 
L
[1 .. 
i
].
Instead of scanning a certain segment 
L
[
 
.. 
] (
 
 
) to find a subrange
for a certain 
x
 
 
, we can simply look up 
A
x
 
to see whether 
A
x
[
 
- 1
] =
A
[
]
. If it is the case, then 
 does 
not occur in 
 
.. 
]. Otherwise, 
[
A
x
[
 
- 1
]
+ 1
, 
A
x
[
] 
] should be the found range.
F
                     
L 
$
 
a
4
       
a
4
 
c
2
       
a
3
 
g
1
       
a
1
 
$        
a
2
 
c
1
       
c
2
 
a
3
       
c
1
 
a
1
g
1
 
a
2
Example
To find the first and the last appearance
of 
c 
in 
L
[2 .. 5], we only need to find
A
c
[2 – 1] = 
A
c 
[1] = 0 and 
A
c 
[5] = 2. So the
corresponding range is
[
A
c
[2
 
- 1
] + 1, 
A
c
[5
]] = [1, 2].
Reduce 
rankAll
-Index Size
F
-ranks: 
F
 
= 
<a; 
x
a
, 
y
a
> 
BWT array:
 L
Reduced appearance array: 
A
 with bucket
 
size 
.
Reduced suffix array: 
SA
* with bucket size 
.
9
 
F
         
L
         
rk
L
$
 
a
4
 
1
a
4
 
c
2
 
1
a
3
 
g
1
 
1
a
1
 
$
 
-
a
2
 
c
1
 
2
c
2
 
a
3
 
2
c
1
 
a
1
 
3
g
1
 
a
2
 
4
L 
 
a
4
 
c
2
g
1
       
$        
c
1
      
a
3
      
a
1
    
a
2      
F
$
 = <$; 1, 1>
F
a
 = <
a
; 2, 5>
F
c
 = <
c
; 6, 7>
F
g
 = <
g
; 8, 8>
 
i
 
1
2
3
4
5
6
7
8
+
+
+
String Matching with 
k 
Mismatches
Search Trees
 
pattern: 
r 
= 
tcaca
; target:
 s 
= 
acagaca
; 
k = 
2.
10
 
 
11
 
String Matching with 
k 
Mismatches
Mismatching information
 
R – 
mismatching table for 
r 
with |
r| = m.
 
R
ij
 
containing the positions of the first 2
k 
+ 1 mismatches between
 
r
[
i 
.. 
m 
q 
+ 
i
] and 
r
[
j 
.. 
m 
q 
+ 
j
], where 
q 
= max{
i
, 
j
}, such that if
 
R
ij
[
l
] = 
x
 (
 
)
 then 
r
[
i 
+ 
x 
- 1] 
 
r
[
j 
+ 
x 
- 1] or one of them does not
 
exist, and it is the 
l-
th mismatch between them.
 
tcacg
12
 
 
 
String Matching with 
k 
Mismatches
Derivation of mismatching information
 
We store only part of mismatching information, specifically:
 
R
12
, …, 
R
1
m
, while all the other mismatching information will
 
be dynamically derived.
 
Derive the mismatching
information between
1
 = 
r
[2
 .. 
4]
 = 
cacg
 and
2
 = 
r
[3 .. 5]
 
= 
acg
from 
R
12
 and 
R
13
.
1
2
3
4
13
13
 
 
 
String Matching with 
k 
Mismatches
Algorithm for Derivation of mismatching information
Let 
, 
1
 and 
2
 be three strings. Let 
A
1
 and 
A
2
 
be two arrays
containing all the positions of mismatches between 
 
and 
1
, and 
and 
2
, respectively.
Create a new array 
A
 such that if 
A
[
i
] = 
j 
(
 
)
,
 
then 
1
[
j
] 
 
1
[
j
], or
one of them does not exists. It is the 
i
th mismatch between them.
1.
p 
:= 1; 
q 
:= 1; 
l 
:= 1;
2.
If 
A
2
[
q
] < 
A
1
[
p
], then {
A
[
l
] := 
A
2
[
q
]; 
q  
:= 
q 
+ 1; 
l 
:= 
l 
+ 1;}
3.
If 
A
1
[
p
] < 
A
2
[
q
], then {
A
[
l
] := 
A
1
[
p
]; 
p 
:= 
p 
+ 1; 
l 
:= 
l 
+ 1;}
4.
If 
A
1
[
p
] = 
A
2
[
q
], then {if 
1
[
p
] 
 
2
[
q
], then {
A
[
l
] := 
q
; 
l 
:= 
l 
+ 1;} 
p 
:= 
p 
+ 1; 
q 
:= 
q 
+ 1;}
5.
If 
p 
> |
A
1
|, 
q
 > |
A
2
|, or both 
A
1
[
p
] and 
A
2
[
q
] are 
, stop (if 
A
1
 
(or 
A
2
)
 
has some remaining
elements, which are not 
, first  append them to the rear of 
A
, and then stop.)
6.
Otherwise, go to (2).
14
 
 
14
14
 
 
 
String Matching with 
k 
Mismatches
Derivation of mismatching information for paths in
a search tree.
 
r
[2] =
 c
 
r
[3] =
 a
 
r
[4] =
 c
 
r
[5] =
 a
 
r
[1] =
 t
 
r
:
15
15
15
String Matching with 
k 
Mismatches
Mismatching trees
 
In a search tree, we distinguish between matching and mismatching nodes.
16
String Matching with 
k 
Mismatches
Mismatching trees
<
g
, 
3
>
 
u
10
<
c
, 
3
>
 
u
11
17
 
 
String Matching with 
k 
Mismatches
Algorithm
Mismatching tree generation
Derivation of mismatching information for paths
18
<
g
, 
3
>
 
u
10
<
c
, 
3
>
 
u
11
Derivation of Mismatching Information
R
12
 = 
R
21 
=
 1
   
2
   
3
   
4
   
mis
1
 =
 1
   
4
  
 
 1
   
2
  
3
<
a
, [1, 4]>
 
<
c
, [1, 2]>
 
<
a
, [2, 3]>
 
<
g
, [1, 1]>
 
<
a
, [4, 4]>
 
19
Algorithm
Generation of mismatching trees
In order to generate a mismatching tree 
D
, we will use a stack 
S
 to control the process, in which
each entry is a quadruple 
(
v
, 
j
,
 
, 
u
), where
v
a node inserted into the hash table.
j
j
 is an integer to indicate that 
v 
is the 
j
th node on a path in 
T 
 (counted from the root with
the root as the 0th node).
the number of mismatches between the path and 
r
[0 .. 
j
] (recall that 
r
[0] = ‘-’)
.
u 
– the parent of a node in 
D
 to be created for 
v
.
20
Algorithm
Mismatching tree generation
Each time an entry 
e 
= 
(
v
, 
j
,
 
, 
u
) with 
v 
= <
x
, [
, 
]> is popped out from 
S
, we will check whether
x 
= 
r
[
j
].
If 
x 
= 
r
[
j
], we will generate a node 
u
 
= <
x
, 
j
> and link it to 
u 
as a child.
If 
x 
 
r
[
j
], we will check whether 
u
 is a node of the form <-, 0>. If it is not the case, generate a
node 
u
 = <
-, 
0>.
Otherwise, set 
u
 
to be 
u
.
Using 
search
( ) to find all the children of 
v
: 
v
1
, …, 
v
l
. Then, push each (
v
i
, 
j 
+ 1,
 

, 
u
) into 
S
with 

 being 
 or 
 + 1, depending on whether 
y
i 
= 
r
[
j 
+ 1], where 
v
i
 = <
y
i
, [
i
, 
i
]>.
21
Algorithm
Mismatching information derivation for paths
As with the basic process, each time a node 
v 
= <
x
, [
, 
]> (compared to 
r
[
j
]) is encountered,
which is the same as a previous one 
v
 
= <
x
, [

, 

]> (compared to 
r
[
i
]), we will not create a
subtree in 
T 
in a way as for 
v
, but create a new node 
u 
for 
v 
in 
D
 (mismatching tree) and then
 
go
along 
p
(
v
)
 
(the link associated with 
v
)
 
to find the corresponding nodes 
u
 in 
D
 and search 
D
[
u
] in
the breadth-first manner to generate a subtree rooted at 
u 
in 
D
 by simulating the merge
operation discussed in Subsection 
B
.
In other words, 
D
[
u
] (to be created) corresponds to the mismatch arrays for all the paths going
though 
v 
in 
T
, which will not be actually produced.
22
Algorithm
Mismatching information derivation for paths
To this end, a queue data structure 
Q 
is used to do a breadth-first search of 
D
[
u
], and at the
same time generate 
D
[
u
]. In 
Q
, each entry 
e 
is a pair (
w
, 
) with 
w 
being a node in 
D
[
u
], and 
 
an
entry in 
R
ij
. 
Initially, put (
u
, 
R
ij
[1]) into 
Q, 
where 
u
 = <
x
, 
i
>.
 
In the process,
 
when 
e 
is dequeued
from 
Q
 (taken out from the front), we will make the following operations (simulating the steps in
merge
( )):
1)
Let 
e 
= (
w
, 
R
ij
[
l
]). Assume that 
w 
= <
z
, 
f
> and 
R
ij
[
l
] = 
val.
 
If <
z
, 
f
>
 
= <-, 0>, then create a copy of
w 
added to 
D
[
u
]. If 
w 
is not a leaf node, let 
w
1
, …, 
w
h
 
be the children of 
w
 and enqueue (
w
1
,
R
ij
[
l
]
), …, (
w
h
, 
R
ij
[
l
]
) into 
Q 
(append at the end) in turn. If <
z
, 
f
>
 
 <-, 0>, do (2), (3), or (4).
2)
If 
f 
< 
i 
+ 
val 
- 1, add <
z
, 
f 
i 
+ 
j
> to 
D
[
u
]. If 
w 
is not a leaf node, enqueue (
w
1
, 
R
ij
[
l
]
), …, (
w
h
, 
R
ij
[
l
]
)
into 
Q
.
23
Algorithm
Mismatching information derivation for paths
3)
If 
f 
= 
i 
+ 
val 
- 1, we will distinguish between two subcases: 
z 
r
[
j 
+ 
val 
- 1]
 
and 
z 
= 
r
[
j 
+ 
val 
- 1]. If
z 
r
[
j 
+ 
val 
- 1], we have a mismatching and add a node <
z
, 
j 
+ 
val 
- 1>
 
to 
D
[
u
]. If 
z
 = 
r
[
j
 + 
val 
-
1], create a node <-, 0> added to 
D
[
u
]. (If its parent is <-, 0>, it should be merged into its
parent.) 
4)
If 
w 
is not a leaf node, enqueue <
w
1
, 
R
ij
[
l 
+ 1])
, …, <
 w
h
, 
R
ij
[
l 
+ 1]) 
into 
Q
.
5)
If 
f 
> 
i 
+ 
val 
- 1, we will scan 
R
ij
 starting from 
R
ij
[
l
] until we meet 
R
ij
[
l
] such that 
f 
 
i
 + 
R
ij
[
l
] - 1.
For each 
R
ij
[
g
] (
l 
g 
< 
l
), we create a new node <
r
[
j
 + 
R
ij
[
g
] - 1
], 
j 
+ 
R
ij
[
g
] - 1
> added to 
D
[
u
].
Enqueue <
w
, 
R
ij
[
l
]> 
into 
Q
.
Experiments
Compare 4 different approaches
BWT-based 
[34] (BWT for short),
Amir’s method
 [2] (Amir
 
for short),
Cole’s method 
[14] (Cole
 
for short),
Algorithm A discussed in this paper 
(
A
( )
 
for short)
24
 
25
Experiments
Test Environments:
-
Implementation in C++, compiled by GNU make utility
with optimization of level 2
-
64-bit Ubuntu operating system
-
run on a single core of a 2.40GHz Intel Xeon E5-2630
processor with 32GB RAM
26
Experiments
TABLE I. CHARACTERISTICS OF GENOMES
27
Tests with Real Data
T
ESTS
 
WITH
 
VARYING
 
LENGTH
 
OF
 
READS
 (
OVER
 
Rat genome
)
 
 
 
 
Number of leaf nodes of 
S
-trees
28
28
Tests with Real Data
T
ESTS
 
WITH
 
VARYING
 
LENGTH
 
OF
 
READS
 (
OVER
 
Zebra fish 
) 
 
Number of leaf nodes of 
S
-trees
29
29
29
Tests with real Data
Tests with varying length of reads 
(over 
Rat chr1
) 
 
30
30
30
30
Tests with Real Data
Tests with varying length of reads 
(over 
C. elegans
) 
 
31
31
31
31
31
Tests with varying length of reads (over 
C. merlae 
)
 
Tests with Real Data
 
Conclusion and Future Work
Main contribution
 
-
 
Combination of derivation of mismatching information and
BWT indexes for 
k 
mismatching problem
 
-
 
Concept of mismatching trees
 
-
 
Extensive tests
Future work
 
-
 
String matching with 
k 
differences
 
-
 
String matching with 
don’t care 
symbols
32
Thank you!
33
Slide Note
Embed
Share

Introducing a novel method for string matching with k mismatches, catering to scenarios like DNA databases with polymorphisms or sequencing errors. Explore the utilization of BWT arrays and mismatching trees for efficient pattern recognition. Learn about related work, BWT index construction, and backward search techniques. This research sheds light on enhancing string matching accuracy across various applications.

  • String Matching
  • Mismatches
  • DNA Databases
  • BWT Arrays
  • Pattern Recognition

Uploaded on Feb 28, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1

  2. Outline Motivation - Statement of Problem - Related work BWT Arrays A space-economic Index for String Matching String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees Experiments Conclusion and Future Work 2

  3. Statement of Problem String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s. - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern target Example: k = 4 a a a a a c a a a c a c a c a c a g a a g c c c 3

  4. Related Work Exact string matching - On-line algorithms: Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick - Index based: suffix trees (Weiner; McCreight; Ukkonen), suffix arrays (Manber, Myers), BWT- transformation (Burrow-Wheeler), Hash (Karp, Rabin) Inexact string matching String matching with k mismatches - Hamming distance (Landau, Cole Cole) String matching with k differences - Levelshtein distance (Chang, String matching with wild-cards (Manber, Baeza-Yates) Landau, U U. . Vishkin Vishkin; ; Amir Amir at at al al.; .; - Lampe) Chang, Lampe - - 4

  5. BWT-Index Burrows-Wheeler Transform (BWT) s = a1c1a2g1a3c2a4$ BWT construction: Rank correspondence: F rkF - 1 2 3 4 1 2 1 rkL 1 1 1 - 2 2 3 4 L if SA[i] = 1; L[i] = $, L[i] = s[SA[i] 1], otherwise. $ a1c1 a2 g1 a3 c2 a4 a4$ a1 c1 a2 g1 a3 c2 a3 c2 a4$ a1 c1 a2 g1 a1 c1a2 g1 a3 c2 a4$ a2 g1 a3 c2 a4$ a1 c1 c2 a4$ a1 c1 a2 g1 a3 c1 a2 g1 a3 c2 a4$ a1 g1 a3 c2 a4$ a1 c1 a2 rkF(e) = rkL(e) a1 c1a2 g1 a3 c2 a4$ c1 a2 g1 a3 c2 a4$ a1 a2 g1 a3 c2 a4$ a1 c1 g1 a3 c2 a4$ a1 c1 a2 a3 c2 a4$ a1 c1 a2 g1 c2 a4$ a1 c1 a2 g1 a3 a4$ a1 c1 a2 g1 a3 c2 $ a1c1 a2 g1 a3 c2 a4 rank: 3 SA[ ] suffix array rank: 3 5

  6. Backward Search of BWT-Index <z, [ , ]>, if z appears in L ; otherwise. s = a1c1a2g1a3c2a4$ Search p = aca search(z, ) = , Z: a character L : a range in L, corresponding to : a range in F Backward Search Suffix Array F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 8 7 5 1 3 6 2 4 6

  7. Backward Search of BWT-Index search(a, <c, [1, 2]>) search(c, <a, [2, 5]>) Searchsequence: <a, [2, 5]> <c, [1, 2]> <a, [3, 4]> Suffix Array F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 8 7 5 1 3 6 2 4 7 7

  8. rankAll Arrange Arrange | | | | arrays in in the the array array for Instead of of scanning for for a a certain certain X X , , we A A [ [ ] ]. . If If it it is is the + + 1 1, , A AX X[ [ ] ] ] ] should for a a character character X X such appearances of of X Xwithin arrays each for X X) ) is is the scanning a a certain each for the number number of of appearances certain segment segment L L[ [ .. .. ] ] ( ( ) ) to to find we can can simply simply look look up the case, case, then then does does not not occur should be be the the found found range range. . that A AX X[ [i i] ] (the within L L[ [1 1 .. .. i i] ]. . find a a subrange up A AX Xto to see see whether whether A AX X[ [ - - 1 1] ] = = occur in in .. .. ] ]. . Otherwise, Otherwise, [ [A AX X[ [ - - 1 1] ] such that (the i ith thentry entry Instead subrange A$ 0 0 0 1 1 1 1 1 Aa 1 1 1 1 1 2 3 4 Ac 0 1 1 1 2 2 2 2 Ag At 0 0 1 1 1 1 1 1 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 Example To find the first and the last appearance of c in L[2 .. 5], we only need to find Ac[2 1] = Ac [1] = 0 and Ac [5] = 2. So the corresponding range is [Ac[2- 1] + 1, Ac[5]] = [1, 2]. 0 0 0 0 0 0 0 0

  9. Reduce rankAll-Index Size F-ranks: F = <a; xa, ya> BWT array: L Reduced appearance array: A with bucket size . Reduced suffix array: SA* with bucket size . Find a range: top F(x ) + A [ (top-1)/ ] + r +1 bot F(x ) + A [ bot/ ] + r r is the number of 's appearances within L[ (top - 1)/ .. top - 1] r is the number of 's appearances within L[ bot/ .. bot ] L a4 c2 g1 $ c1 a3 a1 a2 SA* F = < ; x , y > i A$ 0 0 0 1 1 1 1 1 Aa 1 1 1 1 1 2 3 4 Ac 0 1 1 1 2 2 2 2 Ag At 0 0 1 1 1 1 1 1 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 rkL 1 1 1 - 2 2 3 4 SA 8 7 5 1 3 6 2 4 8 7 5 1 3 6 2 4 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 F$ = <$; 1, 1> Fa = <a; 2, 5> Fc = <c; 6, 7> Fg = <g; 8, 8> + + + 9

  10. String Matching with k Mismatches Search Trees pattern: r = tcaca; target: s = acagaca; k = 2. v0 <-, [1, 8]> T: r: v1 r[1] = t v2 v3 <a, [1, 4]> v5 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4>] <c, [2, 2]> v6 r[2] = c v4 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <a, [3, 3]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> v7 v10 v8 v9 v11 r[3] = a v14 v12 v13 v15 r[4] = c <a, [3, 3]> <$, [-, -]> v18 v17 v19 r[5] = a v16 P2 P3 P1 P4 10

  11. String Matching with k Mismatches Mismatching information R mismatching table for r with |r| = m. Rij containing the positions of the first 2k + 1 mismatches between r[i .. m q + i] and r[j .. m q + j], where q = max{i, j}, such that if Rij[l] = x ( ) then r[i + x - 1] r[j + x - 1] or one of them does not exist, and it is the l-th mismatch between them. i r: tcacg cacg r2: r1: r1: 1 2 3 4 R12: tcacg acg tcacg R13: 1 3 r3: r1: i R14 1 2 r4: r1: cg tcacg g R15 1 r5: 11

  12. String Matching with k Mismatches Derivation of mismatching information We store only part of mismatching information, specifically: R12, , R1m, while all the other mismatching information will be dynamically derived. A1 = R12: 1 2 3 p 1 = r[2 .. 4] = cacg and Step 1: Step 2: Step 3: 2 3 Derive the mismatching information between 4 1 4 1 2 3 4 p p A2 = R13: 1 3 1 3 1 3 2 = r[3 .. 5]= acg q q q from R12 and R13. 1[1]=c 2[1]=a 1[3]=c 2[3]=g A: 1 1 2 3 1 2 12

  13. String Matching with k Mismatches Algorithm for Derivation of mismatching information Let , 1 and 2 be three strings. Let A1 and A2be two arrays containing all the positions of mismatches between and 1, and and 2, respectively. Create a new array A such that if A[i] = j ( ),then 1[j] 1[j], or one of them does not exists. It is the ith mismatch between them. := 1; q q := 1; := 1; l l := 1; := 1; 2. 2. If If A A2 2[ [q q] < ] < A A1 1[ [p p], then { ], then {A A[ [l l] := ] := A A2 2[ [q q]; ]; q q := := q q + 1; + 1; l l := 3. 3. If If A A1 1[ [p p] < ] < A A2 2[ [q q], then { ], then {A A[ [l l] := ] := A A1 1[ [p p]; ]; p p := := p p + 1; + 1; l l := 4. 4. If If A A1 1[ [p p] = ] = A A2 2[ [q q], then {if ], then {if 1 1[ [p p] ] 2 2[ [q q], then { ], then {A A[ [l l] := 5. 5. If If p p > | > |A A1 1|, |, q q > | > |A A2 2|, or both |, or both A A1 1[ [p p] and ] and A A2 2[ [q q] are ] are , stop (if elements, which are not elements, which are not , first append them to the rear of , first append them to the rear of A A, and then stop.) 6. 6. Otherwise, go to (2). Otherwise, go to (2). 1. 1. p p := 1; := l l + 1;} + 1;} := l l + 1;} + 1;} ] := q q; ; l l := , stop (if A A1 1(or := l l + 1;} + 1;} p p := := p p + 1; + 1; q q := (or A A2 2) )has some remaining has some remaining , and then stop.) := q q + 1;} + 1;} 13 13

  14. String Matching with k Mismatches Derivation of mismatching information for paths in a search tree. This part of P3 will not be created. We derive the mismatching information for it according to P1 and R21. <-, [1, 8]> r: v1 r[1] = t <a, [1, 4]> <c, [1, 2]> P: P : P: P : r[2] = c <a, [2, 3]> <g, [1, 1]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> j i r[3] = a i j r[4] = c <a, [4, 4>]> <c, [2, 2]> r[5] = a P3 P1 14 14 14

  15. String Matching with k Mismatches Mismatching trees In a search tree, we distinguish between matching and mismatching nodes. v0 <-, [1, 8]> T: r: v1 r[1] = t v2 v3 <a, [1, 4]> v5 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4>] <c, [2, 2]> v6 v7 r[2] = c v4 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <a, [3, 3]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> v10 v8 v9 v11 r[3] = a v14 v12 v13 v15 r[4] = c <a, [3, 3]> <$, [-, -]> v18 v17 v19 r[5] = a v16 P2 P3 P1 P4 15 15 15

  16. String Matching with k Mismatches Mismatching trees v0 <-, 0> T: r: u1 r[1] = t u2 u3 <a, 1> <g, 1> <c, 1> u6 u7 <a, 2> r[2] = c u4 u5 <a, 2> <g, 2> <-, 0> u11 u10 u9 <c, 3> r[3] = a <g, 3> <-, 0> u12 r[4] = c <g, 4> <-, 0> r[5] = a u13 P2 P3 P1 P4 16

  17. String Matching with k Mismatches Algorithm Mismatching tree generation Derivation of mismatching information for paths 17

  18. Derivation of Mismatching Information v0 <-, 0> T: r: u1 r[1] = t u2 u3 <a, 1> <g, 1> <c, 1> u6 u7 <a, 2> r[2] = c u4 u5 <a, 2> <g, 2> <-, 0> u9 u11 u10 r[3] = a <c, 3> <g, 3> <-, 0> u10 123 r[4] = c <g, 4> <-, 0> r[5] = a u11 mis1 = 14 R12 = R21 = 1234 P2 P3 P1 P4 <a, [1, 4]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> 18

  19. Algorithm Generation of mismatching trees In order to generate a mismatching tree D, we will use a stack S to control the process, in which each entry is a quadruple (v, j, , u), where v a node inserted into the hash table. j j is an integer to indicate that v is the jth node on a path in T (counted from the root with the root as the 0th node). the number of mismatches between the path and r[0 .. j] (recall that r[0] = - ). u the parent of a node in D to be created for v. 19

  20. Algorithm Mismatching tree generation Each time an entry e = (v, j, , u) with v = <x, [ , ]> is popped out from S, we will check whether x = r[j]. If x = r[j], we will generate a node u = <x, j> and link it to u as a child. If x r[j], we will check whether u is a node of the form <-, 0>. If it is not the case, generate a node u = <-, 0>. Otherwise, set u to be u. Using search( ) to find all the children of v: v1, , vl. Then, push each (vi, j + 1, , u ) into S with being or + 1, depending on whether yi = r[j + 1], where vi = <yi, [ i, i]>. 20

  21. Algorithm Mismatching information derivation for paths As with the basic process, each time a node v = <x, [ , ]> (compared to r[j]) is encountered, which is the same as a previous one v = <x , [ , ]> (compared to r[i]), we will not create a subtree in T in a way as for v , but create a new node u for v in D (mismatching tree) and thengo along p(v )(the link associated with v )to find the corresponding nodes u in D and search D[u ] in the breadth-first manner to generate a subtree rooted at u in D by simulating the merge operation discussed in Subsection B. In other words, D[u] (to be created) corresponds to the mismatch arrays for all the paths going though v in T, which will not be actually produced. 21

  22. Algorithm Mismatching information derivation for paths To this end, a queue data structure Q is used to do a breadth-first search of D[u ], and at the same time generate D[u]. In Q, each entry e is a pair (w, ) with w being a node in D[u ], and an entry in Rij. Initially, put (u , Rij[1]) into Q, where u = <x, i>.In the process,when e is dequeued from Q (taken out from the front), we will make the following operations (simulating the steps in merge( )): Let e = (w, Rij[l]). Assume that w = <z, f> and Rij[l] = val. If <z, f>= <-, 0>, then create a copy of w added to D[u]. If w is not a leaf node, let w1, , whbe the children of w and enqueue (w1, Rij[l]), , (wh, Rij[l]) into Q (append at the end) in turn. If <z, f> <-, 0>, do (2), (3), or (4). If f < i + val - 1, add <z, f i + j> to D[u]. If w is not a leaf node, enqueue (w1, Rij[l]), , (wh, Rij[l]) into Q. 1) 2) 22

  23. Algorithm Mismatching information derivation for paths If f = i + val - 1, we will distinguish between two subcases: z r[j + val - 1]and z = r[j + val - 1]. If z r[j + val - 1], we have a mismatching and add a node <z, j + val - 1>to D[u]. If z = r[j + val - 1], create a node <-, 0> added to D[u]. (If its parent is <-, 0>, it should be merged into its parent.) If w is not a leaf node, enqueue <w1, Rij[l + 1]), , < wh, Rij[l + 1]) into Q. If f > i + val - 1, we will scan Rij starting from Rij[l] until we meet Rij[l ] such that f i + Rij[l ] - 1. For each Rij[g] (l g < l ), we create a new node <r[j + Rij[g] - 1], j + Rij[g] - 1> added to D[u]. Enqueue <w, Rij[l ]> into Q. 3) 4) 5) 23

  24. Experiments Compare 4 different approaches BWT BWT- -based based [34] (BWT for short), [34] (BWT for short), Amir s method Amir s method [2] (Amir [2] (Amirfor short), for short), Cole s method Cole s method [14] (Cole [14] (Colefor short), for short), Algorithm A discussed in this paper Algorithm A discussed in this paper ( (A A( ) ( )for short) for short) 24

  25. Experiments Test Environments: - Implementation in C++, compiled by GNU make utility with optimization of level 2 - 64-bit Ubuntu operating system - run on a single core of a 2.40GHz Intel Xeon E5-2630 processor with 32GB RAM 25

  26. Experiments TABLE I. CHARACTERISTICS OF GENOMES Genomes Rat chr1 (Rnor_6.0) C. merolae (ASM9120v1) C. elegans (WBcel235) Zebra fish (GRCz10) Rat (Rnor_6.0) Genome sizes (bp) 290,094,217 16,728,967 103,022,290 1,464,443,456 2,909,701,677 26

  27. Tests with Real Data TESTSWITHVARYINGLENGTHOFREADS (OVER Rat genome) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 700 1000 600 800 500 600 400 300 400 200 200 100 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k Number of leaf nodes of S-trees varying length of reads k/length k/length- -of of- -read No No. . of of leaf leaf nodes read nodes 5/50 5/50 2K 10 10/ /100 100 0.7M 20 20/ /150 150 16.5M 30 30/ /200 200 102M 27

  28. Tests with Real Data TESTSWITHVARYINGLENGTHOFREADS (OVER Zebra fish ) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 400 600 500 300 400 300 200 200 100 100 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads Number of leaf nodes of S-trees k/length k/length- -of of- -read read 5/50 5/50 10 10/ /100 100 20 20/ /150 150 30 30/ /200 200 No No. . of of leaf leaf nodes nodes 0.7K 0.30M 9.2M 89M 28 28

  29. Tests with real Data Tests with varying length of reads (OVER Rat chr1) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 100 100 80 80 60 60 40 40 20 20 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads 29 29 29

  30. Tests with Real Data Tests with varying length of reads (OVER C. elegans) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 50 60 50 40 40 30 30 20 20 10 10 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads 30 30 30 30

  31. Tests with Real Data Tests with varying length of reads (over C. merlae ) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 5 6 5 4 4 3 3 2 2 1 1 0 0 5 1 0 20 30 40 100 150 200 250 300 varying values of k varying length of reads 31 31 31 31 31

  32. Conclusion and Future Work Main contribution - Combination of derivation of mismatching information and BWT indexes for k mismatching problem - Concept of mismatching trees - Extensive tests Future work - String matching with k differences - String matching with don t care symbols 32

  33. Thank you! Thank you! 33

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#