String Matching with Mismatches: An Innovative Approach

BWT Arrays and Mismatching Trees: A New Way for

String Matching with

 Mismatches

Yangjun Chen, Yujia Wu

Department of Applied Computer Science

University of Winnipeg

Outline



Motivation

Statement of Problem

Related work



BWT Arrays – A space-economic Index for String Matching



String Matching with

Mismatches

Search trees

Mismatching information

Mismatching trees



Experiments



Conclusion and Future Work

Statement of Problem



String matching with

mismatches:

find all the occurrences of a

pattern string

 in a target string

 with each occurrence having

up to

 positions different between

and

In DNA databases,

due to polymorphisms or mutations among

individuals or even sequencing errors, a read (a short sample

DNA

sequence) may disagree in some positions at any of its

occurrences in a genome.

a a a a a c a a a c

a c a c a c a g a a g c c c

Example:

= 4

pattern

target

Related Work



Exact string matching

On-line algorithms:

Knuth-Morris-Pratt

Boyer-Moore

Aho-Corasick

Index based:

suffix trees (

Weiner

McCreight

Ukkonen

), suffix arrays (

Manber

Myers

), BWT-

transformation (

Burrow

Wheeler

), Hash (

Karp

Rabin



Inexact string matching

String matching with

 mismatches - Hamming distance (

Landau, U. Vishkin; Amir at al.;

Cole

String matching with

k differences

- Levelshtein distance (

Chang, Lampe

String matching with

wild-cards

Manber, Baeza-Yates

BWT-Index



Burrows-Wheeler Transform (

BWT



rk

rk

rank: 3

rank: 3

Rank correspondence:

BWT construction:

SA

[…] – suffix array

rk

) =

rk





Search

aca

Backward Search of BWT-Index

Backward Search

: a character



: a range in



: a range in

, corresponding to



Backward Search of BWT-Index

rankAll



Arrange |



| arrays each for a character





 such that

(the

th

entry

in the array for

 is the number of appearances of

within

[1 ..

].



Instead of scanning a certain segment



..



] (







) to find a subrange

for a certain





, we can simply look up

to see whether



- 1

] =





. If it is the case, then



 does

not occur in



..



]. Otherwise,



- 1

+ 1



] should be the found range.

Example

To find the first and the last appearance

of

in

[2 .. 5], we only need to find

[2 – 1] =

[1] = 0 and

[5] = 2. So the

corresponding range is

[2

- 1

] + 1,

[5

]] = [1, 2].

Reduce

rankAll

-Index Size



-ranks:



<a;



BWT array:



Reduced appearance array:



 with bucket

size





Reduced suffix array:

SA

* with bucket size



rk

 = <$; 1, 1>

= <

; 2, 5>

= <

; 6, 7>

= <

; 8, 8>

String Matching with

Mismatches



Search Trees

pattern:

tcaca

; target:

acagaca

k =

2.

String Matching with

Mismatches



Mismatching information

R –

mismatching table for

with |

r| = m.

ij

–

containing the positions of the first 2

+ 1 mismatches between

..

–

] and

..

–

], where

= max{

}, such that if

ij

] =





 then

- 1]



- 1] or one of them does not

exist, and it is the

l-

th mismatch between them.

tcacg

String Matching with

Mismatches



Derivation of mismatching information

We store only part of mismatching information, specifically:

, …,

, while all the other mismatching information will

be dynamically derived.

Derive the mismatching

information between



[2

..

4]

cacg

and



[3 .. 5]

acg

from

and



String Matching with

Mismatches



Algorithm for Derivation of mismatching information



Let





and



 be three strings. Let

and

be two arrays

containing all the positions of mismatches between



and



, and



and



, respectively.



Create a new array

 such that if

] =





then







], or

one of them does not exists. It is the

th mismatch between them.

1.

:= 1;

:= 1;

:= 1;

2.

If

] <

], then {

] :=

];

:=

+ 1;

:=

+ 1;}

3.

If

] <

], then {

] :=

];

:=

+ 1;

:=

+ 1;}

4.

If

] =

], then {if







], then {

] :=

:=

+ 1;}

:=

+ 1;

:=

+ 1;}

5.

If

> |

|,

> |

|, or both

] and

] are



, stop (if

(or

has some remaining

elements, which are not



, first  append them to the rear of

, and then stop.)

6.

Otherwise, go to (2).

String Matching with

Mismatches



Derivation of mismatching information for paths in

a search tree.

[2] =

[3] =

[4] =

[5] =

[1] =

String Matching with

Mismatches



Mismatching trees

In a search tree, we distinguish between matching and mismatching nodes.

String Matching with

Mismatches



Mismatching trees

String Matching with

Mismatches



Algorithm



Mismatching tree generation



Derivation of mismatching information for paths

Derivation of Mismatching Information



mis





, [1, 4]>

, [1, 2]>

, [2, 3]>

, [1, 1]>

, [4, 4]>

Algorithm



Generation of mismatching trees



In order to generate a mismatching tree

, we will use a stack

 to control the process, in which

each entry is a quadruple



), where

•

–

a node inserted into the hash table.

•

–

 is an integer to indicate that

is the

th node on a path in

 (counted from the root with

the root as the 0th node).

•



–

the number of mismatches between the path and

[0 ..

] (recall that

[0] = ‘-’)

•

– the parent of a node in

 to be created for

Algorithm



Mismatching tree generation



Each time an entry



) with

= <

, [





]> is popped out from

, we will check whether

].

•

If

], we will generate a node



= <

> and link it to

as a child.

•

If



], we will check whether

 is a node of the form <-, 0>. If it is not the case, generate a

node



= <

-,

0>.

•

Otherwise, set



to be

•

Using

search

( ) to find all the children of

, …,

. Then, push each (

+ 1,





) into

with



 being



or



 + 1, depending on whether

+ 1], where

= <

, [





]>.

Algorithm



Mismatching information derivation for paths



As with the basic process, each time a node

= <

, [





]> (compared to

]) is encountered,

which is the same as a previous one



= <



, [





]> (compared to

]), we will not create a

subtree in

in a way as for



, but create a new node

for

in

 (mismatching tree) and then

go

along



(the link associated with



to find the corresponding nodes



in

 and search



] in

the breadth-first manner to generate a subtree rooted at

in

 by simulating the merge

operation discussed in Subsection



In other words,

] (to be created) corresponds to the mismatch arrays for all the paths going

though

in

, which will not be actually produced.

Algorithm



Mismatching information derivation for paths



To this end, a queue data structure

is used to do a breadth-first search of



], and at the

same time generate

]. In

, each entry

is a pair (



) with

being a node in



], and



an

entry in

ij

Initially, put (



ij

[1]) into

Q,

where



= <

>.

In the process,

when

is dequeued

from

 (taken out from the front), we will make the following operations (simulating the steps in

merge

( )):

1)

Let

= (

ij

]). Assume that

= <

> and

ij

] =

val.

If <

= <-, 0>, then create a copy of

added to

]. If

is not a leaf node, let

, …,

be the children of

 and enqueue (

ij

), …, (

ij

) into

(append at the end) in turn. If <



 <-, 0>, do (2), (3), or (4).

2)

If

val

- 1, add <

–

> to

]. If

is not a leaf node, enqueue (

ij

), …, (

ij

into

Algorithm



Mismatching information derivation for paths

3)

If

val

- 1, we will distinguish between two subcases:

≠

val

- 1]

and

val

- 1]. If

≠

val

- 1], we have a mismatching and add a node <

val

- 1>

to

]. If

val

1], create a node <-, 0> added to

]. (If its parent is <-, 0>, it should be merged into its

parent.)

4)

If

is not a leaf node, enqueue <

ij

+ 1])

, …, <

ij

+ 1])

into

5)

If

val

- 1, we will scan

ij

 starting from

ij

] until we meet

ij



] such that



ij



] - 1.

For each

ij

] (

≤



), we create a new node <

ij

] - 1

],

ij

] - 1

> added to

].

Enqueue <

ij



]>

into

Experiments



Compare 4 different approaches



BWT-based

[34] (BWT for short),



Amir’s method

 [2] (Amir

for short),



Cole’s method

[14] (Cole

for short),



Algorithm A discussed in this paper

( )

for short)

Experiments

Test Environments:

Implementation in C++, compiled by GNU make utility

with optimization of level 2

64-bit Ubuntu operating system

run on a single core of a 2.40GHz Intel Xeon E5-2630

processor with 32GB RAM

Experiments

TABLE I. CHARACTERISTICS OF GENOMES

Tests with Real Data



ESTS

WITH

VARYING

LENGTH

OF

READS

OVER

Rat genome

Number of leaf nodes of

-trees

Tests with Real Data



ESTS

WITH

VARYING

LENGTH

OF

READS

OVER

Zebra fish

Number of leaf nodes of

-trees

Tests with real Data



Tests with varying length of reads

(over

Rat chr1

Tests with Real Data



Tests with varying length of reads

(over

C. elegans



Tests with varying length of reads (over

C. merlae

Tests with Real Data

Conclusion and Future Work



Main contribution

Combination of derivation of mismatching information and

BWT indexes for

mismatching problem

Concept of mismatching trees

Extensive tests



Future work

String matching with

differences

String matching with

don’t care

symbols

Thank you!

Slide Note

Embed Share

Download

Introducing a novel method for string matching with k mismatches, catering to scenarios like DNA databases with polymorphisms or sequencing errors. Explore the utilization of BWT arrays and mismatching trees for efficient pattern recognition. Learn about related work, BWT index construction, and backward search techniques. This research sheds light on enhancing string matching accuracy across various applications.

laskowski_n Follow

Uploaded on Feb 28, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches Yangjun Chen, Yujia Wu Department of Applied Computer Science University of Winnipeg 1

Outline Motivation - Statement of Problem - Related work BWT Arrays A space-economic Index for String Matching String Matching with k Mismatches - Search trees - Mismatching information - Mismatching trees Experiments Conclusion and Future Work 2

Statement of Problem String matching with k mismatches: find all the occurrences of a pattern string r in a target string s with each occurrence having up to k positions different between r and s. - In DNA databases, due to polymorphisms or mutations among individuals or even sequencing errors, a read (a short sample DNA sequence) may disagree in some positions at any of its occurrences in a genome. pattern target Example: k = 4 a a a a a c a a a c a c a c a c a g a a g c c c 3

Related Work Exact string matching - On-line algorithms: Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick - Index based: suffix trees (Weiner; McCreight; Ukkonen), suffix arrays (Manber, Myers), BWT- transformation (Burrow-Wheeler), Hash (Karp, Rabin) Inexact string matching String matching with k mismatches - Hamming distance (Landau, Cole Cole) String matching with k differences - Levelshtein distance (Chang, String matching with wild-cards (Manber, Baeza-Yates) Landau, U U. . Vishkin Vishkin; ; Amir Amir at at al al.; .; - Lampe) Chang, Lampe - - 4

BWT-Index Burrows-Wheeler Transform (BWT) s = a1c1a2g1a3c2a4$ BWT construction: Rank correspondence: F rkF - 1 2 3 4 1 2 1 rkL 1 1 1 - 2 2 3 4 L if SA[i] = 1; L[i] = $, L[i] = s[SA[i] 1], otherwise. $ a1c1 a2 g1 a3 c2 a4 a4$ a1 c1 a2 g1 a3 c2 a3 c2 a4$ a1 c1 a2 g1 a1 c1a2 g1 a3 c2 a4$ a2 g1 a3 c2 a4$ a1 c1 c2 a4$ a1 c1 a2 g1 a3 c1 a2 g1 a3 c2 a4$ a1 g1 a3 c2 a4$ a1 c1 a2 rkF(e) = rkL(e) a1 c1a2 g1 a3 c2 a4$ c1 a2 g1 a3 c2 a4$ a1 a2 g1 a3 c2 a4$ a1 c1 g1 a3 c2 a4$ a1 c1 a2 a3 c2 a4$ a1 c1 a2 g1 c2 a4$ a1 c1 a2 g1 a3 a4$ a1 c1 a2 g1 a3 c2 $ a1c1 a2 g1 a3 c2 a4 rank: 3 SA[ ] suffix array rank: 3 5

Backward Search of BWT-Index <z, [ , ]>, if z appears in L ; otherwise. s = a1c1a2g1a3c2a4$ Search p = aca search(z, ) = , Z: a character L : a range in L, corresponding to : a range in F Backward Search Suffix Array F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 8 7 5 1 3 6 2 4 6

Backward Search of BWT-Index search(a, <c, [1, 2]>) search(c, <a, [2, 5]>) Searchsequence: <a, [2, 5]> <c, [1, 2]> <a, [3, 4]> Suffix Array F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 8 7 5 1 3 6 2 4 7 7

rankAll Arrange Arrange | | | | arrays in in the the array array for Instead of of scanning for for a a certain certain X X , , we A A [ [ ] ]. . If If it it is is the + + 1 1, , A AX X[ [ ] ] ] ] should for a a character character X X such appearances of of X Xwithin arrays each for X X) ) is is the scanning a a certain each for the number number of of appearances certain segment segment L L[ [ .. .. ] ] ( ( ) ) to to find we can can simply simply look look up the case, case, then then does does not not occur should be be the the found found range range. . that A AX X[ [i i] ] (the within L L[ [1 1 .. .. i i] ]. . find a a subrange up A AX Xto to see see whether whether A AX X[ [ - - 1 1] ] = = occur in in .. .. ] ]. . Otherwise, Otherwise, [ [A AX X[ [ - - 1 1] ] such that (the i ith thentry entry Instead subrange A$ 0 0 0 1 1 1 1 1 Aa 1 1 1 1 1 2 3 4 Ac 0 1 1 1 2 2 2 2 Ag At 0 0 1 1 1 1 1 1 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 Example To find the first and the last appearance of c in L[2 .. 5], we only need to find Ac[2 1] = Ac [1] = 0 and Ac [5] = 2. So the corresponding range is [Ac[2- 1] + 1, Ac[5]] = [1, 2]. 0 0 0 0 0 0 0 0

Reduce rankAll-Index Size F-ranks: F = <a; xa, ya> BWT array: L Reduced appearance array: A with bucket size . Reduced suffix array: SA* with bucket size . Find a range: top F(x ) + A [ (top-1)/ ] + r +1 bot F(x ) + A [ bot/ ] + r r is the number of 's appearances within L[ (top - 1)/ .. top - 1] r is the number of 's appearances within L[ bot/ .. bot ] L a4 c2 g1 $ c1 a3 a1 a2 SA* F = < ; x , y > i A$ 0 0 0 1 1 1 1 1 Aa 1 1 1 1 1 2 3 4 Ac 0 1 1 1 2 2 2 2 Ag At 0 0 1 1 1 1 1 1 F $ a4 a3 a1 a2 c2 c1 g1 L a4 c2 g1 $ c1 a3 a1 a2 rkL 1 1 1 - 2 2 3 4 SA 8 7 5 1 3 6 2 4 8 7 5 1 3 6 2 4 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 F$ = <$; 1, 1> Fa = <a; 2, 5> Fc = <c; 6, 7> Fg = <g; 8, 8> + + + 9

String Matching with k Mismatches Search Trees pattern: r = tcaca; target: s = acagaca; k = 2. v0 <-, [1, 8]> T: r: v1 r[1] = t v2 v3 <a, [1, 4]> v5 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4>] <c, [2, 2]> v6 r[2] = c v4 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <a, [3, 3]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> v7 v10 v8 v9 v11 r[3] = a v14 v12 v13 v15 r[4] = c <a, [3, 3]> <$, [-, -]> v18 v17 v19 r[5] = a v16 P2 P3 P1 P4 10

String Matching with k Mismatches Mismatching information R mismatching table for r with |r| = m. Rij containing the positions of the first 2k + 1 mismatches between r[i .. m q + i] and r[j .. m q + j], where q = max{i, j}, such that if Rij[l] = x ( ) then r[i + x - 1] r[j + x - 1] or one of them does not exist, and it is the l-th mismatch between them. i r: tcacg cacg r2: r1: r1: 1 2 3 4 R12: tcacg acg tcacg R13: 1 3 r3: r1: i R14 1 2 r4: r1: cg tcacg g R15 1 r5: 11

String Matching with k Mismatches Derivation of mismatching information We store only part of mismatching information, specifically: R12, , R1m, while all the other mismatching information will be dynamically derived. A1 = R12: 1 2 3 p 1 = r[2 .. 4] = cacg and Step 1: Step 2: Step 3: 2 3 Derive the mismatching information between 4 1 4 1 2 3 4 p p A2 = R13: 1 3 1 3 1 3 2 = r[3 .. 5]= acg q q q from R12 and R13. 1[1]=c 2[1]=a 1[3]=c 2[3]=g A: 1 1 2 3 1 2 12

String Matching with k Mismatches Algorithm for Derivation of mismatching information Let , 1 and 2 be three strings. Let A1 and A2be two arrays containing all the positions of mismatches between and 1, and and 2, respectively. Create a new array A such that if A[i] = j ( ),then 1[j] 1[j], or one of them does not exists. It is the ith mismatch between them. := 1; q q := 1; := 1; l l := 1; := 1; 2. 2. If If A A2 2[ [q q] < ] < A A1 1[ [p p], then { ], then {A A[ [l l] := ] := A A2 2[ [q q]; ]; q q := := q q + 1; + 1; l l := 3. 3. If If A A1 1[ [p p] < ] < A A2 2[ [q q], then { ], then {A A[ [l l] := ] := A A1 1[ [p p]; ]; p p := := p p + 1; + 1; l l := 4. 4. If If A A1 1[ [p p] = ] = A A2 2[ [q q], then {if ], then {if 1 1[ [p p] ] 2 2[ [q q], then { ], then {A A[ [l l] := 5. 5. If If p p > | > |A A1 1|, |, q q > | > |A A2 2|, or both |, or both A A1 1[ [p p] and ] and A A2 2[ [q q] are ] are , stop (if elements, which are not elements, which are not , first append them to the rear of , first append them to the rear of A A, and then stop.) 6. 6. Otherwise, go to (2). Otherwise, go to (2). 1. 1. p p := 1; := l l + 1;} + 1;} := l l + 1;} + 1;} ] := q q; ; l l := , stop (if A A1 1(or := l l + 1;} + 1;} p p := := p p + 1; + 1; q q := (or A A2 2) )has some remaining has some remaining , and then stop.) := q q + 1;} + 1;} 13 13

String Matching with k Mismatches Derivation of mismatching information for paths in a search tree. This part of P3 will not be created. We derive the mismatching information for it according to P1 and R21. <-, [1, 8]> r: v1 r[1] = t <a, [1, 4]> <c, [1, 2]> P: P : P: P : r[2] = c <a, [2, 3]> <g, [1, 1]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> j i r[3] = a i j r[4] = c <a, [4, 4>]> <c, [2, 2]> r[5] = a P3 P1 14 14 14

String Matching with k Mismatches Mismatching trees In a search tree, we distinguish between matching and mismatching nodes. v0 <-, [1, 8]> T: r: v1 r[1] = t v2 v3 <a, [1, 4]> v5 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4>] <c, [2, 2]> v6 v7 r[2] = c v4 <g, [1, 1]> <a, [4, 4]> <c, [2, 2]> <a, [3, 3]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> v10 v8 v9 v11 r[3] = a v14 v12 v13 v15 r[4] = c <a, [3, 3]> <$, [-, -]> v18 v17 v19 r[5] = a v16 P2 P3 P1 P4 15 15 15

String Matching with k Mismatches Mismatching trees v0 <-, 0> T: r: u1 r[1] = t u2 u3 <a, 1> <g, 1> <c, 1> u6 u7 <a, 2> r[2] = c u4 u5 <a, 2> <g, 2> <-, 0> u11 u10 u9 <c, 3> r[3] = a <g, 3> <-, 0> u12 r[4] = c <g, 4> <-, 0> r[5] = a u13 P2 P3 P1 P4 16

String Matching with k Mismatches Algorithm Mismatching tree generation Derivation of mismatching information for paths 17

Derivation of Mismatching Information v0 <-, 0> T: r: u1 r[1] = t u2 u3 <a, 1> <g, 1> <c, 1> u6 u7 <a, 2> r[2] = c u4 u5 <a, 2> <g, 2> <-, 0> u9 u11 u10 r[3] = a <c, 3> <g, 3> <-, 0> u10 123 r[4] = c <g, 4> <-, 0> r[5] = a u11 mis1 = 14 R12 = R21 = 1234 P2 P3 P1 P4 <a, [1, 4]> <c, [1, 2]> <a, [2, 3]> <g, [1, 1]> <a, [4, 4]> 18

Algorithm Generation of mismatching trees In order to generate a mismatching tree D, we will use a stack S to control the process, in which each entry is a quadruple (v, j, , u), where v a node inserted into the hash table. j j is an integer to indicate that v is the jth node on a path in T (counted from the root with the root as the 0th node). the number of mismatches between the path and r[0 .. j] (recall that r[0] = - ). u the parent of a node in D to be created for v. 19

Algorithm Mismatching tree generation Each time an entry e = (v, j, , u) with v = <x, [ , ]> is popped out from S, we will check whether x = r[j]. If x = r[j], we will generate a node u = <x, j> and link it to u as a child. If x r[j], we will check whether u is a node of the form <-, 0>. If it is not the case, generate a node u = <-, 0>. Otherwise, set u to be u. Using search( ) to find all the children of v: v1, , vl. Then, push each (vi, j + 1, , u ) into S with being or + 1, depending on whether yi = r[j + 1], where vi = <yi, [ i, i]>. 20

Algorithm Mismatching information derivation for paths As with the basic process, each time a node v = <x, [ , ]> (compared to r[j]) is encountered, which is the same as a previous one v = <x , [ , ]> (compared to r[i]), we will not create a subtree in T in a way as for v , but create a new node u for v in D (mismatching tree) and thengo along p(v )(the link associated with v )to find the corresponding nodes u in D and search D[u ] in the breadth-first manner to generate a subtree rooted at u in D by simulating the merge operation discussed in Subsection B. In other words, D[u] (to be created) corresponds to the mismatch arrays for all the paths going though v in T, which will not be actually produced. 21

Algorithm Mismatching information derivation for paths To this end, a queue data structure Q is used to do a breadth-first search of D[u ], and at the same time generate D[u]. In Q, each entry e is a pair (w, ) with w being a node in D[u ], and an entry in Rij. Initially, put (u , Rij[1]) into Q, where u = <x, i>.In the process,when e is dequeued from Q (taken out from the front), we will make the following operations (simulating the steps in merge( )): Let e = (w, Rij[l]). Assume that w = <z, f> and Rij[l] = val. If <z, f>= <-, 0>, then create a copy of w added to D[u]. If w is not a leaf node, let w1, , whbe the children of w and enqueue (w1, Rij[l]), , (wh, Rij[l]) into Q (append at the end) in turn. If <z, f> <-, 0>, do (2), (3), or (4). If f < i + val - 1, add <z, f i + j> to D[u]. If w is not a leaf node, enqueue (w1, Rij[l]), , (wh, Rij[l]) into Q. 1) 2) 22

Algorithm Mismatching information derivation for paths If f = i + val - 1, we will distinguish between two subcases: z r[j + val - 1]and z = r[j + val - 1]. If z r[j + val - 1], we have a mismatching and add a node <z, j + val - 1>to D[u]. If z = r[j + val - 1], create a node <-, 0> added to D[u]. (If its parent is <-, 0>, it should be merged into its parent.) If w is not a leaf node, enqueue <w1, Rij[l + 1]), , < wh, Rij[l + 1]) into Q. If f > i + val - 1, we will scan Rij starting from Rij[l] until we meet Rij[l ] such that f i + Rij[l ] - 1. For each Rij[g] (l g < l ), we create a new node <r[j + Rij[g] - 1], j + Rij[g] - 1> added to D[u]. Enqueue <w, Rij[l ]> into Q. 3) 4) 5) 23

Experiments Compare 4 different approaches BWT BWT- -based based [34] (BWT for short), [34] (BWT for short), Amir s method Amir s method [2] (Amir [2] (Amirfor short), for short), Cole s method Cole s method [14] (Cole [14] (Colefor short), for short), Algorithm A discussed in this paper Algorithm A discussed in this paper ( (A A( ) ( )for short) for short) 24

Experiments Test Environments: - Implementation in C++, compiled by GNU make utility with optimization of level 2 - 64-bit Ubuntu operating system - run on a single core of a 2.40GHz Intel Xeon E5-2630 processor with 32GB RAM 25

Experiments TABLE I. CHARACTERISTICS OF GENOMES Genomes Rat chr1 (Rnor_6.0) C. merolae (ASM9120v1) C. elegans (WBcel235) Zebra fish (GRCz10) Rat (Rnor_6.0) Genome sizes (bp) 290,094,217 16,728,967 103,022,290 1,464,443,456 2,909,701,677 26

Tests with Real Data TESTSWITHVARYINGLENGTHOFREADS (OVER Rat genome) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 700 1000 600 800 500 600 400 300 400 200 200 100 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k Number of leaf nodes of S-trees varying length of reads k/length k/length- -of of- -read No No. . of of leaf leaf nodes read nodes 5/50 5/50 2K 10 10/ /100 100 0.7M 20 20/ /150 150 16.5M 30 30/ /200 200 102M 27

Tests with Real Data TESTSWITHVARYINGLENGTHOFREADS (OVER Zebra fish ) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 400 600 500 300 400 300 200 200 100 100 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads Number of leaf nodes of S-trees k/length k/length- -of of- -read read 5/50 5/50 10 10/ /100 100 20 20/ /150 150 30 30/ /200 200 No No. . of of leaf leaf nodes nodes 0.7K 0.30M 9.2M 89M 28 28

Tests with real Data Tests with varying length of reads (OVER Rat chr1) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 100 100 80 80 60 60 40 40 20 20 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads 29 29 29

Tests with Real Data Tests with varying length of reads (OVER C. elegans) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 50 60 50 40 40 30 30 20 20 10 10 0 0 5 1 0 20 30 40 1 00 1 50 200 250 300 varying values of k varying length of reads 30 30 30 30

Tests with Real Data Tests with varying length of reads (over C. merlae ) BWT Cole's Amir's A( ) BWT Amir's Cole's A( ) time (s) 5 6 5 4 4 3 3 2 2 1 1 0 0 5 1 0 20 30 40 100 150 200 250 300 varying values of k varying length of reads 31 31 31 31 31

Conclusion and Future Work Main contribution - Combination of derivation of mismatching information and BWT indexes for k mismatching problem - Concept of mismatching trees - Extensive tests Future work - String matching with k differences - String matching with don t care symbols 32

Thank you! Thank you! 33

String Matching with Mismatches: An Innovative Approach

Download Presentation

Presentation Transcript

Related

More Related Content