Efficient Algorithms for Degenerate String Reconstruction from Cover Arrays

undefined
  Dipankar Ranjan Baisya,
  Mir Md. Faysal &
  M. Sohel Rahman
  
  
CSE, BUET
  
Dhaka 1000
Degenerate String Reconstruction
from Cover Arrays (Extended
Abstract)
1
M
o
t
i
v
a
t
i
o
n
Extensive use of degenerate string in
molecular biology.
o
used to model biological sequences. (e.g.,
FASTA format for representing either
nucleotide sequences or peptide sequences
)
o
very efficient in expressing polymorphism in
biological sequences. 
(e.g., Single Nucleotide
Polymorphism (SNP)
)
2
G
o
a
l
 
o
f
 
t
h
e
 
P
a
p
e
r
Present efficient algorithms to  reconstruct a
degenerate string from a valid cover array.
We have  derived two algorithms one using
unbounded alphabet and other using minimum
sized alphabet for that purpose.
3
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
Let        = { A, C, G, T }
4
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
Let        = { A, C, G, T }
Then we can get 2^4 -1 = 15 non-empty sets of
letters.
5
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
Let        = { A, C, G, T }
Then we can get 2^4 -1 = 15 non-empty sets of
letters.
At each position of an degenerate string we
have one of those sets.
6
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
7
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
 
E
x
a
m
p
l
e
8
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
:
 
E
q
u
a
l
i
t
y
/
M
a
t
c
h
9
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
:
 
E
q
u
a
l
i
t
y
/
M
a
t
c
h
10
10
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
 
S
t
r
i
n
g
We say that a string S covers a string U if
every letter of U is contained in some
occurrence of S as a substring of U. Then S is
called a cover of U.
11
11
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
 
S
t
r
i
n
g
We say that a string S covers a string U if
every letter of U is contained in some
occurrence of S as a substring of U. Then S is
called a cover of U.
The cover array C of a regular string X[1..n], is
a data structure used to store the length of the
longest proper cover of every prefix of X.
12
12
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
 
S
t
r
i
n
g
13
13
so for all                         stores the length of the
longest  cover of  X[1..i] or 0.
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
 
S
t
r
i
n
g
14
14
so for all                         stores the length of the
longest  cover of  X[1..i] or 0.
since every cover of any cover of X is also a
cover of X, it turns out that, the cover array C
compactly describes all the covers of every
prefix of X.
E
x
p
.
 
o
f
 
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
s
t
r
i
n
g
Table 2 : Cover array of abaababaabaababaaba
15
15
E
x
p
.
 
o
f
 
C
o
v
e
r
 
A
r
r
a
y
 
o
f
 
R
e
g
u
l
a
r
s
t
r
i
n
g
From Table 2 we can see that, cover array of
X, has the entries
  representing the three covers of X having length
11, 6 and 3 respectively.
16
16
C
o
v
e
r
 
A
r
r
a
y
 
f
o
r
 
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
For a degenerate string,      stores the list of
   the lengths of the covers of          .
17
17
C
o
v
e
r
 
A
r
r
a
y
 
f
o
r
 
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
For a degenerate string,      stores the list of
the lengths of the covers of          .
More elaborately, each      is a   list            such
that                   where          denotes the p-th
largest  cover of X[1..i].
18
18
E
x
a
m
p
l
e
 
o
f
 
C
o
v
e
r
 
A
r
r
a
y
 
f
o
r
D
e
g
e
n
e
r
a
t
e
 
S
t
r
i
n
g
     Table 3. Cover Array of aba[ab][ab]a
19
19
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
for            we define          as
 of  a  Cover  of                  .
20
20
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
for            we define           as
 of  a  Cover  of                  .
candidate-lengths of covers of                 is
21
21
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
22
22
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
Further we assume that a symbol
    is required by          .
k
23
23
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
Further we assume that a symbol
   is required by         .
 
Notably 
                         .
k
24
24
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
Lemma 1.  Suppose           is the cover array of
a  string             .  If           , then
25
25
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
Lemma 1.  Suppose           is the cover array of
a  string             .  If           , then
Proof. proof will be provided in the journal
version.
26
26
B
a
s
i
c
 
V
a
l
i
d
a
t
i
o
n
 
o
f
 
a
 
C
o
v
e
r
 
A
r
r
a
y
Theorem 2.
Proof. proof will be provided in the journal
version.
27
27
O
u
r
 
A
l
g
o
r
i
t
h
m
s
28
28
O
u
r
 
A
l
g
o
r
i
t
h
m
s
29
29
C
r
A
y
D
S
R
U
n
we denote by         the new set of characters
introduced in      , i.e.,                           .
30
30
C
r
A
y
D
S
R
U
n
we denote by        the new set of characters
introduced in       , i.e.,                          .
we denote by       the set of symbols  that are
not allowed only at Position    , i.e.,
31
31
C
r
A
y
D
S
R
U
n
we denote by         the  new set of characters
introduced in      , i.e.,                          .
we denote by       the set of symbols  that are
not allowed only at Position    , i.e.,
     denotes the  set of symbols that can be
assigned to           .
32
32
C
r
A
y
D
S
R
U
n
Lemma 3.
Proof. proof will be provided in the journal
version.
33
33
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
Suppose, we have a cover array        where       is
the product of maximum string length     and
maximum list length  of cover array        .
34
34
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
Suppose, we have a cover array       where     is
the product of maximum string length     and
maximum list length  of cover array        .
               so,
35
35
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
step 1,  validity checking.
36
36
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
step 1,  validity checking.
step 2,  find     .
37
37
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
step 1,  validity checking.
step 2,  find     .
step 3, find       .
38
38
C
r
A
y
D
S
R
U
n
:
 
B
a
s
i
c
 
S
t
e
p
s
step 1,  validity checking.
step 2,  find     .
step 3, find       .
step 4, place new character in proper positions.
39
39
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
1
scan/proceed one position at a time from left to
right i.e., in an online fashion.
check validity at each position of input array
according to  “Basic Validation” section  of a
cover array.
40
40
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
1
as soon as it finds a violation of validity the
algorithms returns the input array as invalid.
for further validation see Lemma 4 and Lemma
5 (Slide no. 45 and 46).
41
41
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
2
for each            ,  our algorithm finds        for
each                   .
                                                .
42
42
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
2
for each            ,  our algorithm finds        for
each                    .
                                              .
our algorithm finds       with the help of Lemma
4 and Lemma 5.(see next slides)
43
43
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
2
Lemma 4.
Proof. proof will be provided in the journal
version.
44
44
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
2
Lemma 5.
Proof. proof  will be provided in the journal
version.
45
45
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
2
we  can see from Lemma 4 and Lemma 5 that
we need to perform intersection operation.
Finding intersection will take          .
so for each position step 2 takes              .
46
46
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
3
similar  to step 2.
find      using Lemma 3.(see slide no.34)
for that our algorithms finds a set of
    in a similar way of finding       in previous step
    except that instead of               there will  be
            iterations for each         .
47
47
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
3
similar  to step 2.
find      using Lemma 3.(see slide no.34)
for that our algorithms finds a set of
    in a similar way of finding       in previous step
    except that instead of               there will  be
            iterations for each         .
then                                    .
48
48
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
3
49
49
 so for each position step 3 takes             .
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
4
if                   where
         then
50
50
C
r
A
y
D
S
R
U
n
:
 
S
t
e
p
 
4
if                   where
         then
if                 then  new characters will be placed
in proper positions.
51
51
C
r
A
y
D
S
R
U
n
:
 
R
u
n
n
i
n
g
 
T
i
m
e
for each position i,
Construction of
Construction of
 Construction of
Construction of
52
52
C
r
A
y
D
S
R
U
n
:
 
R
u
n
n
i
n
g
 
T
i
m
e
for each position i,
Construction of
Construction of
 Construction of
Construction of
so  for       positions, we need
53
53
C
r
A
y
D
S
R
i
n
Lemma 6. for every degenerate  string            if
               is the only entry in      , then
and
Proof. proof will be provided in the journal
version.
54
54
C
r
A
y
D
S
R
i
n
works exactly like                      except it
computes       slight  differently.
55
55
C
r
A
y
D
S
R
i
n
works exactly like                      except it
computes       slight  differently.
it computes        following Lemmas 3.a and 6
(instead  of Lemma 3.b).
56
56
C
r
A
y
D
S
R
i
n
:
 
R
u
n
n
i
n
g
 
T
i
m
e
Runtime analysis follows readily from   analysis
of                     .
only extra calculation                    which can be
done in             .
57
57
C
r
A
y
D
S
R
i
n
:
 
R
u
n
n
i
n
g
 
T
i
m
e
Runtime analysis follows readily from   analysis
of                     .
only extra calculation                    which can be
done in             .
therefore overall runtime complexity             .
58
58
 
                       
Thank you
59
59
Slide Note
Embed
Share

Presented in this extended abstract are efficient algorithms for reconstructing a degenerate string from a valid cover array. The paper discusses the extensive use of degenerate strings in molecular biology for modeling biological sequences and expressing polymorphism. It introduces two algorithms—one utilizing an unbounded alphabet and the other using a minimum-sized alphabet to achieve the reconstruction. The concept of degenerate strings, their sets of letters, and examples are detailed, along with the definition of cover arrays in regular strings.

  • Algorithms
  • Degenerate String
  • Cover Arrays
  • Molecular Biology

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Degenerate String Reconstruction from Cover Arrays (Extended Abstract) Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 1

  2. Motivation Extensive use of degenerate string in molecular biology. o used to model biological sequences. (e.g., FASTA format for representing either nucleotide sequences or peptide sequences) o very efficient in expressing polymorphism in biological sequences. (e.g., Single Nucleotide Polymorphism (SNP)) 2

  3. Goal of the Paper Present efficient algorithms to reconstruct a degenerate string from a valid cover array. We have derived two algorithms one using unbounded alphabet and other using minimum sized alphabet for that purpose. 3

  4. Degenerate String Let = { A, C, G, T } 4

  5. Degenerate String Let = { A, C, G, T } Then we can get 2^4 -1 = 15 non-empty sets of letters. 5

  6. Degenerate String Let = { A, C, G, T } Then we can get 2^4 -1 = 15 non-empty sets of letters. At each position of an degenerate string we have one of those sets. 6

  7. Degenerate String 7

  8. Degenerate String Example 8

  9. Degenerate String: Equality/Match 9

  10. Degenerate String: Equality/Match 10

  11. Cover Array of Regular String We say that a string S covers a string U if every letter of U is contained in some occurrence of S as a substring of U. Then S is called a cover of U. 11

  12. Cover Array of Regular String We say that a string S covers a string U if every letter of U is contained in some occurrence of S as a substring of U. Then S is called a cover of U. The cover array C of a regular string X[1..n], is a data structure used to store the length of the longest proper cover of every prefix of X. 12

  13. Cover Array of Regular String so for all stores the length of the longest cover of X[1..i] or 0. 13

  14. Cover Array of Regular String so for all stores the length of the longest cover of X[1..i] or 0. since every cover of any cover of X is also a cover of X, it turns out that, the cover array C compactly describes all the covers of every prefix of X. 14

  15. Exp. of Cover Array of Regular string Table 2 : Cover array of abaababaabaababaaba 15

  16. Exp. of Cover Array of Regular string From Table 2 we can see that, cover array of X, has the entries representing the three covers of X having length 11, 6 and 3 respectively. 16

  17. Cover Array for Degenerate String For a degenerate string, the lengths of the covers of stores the list of . 17

  18. Cover Array for Degenerate String For a degenerate string, the lengths of the covers of stores the list of . More elaborately, each is a list such that where denotes the p-th largest cover of X[1..i]. 18

  19. Example of Cover Array for Degenerate String Table 3. Cover Array of aba[ab][ab]a 19

  20. Basic Validation of a Cover Array for we define as of a Cover of . 20

  21. Basic Validation of a Cover Array for we define as of a Cover of . candidate-lengths of covers of is 21

  22. Basic Validation of a Cover Array 22

  23. Basic Validation of a Cover Array Further we assume that a symbol is required by . 23

  24. Basic Validation of a Cover Array Further we assume that a symbol is required by . Notably . 24

  25. Basic Validation of a Cover Array Lemma 1. Suppose is the cover array of a string . If , then 25

  26. Basic Validation of a Cover Array Lemma 1. Suppose is the cover array of a string . If , then Proof. proof will be provided in the journal version. 26

  27. Basic Validation of a Cover Array Theorem 2. Proof. proof will be provided in the journal version. 27

  28. Our Algorithms 28

  29. Our Algorithms 29

  30. CrAyDSRUn we denote by the new set of characters introduced in , i.e., . 30

  31. CrAyDSRUn we denote by the new set of characters introduced in , i.e., . we denote by the set of symbols that are not allowed only at Position , i.e., 31

  32. CrAyDSRUn we denote by the new set of characters introduced in , i.e., . we denote by the set of symbols that are not allowed only at Position , i.e., denotes the set of symbols that can be assigned to . 32

  33. CrAyDSRUn Lemma 3. Proof. proof will be provided in the journal version. 33

  34. CrAyDSRUn: Basic Steps Suppose, we have a cover array where is the product of maximum string length and maximum list length of cover array . 34

  35. CrAyDSRUn: Basic Steps Suppose, we have a cover array where is the product of maximum string length and maximum list length of cover array . so, 35

  36. CrAyDSRUn: Basic Steps step 1, validity checking. 36

  37. CrAyDSRUn: Basic Steps step 1, validity checking. step 2, find . 37

  38. CrAyDSRUn: Basic Steps step 1, validity checking. step 2, find . step 3, find . 38

  39. CrAyDSRUn: Basic Steps step 1, validity checking. step 2, find . step 3, find . step 4, place new character in proper positions. 39

  40. CrAyDSRUn: Step 1 scan/proceed one position at a time from left to right i.e., in an online fashion. check validity at each position of input array according to Basic Validation section of a cover array. 40

  41. CrAyDSRUn: Step 1 as soon as it finds a violation of validity the algorithms returns the input array as invalid. for further validation see Lemma 4 and Lemma 5 (Slide no. 45 and 46). 41

  42. CrAyDSRUn: Step 2 for each , our algorithm finds for each . . 42

  43. CrAyDSRUn: Step 2 for each , our algorithm finds for each . . our algorithm finds with the help of Lemma 4 and Lemma 5.(see next slides) 43

  44. CrAyDSRUn: Step 2 Lemma 4. Proof. proof will be provided in the journal version. 44

  45. CrAyDSRUn: Step 2 Lemma 5. Proof. proof will be provided in the journal version. 45

  46. CrAyDSRUn: Step 2 we can see from Lemma 4 and Lemma 5 that we need to perform intersection operation. Finding intersection will take . so for each position step 2 takes . 46

  47. CrAyDSRUn: Step 3 similar to step 2. find using Lemma 3.(see slide no.34) for that our algorithms finds a set of in a similar way of finding except that instead of there will be iterations for each . in previous step 47

  48. CrAyDSRUn: Step 3 similar to step 2. find using Lemma 3.(see slide no.34) for that our algorithms finds a set of in a similar way of finding except that instead of there will be iterations for each . in previous step then . 48

  49. CrAyDSRUn: Step 3 so for each position step 3 takes . 49

  50. CrAyDSRUn: Step 4 if where then 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#