Closed Factorization

undefined
Closed Factorization
Golnaz Badkobeh
1
, Hideo Bannai
2
, Keisuke Goto
2
,
Tomohiro I
2
, Costas S. 
Iliopoulos
3
, Shunsuke Inenaga
2
,
Simon J. Puglisi
4
, and 
Shiho Sugimoto
2
1.
University of Sheffield, United Kingdom
2.
Kyushu University, Japan
3.
King’s College London, United Kingdom
4.
University of Helsinki, Finland
A 
closed string
 is a string with a proper substring that occurs
as a prefix and a suffix but does not have internal occurrences
[Fici, 2011].
Closed Strings
 
a  b  c
  a  b  a  c  a  c  b  a  
a  b  c
 
a  a  a  a  a  a  a  a  a  a  a  a  a
  a
Closing border
A 
closed string
 is a string with a proper substring that occurs
as a prefix and a suffix but does not have internal occurrences
[Fici, 2011].
A string of length 1 is closed, where the closing border is
the empty string 
ε
.
A closed string has a unique closing border.
Closed Strings
a  b  c
  a  b  a  c  a  c  b  a  
a  b  c
a  a  a  a  a  a  a  a  a  a  a  a  a
  a
Closing border
We introduce the 
Longest Closed Factor Array
 of a string 
w
and an algorithm which computes it in 
O(
n
 log 
n / 
loglog 
n
)
time and O(
n
) space.
n
 is the length of 
w
.
We introduce the 
Closed Factorization
 of a string 
w
 and the
algorithm which compute it in 
O(
n
) time and space.
n
 is the length of 
w
.
Our Contribution
Definition of Longest Closed Factor Array
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1
 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Definition of Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1
 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Definition of Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1
 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Definition of Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1
 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Definition of Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1
 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Definition of Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The longest closed factor array of 
w
 of length 
n
 is an array
A[1..
n
] of integers such that for any 1 ≤ 
i
n
, A[
i
] = 
l
 if and
only if 
l
 is the length of the longest closed prefix of 
w
[
i
..
n
].
Theorem 1
Given a string 
w
 of length 
n
 over an integer alphabet,
the closed factor array of 
w
 can be computed in
O(
n
 log 
n / 
loglog 
n
) time and O(
n
) space.
Computing Longest Closed Factor Array
Lemma 1
The longest prefix
 
of 
w
[
i
..
n
] which has another occurrence to
the right of 
i
, is the closing border of the longest closed factor
starting at 
i
.
Computing Longest Closed Factor Array
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
Lemma 1
The longest prefix
 
of 
w
[
i
..
n
] which has another occurrence to
the right of 
i
, is the closing border of the longest closed factor
starting at 
i
.
Computing Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Lemma 1
The longest prefix
 
of 
w
[
i
..
n
] which has another occurrence to
the right of 
i
, is the closing border of the longest closed factor
starting at 
i
.
Computing Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Lemma 1
The longest prefix
 
of 
w
[
i
..
n
] which has another occurrence to
the right of 
i
, is the closing border of the longest closed factor
starting at 
i
.
Computing Longest Closed Factor Array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1.
Construct and preprocess the suffix tree of 
w
.
2.
 
i  
    1.
3.
Compute the closing border 
b
i
 starting at position 
i
.
with the suffix tree constructed in Step 1
4.
Find the leftmost occurrence 
j
 of 
b
i
 in 
w
[
i
+1..
n
].
with a range successor query
5.
A[
i
]       
j
 + |
b
i
|
i
.
6.
 
i
      
i
 +1.
7.
Repeat Steps 3~5 until 
i
 = 
n
.
Outline of Our Algorithm
Construct the suffix tree of a given string 
w
.
Each leaf of the suffix tree stores the beginning
position of the suffix corresponding to the leaf.
Any internal node 
v
 of the suffix tree is labeled by the
maximum leaf value in the subtree rooted at 
v
.
Step 1
a
a
a
b
a
$
b
$
$
a
b
a
$
$
a
b
a
b
a
$
$
1
2
3
4
5
6
7
 
4
 
6
 
5
w
 = 
abaaba$
 
SA
Outline of Our Algorithm
1.
Construct and preprocess the suffix tree of 
w
.
2.
 
i  
    1.
3.
Compute the closing border 
b
i
 starting at position 
i
.
with the suffix tree constructed in Step 1
4.
Find the leftmost occurrence 
j
 of 
b
i
 in 
w
[
i
+1..
n
].
with a range successor query
5.
A[
i
]       
j
 + |
b
i
|
i
.
6.
 
i
      
i
 +1.
7.
Repeat Steps 3~5 until 
i
 = 
n
.
Compute the closing border 
b
i
 starting at
position 
i
.
Find the highest node 
x 
labeled 
i.
The path from the root to the parent of 
x
is the closing border of longest closed factor
 starting at position 
i
.
Step 3
a
a
a
b
a
$
b
$
$
a
b
a
$
$
a
b
a
b
a
$
$
1
2
3
4
5
6
7
4
6
5
Suffix Tree of 
abaaba$
Step 3
root
i
t
i
x
 
u
w
i
pathlabel(
x
)
pathlabel(
u
)
Suffix Tree of 
w
a
x 
: the highest node labeled 
i
t
u 
: the parent of 
x
t
How do we find node 
x
?
Compute the closing border 
b
i
 starting at position 
i
.
Find the highest node 
x 
labeled 
i.
Traverse the suffix tree from the 
root.
O(|
x
|) time for a constant alphabet.
O(|
x
| log 
n
) time for an integer alphabet.
An array P[1..
n
] enables us to find node 
x 
in O(1) time.
P[
i
] contains a pointer to node 
x 
in the tree for which 
i 
is the
maximum leaf value.
P can be computed in O(
n
) time with pre-order traversing.
Step 3
Outline of Our Algorithm
1.
Construct and preprocess the suffix tree of 
w
.
2.
 
i  
    1.
3.
Compute the closing border 
b
i
 starting at position 
i
.
with the suffix tree constructed in Step 1
4.
Find the leftmost occurrence 
j
 of 
b
i
 in 
w
[
i
+1..
n
].
with a range successor query
5.
A[
i
]       
j
 + |
b
i
|
i
.
6.
 
i
      
i
 +1.
7.
Repeat Steps 3~5 until 
i
 = 
n
.
Step 4
root
i
i
h
x
 
u
G
Suffix Tree of 
w
a
x 
: the highest node labeled 
i
t
t
w
i
pathlabel(
x
)
pathlabel(
u
)
t
h
u 
: the parent of 
x
h
 is the successor
of 
i
 in
 the set of
the leaf values.
Compute the longest closed factor starting at position 
i
.
Use a range successor query data structure for the suffix
array [Yu et al., 2011].
Each internal node 
v
  stores the beginning and ending
positions of the corresponding range in the suffix array.
Step 4
 
a
 
a
 
a
 
b
 
a
 
$
 
b
 
$
 
$
 
a
 
b
 
a
 
$
 
$
 
a
 
b
 
a
 
b
 
a
 
$
 
$
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
4
 
6
 
5
 
Suffix Tree of   
a b a a b a $
 
1
 
2
 
3
 
4
 
5
 
6
 
7
Compute the longest closed factor starting at position 
i
.
Use a range successor query data structure for the suffix
array [Yu et al., 2011].
Each internal node 
v
  stores the beginning and ending
positions of the corresponding range in the suffix array.
Range successor query need O(
log 
n / 
loglog 
n
) time for
each position 
i
.
Step 4
a
a
a
b
a
$
b
$
$
a
b
a
$
$
a
b
a
b
a
$
$
1
2
3
4
5
6
7
4
6
5
Suffix Tree of   
a b a a b a $
1
2
3
4
5
6
7
Given a string 
w
 of length 
n
 over an integer alphabet,
the closed factor array of 
w
 can be computed in
O(
n
 log 
n / 
loglog 
n
) time and O(
n
) space.
Our Result 1
The 
closed factorization
 of string 
w
 of length 
n
 is a sequence
(
G
0
,G
1
,…,G
k
) of strings such that G
0
 = ε, 
w
 = G
1
…G
k
 and, for
each 1 ≤ 
j 
k
, G
j
 is the longest closed prefix of 
w
[|G
1
…G
j
-
1
|+1..
n
].
Definition of Closed Factorization
 
a  b  a  b  a  a  c  b  b  b  c  b  c  c  $
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
Theorem 2
Given a string 
w
 of length 
n
 over an integer alphabet,
the closed factorization of 
w
 can be computed in
O(
n
) time and space.
Computing Closed Factorization
1.
Construct and preprocess the suffix tree of 
w
.
2.
 
i
      1.
3.
Compute the closing border 
b
i
 starting at position 
i
.
with the suffix tree constructed in Step 1
4.
Find the leftmost occurrence 
j
 of 
b
i
 in 
w
[
i
+1..
n
].
with the KMP algorithm
Stop the KMP algorithm as soon as 
j
 is found.
5.
 
i
      
j 
+ |
b
i
|
.
6.
Repeat Steps 3~5 until 
i
 = 
n
.
Outline of Our Algorithm
We can compute each factor G
j
 
in O(|G
j
|) time with the KMP
algorithm.
Because the sum of the lengths of all factors is 
n,
 the total
time to compute the closed factorization is O(
n
).
Algorithm of Closed Factorization
Given a string 
w
 of length 
n
 over an integer alphabet,
the closed factorization of 
w
 can be computed in
O(
n
) time and space.
Our Result 2
We introduced the Longest Closed Factor Array of a string and
proposed an algorithm which computes it in 
O(
n
 log 
n / 
loglog
n
) time and O(
n
) space.
We introduced the Closed Factorization of a string and
proposed an algorithm which computes it in 
O(
n
) time and
space.
Conclusion
Can we efficiently compute the longest closed factor
array withou
t range successor queries?
Can we find the longest closed factor containing each
position without the longest closed factor array?
Open Problems
Slide Note
Embed
Share

This document delves into the concept of closed factorization and closed strings, exploring the definitions, contributions, and algorithms associated with them. It introduces the Longest Closed Factor Array and its computation, along with explanations of closed strings and their unique properties. The document also defines the Longest Closed Factor Array in detail, providing insights into its significance and applications.

  • Closed Factorization
  • Strings Analysis
  • Algorithm
  • Longest Closed Factor Array
  • Substring

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Closed Factorization Golnaz Badkobeh1, Hideo Bannai2, Keisuke Goto2, Tomohiro I2, Costas S. Iliopoulos3, Shunsuke Inenaga2, Simon J. Puglisi4, and Shiho Sugimoto2 1. 2. 3. 4. University of Sheffield, United Kingdom Kyushu University, Japan King s College London, United Kingdom University of Helsinki, Finland Everything is String.

  2. Closed Strings A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011]. a b c a b a c a c b a a b c a a a a a a a a a a a a a a Closing border Everything is String.

  3. Closed Strings A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011]. a b c a b a c a c b a a b c a a a a a a a a a a a a a a Closing border A string of length 1 is closed, where the closing border is the empty string . A closed string has a unique closing border. Everything is String.

  4. Our Contribution We introduce the Longest Closed Factor Array of a string w and an algorithm which computes it in O(n log n / loglog n) time and O(n) space. n is the length of w. We introduce the Closed Factorization of a string w and the algorithm which compute it in O(n) time and space. n is the length of w. Everything is String.

  5. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a b a b a a c b b b c b c c $ Everything is String.

  6. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a A = b a b a a c b b b c b c c $ Everything is String.

  7. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a A = 5 b a b a a c b b b c b c c $ Everything is String.

  8. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a A = 5 b a b a a c b b b c b c c $ Everything is String.

  9. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a A = 5 b 4 a b a a c b b b c b c c $ Everything is String.

  10. Definition of Longest Closed Factor Array The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 i n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a A = 5 b 4 a 3 b 5 a 2 a 1 c 6 b 3 b 2 b 4 c 3 b 1 c 2 c 1 $ 1 Everything is String.

  11. Computing Longest Closed Factor Array Theorem 1 Given a string w of length n over an integer alphabet, the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space. Everything is String.

  12. Computing Longest Closed Factor Array Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a b a b a a c b b b c b c c $ Everything is String.

  13. Computing Longest Closed Factor Array Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a b a b a a c b b b c b c c $ Everything is String.

  14. Computing Longest Closed Factor Array Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a b a b a a c b b b c b c c $ Everything is String.

  15. Computing Longest Closed Factor Array Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w = a b a b a a c b b b c b c c $ Everything is String.

  16. Outline of Our Algorithm 1. Construct and preprocess the suffix tree of w. 2. i 1. 3. Compute the closing border bistarting at position i. with the suffix tree constructed in Step 1 4. Find the leftmost occurrence j of biin w[i+1..n]. with a range successor query 5. A[i] j + |bi| i. 6. i i +1. 7. Repeat Steps 3~5 until i = n. Everything is String.

  17. Step 1 Construct the suffix tree of a given string w. Each leaf of the suffix tree stores the beginning position of the suffix corresponding to the leaf. Any internal node v of the suffix tree is labeled by the maximum leaf value in the subtree rooted at v. w = abaaba$ a b a 6 b a a ab $ $ $ 5 $ a aba$ 6 4 b $ a $ SA 5 2 3 4 1 7 Everything is String.

  18. Outline of Our Algorithm 1. Construct and preprocess the suffix tree of w. 2. i 1. 3. Compute the closing border bistarting at position i. with the suffix tree constructed in Step 1 4. Find the leftmost occurrence j of biin w[i+1..n]. with a range successor query 5. A[i] j + |bi| i. 6. i i +1. 7. Repeat Steps 3~5 until i = n. Everything is String.

  19. Step 3 Compute the closing border bistarting at position i. Find the highest node x labeled i. The path from the root to the parent of x is the closing border of longest closed factor starting at position i. a b a 6 b a a ab $ $ $ 5 $ a aba$ 6 4 b $ a $ Suffix Tree of abaaba$ 5 2 3 4 1 7 Everything is String.

  20. Step 3 root Suffix Tree of w i t w u t pathlabel(x) a pathlabel(u) x i How do we find node x? t i x : the highest node labeled i u : the parent of x Everything is String.

  21. Step 3 Compute the closing border bistarting at position i. Find the highest node x labeled i. Traverse the suffix tree from the root. O(|x|) time for a constant alphabet. O(|x| log n) time for an integer alphabet. An array P[1..n] enables us to find node x in O(1) time. P[i] contains a pointer to node x in the tree for which i is the maximum leaf value. P can be computed in O(n) time with pre-order traversing. Everything is String.

  22. Outline of Our Algorithm 1. Construct and preprocess the suffix tree of w. 2. i 1. 3. Compute the closing border bistarting at position i. with the suffix tree constructed in Step 1 4. Find the leftmost occurrence j of biin w[i+1..n]. with a range successor query 5. A[i] j + |bi| i. 6. i i +1. 7. Repeat Steps 3~5 until i = n. Everything is String.

  23. Step 4 root Suffix Tree of w i h t w u t pathlabel(x) a pathlabel(u) x i G h is the successor of i in the set of the leaf values. t h i x : the highest node labeled i u : the parent of x Everything is String.

  24. Step 4 Compute the longest closed factor starting at position i. Use a range successor query data structure for the suffix array [Yu et al., 2011]. Each internal node v stores the beginning and ending positions of the corresponding range in the suffix array. a b a 6 b a a ab $ 1 2 3 4 5 6 7 Suffix Tree of a b a a b a $ $ $ 5 $ a aba$ 6 4 b $ a $ 5 2 3 4 1 7 Everything is String.

  25. Step 4 Compute the longest closed factor starting at position i. Use a range successor query data structure for the suffix array [Yu et al., 2011]. Each internal node v stores the beginning and ending positions of the corresponding range in the suffix array. Range successor query need O(log n / loglog n) time for each position i. a b a 6 b a a ab $ 1 2 3 4 5 6 7 Suffix Tree of a b a a b a $ $ $ 5 $ a aba$ 6 4 b $ a $ 5 2 3 4 1 7 Everything is String.

  26. Our Result 1 Given a string w of length n over an integer alphabet, the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space. Everything is String.

  27. Definition of Closed Factorization The closed factorization of string w of length n is a sequence (G0,G1, ,Gk) of strings such that G0= , w = G1 Gkand, for each 1 j k, Gjis the longest closed prefix of w[|G1 Gj- 1|+1..n]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a b a b a a c b b b c b c c $ Everything is String.

  28. Computing Closed Factorization Theorem 2 Given a string w of length n over an integer alphabet, the closed factorization of w can be computed in O(n) time and space. Everything is String.

  29. Outline of Our Algorithm 1. Construct and preprocess the suffix tree of w. 2. i 1. 3. Compute the closing border bistarting at position i. with the suffix tree constructed in Step 1 4. Find the leftmost occurrence j of biin w[i+1..n]. with the KMP algorithm Stop the KMP algorithm as soon as j is found. 5. i j + |bi|. 6. Repeat Steps 3~5 until i = n. Everything is String.

  30. Algorithm of Closed Factorization We can compute each factor Gjin O(|Gj|) time with the KMP algorithm. Because the sum of the lengths of all factors is n, the total time to compute the closed factorization is O(n). Everything is String.

  31. Our Result 2 Given a string w of length n over an integer alphabet, the closed factorization of w can be computed in O(n) time and space. Everything is String.

  32. Conclusion We introduced the Longest Closed Factor Array of a string and proposed an algorithm which computes it in O(n log n / loglog n) time and O(n) space. We introduced the Closed Factorization of a string and proposed an algorithm which computes it in O(n) time and space. Everything is String.

  33. Open Problems Can we efficiently compute the longest closed factor array without range successor queries? Can we find the longest closed factor containing each position without the longest closed factor array? Everything is String.

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#