Trie-based Entity Extraction Framework for Dirty Real-World Data

Dong Deng     (Tsinghua, China)
Guoliang Li    (Tsinghua, China)
Jianhua Feng  (Tsinghua, China)
 
Outline
 
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
2
/42
Dictionary-based NER
Named Entity Recognition
10/9/2024
Taste @ ICDE2012
Isaac Newton
Sigmund Freud
English
Austrian
physicist
mathematician
astronomer
philosopher
alchemist
theologian
psychiatrist
economist
historian
sociologist
...
1 Sir Isaac Newton was an English physicist,
mathematician, astronomer, natural philosopher,
alchemist, and theologian and one of the most
influential men in human history. His Philosophiæ
Naturalis Principia Mathematica, published in 1687, is
by itself considered to be among the most influential
books in the history of science, laying the groundwork
for most of classical mechanics.
2 Sigmund Freud was an Austrian psychiatrist who
founded the psychoanalytic school of psychology. Freud
is best known for his theories of the unconscious mind
and the defense mechanism of repression and for
creating the clinical practice of psychoanalysis for
curing psychopathology through dialogue between a
patient and a psychoanalyst.
Dictionary of Entities
Documents
3
/42
Wikipedia
http://en.wikipedia.org/wiki/Levenshtein_distance
Automatically add the links
10/9/2024
Taste @ ICDE2012
4
/42
Typo in “author”
Typo in “title”
rela
x
ed
rela
t
ed
Argyri
o
s Zymnis
Argyris Zymnis
DBLP Complete Search
10/9/2024
Real-world Data is Rather Dirty
Taste @ ICDE2012
5
/42
Approximate dictionary-based entity extraction
 finds all substrings
from the document that approximately match the predefined entities.
For example:
Approximate Entity Extraction
10/9/2024
Taste @ ICDE2012
Isaac Newton
Sigmund Freud
physicist
astronomer
alchemist
theologian
economist
sociologist           ...
Sigmund Freund was an Austrian psychiatrest who
founded the psychoanalytic school of psychology.
Freud is best known for his theories of the
unconscious mind and the defense mechanism of
repression and for creating the clinical practice of
psychoanalysis for curing psychopathology
through dialogue between a patient and a
psychoanalayst.
Dictionary of Entities
Documents
6
/42
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
7
/42
Problem Formulation
10/9/2024
Taste @ ICDE2012
Approximate Entity Extraction: 
Given a dictionary of entities
E = {e
1
, e
2
, . . . , e
n
}, a document D, and a predefined edit distance
threshold 
τ
, approximate entity extraction finds all “
similar” 
pairs
<s, e
i
> such that  ED(s, ei) ≤ 
τ
, where s is a substring of D and 
ei
 E
.
8
/42
ED(
r, s
): The minimum number of single-character edit
operations(insertion/deletion/substitution) to transform 
r
 to 
s
.
For example:   ED(
marios, 
maras) = 2
D
i,0 
= i,     D
0,j 
= j,
D
i,j
 = min{D
i-1, j 
+ 1, D
i, j-1
 + 1,
                 D
i-1, j-1
 + t
i, j
},
                        0 if a
i
 = b
j
                        1 if a
i
 
 b
j
.
Edit Distance
10/9/2024
Taste @ ICDE2012
9
/42
where   t
i, j
 =
State-of-the-art Methods
Shortage:
Large Index Size
Need to Tune Parameters
Not efficient for large threshold
10/9/2024
Taste @ ICDE2012
10
/42
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
11
/42
Observation
Set 
τ
=2                     entity:
vankatesh
10/9/2024
Taste @ ICDE2012
12
/42
Split the entity into 
τ
+1=3 segments
A substring in document
d
o
c
u
m
e
n
t
:
 
v
o
n
c
o
u
v
e
r
Observation
Set 
τ
=2                  entity:
vankatesh
10/9/2024
Taste @ ICDE2012
13
/42
>= 1
edit operation
>= 1
edit operation
>= 1
edit operation
>= 
τ
 + 1 = 3
edit operation
NOT SIMILAR
d
o
c
u
m
e
n
t
:
 
v
o
n
c
o
u
v
e
r
Observation
Valid Substring:  
s
 is a valid substring if
       L
min
− τ <=|s|<= L
max
+ τ
          L
min
: the minimum entity length.
             
L
max
: 
the maximum entity length.
10/9/2024
Taste @ ICDE2012
14
/42
Trie-based Framework
Step1: Partition entities into segments using
even
 partition scheme.
10/9/2024
Taste @ ICDE2012
15
/42
Trie-based Framework
Step2: Index the segments using a trie structure
10/9/2024
Taste @ ICDE2012
16
/42
Trie-based Framework
 
Step3: From the document, find the matched
segments from the trie structure.
 
Baseline: Trie-search Method
3.1 Enumerator all valid substrings.
3.2 Find each suffix of every substring in the trie
structure to check if it can reach the leaf node.
3.3 Verify the candidate pairs.
10/9/2024
Taste @ ICDE2012
17
/42
Trie-search
 
Method
3.1 Enumerator all valid substrings.
10/9/2024
Taste @ ICDE2012
18
/42
 
kaushit chekrabarti, surajit chaudhuri, vankatesh ganti,
          if L
min
 - 
τ 
= 9-2=7  L
max
 +
 τ 
= 15+2=17
Trie-search
 
Method
3.1 Enumerator all valid substrings.
10/9/2024
Taste @ ICDE2012
19
/42
 
kaushit chekrabarti, surajit chaudhuri, vankatesh ganti,
          if L
min
 - 
τ 
= 9-2=7  L
max
 +
 τ 
= 15+2=17
Trie-search Method
3.1 Enumerator all valid substrings.
10/9/2024
Taste @ ICDE2012
20
/42
 
kaushit chekrabarti, surajit chaudhuri, vankatesh ganti,
          if L
min
 - 
τ 
= 9-2=7  L
max
 +
 τ 
= 15+2=17
Trie-search 
Method
3.1 Enumerator all valid substrings.
10/9/2024
Taste @ ICDE2012
21
/42
 
kaushit chekrabarti, surajit chaudhuri, vankatesh ganti,
          if L
min
 - 
τ 
= 9-2=7  L
max
 +
 τ 
= 15+2=17
Trie-search Method
3.2 
Find each suffix of every substring in the
trie structure to check if it can reach the leaf.
10/9/2024
Taste @ ICDE2012
22
/42
chaudhuri
Pruned
Trie-search Method
3.2 
Find each suffix of every substring in the
trie structure to check if it can reach the leaf.
10/9/2024
Taste @ ICDE2012
23
/42
surajit chaud
 
Candidate
of entity 3
Trie-search Method
3.3 
Verify the candidate pairs
10/9/2024
Taste @ ICDE2012
24
/42
ED(surajit chaud, surajit chaudri)=2
Trie-search Method
Shortage
Candidate pairs may overhead:
<surajit_chaudhuri, surajit_chaudri>
 < urajit_chaudhuri, surajit_chaudri>
 <surajit_chaudhur , surajit_chaudri>
                                    …...
10/9/2024
Taste @ ICDE2012
25
/42
Duplicate Computations
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
26
/42
Trie-based Algorithm
Step3: From the document, find the matched segments
from the trie structure.
To avoid duplicate computation, we do not enumerate all
valid substrings.
Trie-based Method
3.1 Search:  Search matched segments.
3.2 Extension: Extend the matched segments to find similar
pairs.
10/9/2024
Taste @ ICDE2012
27
/42
Trie-based Algorithm
3.1 Search: Search matched segments.
10/9/2024
Taste @ ICDE2012
28
/42
kaushit chekrabarti, surajit chaudhuri, vankatesh ganti,
Trie-based Algorithm
3.2 Extend the matched segments to find similar pairs.
10/9/2024
Taste @ ICDE2012
29
/42
D: kaush
it ch
ekrabarti, surajit chaudhuri, vankatesh ganti,
e
3
:   suraj
it ch
audri
e
4
: caush
it ch
audui
e
5
: caush
it ch
akrab
10/9/2024
Taste @ ICDE2012
30
/42
e
3
 
All Bigger
Than Two,
Terminate!
No Need to Compute This Range
10/9/2024
Taste @ ICDE2012
31
/42
e
4
 
Not Bigger
Than Two,
Extend The
Right Part!
 
All Bigger
Than Two,
Prune!
No Need to Compute This Range
 
Not Bigger
Than Two,
Extend The
Right Part!
10/9/2024
Taste @ ICDE2012
32
/42
e
5
 
Not Bigger
Than Two,
Get
Candidates!
 
Same as the
left part of e
4
Share Prefix!
 
Share Prefix!
No Need to Compute This Range
Trie-based Algorithm
3.2 Extend the matched segments to find similar pairs.
10/9/2024
Taste @ ICDE2012
33
/42
                             ush 
2
                           aush 
1
                         kaush 
1
2
 ekra
1 
 ekrab
2
 ekraba
it_ch
Trie-based Algorithm
3.2 Extend the matched segments to find similar pairs.
10/9/2024
Taste @ ICDE2012
34
/42
 
We get two results pairs:
<caushit chakrab,
 
aushit_chekrab>
<caushit chakrab
, 
kaushit_chekrab>
Trie-based Algorithm
Two-level Trie
10/9/2024
Taste @ ICDE2012
35
/42
 
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
36
/42
Obvservation
Even Partition:
10/9/2024
Taste @ ICDE2012
37
/42
vanateshe
van
she
ate
Obvservation
Even Partition:
10/9/2024
Taste @ ICDE2012
38
/42
vanateshe
van
she
ate
an efficient filter for approxim
ate
 member
she
p
checking. kaushit chekrabarti, surajit chaudhuri,
van
k
ate
sh ganti, dong xin. 
van
couver, canada.
sigmod 2008.
 
Extend 5 times.
        C=5
Obvservation
Uneven Partition:
10/9/2024
Taste @ ICDE2012
39
/42
vanateshe
vana
 he
tes
Obvservation
Uneven Partition:
10/9/2024
Taste @ ICDE2012
40
/42
vanateshe
vana
 he
tes
an efficient filter for approximate members
he
p
checking. kaushit chekrabarti, surajit chaudhuri,
vanka
tes
h ganti, dong xin. vancouver, canada.
sigmod 2008.
 
Extend 2 times.
        C=2
Optimize Object
10/9/2024
Taste @ ICDE2012
41
/42
Entity:    
c
1
   c
2    
c
3
   c
4
   … …  c
m-2
   c
m-1    
c
m
Document
g
1
g
2
…  …
g
τ
g
τ
+1
Wg
1
Wg
2
Wg
τ
Wg
τ
+1
Appear Time: 
Segments:
Optimize Object: C=M[
τ
+1][m].
M[i][j]: the minimum totle partition weight
to partition  c
1
c
2
 … c
j-1
c
j  
into i segments.
Dynamic Way:
10/9/2024
Taste @ ICDE2012
42
/42
If the start position of last partition is 
p,
Then:
Iterative we have
Determining 
Wg
Build A Suffix Trie.
10/9/2024
Taste @ ICDE2012
43
/42
The segment 
g
   The number of subtrie leaf nodes
=
 
The occurrence time in document = 
Wg
Pruning
Even VS. Dict+Doc Method
Even involves larger candidate set
Dict+Doc counts the indexing time
Make indexing more efficient
Using Segment Length to Do Pruning.
Using Even-Partition Weight as An Upper Bound.
Adding Extra Pointers On Suffix Trie.
10/9/2024
Taste @ ICDE2012
44
/42
Time Comlexity:
10/9/2024
Taste @ ICDE2012
45
/42
Time Comlexity:
10/9/2024
Taste @ ICDE2012
46
/42
Time Comlexity:
10/9/2024
Taste @ ICDE2012
47
/42
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
Reference
10/9/2024
Taste @ ICDE2012
48
/42
Experiment Setup
Three real Data sets
Existing algorithms
NGPP (downloaded from its hompage)
Faerie (we implemented)
10/9/2024
Taste @ ICDE2012
49
/42
Search-Extension VS Sort-Extension
10/9/2024
Taste @ ICDE2012
50
/42
SORT-EXTENSION was about 3-4 times faster than SEARCH-EXTENSION.
This is because SORT-EXTENSION shares the computation on the common
prefixes for different entities.
Partition : Even vs Dict+Doc
10/9/2024
Taste @ ICDE2012
51
/42
EVEN partition method may involve large numbers of candidates for large
thresholds and thus may lead to low performance.
Comparison with State-of-the-art Methods
10/9/2024
Taste @ ICDE2012
Faerie VS NGPP 
52
/42
Scalability with Dictionary Sizes
10/9/2024
Taste @ ICDE2012
 
53
/42
Outline
Motivation
Problem Formulation
Trie-based Framework
Trie-based Algorithms
Optimizing Partition Scheme
Experiment
Conclusion
10/9/2024
Taste @ ICDE2012
54
/42
Conclusion
 
Approximate entity extraction
A trie-based framework
Search-and-extension-based search method
Selecting high-quality partition scheme
10/9/2024
Taste @ ICDE2012
55
/42
Slide Note
Embed
Share

Researchers from Tsinghua University, China, have developed a Trie-based framework for entity extraction in real-world data, addressing challenges such as dirty data and typos in author names and titles. The framework leverages Trie-based algorithms to optimize partition schemes and extract named entities based on dictionary matching, demonstrating its effectiveness in experiments. This innovative approach offers a solution for accurate and efficient entity recognition in diverse datasets.

  • Entity Extraction
  • Trie-based Framework
  • Real-World Data
  • Named Entity Recognition
  • Tsinghua University

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Dong Deng (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

  2. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Taste @ ICDE2012 10/9/2024 2/42

  3. Named Entity Recognition Dictionary-based NER Documents Dictionary of Entities 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophi Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist ... 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst. Taste @ ICDE2012 10/9/2024 3/42

  4. Automatically add the links Wikipedia http://en.wikipedia.org/wiki/Levenshtein_distance Taste @ ICDE2012 10/9/2024 4/42

  5. Real-world Data is Rather Dirty DBLP Complete Search Typo in author Argyrios Zymnis Argyris Zymnis Typo in title relaxed related Taste @ ICDE2012 10/9/2024 5/42

  6. Approximate Entity Extraction Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities. For example: Dictionary of Entities Documents Isaac Newton Sigmund Freud physicist astronomer alchemist theologian economist sociologist ... Sigmund Freund was an Austrian psychiatrestwho founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalayst. Taste @ ICDE2012 10/9/2024 6/42

  7. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Taste @ ICDE2012 10/9/2024 7/42

  8. Problem Formulation Approximate Entity Extraction: Given a dictionary of entities E = {e1, e2, . . . , en}, a document D, and a predefined edit distance threshold , approximate entity extraction finds all similar pairs <s, ei> such that ED(s, ei) , where s is a substring of D and ei E. Taste @ ICDE2012 10/9/2024 8/42

  9. Edit Distance ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(marios, maras) = 2 Di,0= i, Di,j= min{Di-1, j + 1, Di, j-1+ 1, Di-1, j-1+ ti, j}, 0 if ai= bj 1 if ai bj. D0,j= j, Substitute a->i Insert o where ti, j= Taste @ ICDE2012 10/9/2024 9/42

  10. State-of-the-art Methods NGPP Faerie Inverted Index Prefix of 1-variant family q-grams Filtering Condition a substring matches with the prefix of 1-variant of partition Overlap must be larger than a threshold Shortage: Large Index Size Need to Tune Parameters Not efficient for large threshold Taste @ ICDE2012 10/9/2024 10/42

  11. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Taste @ ICDE2012 10/9/2024 11/42

  12. Observation Set =2 entity:vankatesh document: voncouver Split the entity into +1=3 segments A substring in document Taste @ ICDE2012 10/9/2024 12/42

  13. Observation Set =2 entity:vankatesh document: voncouver >= 1 edit operation >= 1 edit operation >= 1 edit operation >= + 1 = 3 edit operation NOT SIMILAR Taste @ ICDE2012 10/9/2024 13/42

  14. Observation Valid Substring: s is a valid substring if Lmin <=|s|<= Lmax+ Lmin: the minimum entity length. Lmax: the maximum entity length. Taste @ ICDE2012 10/9/2024 14/42

  15. Trie-based Framework Step1: Partition entities into segments using even partition scheme. Taste @ ICDE2012 10/9/2024 15/42

  16. Trie-based Framework Step2: Index the segments using a trie structure Taste @ ICDE2012 10/9/2024 16/42

  17. Trie-based Framework Step3: From the document, find the matched segments from the trie structure. Baseline: Trie-search Method 3.1 Enumerator all valid substrings. 3.2 Find each suffix of every substring in the trie structure to check if it can reach the leaf node. 3.3 Verify the candidate pairs. Taste @ ICDE2012 10/9/2024 17/42

  18. Trie-search Method 3.1 Enumerator all valid substrings. if Lmin- = 9-2=7 Lmax+ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 7 Taste @ ICDE2012 10/9/2024 18/42

  19. Trie-search Method 3.1 Enumerator all valid substrings. if Lmin- = 9-2=7 Lmax+ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 8 Taste @ ICDE2012 10/9/2024 19/42

  20. Trie-search Method 3.1 Enumerator all valid substrings. if Lmin- = 9-2=7 Lmax+ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 11 Taste @ ICDE2012 10/9/2024 20/42

  21. Trie-search Method 3.1 Enumerator all valid substrings. if Lmin- = 9-2=7 Lmax+ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 17 Taste @ ICDE2012 10/9/2024 21/42

  22. Trie-search Method 3.2 Find each suffix of every substring in the trie structure to check if it can reach the leaf. chaudhuri Pruned Taste @ ICDE2012 10/9/2024 22/42

  23. Trie-search Method 3.2 Find each suffix of every substring in the trie structure to check if it can reach the leaf. surajit chaud Candidate of entity 3 Taste @ ICDE2012 10/9/2024 23/42

  24. Trie-search Method 3.3 Verify the candidate pairs ED(surajit chaud, surajit chaudri)=2 Taste @ ICDE2012 10/9/2024 24/42

  25. Trie-search Method Shortage Candidate pairs may overhead: <surajit_chaudhuri, surajit_chaudri> < urajit_chaudhuri, surajit_chaudri> <surajit_chaudhur , surajit_chaudri> ... Taste @ ICDE2012 10/9/2024 25/42

  26. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Taste @ ICDE2012 10/9/2024 26/42

  27. Trie-based Algorithm Step3: From the document, find the matched segments from the trie structure. To avoid duplicate computation, we do not enumerate all valid substrings. Trie-based Method 3.1 Search: Search matched segments. 3.2 Extension: Extend the matched segments to find similar pairs. Taste @ ICDE2012 10/9/2024 27/42

  28. Trie-based Algorithm 3.1 Search: Search matched segments. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Taste @ ICDE2012 10/9/2024 28/42

  29. Trie-based Algorithm 3.2 Extend the matched segments to find similar pairs. D: kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, e3: surajit chaudri e4: caushit chaudui e5: caushit chakrab Taste @ ICDE2012 10/9/2024 29/42

  30. k a u s h i t _ c h e k r a b a r s 5 4 4 4 5 5 All Bigger Than Two, Terminate! u 5 4 3 4 4 4 r 4 4 3 3 3 3 a 4 3 3 2 2 2 j 5 4 3 2 1 1 i 5 4 3 2 1 0 e3 t _ c h a u d r i Taste @ ICDE2012 10/9/2024 30/42

  31. k a u s h i t _ c h e k r a b a r c 1 1 2 3 4 5 Not Bigger Than Two, Extend The Right Part! a 1 0 1 2 3 4 u 2 1 0 1 2 3 s 3 2 1 0 1 2 h 4 3 2 1 0 1 i 5 4 3 2 1 0 e4 t _ c h 0 1 2 3 4 5 6 7 a 1 1 2 3 3 4 5 6 u 2 2 2 3 4 4 5 6 All Bigger Than Two, Prune! d 3 3 3 3 4 5 5 6 u 4 4 4 4 4 5 6 6 i 5 5 5 5 5 5 6 7 Taste @ ICDE2012 10/9/2024 31/42

  32. k a u s h I t _ c h e k r a b a r Same as the left part of e4 Share Prefix! c 1 1 2 3 4 5 Not Bigger Than Two, Extend The Right Part! a 1 0 1 2 3 4 u 2 1 0 1 2 3 s 3 2 1 0 1 2 h 4 3 2 1 0 1 i 5 4 3 2 1 0 e5 t _ c h 0 1 2 3 4 5 6 7 a 1 1 2 3 3 4 5 6 Share Prefix! k 2 2 1 2 3 4 5 6 Not Bigger Than Two, Get Candidates! r 3 3 2 1 2 3 4 5 a 4 4 3 2 1 2 3 4 b 5 5 4 3 2 1 2 3 Taste @ ICDE2012 10/9/2024 32/42

  33. Trie-based Algorithm 3.2 Extend the matched segments to find similar pairs. it_ch ush 2 aush 1 kaush 1 2 ekra 1 ekrab 2 ekraba Taste @ ICDE2012 10/9/2024 33/42

  34. Trie-based Algorithm 3.2 Extend the matched segments to find similar pairs. We get two results pairs: <caushit chakrab, aushit_chekrab> <caushit chakrab, kaushit_chekrab> Taste @ ICDE2012 10/9/2024 34/42

  35. Trie-based Algorithm Two-level Trie Taste @ ICDE2012 10/9/2024 35/42

  36. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Taste @ ICDE2012 10/9/2024 36/42

  37. Obvservation Even Partition: vanateshe van ate she Taste @ ICDE2012 10/9/2024 37/42

  38. Obvservation Even Partition: vanateshe Extend 5 times. C=5 van ate she an efficient filter for approximate membershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008. Taste @ ICDE2012 10/9/2024 38/42

  39. Obvservation Uneven Partition: vanateshe vana tes he Taste @ ICDE2012 10/9/2024 39/42

  40. Obvservation Uneven Partition: vanateshe Extend 2 times. C=2 vana tes he an efficient filter for approximate membershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008. Taste @ ICDE2012 10/9/2024 40/42

  41. Optimize Object Entity: c1c2 c3c4 cm-2cm-1 cm Segments: g g1 g2 g +1 Appear Time: Wg1 Wg2 Wg Wg +1 Document ?+1??? C= ?=1 Optimize Object: C=M[ +1][m]. M[i][j]: the minimum totle partition weight to partition c1c2 cj-1cj into i segments. Taste @ ICDE2012 10/9/2024 41/42

  42. Dynamic Way: If the start position of last partition is p, Then: C=M[ +1][m]=M[ ][p 1]+W?? ?? Iterative we have Taste @ ICDE2012 10/9/2024 42/42

  43. Determining Wg Build A Suffix Trie. The segment g The number of subtrie leaf nodes = The occurrence time in document = Wg Taste @ ICDE2012 10/9/2024 43/42

  44. Pruning Even VS. Dict+Doc Method Even involves larger candidate set Dict+Doc counts the indexing time Make indexing more efficient Using Segment Length to Do Pruning. Using Even-Partition Weight as An Upper Bound. Adding Extra Pointers On Suffix Trie. Taste @ ICDE2012 10/9/2024 44/42

  45. Time Comlexity: Trie-Search: ????+? O ?=???? ? Search-Extension: ? ? ? + ??????? O ? + ?????? ????+? ? where ? = ??/ + 1 ?=???? ? Taste @ ICDE2012 10/9/2024 45/42

  46. Time Comlexity: Search-Extension: O ? + ?????? Sort-Extension: ?????? ? O ? + where is the ratio of the trie. Taste @ ICDE2012 10/9/2024 46/42

  47. Time Comlexity: Sort-Extension: ?????? ? O ? + Dict+Doc Partition: ??????? ? O ? + + Where CD is the candidate number using Dict+Doc partition method, CD<<C. Taste @ ICDE2012 10/9/2024 47/42

  48. Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion Reference Taste @ ICDE2012 10/9/2024 48/42

  49. Experiment Setup Three real Data sets Existing algorithms NGPP (downloaded from its hompage) Faerie (we implemented) Taste @ ICDE2012 10/9/2024 49/42

  50. Search-Extension VS Sort-Extension SORT-EXTENSION was about 3-4 times faster than SEARCH-EXTENSION. This is because SORT-EXTENSION shares the computation on the common prefixes for different entities. Taste @ ICDE2012 10/9/2024 50/42

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#