Matching Keys in Database Systems

On Concise Set of Relative
Candidate Keys
Shaoxu Song
 
(Tsinghua), 
Lei Chen (HKUST), Hong Cheng (CUHK)
Example of Matching Keys
For identifying the same real-world entities,
 
matching
 
keys
 
specify
What
 attributes to compare and
How
 to compare them
ψ2 : (name,address,department
[0,4],[0,2],[0,0])
states that for any tuples ti,tj in a relation
if their distance on attribute name is in [0, 4], i.e., ≥ 0 and ≤ 4
the distance on address is in [0, 2]
the same department, with distance in [0,0]
their ssn must be 
identified
[Fan et al., VLDB09]
Relative Candidate Keys (RCKs)
A matching key ψ2 is said 
redundant
 w.r.t. a relation, if
all the tuple pairs that can be identified by ψ2
can also be identified by another ψ1
ψ
1
 : (name,address
[0,4],[0,
3
]) 
ψ2 : (name,address,department
[0,4],[0,3],[0,0])
RCKs, a special group of matching keys
the number of compared attributes is 
minimized
Analogous
 
to
 
candidate
 
keys,
 
w.r.t.
 
functional
 
dependencies
ψ1 is an RCK
[Fan et al., VLDB09]
Minimal Matching Keys
Redundancy issues exist
not only w.r.t. “
what
 attributes to compare”
but also in “
how
 to compare them”
ψ
1
 : (name,address
[0,4]
,[0,2]) 
ψ3 : (name,address
[0,0]
,[0,2])
Redundancy among matching keys on the 
same
 attributes
any tuple pair agreeing ψ3 with name distance in [0, 0] 
always
satisfies [0,4] of ψ1
ψ3 is an RCK
but 
not minimal 
Reliable
 
Matching Keys
Consider a 
training
 data instance
the same real-world entities in attribute Y are pre-identified
e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn
Support
the number of tuple pairs that can be 
covered
 by ψ
Confidence
the 
proportion
 of covered tuple pairs that correspond to 
true
identifications on Y
ψ5 : (name, address 
           [0, 4], [0, 4])
supp(ψ5) = 4/15 
conf(ψ5) = 3/4
Matching Key Set
Consider a set 
Φ
 of matching keys relative to the same Y
ψ
1
 : (name,address
[0,4],[0,2]) 
ψ
6
 : (name,
department
[0,
0
],[0,
0
])
A tuple pair may agree on (be covered by) 
several
 keys ψ 
 
Φ
To avoid 
duplicate
 counting, consider the 
distinct
 tuple pairs that
are covered by a set of matching keys
supp(ψ1) = 2/15 
supp(ψ6) = 2/15 
supp(
Φ
) = 3/15 
Hardness and Solutions
Given a relation instance r of R, a Y over R, a constant k, and
the minimum requirements of support ηs and confidence ηc,
To find a set 
Φ
 of matching keys such that
supp(
Φ
) ≥ ηs, conf(
Φ
) ≥ ηc, and
the size of the set |
Φ
| is 
minimized
The problem is NP-hard
Greedy solution
Select a 
ψ with the 
maximum 
support in each iteration
does not stop until the minimum support ηs is satisfied
Redundancy Free Results
Subsume
: on distance restrictions,
[0,4] subsumes [0,2]
Dominate
: 
ψ1 
 ψ2
,
if all distance restrictions in 
ψ1 
subsume that of
 ψ2
Minimal
: a 
ψ
 is minimal
if there does not exist any ψ′ such that 
ψ′
ψ, ( and conf(ψ’)≥ηc )
Minimal matching keys are 
always
 RCKs
“minimal” definition is more strict than the RCK definition
Greedy algorithm returns 
minimal
 results
For any ψ1 
 ψ2, we have supp(ψ1) ≥ supp(ψ2)
ψ
1
 : (name,address
[0,4],[0,2])
ψ2 : (name,address,department
           [0,4],[0,2],[0,0]) 
 
ψ3 : (name,address
[0,0],[0,2])
Pruning Idea
If a 
ψ1
 is selected to result set 
Φ
any
 
ψ2
,
 
ψ1 
 ψ2
, has 
no
 further contribution to supp(
Φ
)
i.e., supp({
ψ1
}) = supp({
ψ1
,
 ψ
2})
ψ2
 can be directly ignored
Example: suppose that 
ψ
1 is selected to 
Φ
supp({
ψ1
}) = supp({
ψ1
,
 ψ
2}) = supp({
ψ1
,
 ψ
2}) = 2/15
ψ2, 
ψ3 
can be 
pruned
 in the following computation
ψ
1
 : (name,address
[0,4],[0,2])
ψ2 : (name,address,department
           [0,4],[0,2],[0,0]) 
 
ψ3 : (name,address
[0,0],[0,2])
Experiments
The returned 
set size 
is affected by ηs and ηc
Higher ηs and ηc lead to larger set size.
When both ηs and ηc are too high, there may 
not
 exist any valid
matching key set
Pruning technique significantly 
reduced
 the time costs
Experiments
Concise RCK sets with support commitment ηs
higher
 accuracy
Compare with considering 
all RCKs
the recall is high by all RCKs
but the precision is 
low
many irrational keys with low support
probably overfit the data
Conclusion
Relative candidate keys (RCKs) clear up redundant semantics
w.r.t. “
what
 attributes to compare”
minimal on the number of compared attributes
Minimal matching keys, a concise set of RCKs
Redundancy among RCKs on the same attributes
about “
how
 to compare them”
Introduce a greedy discovery algorithm
The return results are 
guaranteed
 to be
RCKs (minimal w.r.t. attributes), and also
minimal
 w.r.t. distance restrictions
i.e., redundancy free w.r.t. “how to compare the attributes”
Thanks
 
Slide Note
Embed
Share

Matching keys play a crucial role in identifying the same real-world entities in database systems. They specify which attributes to compare and how to compare them, helping minimize redundancy and improve data accuracy. This summary discusses relative candidate keys, minimal matching keys, and reliable matching keys, highlighting the importance of choosing the right attributes for comparison and reducing redundancy in matching key selection.

  • Database Systems
  • Matching Keys
  • Data Accuracy
  • Redundancy
  • Entity Identification

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. On Concise Set of Relative Candidate Keys Shaoxu Song(Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

  2. [Fan et al., VLDB09] Example of Matching Keys For identifying the same real-world entities, matching keys specify What attributes to compare and Howto compare them 2 : (name,address,department [0,4],[0,2],[0,0]) states that for any tuples ti,tjin a relation if their distance on attribute name is in [0, 4], i.e., 0 and 4 the distance on address is in [0, 2] the same department, with distance in [0,0] their ssnmust be identified SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science t2 2****3 J Smith Mark Rd Social Science

  3. [Fan et al., VLDB09] Relative Candidate Keys (RCKs) A matching key 2 is said redundant w.r.t. a relation, if all the tuple pairs that can be identified by 2 can also be identified by another 1 1 : (name,address [0,4],[0,3]) 2 : (name,address,department [0,4],[0,3],[0,0]) RCKs, a special group of matching keys the number of compared attributes is minimized Analogous to candidate keys, w.r.t. functional dependencies SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science 1 is an RCK t2 2****3 J Smith Mark Rd Social Science t3 862*** Wixom J Smith Park St Social Science t4 862*** W J Smith Park Street Social Science

  4. Minimal Matching Keys Redundancy issues exist not only w.r.t. what attributes to compare but also in how to compare them 1 : (name,address [0,4],[0,2]) 3 : (name,address [0,0],[0,2]) Redundancy among matching keys on the same attributes any tuple pair agreeing 3 with name distance in [0, 0] always satisfies [0,4] of 1 SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science t2 2****3 Smith Mark Rd Social Science 3 is an RCK but not minimal t3 862*** Smith Park St Social Science t4 862*** Will J Smith Park Street Social Science t5 0****5 C Green Mark Road Computing t6 0****5 C Green Mark Rd Computing

  5. Reliable Matching Keys Consider a training data instance the same real-world entities in attribute Y are pre-identified e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn Support the number of tuple pairs that can be covered by Confidence the proportion of covered tuple pairs that correspond to true identifications on Y 5 : (name, address [0, 4], [0, 4]) t2 SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Social Science supp( 5) = 4/15 conf( 5) = 3/4 t4 862731 Will J Smith Park Street Social Science t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing

  6. Matching Key Set Consider a set of matching keys relative to the same Y 1 : (name,address [0,4],[0,2]) 6: (name,department [0,0],[0,0]) A tuple pair may agree on (be covered by) several keys To avoid duplicate counting, consider the distinct tuple pairs that are covered by a set of matching keys SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science supp( 1) = 2/15 supp( 6) = 2/15 supp( ) = 3/15 t2 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Business t4 862731 W J Smith Park Street Business t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing

  7. Hardness and Solutions Given a relation instance r of R, a Y over R, a constant k, and the minimum requirements of support s and confidence c, To find a set of matching keys such that supp( ) s, conf( ) c, and the size of the set | | is minimized The problem is NP-hard Greedy solution Select a with the maximum support in each iteration does not stop until the minimum support sis satisfied

  8. Redundancy Free Results 1 : (name,address [0,4],[0,2]) 2 : (name,address,department [0,4],[0,2],[0,0]) 3 : (name,address [0,0],[0,2]) Subsume: on distance restrictions, [0,4] subsumes [0,2] Dominate: 1 2, if all distance restrictions in 1 subsume that of 2 Minimal: a is minimal if there does not exist any such that , ( and conf( ) c ) Minimal matching keys are alwaysRCKs minimal definition is more strict than the RCK definition Greedy algorithm returns minimal results For any 1 2, we have supp( 1) supp( 2)

  9. Pruning Idea If a 1is selected to result set any 2, 1 2, has nofurther contribution to supp( ) i.e., supp({ 1}) = supp({ 1, 2}) 2can be directly ignored Example: suppose that 1 is selected to supp({ 1}) = supp({ 1, 2}) = supp({ 1, 2}) = 2/15 2, 3 can be pruned in the following computation SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science 1 : (name,address [0,4],[0,2]) 2 : (name,address,department [0,4],[0,2],[0,0]) 3 : (name,address [0,0],[0,2]) t2 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Social Science t4 862731 Will J Smith Park Street Social Science t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing

  10. Experiments The returned set size is affected by s and c Higher sand c lead to larger set size. When both s and c are too high, there may not exist any valid matching key set Pruning technique significantly reduced the time costs (c) hs = 1500/4.9*107, hc = 1.0 (a) Restaurant 5 Set size Set size 4 6 Time cost (s) CS+GA CSP+GA CS+GAP CSP+GAP 5 3 4 3 2 2 1 100 105 110 0.6 0.7 0.8 0.9 1 1 0 90 hs*372816 95 hc 0 4k 5k 6k 7k 8k 9k 10k # tuples

  11. Experiments Concise RCK sets with support commitment s higher accuracy Compare with considering all RCKs the recall is high by all RCKs but the precision is low many irrational keys with low support probably overfit the data (e) Restaurant 1 0.8 (a) Restaurant (c) Restaurant 1 1 F-measure 0.6 90.0 95.0 100.0 105.0 110.0 0.8 0.8 0.4 Precision 0.6 0.6 90.0 95.0 100.0 105.0 110.0 90.0 95.0 100.0 105.0 110.0 Recall 0.4 0.4 0.2 findRCKs All 0.2 0.2 findRCKs findRCKs 0 All All 0.6 0.7 0.8 hc 0.9 1.0 0 0 0.6 0.7 0.8 hc 0.9 1.0 0.6 0.7 0.8 hc 0.9 1.0

  12. Conclusion Relative candidate keys (RCKs) clear up redundant semantics w.r.t. what attributes to compare minimal on the number of compared attributes Minimal matching keys, a concise set of RCKs Redundancy among RCKs on the same attributes about how to compare them Introduce a greedy discovery algorithm The return results are guaranteed to be RCKs (minimal w.r.t. attributes), and also minimal w.r.t. distance restrictions i.e., redundancy free w.r.t. how to compare the attributes

  13. Thanks

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#