Understanding Matching Keys in Database Systems
Matching keys play a crucial role in identifying the same real-world entities in database systems. They specify which attributes to compare and how to compare them, helping minimize redundancy and improve data accuracy. This summary discusses relative candidate keys, minimal matching keys, and reliable matching keys, highlighting the importance of choosing the right attributes for comparison and reducing redundancy in matching key selection.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
On Concise Set of Relative Candidate Keys Shaoxu Song(Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)
[Fan et al., VLDB09] Example of Matching Keys For identifying the same real-world entities, matching keys specify What attributes to compare and Howto compare them 2 : (name,address,department [0,4],[0,2],[0,0]) states that for any tuples ti,tjin a relation if their distance on attribute name is in [0, 4], i.e., 0 and 4 the distance on address is in [0, 2] the same department, with distance in [0,0] their ssnmust be identified SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science t2 2****3 J Smith Mark Rd Social Science
[Fan et al., VLDB09] Relative Candidate Keys (RCKs) A matching key 2 is said redundant w.r.t. a relation, if all the tuple pairs that can be identified by 2 can also be identified by another 1 1 : (name,address [0,4],[0,3]) 2 : (name,address,department [0,4],[0,3],[0,0]) RCKs, a special group of matching keys the number of compared attributes is minimized Analogous to candidate keys, w.r.t. functional dependencies SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science 1 is an RCK t2 2****3 J Smith Mark Rd Social Science t3 862*** Wixom J Smith Park St Social Science t4 862*** W J Smith Park Street Social Science
Minimal Matching Keys Redundancy issues exist not only w.r.t. what attributes to compare but also in how to compare them 1 : (name,address [0,4],[0,2]) 3 : (name,address [0,0],[0,2]) Redundancy among matching keys on the same attributes any tuple pair agreeing 3 with name distance in [0, 0] always satisfies [0,4] of 1 SSN Name Address Department t1 234*** Jason Smith Mark Road Social Science t2 2****3 Smith Mark Rd Social Science 3 is an RCK but not minimal t3 862*** Smith Park St Social Science t4 862*** Will J Smith Park Street Social Science t5 0****5 C Green Mark Road Computing t6 0****5 C Green Mark Rd Computing
Reliable Matching Keys Consider a training data instance the same real-world entities in attribute Y are pre-identified e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn Support the number of tuple pairs that can be covered by Confidence the proportion of covered tuple pairs that correspond to true identifications on Y 5 : (name, address [0, 4], [0, 4]) t2 SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Social Science supp( 5) = 4/15 conf( 5) = 3/4 t4 862731 Will J Smith Park Street Social Science t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing
Matching Key Set Consider a set of matching keys relative to the same Y 1 : (name,address [0,4],[0,2]) 6: (name,department [0,0],[0,0]) A tuple pair may agree on (be covered by) several keys To avoid duplicate counting, consider the distinct tuple pairs that are covered by a set of matching keys SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science supp( 1) = 2/15 supp( 6) = 2/15 supp( ) = 3/15 t2 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Business t4 862731 W J Smith Park Street Business t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing
Hardness and Solutions Given a relation instance r of R, a Y over R, a constant k, and the minimum requirements of support s and confidence c, To find a set of matching keys such that supp( ) s, conf( ) c, and the size of the set | | is minimized The problem is NP-hard Greedy solution Select a with the maximum support in each iteration does not stop until the minimum support sis satisfied
Redundancy Free Results 1 : (name,address [0,4],[0,2]) 2 : (name,address,department [0,4],[0,2],[0,0]) 3 : (name,address [0,0],[0,2]) Subsume: on distance restrictions, [0,4] subsumes [0,2] Dominate: 1 2, if all distance restrictions in 1 subsume that of 2 Minimal: a is minimal if there does not exist any such that , ( and conf( ) c ) Minimal matching keys are alwaysRCKs minimal definition is more strict than the RCK definition Greedy algorithm returns minimal results For any 1 2, we have supp( 1) supp( 2)
Pruning Idea If a 1is selected to result set any 2, 1 2, has nofurther contribution to supp( ) i.e., supp({ 1}) = supp({ 1, 2}) 2can be directly ignored Example: suppose that 1 is selected to supp({ 1}) = supp({ 1, 2}) = supp({ 1, 2}) = 2/15 2, 3 can be pruned in the following computation SSN Name Address Department t1 234863 Jason Smith Mark Road Social Science 1 : (name,address [0,4],[0,2]) 2 : (name,address,department [0,4],[0,2],[0,0]) 3 : (name,address [0,0],[0,2]) t2 234863 J Smith Mark Rd Social Science t3 862731 W J Smith Park St Social Science t4 862731 Will J Smith Park Street Social Science t5 068335 C Green Mark Road Computing t6 068335 C Green Mark Rd Computing
Experiments The returned set size is affected by s and c Higher sand c lead to larger set size. When both s and c are too high, there may not exist any valid matching key set Pruning technique significantly reduced the time costs (c) hs = 1500/4.9*107, hc = 1.0 (a) Restaurant 5 Set size Set size 4 6 Time cost (s) CS+GA CSP+GA CS+GAP CSP+GAP 5 3 4 3 2 2 1 100 105 110 0.6 0.7 0.8 0.9 1 1 0 90 hs*372816 95 hc 0 4k 5k 6k 7k 8k 9k 10k # tuples
Experiments Concise RCK sets with support commitment s higher accuracy Compare with considering all RCKs the recall is high by all RCKs but the precision is low many irrational keys with low support probably overfit the data (e) Restaurant 1 0.8 (a) Restaurant (c) Restaurant 1 1 F-measure 0.6 90.0 95.0 100.0 105.0 110.0 0.8 0.8 0.4 Precision 0.6 0.6 90.0 95.0 100.0 105.0 110.0 90.0 95.0 100.0 105.0 110.0 Recall 0.4 0.4 0.2 findRCKs All 0.2 0.2 findRCKs findRCKs 0 All All 0.6 0.7 0.8 hc 0.9 1.0 0 0 0.6 0.7 0.8 hc 0.9 1.0 0.6 0.7 0.8 hc 0.9 1.0
Conclusion Relative candidate keys (RCKs) clear up redundant semantics w.r.t. what attributes to compare minimal on the number of compared attributes Minimal matching keys, a concise set of RCKs Redundancy among RCKs on the same attributes about how to compare them Introduce a greedy discovery algorithm The return results are guaranteed to be RCKs (minimal w.r.t. attributes), and also minimal w.r.t. distance restrictions i.e., redundancy free w.r.t. how to compare the attributes