Identifying Completeness of Query Answers in Incomplete Databases
The study delves into how to assess the completeness of query answers when dealing with partially complete databases. By analyzing data from a telecommunication company’s data warehouse, the query results are examined to determine if all warnings generated by maintenance objects with hardware teams during week 2 have been captured. Through a series of SQL operations and logical reasoning, the completeness of the query results is evaluated, ensuring accurate and comprehensive information retrieval.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Identifying the Extent of Completeness of Query Answers over Partially Complete Databases Simon Razniewski1, Flip Korn2, Werner Nutt1, Divesh Srivastava3 1Free University of Bozen-Bolzano 2Google Research 3AT&T Labs - Research
Data warehouse of a telecommunication company Admin John knows Team table is complete (HR says so) Maintenance is complete for teams A, B and C their reporting systems export data automatically Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Potential data loss due to a system failure on Tuesday Data further than Wednesday maybe not fully loaded Maintenance Warnings Teams ID resp reason day week ID message name specialization tw37 A disk failure Mon 1 tw37 high voltage A hardware tw59 D software crash Fri 1 tw37 high voltage B hardware tw83 B unknown Wed 2 tw37 overheat C network tw91 C update failure Tue 1 tw59 auto restart C software tw91 C network error Fri 1 tw59 overheat D network Mon 2 tw83 high voltage Tue 2 tw83 auto restart 2
John wants to know Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Is this all that hardware teams have done? SELECT * FROM Warnings W JOIN Maintenance M ON W.ID = M.ID JOIN Teams T ON M.responsible = T.name WHERE W.week = 2 AND T.specialization = 'hardware' W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware 3
John reasons Maintenance Warnings Teams ID resp reason day week ID message name specialization Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Warnings is complete for Week 1 and Monday and Wednesday of Week 2 Maintenance is complete for teams A, B and C Team is complete The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C 4
John looks at the data The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C Teams name specialization A hardware B hardware C network C software D network There are no other hardware teams than A and B The query result is fully complete for Monday and Wednesday 5
Questions Warnings are complete for Week 1 1. How can we formally describe complete parts of a database? The query result contains all warnings from Monday of week 2 for team A 2. How can we use database completeness information to identify complete parts of query answers? 6
Related work Publication Description Language Focus of the work Motro, TODS 1989 Views Schema-level reasoning Levy, LC statements, similar to views Schema-level reasoning VLDB 1996 Fan & Geerts, PODS 2009 Various query languages (CQ-Datalog) Master data management, where an upper bound database exists Lang et al., SIGMOD 2014 Columns/operators Distributed databases on the web, operational failures during query execution 7
Formalism: Patterns Warnings Warnings Warnings Warnings Warnings We have all warnings from week 1 day day day day day week week week week week ID ID ID ID ID message message message message message Mon Mon Mon Mon Mon 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage We have all warnings from Monday of week 2 Fri Fri Fri Fri Fri 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage Wed Wed Wed Wed Wed 2 2 2 2 2 tw37 tw37 tw37 tw37 tw37 overheat overheat overheat overheat overheat Tue Tue Tue Tue Tue 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 auto restart auto restart auto restart auto restart auto restart Fri Fri Fri Fri Fri 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 overheat overheat overheat overheat overheat Mon Mon Mon Mon Mon 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 high voltage high voltage high voltage high voltage high voltage Tue Tue Tue Tue Tue 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 auto restart auto restart auto restart auto restart auto restart * * * * 1 1 1 1 * * * * * * * * Mon Mon Mon 2 2 2 * * * * * * Less expressive than previous formalisms Can be expressed in the same schema as the data 8
Johns knowledge expressed by patterns Maintenance is complete for teams A, B and C Team table is complete Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Teams Maintenance Warnings name specialization ID resp reason day week ID message A hardware tw37 A disk failure Mon 1 tw37 high voltage B hardware tw59 D software crash Fri 1 tw37 high voltage C network tw83 B unknown Wed 2 tw37 overheat C software tw91 C update failure Tue 1 tw59 auto restart D network tw91 C network error Fri 1 tw59 overheat * * * A * Mon 2 tw83 high voltage * B * Tue 2 tw83 auto restart * C * * 1 * * Mon 2 * * Wed 2 * * 9
Johns conclusions expressed by patterns Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. W.Day W.Day W.Day W.week W.week W.week W.ID W.ID W.ID W.message W.message W.message M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization Wed Wed Wed 2 2 2 tw37 tw37 tw37 overheat overheat overheat tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Mon Mon Mon 2 2 2 tw83 tw83 tw83 high voltage high voltage high voltage tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Tue Tue Tue 2 2 2 tw83 tw83 tw83 auto restart auto restart auto restart tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Mon Mon * * * * * * * * A A * * A A * * Mon * * * * B * B * Mon * * * * C * C * Wed * * * * A * A * Wed * * * * B * B * Wed * * * * C * C * The query result contains all warnings from Monday for team A 10
How to compute the completeness patterns for queries? Queries are computed by relational algebra Here: Select, project, equijoin ?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? Schema reasoning: - Apply algebra operators to completeness patterns (analogous to query result computation) 11
Reasoning about selections ?????= "??"(?)) Teams name specialization name name specialization specialization ?? A hardware A A hardware hardware B hardware B B hardware hardware C network * * C software D network * * Rule 1: Statements with * survive 12
?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 13
Reasoning about selections (2) Warnings day week ID message Mon 1 tw37 high voltage ?????=?(?) Fri 1 tw37 high voltage Wed 2 tw37 overheat day day week week ID ID message message Tue 1 tw59 auto restart Wed Wed 2 2 tw37 tw37 overheat overheat Fri 1 tw59 overheat Mon Mon 2 2 tw83 tw83 high voltage high voltage Mon 2 tw83 high voltage Tue Tue 2 2 tw83 ? * auto restart auto restart tw83 Tue 2 tw83 auto restart ** Mon 2 * * * 1 * * Wed 2 * Mon 2 * * Wed 2 * * Rule 2: Irrelevant constants are ignored Rule 3: Selected constants survive and are promoted 14
?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 15
Reasoning about joins Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure tw91 C network error * A * * B * * C * M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Rule 1: Constants join with equal constants Rule 2: Stars join with anything Rule 3: Constants can be promoted tw83 tw83 tw83 B B B unknown unknown B B B hardware hardware hardware unknown ? * * * * A A * * A * * * * * B B * * B * * * * * C C C * * * * * * A * * * * B * 16 * * * C *
Algorithmic completeness Proven: Extended algebra gives all conclusions that hold on the schema level (reasoning only with the yellow metadata) Independent of the algebra tree chosen 17
Looking at the data Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure There cannot be other hardware teams than A and B tw91 C network error * A * * B * * C * M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown M.reason B hardware T.specialization M.ID M.resp T.name * A * * * hardware tw37 A disk failure A * B * * * hardware Database instance allows for more promotion! tw83 B unknown B * C * * * * * * * * * * * A * * * * B * 18 * * * C *
How expensive is that? Experiments: Schema vs instance reasoning Instance reasoning vs query evaluation Minimization of sets of patterns 19
Instance reasoning vs query evaluation (1) Synthetic data Wikipedia has around 1000 lists declared as complete (using a template or in natural language) http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29 20
Instance reasoning vs query evaluation (2) Manually extracted some and grouped them by topic Recurrent topics: Sports teams, political assemblies, geographical features, songs, operas and other pieces of art Generated one table each about cities, schools and countries city name * * * * * * * * * country USA Germany Ukraine Bulgaria USA UK USA Czech Slovenia state Virginia * * * New York Carmarthenshire West Virginia Moravian-Silesia * county * * * * * * Hampshire County Nov Ji n * 21
Instance reasoning vs query evaluation (3) SELECT * FROM country, city, school WHERE country.capital=city.name AND city.state=school.state SQL runtime: 2040 ms (25891 records) Completeness pattern runtime: 900 ms (46 patterns) Median over 7 join queries: SQL runtime: 2040 ms Completeness pattern runtime: 460 ms 22
Summary Completeness patterns are a natural way to describe complete parts of databases and query answers Can be expressed in the same schema Modified the relational algebra to manipulate completeness patterns Selection and projection easy Join may be expensive (in theory, in practice, usually not) Also in the paper Aggregation operators Details on instance-specific generalizations Comparison of data structures for minimizing patterns Discrimination trees allow efficient handling of large sets 23