Identifying Completeness of Query Answers in Incomplete Databases

 
Identifying the Extent of Completeness of
Query Answers over Partially Complete Databases
 
Simon Razniewski
1
, Flip Korn
2
,
Werner Nutt
1
, Divesh Srivastava
3
 
1
 Free University of Bozen-Bolzano
2
 Google Research
3
 AT&T Labs - Research
Data warehouse of a
telecommunication company
 
Admin John knows
Team
 table is complete 
(HR says so)
Maintenance
 is complete for teams A, B and C
their reporting systems export data automatically
Warnings
 is complete for all of Week 1,
and Monday and Wednesday of Week 2
Potential data loss due to a system failure on Tuesday
Data further than Wednesday maybe not fully loaded
2
John wants to know
 
“Give me all warnings in week 2 that are generated
  by objects in maintenance with a hardware team.”
 
 
SELECT *
  FROM Warnings W
    JOIN Maintenance M ON W.ID = M.ID
    JOIN Teams T ON M.responsible = T.name
  WHERE W.week = 2
    AND T.specialization = 'hardware'
Is this all that
hardware
teams have
done?
3
John reasons
 
 
 
 
 
“Give me all warnings in week 2 that are generated
  by objects in maintenance with a hardware team.”
 
 
Warnings
 is complete for Week 1 and 
Monday
 and 
Wednesday
 of Week 2
Maintenance
 is complete for teams 
A
, 
B
 and 
C
Team
 is complete
 
 The query result definitely contains all warnings from
Monday
 for team 
A
Monday
 for team 
B
Monday
 for team 
C
Wednesday
 for team 
A
Wednesday
 for team 
B
Wednesday
 for team 
C
4
John looks at the data
 
 The query result definitely contains all warnings from
Monday
 for team 
A
Monday
 for team 
B
Monday
 for team 
C
Wednesday
 for team 
A
Wednesday
 for team 
B
Wednesday
 for team 
C
 
There are 
no other hardware 
teams than A and B
 
The query result is 
fully complete for Monday and Wednesday
5
Questions
 
 
“Warnings are complete for Week 1”
 
1.
How can we formally describe
complete parts of a database?
 
 
The query result contains all warnings
 
  from Monday of week 2 for team A”
 
2.
How can we use database completeness
information to identify
complete parts of query answers?
6
 
Related work
 
 
7
Formalism: Patterns
 
 
We have all warnings from week 1
 
We have all warnings from
Monday of week 2
 
 
 
 
 
Less expressive 
than previous formalisms
Can be expressed in the 
same schema as the data
8
 
John’s knowledge expressed by patterns
 
9
 
Team
 table is complete
 
Maintenance
 is complete
     for teams A, B and C
 
Warnings
 is complete for all of Week 1,
and Monday and Wednesday of Week 2
John’s conclusions expressed by patterns
“Give me all warnings in week 2 that are generated
 by objects in maintenance with a hardware team.”
10
 
 The query result contains all warnings from
Monday
 for team 
A
How to compute the completeness patterns for queries?
 
Queries are computed by 
relational algebra
Here: Select, project, equijoin
 
 
 
 
 
 
 
Schema 
reasoning
:
   
- Apply
 
algebra 
operators
 
to
 completeness patterns
     (analogous to query result computation)
11
?
?
 
Rule 1: Statements with * survive
12
Reasoning about selections
 
13
?
 
Rule 2: Irrelevant constants are ignored
Rule 3: Selected constants survive and are 
promoted
14
Reasoning about selections (2)
*
*
 
15
?
16
Reasoning about joins
 
Rule 1: Constants join with equal constants
Rule 2: Stars join with anything
Rule 3: Constants can be 
promoted
 
Algorithmic completeness
 
 
 
Proven
: Extended algebra gives 
all conclusions
that hold on the 
schema level
      
(reasoning only with the yellow metadata)
 
Independent of the algebra tree chosen
 
17
18
Looking at the data
There
cannot be
other hardware
teams than
A and B
 
Database instance
allows for 
more promotion
!
 
How expensive is that?
 
 
Experiments:
 
Schema
 vs 
instance
 reasoning
 
Instance
 reasoning vs 
query evaluation
 
Minimization
 of sets of patterns
 
19
 
Instance reasoning vs query evaluation (1)
 
Synthetic data
 
Wikipedia 
has around 
1000 lists 
declared as 
complete
(using a template or in natural language)
 
20
 
http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29
 
Manually extracted some and grouped them by topic
Recurrent topics: Sports teams, political assemblies, geographical features,
songs, operas and other pieces of art
Generated 
one table each 
about cities, schools and countries
 
 
21
 
Instance reasoning vs query evaluation (2)
 
22
 
 
SELECT *
FROM country, city, school
WHERE country.capital=city.name
   AND city.state=school.state
 
SQL runtime: 
2040 ms (25891 records)
Completeness pattern runtime
: 900 ms (46 patterns)
 
 
Median
 over 7 join queries:
SQL runtime
: 2040 ms
Completeness pattern runtime
: 460 ms
 
Instance reasoning vs query evaluation (3)
 
Summary
 
 
Completeness patterns 
are a natural way to describe
complete parts of databases and query answers
Can be expressed in the same schema
 
Modified the relational algebra
to manipulate completeness patterns
Selection and projection easy
Join may be expensive (in theory, in practice, usually not)
 
Also in the paper
Aggregation
 operators
Details on instance-specific 
generalizations
Comparison of 
data structures for minimizing patterns
Discrimination trees allow efficient handling of large sets
 
23
Slide Note
Embed
Share

The study delves into how to assess the completeness of query answers when dealing with partially complete databases. By analyzing data from a telecommunication company’s data warehouse, the query results are examined to determine if all warnings generated by maintenance objects with hardware teams during week 2 have been captured. Through a series of SQL operations and logical reasoning, the completeness of the query results is evaluated, ensuring accurate and comprehensive information retrieval.

  • Incomplete Databases
  • Query Answers
  • Data Warehouse
  • Telecommunication Company
  • SQL Operations

Uploaded on Aug 19, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Identifying the Extent of Completeness of Query Answers over Partially Complete Databases Simon Razniewski1, Flip Korn2, Werner Nutt1, Divesh Srivastava3 1Free University of Bozen-Bolzano 2Google Research 3AT&T Labs - Research

  2. Data warehouse of a telecommunication company Admin John knows Team table is complete (HR says so) Maintenance is complete for teams A, B and C their reporting systems export data automatically Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Potential data loss due to a system failure on Tuesday Data further than Wednesday maybe not fully loaded Maintenance Warnings Teams ID resp reason day week ID message name specialization tw37 A disk failure Mon 1 tw37 high voltage A hardware tw59 D software crash Fri 1 tw37 high voltage B hardware tw83 B unknown Wed 2 tw37 overheat C network tw91 C update failure Tue 1 tw59 auto restart C software tw91 C network error Fri 1 tw59 overheat D network Mon 2 tw83 high voltage Tue 2 tw83 auto restart 2

  3. John wants to know Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Is this all that hardware teams have done? SELECT * FROM Warnings W JOIN Maintenance M ON W.ID = M.ID JOIN Teams T ON M.responsible = T.name WHERE W.week = 2 AND T.specialization = 'hardware' W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware 3

  4. John reasons Maintenance Warnings Teams ID resp reason day week ID message name specialization Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Warnings is complete for Week 1 and Monday and Wednesday of Week 2 Maintenance is complete for teams A, B and C Team is complete The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C 4

  5. John looks at the data The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C Teams name specialization A hardware B hardware C network C software D network There are no other hardware teams than A and B The query result is fully complete for Monday and Wednesday 5

  6. Questions Warnings are complete for Week 1 1. How can we formally describe complete parts of a database? The query result contains all warnings from Monday of week 2 for team A 2. How can we use database completeness information to identify complete parts of query answers? 6

  7. Related work Publication Description Language Focus of the work Motro, TODS 1989 Views Schema-level reasoning Levy, LC statements, similar to views Schema-level reasoning VLDB 1996 Fan & Geerts, PODS 2009 Various query languages (CQ-Datalog) Master data management, where an upper bound database exists Lang et al., SIGMOD 2014 Columns/operators Distributed databases on the web, operational failures during query execution 7

  8. Formalism: Patterns Warnings Warnings Warnings Warnings Warnings We have all warnings from week 1 day day day day day week week week week week ID ID ID ID ID message message message message message Mon Mon Mon Mon Mon 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage We have all warnings from Monday of week 2 Fri Fri Fri Fri Fri 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage Wed Wed Wed Wed Wed 2 2 2 2 2 tw37 tw37 tw37 tw37 tw37 overheat overheat overheat overheat overheat Tue Tue Tue Tue Tue 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 auto restart auto restart auto restart auto restart auto restart Fri Fri Fri Fri Fri 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 overheat overheat overheat overheat overheat Mon Mon Mon Mon Mon 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 high voltage high voltage high voltage high voltage high voltage Tue Tue Tue Tue Tue 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 auto restart auto restart auto restart auto restart auto restart * * * * 1 1 1 1 * * * * * * * * Mon Mon Mon 2 2 2 * * * * * * Less expressive than previous formalisms Can be expressed in the same schema as the data 8

  9. Johns knowledge expressed by patterns Maintenance is complete for teams A, B and C Team table is complete Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Teams Maintenance Warnings name specialization ID resp reason day week ID message A hardware tw37 A disk failure Mon 1 tw37 high voltage B hardware tw59 D software crash Fri 1 tw37 high voltage C network tw83 B unknown Wed 2 tw37 overheat C software tw91 C update failure Tue 1 tw59 auto restart D network tw91 C network error Fri 1 tw59 overheat * * * A * Mon 2 tw83 high voltage * B * Tue 2 tw83 auto restart * C * * 1 * * Mon 2 * * Wed 2 * * 9

  10. Johns conclusions expressed by patterns Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. W.Day W.Day W.Day W.week W.week W.week W.ID W.ID W.ID W.message W.message W.message M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization Wed Wed Wed 2 2 2 tw37 tw37 tw37 overheat overheat overheat tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Mon Mon Mon 2 2 2 tw83 tw83 tw83 high voltage high voltage high voltage tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Tue Tue Tue 2 2 2 tw83 tw83 tw83 auto restart auto restart auto restart tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Mon Mon * * * * * * * * A A * * A A * * Mon * * * * B * B * Mon * * * * C * C * Wed * * * * A * A * Wed * * * * B * B * Wed * * * * C * C * The query result contains all warnings from Monday for team A 10

  11. How to compute the completeness patterns for queries? Queries are computed by relational algebra Here: Select, project, equijoin ?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? Schema reasoning: - Apply algebra operators to completeness patterns (analogous to query result computation) 11

  12. Reasoning about selections ?????= "??"(?)) Teams name specialization name name specialization specialization ?? A hardware A A hardware hardware B hardware B B hardware hardware C network * * C software D network * * Rule 1: Statements with * survive 12

  13. ?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 13

  14. Reasoning about selections (2) Warnings day week ID message Mon 1 tw37 high voltage ?????=?(?) Fri 1 tw37 high voltage Wed 2 tw37 overheat day day week week ID ID message message Tue 1 tw59 auto restart Wed Wed 2 2 tw37 tw37 overheat overheat Fri 1 tw59 overheat Mon Mon 2 2 tw83 tw83 high voltage high voltage Mon 2 tw83 high voltage Tue Tue 2 2 tw83 ? * auto restart auto restart tw83 Tue 2 tw83 auto restart ** Mon 2 * * * 1 * * Wed 2 * Mon 2 * * Wed 2 * * Rule 2: Irrelevant constants are ignored Rule 3: Selected constants survive and are promoted 14

  15. ?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 15

  16. Reasoning about joins Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure tw91 C network error * A * * B * * C * M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Rule 1: Constants join with equal constants Rule 2: Stars join with anything Rule 3: Constants can be promoted tw83 tw83 tw83 B B B unknown unknown B B B hardware hardware hardware unknown ? * * * * A A * * A * * * * * B B * * B * * * * * C C C * * * * * * A * * * * B * 16 * * * C *

  17. Algorithmic completeness Proven: Extended algebra gives all conclusions that hold on the schema level (reasoning only with the yellow metadata) Independent of the algebra tree chosen 17

  18. Looking at the data Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure There cannot be other hardware teams than A and B tw91 C network error * A * * B * * C * M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown M.reason B hardware T.specialization M.ID M.resp T.name * A * * * hardware tw37 A disk failure A * B * * * hardware Database instance allows for more promotion! tw83 B unknown B * C * * * * * * * * * * * A * * * * B * 18 * * * C *

  19. How expensive is that? Experiments: Schema vs instance reasoning Instance reasoning vs query evaluation Minimization of sets of patterns 19

  20. Instance reasoning vs query evaluation (1) Synthetic data Wikipedia has around 1000 lists declared as complete (using a template or in natural language) http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29 20

  21. Instance reasoning vs query evaluation (2) Manually extracted some and grouped them by topic Recurrent topics: Sports teams, political assemblies, geographical features, songs, operas and other pieces of art Generated one table each about cities, schools and countries city name * * * * * * * * * country USA Germany Ukraine Bulgaria USA UK USA Czech Slovenia state Virginia * * * New York Carmarthenshire West Virginia Moravian-Silesia * county * * * * * * Hampshire County Nov Ji n * 21

  22. Instance reasoning vs query evaluation (3) SELECT * FROM country, city, school WHERE country.capital=city.name AND city.state=school.state SQL runtime: 2040 ms (25891 records) Completeness pattern runtime: 900 ms (46 patterns) Median over 7 join queries: SQL runtime: 2040 ms Completeness pattern runtime: 460 ms 22

  23. Summary Completeness patterns are a natural way to describe complete parts of databases and query answers Can be expressed in the same schema Modified the relational algebra to manipulate completeness patterns Selection and projection easy Join may be expensive (in theory, in practice, usually not) Also in the paper Aggregation operators Details on instance-specific generalizations Comparison of data structures for minimizing patterns Discrimination trees allow efficient handling of large sets 23

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#