Identifying Completeness of Query Answers in Incomplete Databases

Identifying the Extent of Completeness of

Query Answers over Partially Complete Databases

Simon Razniewski

, Flip Korn

Werner Nutt

, Divesh Srivastava

 Free University of Bozen-Bolzano

 Google Research

 AT&T Labs - Research

Data warehouse of a

telecommunication company

Admin John knows

•

Team

 table is complete

(HR says so)

•

Maintenance

 is complete for teams A, B and C

•

their reporting systems export data automatically

•

Warnings

 is complete for all of Week 1,

and Monday and Wednesday of Week 2

•

Potential data loss due to a system failure on Tuesday

•

Data further than Wednesday maybe not fully loaded

John wants to know

“Give me all warnings in week 2 that are generated

  by objects in maintenance with a hardware team.”

SELECT *

  FROM Warnings W

    JOIN Maintenance M ON W.ID = M.ID

    JOIN Teams T ON M.responsible = T.name

  WHERE W.week = 2

    AND T.specialization = 'hardware'

Is this all that

hardware

teams have

done?

John reasons

“Give me all warnings in week 2 that are generated

  by objects in maintenance with a hardware team.”

•

Warnings

 is complete for Week 1 and

Monday

and

Wednesday

 of Week 2

•

Maintenance

 is complete for teams

and

•

Team

 is complete



 The query result definitely contains all warnings from

–

Monday

 for team

–

Monday

 for team

–

Monday

 for team

–

Wednesday

 for team

–

Wednesday

 for team

–

Wednesday

 for team

John looks at the data



 The query result definitely contains all warnings from

–

Monday

 for team

–

Monday

 for team

–

Monday

 for team

–

Wednesday

 for team

–

Wednesday

 for team

–

Wednesday

 for team

•

There are

no other hardware

teams than A and B



The query result is

fully complete for Monday and Wednesday

Questions

“Warnings are complete for Week 1”

1.

How can we formally describe

complete parts of a database?

“

The query result contains all warnings

  from Monday of week 2 for team A”

2.

How can we use database completeness

information to identify

complete parts of query answers?

Related work

Formalism: Patterns

We have all warnings from week 1

We have all warnings from

Monday of week 2

•

Less expressive

than previous formalisms

•

Can be expressed in the

same schema as the data

John’s knowledge expressed by patterns

Team

 table is complete

Maintenance

 is complete

     for teams A, B and C

Warnings

 is complete for all of Week 1,

and Monday and Wednesday of Week 2

John’s conclusions expressed by patterns

“Give me all warnings in week 2 that are generated

 by objects in maintenance with a hardware team.”



 The query result contains all warnings from

•

Monday

 for team

•

…

How to compute the completeness patterns for queries?

Queries are computed by

relational algebra

Here: Select, project, equijoin

Schema

reasoning

- Apply

algebra

operators

to

 completeness patterns

     (analogous to query result computation)

Rule 1: Statements with * survive

Reasoning about selections

Rule 2: Irrelevant constants are ignored

Rule 3: Selected constants survive and are

promoted

Reasoning about selections (2)

Reasoning about joins

Rule 1: Constants join with equal constants

Rule 2: Stars join with anything

Rule 3: Constants can be

promoted

Algorithmic completeness

Proven

: Extended algebra gives

all conclusions

that hold on the

schema level

(reasoning only with the yellow metadata)

•

Independent of the algebra tree chosen

Looking at the data

There

cannot be

other hardware

teams than

A and B

Database instance

allows for

more promotion

How expensive is that?

Experiments:

–

Schema

vs

instance

 reasoning

–

Instance

 reasoning vs

query evaluation

–

Minimization

 of sets of patterns

Instance reasoning vs query evaluation (1)

•

Synthetic data

•

Wikipedia

has around

1000 lists

declared as

complete

(using a template or in natural language)

http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29

•

Manually extracted some and grouped them by topic

–

Recurrent topics: Sports teams, political assemblies, geographical features,

songs, operas and other pieces of art

•

Generated

one table each

about cities, schools and countries

Instance reasoning vs query evaluation (2)

SELECT *

FROM country, city, school

WHERE country.capital=city.name

   AND city.state=school.state

SQL runtime:

2040 ms (25891 records)

Completeness pattern runtime

: 900 ms (46 patterns)

Median

 over 7 join queries:

•

SQL runtime

: 2040 ms

•

Completeness pattern runtime

: 460 ms

Instance reasoning vs query evaluation (3)

Summary

•

Completeness patterns

are a natural way to describe

complete parts of databases and query answers

–

Can be expressed in the same schema

•

Modified the relational algebra

to manipulate completeness patterns

–

Selection and projection easy

–

Join may be expensive (in theory, in practice, usually not)

•

Also in the paper

–

Aggregation

 operators

–

Details on instance-specific

generalizations

–

Comparison of

data structures for minimizing patterns

•

Discrimination trees allow efficient handling of large sets

Slide Note

Embed Share

Download

The study delves into how to assess the completeness of query answers when dealing with partially complete databases. By analyzing data from a telecommunication company’s data warehouse, the query results are examined to determine if all warnings generated by maintenance objects with hardware teams during week 2 have been captured. Through a series of SQL operations and logical reasoning, the completeness of the query results is evaluated, ensuring accurate and comprehensive information retrieval.

crystal Follow

Uploaded on Aug 19, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Identifying the Extent of Completeness of Query Answers over Partially Complete Databases Simon Razniewski1, Flip Korn2, Werner Nutt1, Divesh Srivastava3 1Free University of Bozen-Bolzano 2Google Research 3AT&T Labs - Research

Data warehouse of a telecommunication company Admin John knows Team table is complete (HR says so) Maintenance is complete for teams A, B and C their reporting systems export data automatically Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Potential data loss due to a system failure on Tuesday Data further than Wednesday maybe not fully loaded Maintenance Warnings Teams ID resp reason day week ID message name specialization tw37 A disk failure Mon 1 tw37 high voltage A hardware tw59 D software crash Fri 1 tw37 high voltage B hardware tw83 B unknown Wed 2 tw37 overheat C network tw91 C update failure Tue 1 tw59 auto restart C software tw91 C network error Fri 1 tw59 overheat D network Mon 2 tw83 high voltage Tue 2 tw83 auto restart 2

John wants to know Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Is this all that hardware teams have done? SELECT * FROM Warnings W JOIN Maintenance M ON W.ID = M.ID JOIN Teams T ON M.responsible = T.name WHERE W.week = 2 AND T.specialization = 'hardware' W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization Wed 2 tw37 overheat tw37 A disk failure A hardware Mon 2 tw83 high voltage tw83 B unknown B hardware Tue 2 tw83 auto restart tw83 B unknown B hardware 3

John reasons Maintenance Warnings Teams ID resp reason day week ID message name specialization Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. Warnings is complete for Week 1 and Monday and Wednesday of Week 2 Maintenance is complete for teams A, B and C Team is complete The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C 4

John looks at the data The query result definitely contains all warnings from Monday for team A Monday for team B Monday for team C Wednesday for team A Wednesday for team B Wednesday for team C Teams name specialization A hardware B hardware C network C software D network There are no other hardware teams than A and B The query result is fully complete for Monday and Wednesday 5

Questions Warnings are complete for Week 1 1. How can we formally describe complete parts of a database? The query result contains all warnings from Monday of week 2 for team A 2. How can we use database completeness information to identify complete parts of query answers? 6

Related work Publication Description Language Focus of the work Motro, TODS 1989 Views Schema-level reasoning Levy, LC statements, similar to views Schema-level reasoning VLDB 1996 Fan & Geerts, PODS 2009 Various query languages (CQ-Datalog) Master data management, where an upper bound database exists Lang et al., SIGMOD 2014 Columns/operators Distributed databases on the web, operational failures during query execution 7

Formalism: Patterns Warnings Warnings Warnings Warnings Warnings We have all warnings from week 1 day day day day day week week week week week ID ID ID ID ID message message message message message Mon Mon Mon Mon Mon 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage We have all warnings from Monday of week 2 Fri Fri Fri Fri Fri 1 1 1 1 1 tw37 tw37 tw37 tw37 tw37 high voltage high voltage high voltage high voltage high voltage Wed Wed Wed Wed Wed 2 2 2 2 2 tw37 tw37 tw37 tw37 tw37 overheat overheat overheat overheat overheat Tue Tue Tue Tue Tue 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 auto restart auto restart auto restart auto restart auto restart Fri Fri Fri Fri Fri 1 1 1 1 1 tw59 tw59 tw59 tw59 tw59 overheat overheat overheat overheat overheat Mon Mon Mon Mon Mon 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 high voltage high voltage high voltage high voltage high voltage Tue Tue Tue Tue Tue 2 2 2 2 2 tw83 tw83 tw83 tw83 tw83 auto restart auto restart auto restart auto restart auto restart * * * * 1 1 1 1 * * * * * * * * Mon Mon Mon 2 2 2 * * * * * * Less expressive than previous formalisms Can be expressed in the same schema as the data 8

Johns knowledge expressed by patterns Maintenance is complete for teams A, B and C Team table is complete Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2 Teams Maintenance Warnings name specialization ID resp reason day week ID message A hardware tw37 A disk failure Mon 1 tw37 high voltage B hardware tw59 D software crash Fri 1 tw37 high voltage C network tw83 B unknown Wed 2 tw37 overheat C software tw91 C update failure Tue 1 tw59 auto restart D network tw91 C network error Fri 1 tw59 overheat * * * A * Mon 2 tw83 high voltage * B * Tue 2 tw83 auto restart * C * * 1 * * Mon 2 * * Wed 2 * * 9

Johns conclusions expressed by patterns Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team. W.Day W.Day W.Day W.week W.week W.week W.ID W.ID W.ID W.message W.message W.message M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization Wed Wed Wed 2 2 2 tw37 tw37 tw37 overheat overheat overheat tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Mon Mon Mon 2 2 2 tw83 tw83 tw83 high voltage high voltage high voltage tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Tue Tue Tue 2 2 2 tw83 tw83 tw83 auto restart auto restart auto restart tw83 tw83 tw83 B B B unknown unknown unknown B B B hardware hardware hardware Mon Mon * * * * * * * * A A * * A A * * Mon * * * * B * B * Mon * * * * C * C * Wed * * * * A * A * Wed * * * * B * B * Wed * * * * C * C * The query result contains all warnings from Monday for team A 10

How to compute the completeness patterns for queries? Queries are computed by relational algebra Here: Select, project, equijoin ?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? Schema reasoning: - Apply algebra operators to completeness patterns (analogous to query result computation) 11

Reasoning about selections ?????= "??"(?)) Teams name specialization name name specialization specialization ?? A hardware A A hardware hardware B hardware B B hardware hardware C network * * C software D network * * Rule 1: Statements with * survive 12

?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 13

Reasoning about selections (2) Warnings day week ID message Mon 1 tw37 high voltage ?????=?(?) Fri 1 tw37 high voltage Wed 2 tw37 overheat day day week week ID ID message message Tue 1 tw59 auto restart Wed Wed 2 2 tw37 tw37 overheat overheat Fri 1 tw59 overheat Mon Mon 2 2 tw83 tw83 high voltage high voltage Mon 2 tw83 high voltage Tue Tue 2 2 tw83 ? * auto restart auto restart tw83 Tue 2 tw83 auto restart ** Mon 2 * * * 1 * * Wed 2 * Mon 2 * * Wed 2 * * Rule 2: Irrelevant constants are ignored Rule 3: Selected constants survive and are promoted 14

?.??=?.?? ?.????=?.???? ?????=2 ?????= " ?" ???????? ????? ??????????? 15

Reasoning about joins Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure tw91 C network error * A * * B * * C * M.ID M.ID M.ID M.resp M.resp M.resp M.reason M.reason M.reason T.name T.name T.name T.specialization T.specialization T.specialization tw37 tw37 tw37 A A A disk failure disk failure disk failure A A A hardware hardware hardware Rule 1: Constants join with equal constants Rule 2: Stars join with anything Rule 3: Constants can be promoted tw83 tw83 tw83 B B B unknown unknown B B B hardware hardware hardware unknown ? * * * * A A * * A * * * * * B B * * B * * * * * C C C * * * * * * A * * * * B * 16 * * * C *

Algorithmic completeness Proven: Extended algebra gives all conclusions that hold on the schema level (reasoning only with the yellow metadata) Independent of the algebra tree chosen 17

Looking at the data Maintenance ID resp reason name specialization tw37 A disk failure ? ?.????=?.?????????= "??"(?) A hardware tw59 D software crash B hardware tw83 B unknown * * tw91 C update failure There cannot be other hardware teams than A and B tw91 C network error * A * * B * * C * M.ID M.resp M.reason T.name T.specialization tw37 A disk failure A hardware tw83 B unknown M.reason B hardware T.specialization M.ID M.resp T.name * A * * * hardware tw37 A disk failure A * B * * * hardware Database instance allows for more promotion! tw83 B unknown B * C * * * * * * * * * * * A * * * * B * 18 * * * C *

How expensive is that? Experiments: Schema vs instance reasoning Instance reasoning vs query evaluation Minimization of sets of patterns 19

Instance reasoning vs query evaluation (1) Synthetic data Wikipedia has around 1000 lists declared as complete (using a template or in natural language) http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29 20

Instance reasoning vs query evaluation (2) Manually extracted some and grouped them by topic Recurrent topics: Sports teams, political assemblies, geographical features, songs, operas and other pieces of art Generated one table each about cities, schools and countries city name * * * * * * * * * country USA Germany Ukraine Bulgaria USA UK USA Czech Slovenia state Virginia * * * New York Carmarthenshire West Virginia Moravian-Silesia * county * * * * * * Hampshire County Nov Ji n * 21

Instance reasoning vs query evaluation (3) SELECT * FROM country, city, school WHERE country.capital=city.name AND city.state=school.state SQL runtime: 2040 ms (25891 records) Completeness pattern runtime: 900 ms (46 patterns) Median over 7 join queries: SQL runtime: 2040 ms Completeness pattern runtime: 460 ms 22

Summary Completeness patterns are a natural way to describe complete parts of databases and query answers Can be expressed in the same schema Modified the relational algebra to manipulate completeness patterns Selection and projection easy Join may be expensive (in theory, in practice, usually not) Also in the paper Aggregation operators Details on instance-specific generalizations Comparison of data structures for minimizing patterns Discrimination trees allow efficient handling of large sets 23

Identifying Completeness of Query Answers in Incomplete Databases

Download Presentation

Presentation Transcript

Related

More Related Content