Study on Completeness of Queries over Incomplete Databases

undefined
 
Joint work with Werner Nutt
 
 
Free University of Bozen-Bolzano
 
C
o
m
p
l
e
t
e
n
e
s
s
 
o
f
 
Q
u
e
r
i
e
s
 
o
v
e
r
I
n
c
o
m
p
l
e
t
e
 
D
a
t
a
b
a
s
e
s
S
i
m
o
n
 
R
a
z
n
i
e
w
s
k
i
 
 
Introduction
 
 
 
Data completeness: 
important aspect of 
data quality
 
 
Query answering 
over incomplete data:
     
      extensively studied
 
 
Query Completeness
:  little work
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
2
 
Bolzano is in the Province of South Tyrol
 
 
 
 
 
 
 
 
 
 
    Autonomous, trilingual province in the north of Italy
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
3
Bolzano
 
School Data in South Tyrol
 
Decentrally maintained database
  
  Statistical reports
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
4
 
?
?
 
notoriously incomplete         
  
       correctness important
 
Example Database Schema
 
 
 
 
Pupil(pname, age, sname)
 
School(sname, type, language)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
5
Completeness Reasoning Example
 
Suppose we have 
data about pupils 
from all
German schools
Italian schools, except the high school “Da Vinci“
Ladin schools, except the middle school “Gherdëna“
 
Will the following query get a correct answer?
 
“How many pupils are at German primary schools?“
 
 Yes
30.08.2011
Completeness of Queries over Incomplete Databases
6
 
(if we also have all German primary schools)
Completeness Reasoning Example (Cntd)
 
Suppose we have 
data about pupils 
from all
German schools
Italian schools, except the high school “Da Vinci“
Ladin schools, except the middle school “Gherdëna“
 
Will the following query get a correct answer?
 
“How many Ladin pupils are there?
 
 Maybe not, pupils from “Gherdëna“ could be missing
30.08.2011
Completeness of Queries over Incomplete Databases
7
 
Overview
 
 
Formalization
Incomplete Database
Query Completeness
Table Completeness
 
Reasoning for Conjunctive Queries
Bag Semantics
Set Semantics
Aggregate Queries
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
8
 
Incomplete Database (Motro 1989)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
9
 
Incomplete Database - Example
 
D = (
D
i
, 
D
a
)
 
“Paul and Andrea are pupils in the ideal database”
 
       
D
i
 = { pupil(‘Paul‘, 11, ‘Da Vinci‘),
                  pupil(‘Andrea‘, 14, ‘Gherdëna‘) }
 
“Our available database misses the fact that Andrea is a pupil“
 
 
       
D
a 
= { 
pupil(‘Paul‘, 11, ‘Da Vinci‘) }
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
10
 
Query Completeness (Motro 1989)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
11
Table Completeness (Levy 1996)
30.08.2011
Completeness of Queries over Incomplete Databases
12
This is a full TGD
(= tuple generating dependency)
 
Table Completeness (Cntd)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
13
 
Completeness Reasoning
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
14
 
TC
Statements
C
 
QC
Statement
Compl(Q)
 
Completeness Reasoning (Cntd)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
15
 
What is Known?
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
16
TC-QC
bag
 – Canonical TC Statements
“How many 12-year old pupils are
 at
 the Italian schools?''
Q(COUNT(p)) :
 pupil(p, 12, s), school(s, t, ‘Italian')‏
Q can be answered correctly if
 
   - every 12-year old pupil from an Italian school is there
 
   - every Italian school with a 12-year old pupil is there
That is, if the database satisfies
   - 
Compl(pupil(p, a, s); school(s, t, ‘Italian'), a = 12)
   - 
Compl(school(s, t, l); pupil(p, 12, s), l = ‘Italian')
30.08.2011
Completeness of Queries over Incomplete Databases
17
 
TC-QC
bag
 – Canonical TC Statements (Cntd)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
18
TC-QC
bag
 Reduces to TC-TC
30.08.2011
Completeness of Queries over Incomplete Databases
19
 
TC-TC Entailment = Query Containment
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
20
 
TC-TC Entailment = Query Containment (Cntd)
 
TC statements 
describe parts of tables that are complete
 
TC statements 
entail each other if the parts described are contained
 
Entailment
 of TC from TC can naturally be reduced to
      
    
query containment
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
21
Theorem
:
 
Let 
L
 be a class of conjunctive queries that
 
      (i) contains for every relation the identity query
 
      (ii) is closed under intersection
 
Then 
TC-TC entailment 
and 
containment of unions 
of queries
               can be 
reduced
 to each other 
in linear time
.
 
Complexity
 
 
Classes of conjunctive queries:
 
CQ: Conjunctive queries with comparisons over dense orders
RQ: Relational conjunctive queries (i.e., without comparisons)
LCQ: Linear conjunctive queries (i.e., without self-joins)
LRQ: Linear relational conjunctive queries
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
22
 
TC-QC
bag
 - Complexity
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
23
 
TC-QC
set
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
24
30.08.2011
Completeness of Queries over Incomplete Databases
25
TC-QC
bag
 - Complexity
TC-QC
set
 
Completeness Reasoning for
Aggregate Queries
 
 
 
 
SUM
 
and 
COUNT
:  
similar
 
to 
bag semantics
 
MIN
 
and 
MAX
:   similar to 
set semantics
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
26
 
QC-QC and Query Determinacy
 
Motro’s idea: Look for 
rewritings
 
Given
 
Q
1
(x) :− R(x), S(x)
 
Q
2
(x) :− T(x)
 
Suppose we know 
Compl(Q
1
)
 and 
Compl(Q
2
)
 
Consider
 
Q(x) :− R(x), S(x), T(x)
 
We see: 
Q
 can be 
rewritten
 as
 
 
Q(x) :− 
Q
1
(x), 
Q
2
(x)
 
Therefore, we
 
conclude
 
Compl(Q)
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
27
 
QC-QC and Query Determinacy (Cntd)
 
Queries 
Q
1
, …, Q
n
, Q
 
 
 
 
 
 
 
 
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
28
 
QC-QC and Query Determinacy (Cntd)
 
 
 
 
However:
Decidability of determinacy for conj. queries is open (Segoufin/Vianu ‘05)
Necessity of determinacy for QC-QC entailment is open
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
29
Theorem
: For 
boolean queries
, existence of 
rewritings
, 
determinacy
 and
                 QC-QC entailment 
coincide
 
Where Can Completeness Statements Come From?
 
Any conclusion only as correct as the statements it is derived from
~> On which basis can someone give a completeness statement?
 
Someone 
knows
 some 
part of the real world
 
E.g., a class teacher knows all his students
 
The method of 
data collection 
is known to be 
complete
 
E.g., at the deadline for enrolment all forms must be present
 
Cardinalities
 of parts of the real world 
are
 
known
 and 
the method
of data collection 
is 
correct
 
E.g., no nonexisting schools are registered and the number of
                 schools in South Tyrol is known
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
30
 
Conclusion
 
Framework for 
modelling completeness
query answers (Motro: 
QC statements
)
parts of databases (Levy: 
TC statements
)
 
Reasoning
Complexity analysis 
of 
TC-TC
 and 
TC-QC
Connection
 between 
determinacy 
and 
QC-QC
Reasoning in the presence of 
instances
 
Current work
Schema constraints (keys, foreign keys, finite domains)
Null values
Prototypical implementation
 
30.08.2011
 
Completeness of Queries over Incomplete Databases
 
31
Slide Note
Embed
Share

Investigation into query completeness over incomplete databases, highlighting the importance of data completeness for accurate query answering. Examples and reasoning provided to illustrate the challenges and considerations in ensuring query completeness.

  • Databases
  • Query Completeness
  • Data Quality
  • Research Study
  • Data Analysis

Uploaded on Sep 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Completeness of Queries over Incomplete Databases Simon Razniewski Joint work with Werner Nutt Free University of Bozen-Bolzano

  2. Introduction Data completeness: important aspect of data quality Query answering over incomplete data: extensively studied Query Completeness: little work 30.08.2011 Completeness of Queries over Incomplete Databases 2

  3. Bolzano is in the Province of South Tyrol Bolzano Autonomous, trilingual province in the north of Italy 30.08.2011 Completeness of Queries over Incomplete Databases 3

  4. School Data in South Tyrol Decentrally maintained database Statistical reports ?? notoriously incomplete correctness important 30.08.2011 Completeness of Queries over Incomplete Databases 4

  5. Example Database Schema Pupil(pname, age, sname) School(sname, type, language) 30.08.2011 Completeness of Queries over Incomplete Databases 5

  6. Completeness Reasoning Example Suppose we have data about pupils from all German schools Italian schools, except the high school Da Vinci Ladin schools, except the middle school Gherd na Will the following query get a correct answer? How many pupils are at German primary schools? Yes (if we also have all German primary schools) 30.08.2011 Completeness of Queries over Incomplete Databases 6

  7. Completeness Reasoning Example (Cntd) Suppose we have data about pupils from all German schools Italian schools, except the high school Da Vinci Ladin schools, except the middle school Gherd na Will the following query get a correct answer? How many Ladin pupils are there? Maybe not, pupils from Gherd na could be missing 30.08.2011 Completeness of Queries over Incomplete Databases 7

  8. Overview Formalization Incomplete Database Query Completeness Table Completeness Reasoning for Conjunctive Queries Bag Semantics Set Semantics Aggregate Queries 30.08.2011 Completeness of Queries over Incomplete Databases 8

  9. Incomplete Database (Motro 1989) Incompleteness needs a complete reference Incomplete databases are pairs of an ideal database an available database Da Diand D = (Di, Da) such that Da Di 30.08.2011 Completeness of Queries over Incomplete Databases 9

  10. Incomplete Database - Example D = (Di, Da) Paul and Andrea are pupils in the ideal database Di= { pupil( Paul , 11, Da Vinci ), pupil( Andrea , 14, Gherd na ) } Our available database misses the fact that Andrea is a pupil Da = { pupil( Paul , 11, Da Vinci ) } 30.08.2011 Completeness of Queries over Incomplete Databases 10

  11. Query Completeness (Motro 1989) Query Q The set of answers to Q is complete Notation: Compl(Q) Semantics: (Di, Da) Compl(Q) Q(Di) = Q(Da) iff 30.08.2011 Completeness of Queries over Incomplete Databases 11

  12. Table Completeness (Levy 1996) Table pupil(pname,age,sname) Our available db contains all pupils from Ladin schools Formally: pupili(p,a,s) schooli(s,t, Ladin ) pupila(p,a,s) If (p, a, s) is a Ladin pupil according to the ideal db, then (p, a, s) is a pupil in the available db This is a full TGD (= tuple generating dependency) 30.08.2011 Completeness of Queries over Incomplete Databases 12

  13. Table Completeness (Cntd) Our available db contains all pupils from Ladin schools TGD: ?c = pupili(p,a,s) schooli(s,t, Ladin ) pupila(p,a,s) Notation: Compl(pupil(p, a, s); school(s, t, Ladin ) Semantics: (Di, Da) Compl(pupil(p,a,s); school(s, t, Ladin )) iff (Di, Da) ?c 30.08.2011 Completeness of Queries over Incomplete Databases 13

  14. Completeness Reasoning We have complete data about pupils from all German schools Italian schools, except the high school Da Vinci Ladin schools, except the middle school Gherd na TC Statements C Query How many pupils are at German primary schools? QC Statement Compl(Q) TC-QC entailment C Compl(Q) ? 30.08.2011 Completeness of Queries over Incomplete Databases 14

  15. Completeness Reasoning (Cntd) TC-QC: table completeness entails query completeness Compl(R1; G1), , Compl(Rn; Gn) Compl(Q) - bag semantics Complbag(Q) - set semantics Complset(Q) QC-QC: query completeness entails query completeness Compl(Q1), , Compl(Qn) Compl(Q) TC-TC: table completeness entails table completeness Compl(R1; G1), , Compl(Rn; Gn) Compl(R; G) 30.08.2011 Completeness of Queries over Incomplete Databases 15

  16. What is Known? Characterizing QC-QC entailment: Compl(Q1), , Compl(Qn) Compl(Q) Existence of a rewriting is a sufficient condition (Motro 1989) Deciding TC-QC entailment: Compl(R1; G1), , Compl(Rn; Gn) Compl(Q) Decision procedure for trivial cases (Levy 1996) For reasoning w.r.t. a concrete database instance, data complexity is coNP-complete for first-order queries and TC statements (Denecker et al. 2007) 30.08.2011 Completeness of Queries over Incomplete Databases 16

  17. TC-QCbag Canonical TC Statements How many 12-year old pupils are at the Italian schools?'' Q(COUNT(p)) : pupil(p, 12, s), school(s, t, Italian') Q can be answered correctly if - every 12-year old pupil from an Italian school is there - every Italian school with a 12-year old pupil is there That is, if the database satisfies canonical completeness statements for Q - Compl(pupil(p, a, s); school(s, t, Italian'), a = 12) - Compl(school(s, t, l); pupil(p, 12, s), l = Italian') 30.08.2011 Completeness of Queries over Incomplete Databases 17

  18. TC-QCbag Canonical TC Statements (Cntd) Query Q( x) : A1( x1), , An( xn), The canonical table completeness statement for atom Aiis Compl(Ai; A1, , An-1, An+1, , An) CanQis the set of canonical completeness statements for all atoms of Q Proposition: (Di, Da) CanQ (Di, Da) Complbag (Q) implies 30.08.2011 Completeness of Queries over Incomplete Databases 18

  19. TC-QCbagReduces to TC-TC We saw: CanQ Complbag (Q) ( Complset (Q)) Theorem: Complbag(Q) CanQ For any set C of TC-statements: iff C Complbag(Q) TC-QC TC-TC C CanQ 30.08.2011 Completeness of Queries over Incomplete Databases 19

  20. TC-TC Entailment = Query Containment C1= Compl(pupil(n, a, s); True) C2= Compl(pupil(n, a, s); a = 12') Obviously, C1entails C2 Q1(n) : pupil(n, a, s) Q2(n) : pupil(n, a, s), a = 12' Q2is contained in Q1 C1entails C2 because Q2is contained in Q1 30.08.2011 Completeness of Queries over Incomplete Databases 20

  21. TC-TC Entailment = Query Containment (Cntd) TC statements describe parts of tables that are complete TC statements entail each other if the parts described are contained Entailment of TC from TC can naturally be reduced to query containment Theorem: can be reduced to each other in linear time. Let L be a class of conjunctive queries that (i) contains for every relation the identity query (ii) is closed under intersection Then TC-TC entailment and containment of unions of queries 30.08.2011 Completeness of Queries over Incomplete Databases 21

  22. Complexity Classes of conjunctive queries: CQ: Conjunctive queries with comparisons over dense orders RQ: Relational conjunctive queries (i.e., without comparisons) LCQ: Linear conjunctive queries (i.e., without self-joins) LRQ: Linear relational conjunctive queries 30.08.2011 Completeness of Queries over Incomplete Databases 22

  23. TC-QCbag- Complexity Query Language LRQ LCQ RQ CQ LRQ in PTIME in PTIME NP NP TC Statement Language RQ in PTIME in PTIME NP NP ? ? LCQ coNP coNP 2 2 ? ? CQ coNP coNP 2 2 30.08.2011 Completeness of Queries over Incomplete Databases 23

  24. TC-QCset TC-QCsetis Containment w.r.t. to TC statements C Qi= Qa iff C Qi Qa (monotonicity of Q) Containment w.r.t. TGDs C Qi Qa iff ?c Qi Qa More complex than TC-TC 30.08.2011 Completeness of Queries over Incomplete Databases 24

  25. TC-QCbag- Complexity TC-QCset Query Language LRQ LCQ RQ CQ NP ? LRQ in PTIME in PTIME NP 2 TC Statement Language NP ? RQ in PTIME in PTIME NP 2 ? ? LCQ coNP coNP 2 2 ? ? CQ coNP coNP 2 2 30.08.2011 Completeness of Queries over Incomplete Databases 25

  26. Completeness Reasoning for Aggregate Queries SUM and COUNT: similar to bag semantics MIN and MAX: similar to set semantics 30.08.2011 Completeness of Queries over Incomplete Databases 26

  27. QC-QC and Query Determinacy Motro s idea: Look for rewritings Given Q1(x) : R(x), S(x) Q2(x) : T(x) Suppose we know Compl(Q1) and Compl(Q2) Consider Q(x) : R(x), S(x), T(x) We see: Q can be rewritten as Q(x) : Q1(x), Q2(x) Therefore, we conclude Compl(Q) 30.08.2011 Completeness of Queries over Incomplete Databases 27

  28. QC-QC and Query Determinacy (Cntd) Queries Q1, , Qn, Q Determinacy: Q1, , Qndetermine Q, written Q1, , Qn Q, iff Q1(D) = Q1(D ), , Qn(D) = Qn(D ) implies Q(D) = Q(D ) for all pairs of dbs D, D QC-QC Entailment: Compl(Q1), , Compl(Qn) entails Compl(Q), iff Q1(Di) = Q1(Da), , Qn(Di) = Qn(Da) implies Q(Di) = Q(Da) for all pairs of dbs Di, Da where Da Di Proposition: Q1, , Qn Q implies Compl(Q1), , Compl(Qn) Compl(Q) 30.08.2011 Completeness of Queries over Incomplete Databases 28

  29. QC-QC and Query Determinacy (Cntd) However: Decidability of determinacy for conj. queries is open (Segoufin/Vianu 05) Necessity of determinacy for QC-QC entailment is open Theorem: For boolean queries, existence of rewritings, determinacy and QC-QC entailment coincide 30.08.2011 Completeness of Queries over Incomplete Databases 29

  30. Where Can Completeness Statements Come From? Any conclusion only as correct as the statements it is derived from ~> On which basis can someone give a completeness statement? Someone knows some part of the real world E.g., a class teacher knows all his students The method of data collection is known to be complete E.g., at the deadline for enrolment all forms must be present Cardinalities of parts of the real world are known and the method of data collection is correct E.g., no nonexisting schools are registered and the number of schools in South Tyrol is known 30.08.2011 Completeness of Queries over Incomplete Databases 30

  31. Conclusion Framework for modelling completeness query answers (Motro: QC statements) parts of databases (Levy: TC statements) Reasoning Complexity analysis of TC-TC and TC-QC Connection between determinacy and QC-QC Reasoning in the presence of instances Current work Schema constraints (keys, foreign keys, finite domains) Null values Prototypical implementation 30.08.2011 Completeness of Queries over Incomplete Databases 31

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#