Understanding Probabilistic Query Answering and Group Nearest Neighbor Queries
This chapter delves into probabilistic query types, focusing on probabilistic group nearest neighbor queries. Explore the definitions, processing techniques, and applications of such queries. Learn how probabilistic data management plays a crucial role in uncertain databases, spatial queries, and more, with examples like finding optimal restaurants for a group. Discover the significance of data uncertainty in various real-world applications, such as image retrieval and location-based services.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Probabilistic Data Management Chapter 4: Probabilistic Query Answering (2)
Objectives In this chapter, you will: Learn the definition and query processing techniques of a probabilistic query type Probabilistic Group Nearest Neighbor Query X. Lian and L. Chen,"Probabilistic Group Nearest Neighbor Queries in Uncertain Databases," In IEEE Trans. on Knowledge and Data Eng. (TKDE), vol. 20, no. 6, pp. 809-824, June, 2008. 2
Recall: Probabilistic Query Types Probabilistic Spatial Query Uncertain/probabilistic database Probabilistic range query Probabilistic k-nearest neighbor query Probabilistic group nearest neighbor (PGNN) query Probabilistic reverse k-nearest neighbor query Probabilistic spatial join /similarity join Probabilistic top-k query (or ranked query) Probabilistic skyline query Probabilistic reverse skyline query Probabilistic Preference Query 3
Probabilistic Group Nearest Neighbor Queries in Uncertain Databases In IEEE Trans. on Knowledge and Data Engineering (TKDE), 2008
Group Nearest Neighbor Query Group Nearest Neighbor (GNN) Search [ICDE04] Given a database D anda set Q of query objects q1, q2, , and qn, a GNN query retrieves an object o D that has the smallest summed distance to Q, i.e. i=1~ndist(qi, o) q3 find a data object o that minimizes i=1~3dist(qi, o) q1 o q2 5
An Example of GNN Query In a city, there are many restaurants Three people want to have lunch together at a restaurant Problem: -- person -- restaurant q3 q1 o To find a restaurant in the city that minimizes its total distances to these 3 people q2 6
Other GNN Applications Image Retrieval Find an image in the database that is similar to a group of user- specified query images Geographic information system Mobile computing applications q3 q1 o q2 7
Data Uncertainty In many applications, real-world data are usually uncertain and imprecise Image database Noises in image features Location-based services (LBS) Measurement errors (GPS, RFID, etc.) Trajectory data of mobile users are blurred for the sake of privacy preserving Sensor networks Sensory data inherently contain noises due to the environmental factor, network latency, or fluctuation of battery power 8
Data Uncertainty (cont.) uncertain object o uncertainty region UR(o) traditional database uncertain database 9
GNN in Uncertain Databases Probabilistic Group Nearest Neighbor (PGNN) Query in Uncertain Database [TKDE08] a group of query points q3 uncertainty regions of uncertain objects q1 q2 10
Motivation Example of PGNN In the mixed-reality games (like counter strike (CS)), 3 players may want to find a moving enemy to attack who can minimize their total travelling distance Position of each enemy is uncertain (due to the movement or the network delay) -- player -- moving enemy q3 uncertainty regions of uncertain objects q1 q2 11
Motivation Example of PGNN (cont'd) When a forest has several places on fire, at least 3 firefighters can collaborate to put out a place on fire The positions that are on fire are uncertain A PGNN can be issued to find those places on fire such that this group of firefighters can arrive as early as possible -- firefighter -- place on fire q3 uncertainty regions of uncertain objects q1 q2 12
Contributions The proposal of probabilistic group nearest neighbor (PGNN) query in the uncertain database Two effective pruning methods Spatial pruning Probabilistic pruning Efficient PGNN query processing approach Variants of PGNN query 13
Introduction q3 q1 o Uncertain database q2 Uncertain objects are represented by uncertainty regions rather than precise points Distances between uncertain objects are variables instead of fixed values Query processing over uncertain data Re-define traditional query types over precise data Consider unique characteristics (i.e. uncertainty) of uncertain data Guarantee the accuracy of query answers 14
Definition of PGNN Problem q3 q1 o q2 Probabilistic Group Nearest Neighbor (PGNN) Given an uncertain databaseD, a set of n query points, Q = {q1, q2, , qn}, and a user-specified probability threshold (0, 1], a PGNN query retrieves a set of uncertain objects o D such that they are expected to be GNN of query set Q with probability greater than , that is, where adist(o, Q) is defined as a monotonically increasing function adist(o, Q) = f (dist(o, q1), dist(o, q2), , dist(o, qn)) Euclidean distance aggregate function like SUM, MIN, or MAX 15
Computation of PGNN Answers Straightforward Approach, Linear Scan For each uncertain object o D Sequentially scan the entire database Compute the complex probability integration (via numerical method) Output object o, if its probability is greater than Time Complexity -- O(|D|2) q3 q1 o q2 16
Two Pruning Techniques Spatial Pruning Probabilistic Pruning 17
Spatial Pruning Basic idea Compute the lower/upper bounds of the aggregate distance, adist(o, Q), from each uncertain object o to query set Q, at a low cost Use lower/upper bounds to filter out false alarms q3 select the smallest distance upper bound as threshold aggregate distance upper bound of adist p Q ( , ) 6 5 ro q1 o 4 3 o' dist q C ( , ) center Co o 1 p 2 1 candidates of UR o ( ) data object q2 O p o o' 18
Spatial Pruning (cont'd) Recall from the PGNN definition: In fact, the spatial pruning method discards those objects with expected probability of being GNN equal to 0 19
Derivation of Distance Bounds Probabilistic minimum bounding method (PMBM) Bound all query points with an MBR Compute the minimum/maximum distances between query MBR and uncertain objects q3 max q1 min q2 uncertain object o 20
Derivation of Distance Bounds (cont'd) Probabilistic single point method (PSPM) Select a geometric centroid of query points as the representative Compute the minimum/maximum distances via triangle inequality |dist(q1, q3) - dist (q3, Co)| - ro dist(q1, o) dist(q1, q3)+ dist (q3, Co)+ro centroid q3 q1 ro Co q2 uncertain object o 21
Probabilistic Pruning Intuition of probabilistic pruning Prune those data objects that have the expected PGNN probability (LHS of inequality) smaller than or equal to 22
Probabilistic Pruning (cont'd) (1- )-Hypersphere, (1- ) - hypersphere p ( ) 1- For any uncertain object p, we can pre-compute a hypersphere, namely , within its uncertainty region UR(p), such that object p reside in with probability (1- ), where [0, ] p ( ) 1- rp 1- rp Cp uncertainty region of data object p p ( ) UR 23
Probabilistic Pruning (cont'd) q3 Use (1- )-hypersphere to obtain lower/upper bounds of distance adist(p1- , Q), say [LB_adist(p1- , Q), UB_adist(p1- , Q)] Any object o can be safely pruned if it holds that: UB_adist(p1- , Q) <LB_adist(o, Q) p ( ) 1- q1 dist q C ( , ) p 1 q2 aggregate distance upper bound of adist p Q ( , ) 6 5 upper bound of adist p Q ( , ) o 1- 4 3 o' 2 1 p data object O p o o' 24
PGNN Query Processing Construct an R-tree over uncertainty regions of data objects Nodes are recursively bounded by minimum bounding rectangle (MBR) until finally one node (root) is obtained root N7 N7 N2 c d N5 N6 q3 b q1 a N1 N5 N1 N3 e N3 N2 N4 f q2 N4 a c e g d b f h h g N6 R-tree index I 25
PGNN Query Processing (cont'd) Pruning Intermediate Nodes Define lower/upper bounds of aggregate distances between nodes and query points PMBM (bounding query points with an MBR) PSPM (using centroid of query points and triangle inequality) An intermediate node e can be pruned, if it holds that LB_adist(e, Q) UB_adist(o, Q) for a candidate o 26
PGNN Query Procedure Traverse the nodes of R-tree in a best-first manner by maintaining a minimum heap H (with key the lower bound of distance from node/object to query set Q) Every time we encounter a node/object, we compute its lower/upper bounds of aggregate distance, and apply the spatial pruning to filter out false alarms For uncertain objects that cannot be pruned by spatial pruning, apply probabilistic pruning After tree traversal, we refine the obtained candidate set by calculating the actual PGNN probability 27
Variants of PGNN Query PGNN query with uncertain query objects Re-define the lower/upper bounds of aggregate distances qr3 C3 q q3 qr1 C q uncertainty regions of query objects 1 q1 Co ro dist C C ( , ) q o C2 q 1 qr2 q2 28
Variants of PGNN Query (cont'd) PGNN with different aggregate functions SUM, MIN, MAX AVG, weighted SUM, etc. k-PGNN query To retrieve k uncertain objects that are expected to be closest to a query set Q with probability greater than 29
Variants of PGNN Query (cont'd) Any other variants of PGNN? PGNN on road networks? 30
Experimental Evaluation Experimental Settings Synthetic data sets Generate center locationCo of uncertain object o in a data space [0, 1,000]d Produce radiusro [rmin, rmax] for uncertainty region UR(o) Four types of data sets: lUrU, lUrG, lSrU, lSrG Measures wall clock time (the filtering time of index traversal) speed-up ratio (total time cost compared with that of the linear scan method) 31
Query Performance vs. lUrG PMBM lUrU PMBM lUrG PSPM lUrU PSPM wall clock time (sec) the speed-up ratio compared with the linear scan 0.6 (4717) (3159) (2909) (2676) (2676) (2802) (2724) (3159) (3063) (2909) (4717) (2802) (2628) (2724) the time cost of spatial pruning (upper part) (3063)(4682) (2628)(2796) (2434) (2434) (2526) (2526) 0.4 (2796) (4682) 0.2 the time cost of probabilistic pruning (lower part) 0 lUrU lUrU lUrU lUrU lUrU lUrU lUrG lUrG lUrG lUrG lUrG lUrG 0 0.2 0.4 0.6 0.8 1 data size |D| = 30K, dimensionality d = 3, the number of query pointsn = 4, probability threshold = 1,SUM aggregate function 32
Scalability Test on Data Size |D| lUrG PMBM lUrU PMBM lUrG PSPM lUrU PSPM wall clock time (sec) 0.6 (3709) (3709) (3936) (3936) (3612) (3597) (3711) (3711) (3159) (3159) (3063) (3063) (2540) (2540) (2288) (2288) 0.4 0.2 0 lUrU lUrU lUrU lUrU lUrG lUrG lUrG lUrG 30K 20K 40K 50K |D| dimensionality d = 3, the number of query pointsn = 4, probability threshold = 1, SUM aggregate function 33
Summary We formulated probabilistic group nearest neighbor (PGNN) query in the context of uncertain databases We proposed two effective pruning methods, spatial and probabilistic pruning, to reduce the search space of PGNN query, which can be seamlessly integrated into an efficient query procedure We further discussed some variants of the PGNN query We demonstrated through extensive experiments the efficiency and effectiveness of our proposed pruning methods as well as query processing approaches, in terms of wall clock time and speed-up ratio compared with linear scan 34