WISK: A Workload-aware Learned Index for Spatial Keyword Queries

WISK: A Workload-aware Learned Index for Spatial
Keyword Queries
Yufan Sheng
1
, 
Xin Cao
1*
, 
Yixiang Fang
2
,  Kaiqi Zhao
3
,
Jianzhong Qi
4
, 
Gao Cong
5
, Wenjie Zhang
1
2
Geo-textual objects
: stores, tourist attractions, business, etc.
Geographical location (e.g. Latitude 33°51'S, Longitude 151°12'E)
Text description (e.g. opera, live, car park)
 
Spatial keyword query
: 
retrieve the resultant object combining
spatial query and keyword search.
Spatial query: 
range query
, k-NN query.
Keyword search: 
Boolean search
, ranking-based search
.
 
Spatial keyword index
: 
integrate spatial index into textual
index
 [1]
Combination schema: 
spatial-first/textual-first integration, 
tight
integration
Introduction
1.
Chen, Z., Chen, L., Cong, G., Jensen, C. S.: Location-and keyword-based querying of geo-textual data: a
survey. In: The VLDB Journal, pp 603-640 (2021).
Solution: query-driven partition + RL packing
3
However, traditional indexes only consider the distribution of the underlying data.
Challenge 1: how to utilize the information from the query workload?
Assume two queries (red) have different keywords (e.g. q
1
.kws = {Park, BBQ}, q
2
.kws = {Hotel, Pizza}
Motivation
(a) Data-driven
(b) Query-aware
4
Multi-dimensional learned index cannot adapt to the high dimensionality scenario.
Challenge 2: how to consider the 
spatial
 and 
textual
 information simultaneously?
Motivation
(a) Loose combination
(b) Tight combination
Solution: keyword-based CDF models
5
SKR Query:
Given a query 
q
 with a query region (i.e. 
q.area
) and a
set of keywords (i.e. 
q.kws
), the result of q is a set of
objects, which are within the region and at least
contain one keyword.
Problem Definition
Problem statement:
Given a geo-textual dataset 
D
 and a set of historical
SKR queries 
W
, we aim to learn a spatial keyword
index that can efficiently process the incoming SKR
queries 
Q
.
 
Note: We assume the distributions of W and Q are
the same.
6
Two separate steps
ML to solve the optimization problem
RL to construct the hierarchical index
Overview
7
Cost Formulation
8
Problem
: Optimal Partitioning
Given a dataset D = {𝑜
1
, 𝑜
2
, . . . , 𝑜
𝑛
 } and a query workload
 
W = {𝑞
1
, 𝑞
2
, . . . , 𝑞
𝑚
 }, we aim to find an optimal partition,
i.e., a set of 
k
 clusters 𝐺 = {𝑐
1
, 𝑐
2
, . . . , 𝑐
𝑘
 },
 
where
I.
Each object belongs to exactly one cluster.
II.
The total cost is minimized.
Bottom Partition
Each bottom partition can be considered as a set of objects
NP-hard
 problem: can be reduced from the MaxSkip partitioning problem 
[1, 2]
.
1.
Sun, L., Franklin, M. J., Krishnan, S., Xin, R. S.: Fine-grained partitioning for aggressive data skipping. In: SIGMOD, pp. 1115-1126 (2014).
2.
Yang, Z., Chandramouli, B., Wang, C., Gehrke, J., Li, Y., Minhas, U. F., Larson, P. A., Kossmann, D., Acharya, R.: Qd-tree: Learning data layouts for
big data analytics. In: SIGMOD, pp. 193-208 (2020).
9
Motivated by K-D-Tree, we propose a greedy bi-partitioning algorithm
Greedy Partition
10
Key problems in splitting
Which dimension?
Where?
Greedy SGD Partition
11
The generated bottom clusters can be regarded as a flat index
The pruning power can be further improved
Bottom-up Packing
Problem
: Bottom-up Packing
Given a set of bottom clusters G = {c
1
, c
2
, . . . , c
k
 } and a query workload
 
W = {𝑞
1
, 𝑞
2
, . . . , 𝑞
𝑚
 }, we aim to
construct a hierarchical index level by level, which can minimize the number of accessed nodes
Definition
: Optimization Criterion
Query time – directly reflect the pruning capability but impossible to use
The number of accessed nodes 
– proportional to the time cost and easy to calculate with query labels
12
We formulate the bottom-up packing as an MDP, which can be solved by RL
RL Packing
State: upper node 
 query labels + # of bottom nodes
          bottom node – query labels
Action: add the bottom node into one upper node
Transition: update the query label of the upper node
Reward: the difference of the average number of accessed nodes
13
Datasets
Settings
Experiments
Baseline
Traditional index: CDIR-Tree, SFC-Quad, ST2I, ST2D
Learned index: TFI, LSTI, Flood-T
14
Experiments
Varying the
query
distribution
Varying the
query region
size
Varying the
number of
keywords
WISK can work well
across different settings
15
Scalability
Experiments
Robustness
WISK can work well on large-scale dataset
WISK can work well when query
distribution changes slightly
16
Conclusion
 
 
 
 
 
 
Use ML algorithm to solve the optimal partitioning problem, which is shown to be NP-hard.
 
Pack the generated clusters by using RL framework, which further improve the pruning
capability.
 
Report on extensive experiments on real-world datasets and synthetic workloads.
Conclusion
17
WISK: A Workload-aware Learned Index for Spatial Keyword Queries
Thank you!
Yufan Sheng
, Xin Cao*, Yixiang Fang,  Kaiqi Zhao, Jianzhong Qi, Gao Cong, Wenjie
Zhang
yufan.sheng@unsw.edu.au
Slide Note
Embed
Share

WISK, a workload-aware learned index that combines spatial and keyword queries to efficiently retrieve objects. It integrates spatial and textual indexes and considers query workload information.

  • learned index
  • spatial keyword queries
  • workload-aware
  • spatial query
  • keyword search
  • spatial index
  • textual index
  • combination schema

Uploaded on Dec 21, 2023 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. WISK: A Workload-aware Learned Index for Spatial Keyword Queries Yufan Sheng1, Xin Cao1*, Yixiang Fang2, Kaiqi Zhao3, Jianzhong Qi4, Gao Cong5, Wenjie Zhang1

  2. Introduction Geo-textual objects: stores, tourist attractions, business, etc. Geographical location (e.g. Latitude 33 51'S, Longitude 151 12'E) Text description (e.g. opera, live, car park) Spatial keyword query: retrieve the resultant object combining spatial query and keyword search. Spatial query: range query, k-NN query. Keyword search: Boolean search, ranking-based search. Spatial keyword index: integrate spatial index into textual index[1] Combination schema: spatial-first/textual-first integration, tight integration 1.Chen, Z., Chen, L., Cong, G., Jensen, C. S.: Location-and keyword-based querying of geo-textual data: a survey. In: The VLDB Journal, pp 603-640 (2021). 2

  3. Motivation However, traditional indexes only consider the distribution of the underlying data. Challenge 1: how to utilize the information from the query workload? Assume two queries (red) have different keywords (e.g. q1.kws = {Park, BBQ}, q2.kws = {Hotel, Pizza} (b) Query-aware (a) Data-driven Solution: query-driven partition + RL packing 3

  4. Motivation Multi-dimensional learned index cannot adapt to the high dimensionality scenario. Challenge 2: how to consider the spatial and textual information simultaneously? (b) Tight combination (a) Loose combination Solution: keyword-based CDF models 4

  5. Problem Definition SKR Query: Given a query q with a query region (i.e. q.area) and a set of keywords (i.e. q.kws), the result of q is a set of objects, which are within the region and at least contain one keyword. {Italian, Restaurant} O 4 {Chinese, Asian, Restaurant} O7 {Chinese, Restaurant} {India, Market} {Thailand, Restaurant} O5 Problem statement: Given a geo-textual dataset D and a set of historical SKR queries W, we aim to learn a spatial keyword index that can efficiently process the incoming SKR queries Q. O3 {Market} O2 {Italian, Restaurant} O6 {Chinese, Restaurant} O1 Note: We assume the distributions of W and Q are the same. 5

  6. Overview Two separate steps ML to solve the optimization problem RL to construct the hierarchical index 6

  7. Cost Formulation 7

  8. Bottom Partition Each bottom partition can be considered as a set of objects Problem: Optimal Partitioning Given a dataset D = {?1, ?2, . . . , ??} and a query workload W = {?1, ?2, . . . , ??}, we aim to find an optimal partition, i.e., a set of k clusters ? = {?1, ?2, . . . , ??}, where I. Each object belongs to exactly one cluster. II. The total cost is minimized. NP-hard problem: can be reduced from the MaxSkip partitioning problem [1, 2]. 8 1.Sun, L., Franklin, M. J., Krishnan, S., Xin, R. S.: Fine-grained partitioning for aggressive data skipping. In: SIGMOD, pp. 1115-1126 (2014). 2.Yang, Z., Chandramouli, B., Wang, C., Gehrke, J., Li, Y., Minhas, U. F., Larson, P. A., Kossmann, D., Acharya, R.: Qd-tree: Learning data layouts for big data analytics. In: SIGMOD, pp. 193-208 (2020).

  9. Greedy Partition Motivated by K-D-Tree, we propose a greedy bi-partitioning algorithm Update Lower cost Split Y X Ignore Higher cost 9

  10. Greedy SGD Partition Key problems in splitting Which dimension? Where? X x xb xu 10

  11. Bottom-up Packing The generated bottom clusters can be regarded as a flat index The pruning power can be further improved Definition: Optimization Criterion Query time directly reflect the pruning capability but impossible to use The number of accessed nodes proportional to the time cost and easy to calculate with query labels Problem: Bottom-up Packing Given a set of bottom clusters G = {c1, c2, . . . , ck} and a query workload W = {?1, ?2, . . . , ??}, we aim to construct a hierarchical index level by level, which can minimize the number of accessed nodes 11

  12. RL Packing We formulate the bottom-up packing as an MDP, which can be solved by RL State: upper node query labels + # of bottom nodes bottom node query labels Bottom nodes X Y Z X + 1 3 2 Action: add the bottom node into one upper node Transition: update the query label of the upper node Z + Z + 3 1 2 2 Y Y Y + + + 3 3 3 1 1 1 X Y Z + Z 2 2 2 Z Z Z Z Z Z + + + + + + 3 3 3 3 3 1 3 1 3 3 1 1 1 1 1 1 2 2 2 2 2 2 2 Reward: the difference of the average number of accessed nodes 12

  13. Experiments Datasets Baseline Traditional index: CDIR-Tree, SFC-Quad, ST2I, ST2D Learned index: TFI, LSTI, Flood-T Settings 13

  14. Experiments Varying the query distribution WISK can work well across different settings Varying the query region size Varying the number of keywords 14

  15. Experiments Scalability Robustness ST2I Flood-T WISK ST2I LSTI Flood-T WISK 250 3600 Query time (ms) Query time (ms) 200 2700 150 1800 100 900 0 50 1M 5M 10M 50M 100M 0 0.2 0.4 LAP ratio 0.6 0.8 1 No. of objects WISK can work well on large-scale dataset WISK can work well when query distribution changes slightly 15

  16. Conclusion Conclusion Data-driven index Query-aware index Spatial index R-Tree, RSMI, LISA Flood, Tsunami Geo-textual index WISK CDIR-Tree, SFC-Quad, LSTI Use ML algorithm to solve the optimal partitioning problem, which is shown to be NP-hard. Pack the generated clusters by using RL framework, which further improve the pruning capability. Report on extensive experiments on real-world datasets and synthetic workloads. 16

  17. Thank you! WISK: A Workload-aware Learned Index for Spatial Keyword Queries Yufan Sheng, Xin Cao*, Yixiang Fang, Kaiqi Zhao, Jianzhong Qi, Gao Cong, Wenjie Zhang yufan.sheng@unsw.edu.au 17

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#