Analysis of Drawbacks in BlinkDB System

 
BlinkDB
 
 
BlinkDB: My main takeaways
 
IDEA 1: Organize sampling around query column sets
(a) A small set may “cover” all queries
From workloads
(b) MILP formulation which picks these column sets
IDEA 2: Determine on the fly, the qcs that gives the best
“bang for the buck”
Cute idea of selectivity – number of rows selected divided by
the number of rows read
Both of these are 
nice, novel 
ideas.
 
Where could the BlinkDB approach
fail?
 
 
Where could the BlinkDB approach
fail?
 
QCSes are not stable
# of rare subgroups are high, dimensionality
bad
For example, if three groups have dimensionality
10000 each, the stratified sample of the cross-
product could be GIGANTIC
Aqua would also fail
 
What are the drawbacks of the
BlinkDB system?
 
Let’s talk about
 
Optimization Techniques
 
Query Class
 
Repeated Queries
 
What are the drawbacks of the
BlinkDB system: Optimization
Techniques?
 
 
What are the drawbacks of the
BlinkDB system: Optimization
Techniques?
 
Not clear
how to tune various parameters: K, M
whether the MILP is indeed optimal
if techniques will apply to other workloads or case
studies (two datasets is very limited)
Could have done bootstrapping
 
What are the drawbacks of the
BlinkDB system: Query Classes?
 
 
What are the drawbacks of the
BlinkDB system: Query Classes?
 
Does not handle joins/nesting
Very simple queries
Other Drawbacks
 
A QCS either receives K or none at all. Is this
wise?
Should depend on variance, distribution
If all values of cost of sale is 1 for this product, why
keep a sample of K, 1 would suffice.
If variance is small, a smaller sample may suffice
Paper claims partial covering is OK if need be.
Is that true?
Partial covering may lead to a biased sample
May completely miss some groups
 
Aqua vs. BlinkDB
 
Very similar ideas for offline precomputed
samples
Aqua
Is query agnostic, will take full collection of group-
by columns to construct stratified sample
May be too much
Broader class of queries
General enough to apply to joins (foreign key)
 
Alternatives for speeding up latency
 
Offline Approximate Query Processing
Better or more hardware
Leveraging main memory (e.g., spark)
Parallelism (e.g., dremel)
Data Cube Materialization (NEXT)
Online Approximate Query Processing (NEXT)
 
 
 
OLAP
 
From the 90s
Analytics: Business Intelligence
 
Without aggregates, the underlying data can be
represented as a snowflake schema
 
The key idea of a data cube to materialize and store
aggregates for this snowflake schema
 
 
 
Two representations
 
Normalized Representation
 
Fact tables and dimension tables
 
 
Denormalized representation
 
One row per transaction
 
A Sample Data Cube
 
Total annual sales
of  TV in U.S.A.
 
All this data is completely precomputed before any queries arrive!
 
Data Cube Materialization: Idea from
the 90s
 
Sales volume as a function of product,
month, and region
 
Product
 
Country
 
Month
 
Also called
Data Warehousing
 
The curse of dimensionality
That said, not clear if any of the
queries in their set are greater than
7-8 dimensions
 
Online AQP: Aggregation: Another
Idea from the 90s
 
No upfront computation or storage.
Sample from each group in a 
random 
fashion
If you have indexes, can even do stratified sampling
Display results + place control in hands of the user
 
Why is this bad??
 
Why is OLA bad?
 
Random I/O sucks!
 
Remember:
 
Disk seek 
        
10,000,000 ns
 
 
Read 1 MB sequentially from disk 
 
30,000,000 ns
 
 
Thus, 1 random I/O seek = reading 1/3 MB
sequentially.
 
 
Why is OLA bad?
 
 
Not the case with memory, however!!
 
OLA may still be important
 
 
That said, memory is FAST, so unless you’re dealing
with large datasets (unlikely to keep in memory due to
$$$), scanning entire data may still be pretty fast
 
When is
ONAQP/OFAQP/materialization better?
 
 
ONAQP, OFAQP, materialization
 
OFAQP, materialization both better when you have space
(disk/memory) and a high predictive power
Materialization better when (small) fixed set of queries or
report generation
OFAQP better for ad-hoc queries, ad-hoc measures
Both OFAQP and materialization are bad for time-varying
data!
ONAQP is better only if done in memory, but only LARGE
memory – if small memory, not much performance difference
between scanning entire thing vs. seeking around
Slide Note
Embed
Share

BlinkDB is a system that focuses on organizing sampling around query column sets and determining query classes with the best efficiency. However, potential failures lie in unstable QCSes, high rare subgroup counts, and challenging dimensionality. Drawbacks include unclear parameter tuning, limited optimal solutions, and the system's inability to handle joins and nesting in queries.

  • BlinkDB
  • Data Sampling
  • Query Optimization
  • Drawbacks Analysis
  • Database System

Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. BlinkDB

  2. BlinkDB: My main takeaways IDEA 1: Organize sampling around query column sets (a) A small set may cover all queries From workloads (b) MILP formulation which picks these column sets IDEA 2: Determine on the fly, the qcs that gives the best bang for the buck Cute idea of selectivity number of rows selected divided by the number of rows read Both of these are nice, novel ideas.

  3. Where could the BlinkDB approach fail?

  4. Where could the BlinkDB approach fail? QCSes are not stable # of rare subgroups are high, dimensionality bad For example, if three groups have dimensionality 10000 each, the stratified sample of the cross- product could be GIGANTIC Aqua would also fail

  5. What are the drawbacks of the BlinkDB system? Let s talk about Optimization Techniques Query Class Repeated Queries

  6. What are the drawbacks of the BlinkDB system: Optimization Techniques?

  7. What are the drawbacks of the BlinkDB system: Optimization Techniques? Not clear how to tune various parameters: K, M whether the MILP is indeed optimal if techniques will apply to other workloads or case studies (two datasets is very limited) Could have done bootstrapping

  8. What are the drawbacks of the BlinkDB system: Query Classes?

  9. What are the drawbacks of the BlinkDB system: Query Classes? Does not handle joins/nesting Very simple queries

  10. Other Drawbacks A QCS either receives K or none at all. Is this wise? Should depend on variance, distribution If all values of cost of sale is 1 for this product, why keep a sample of K, 1 would suffice. If variance is small, a smaller sample may suffice Paper claims partial covering is OK if need be. Is that true? Partial covering may lead to a biased sample May completely miss some groups

  11. Aqua vs. BlinkDB Very similar ideas for offline precomputed samples Aqua Is query agnostic, will take full collection of group- by columns to construct stratified sample May be too much Broader class of queries General enough to apply to joins (foreign key)

  12. Alternatives for speeding up latency Offline Approximate Query Processing Better or more hardware Leveraging main memory (e.g., spark) Parallelism (e.g., dremel) Data Cube Materialization (NEXT) Online Approximate Query Processing (NEXT)

  13. OLAP From the 90s Analytics: Business Intelligence Without aggregates, the underlying data can be represented as a snowflake schema The key idea of a data cube to materialize and store aggregates for this snowflake schema

  14. Two representations Normalized Representation Fact tables and dimension tables Denormalized representation One row per transaction

  15. A Sample Data Cube Total annual sales of TV in U.S.A. Date 3Qtr 2Qtr 1Qtr sum 4Qtr TV U.S.A PC VCR Country sum Canada Mexico sum All this data is completely precomputed before any queries arrive!

  16. Data Cube Materialization: Idea from the 90s Sales volume as a function of product, month, and region Also called Data Warehousing Product Month

  17. The curse of dimensionality That said, not clear if any of the queries in their set are greater than 7-8 dimensions

  18. Online AQP: Aggregation: Another Idea from the 90s No upfront computation or storage. Sample from each group in a random fashion If you have indexes, can even do stratified sampling Display results + place control in hands of the user Why is this bad??

  19. Why is OLA bad? Random I/O sucks! Remember: Disk seek 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Thus, 1 random I/O seek = reading 1/3 MB sequentially.

  20. Why is OLA bad? Not the case with memory, however!! OLA may still be important That said, memory is FAST, so unless you re dealing with large datasets (unlikely to keep in memory due to $$$), scanning entire data may still be pretty fast

  21. When is ONAQP/OFAQP/materialization better?

  22. ONAQP, OFAQP, materialization OFAQP, materialization both better when you have space (disk/memory) and a high predictive power Materialization better when (small) fixed set of queries or report generation OFAQP better for ad-hoc queries, ad-hoc measures Both OFAQP and materialization are bad for time-varying data! ONAQP is better only if done in memory, but only LARGE memory if small memory, not much performance difference between scanning entire thing vs. seeking around

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#