Task-Aware Materialization for Fast Data Analytics at University of Wisconsin-Madison

Slide Note
Embed
Share

Data-driven pipelines play a crucial role in modern applications, and optimizing common tasks can significantly speed up these pipelines. The talk at University of Wisconsin-Madison explores smart materialization algorithms and data structures to enhance the performance of data analytics applications, focusing on achieving a smooth tradeoff between preprocessing space and task performance. Key topics include optimizing query results and developing algorithms for efficient data processing applications.


Uploaded on Oct 10, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. UNIVERSITY of WISCONSIN-MADISON Task-Aware Materialization for Fast Data Analytics Shaleen Deep shaleen@cs.wisc.edu Wisconsin DB Affiliates 2021

  2. INTRODUCTION Data driven pipelines are the workhorse for modern applications Business Intelligence Analyti cs Social media analysis Visual analytics Workflow for these pipelines consists of several simpler tasks that are chained together chained together Relational queries to retrieve data, apply algorithms, visualize results S peeding up the pipelines can be achieved optimizing the common tasks

  3. EXAMPLE 1 - DATA PIPELINES Pipeline A Relation R Join Relation S Relation T Processing R and S Pipeline B J oin Output J Ranked Optimizing common sub-task speeds up all pipelines Processing T and J Output Pipeline C Join Processing T and J

  4. EXAMPLE TWO - QUERIES ON THE CLOUD Preprocess/on-the-fly process datasets for faster execution Q1( ) Q6( ) Q2( ) Q5( ) Q3( ) Q4( )

  5. INTRODUCTION Ubiquitous optimization technique: Materialization Intermediate query results/indexes for downstream tasks Subexpressions and views for re-use by the query engine Can we achieve a smooth tradeoff? lazy evaluation query time factorization eager materialization materialize subexpressions preprocessing cost (time and space)

  6. INTRODUCTION Main Goal : develop smart materialization algorithms and data structures that significantly improve the performance of data analytics applications is it possible to do fine-grained query-aware materialization? can we smoothly tradeoff between high preprocessing/space vs lower task performance? are the algorithms provably optimal?

  7. THIS TALK Developed novel algorithms for two important data-processing applications Enumerating query results for templated workloads Ranked enumeration for query results

  8. TALK OUTLINE Part 1 Enumerating query results for templated queries Part 2 Ranked enumeration of query results

  9. TEMPLATED QUERY WORKLOAD Many analysis tasks repeatedly access output of join queries generated from templates Fixed query with different predicates - find all mutual friends of two friends p and q Views repeatedly accessed in multiquery optimization context Graph analytics tasks over relational data: find all co-authors of a given author a on DBLP dataset

  10. MOTIVATING EXAMPLE Consider a symmetric binary relation R where R(a,b) denotes user a is a friend of user b A data analyst wishes to execute following query for two friends x x and z z, return all mutual friends y SELECT R1.y FROM R AS R1, R AS R2 WHERE R1.x = [PARAMETER 1] AND R2.z = [PARAMETER 2] AND R1.y = R2.y;

  11. PROBLEM SETTING pre-processing phase enumeration phase Input template query Q ?[?1](?) ?[?2](?) ?[?3](?) Data Structure read-only database D Preprocessing Time TP Space S Goal : Goal : construct a space-efficient data structure to answer instantiations of the query template

  12. DATA STRUCTURE Key Idea: Use a partial materialization strategy Conventional Materialization Partial Materialization Store a binary tree to denote where is the output Output Q(D) is either materialized completely or not materialized at all 1 height controls space usage 1 1 1 1 (a1,b1), (a1,b2), (a2,b1), (a2,b2), (a1,b1), (a1,b2), (a2,b1), (a2,b2), Compressed Representations of Conjunctive Query Results, PODS 2018

  13. TALK OUTLINE Part 1 Enumerating query results for templated queries Part 2 Ranked enumeration of query results

  14. RANKED ENUMERATION Enumerating results in ranked order is a fundamental task data processing task SQL queries express ordered enumeration via ORDER BY and LIMIT clause Subgraph pattern matching problems in graphs and trees It is important to enumerate the results with minimal (ideally constant) delay and minimal (ideally linear) preprocessing over the input Critical when users only want to see the first few results (top-k where k is small)

  15. MOTIVATING EXAMPLE Given query: SELECT * FROM Hotel, Museum, Restuarant WHERE H.area =M.area = R.area ORDER BY H.price + M.ticket + R.cost LIMIT 10; Main Goal : Main Goal : Develop output sensitive algorithms for top-k queries that have minimal delay between any two output results low preprocessing time low space requirement

  16. OPEN SOURCE RDBMS Open-source RDBMS engines usually evaluate ranked joins by materializing result followed by sorting serial operation output input ORDER Join tmp table, filesort Entire join is materialized even if only the top-10 results are needed Ideally, if top-k results, the algorithm should take O(k) time

  17. RESULT INTUITION Key idea: Combine join trees with the properties of ranking functions Using join trees allows us to use worst-case optimal join algorithms Creating a join tree that is compatible with the ranking function gives us the algorithm Suppose the ranking functions is x1+x2+x3. Exploit the fact that the function is monotone This observation can be combined with priority queue based sorting Ranked Enumeration of Conjunctive Query Results, ICDT 2021

  18. FORMAL MAIN RESULT Q arbitrary acyclic join query rank ranking function ?join tree compatible with rank conditionally optimal

  19. CONCLUSION Task-aware algorithms allow for fine-grained decision making of what materialization to materialization to perform For some problems, there is no materialization required apriori Lots of interesting future work to find better algorithms that rely on materialization materialization

  20. MAIN RESULT Consider the full query template Q with hypergraph H = (V, E). Let u be any cover of V. Then, for any database D and worst-case optimal running time parameter slack: depends on query and edge cover; always 1 better than *enumeration is in lexicographic order **space required during preprocessing is also bounded by S

Related


More Related Content