Outliers in Aggregate Queries with Scorpion

 
 
Scorpion: Explaining Away
Outliers in Aggregate Queries
 
1
 
Michael Wang
Role: Paper Author
 
 
2
 
 
Why?
 
Many datasets contain outliers, found through aggregation
It's useful to perform why-analysis of outliers
problem diagnosis
better model quality
 We want to be more efficient than manual inspection
 
3
 
 
Related Work: Profiler
 
Quick Review: visual data analysis and anomaly detection
suggests and provides views to help the user understand anomalies
Visual approach vs. technical approach
HCI vs. algorithmic
 
4
 
 
Technical Overview
 
1.
Define problem mathematically
2.
Design architecture
3.
Improve algorithms
 
5
 
 
Definitions
 
provenance
 = set of data points that produces the outlier
hold-out result
 = set of data points that user indicates are NOT outliers
 
6
 
 
Technical Definitions
 
predicate
 = conjunction of range clauses and set containment clauses
 
influence
 = the change in aggregation function when removing points
satisfying the predicate
A predicate with high influence conveys information about outliers.
Influential Predicates Problem:
 
7
 
 
Difficulties
 
1.
Finding influential predicates is harder than returning top k outliers
2.
Influence is sensitive to the aggregation function
3.
Users can select multiple regions of outliers and hold-outs
 
8
 
 
Architecture
 
Provenance: finds input groups corresponding to user selection
Partitioner: generates granular predicates from input groups
Merger: combines predicates to increase influence
Scorer: evaluates Partitioner and Merger by computing influence
 
9
 
 
Architecture
 
10
 
 
Aggregation Properties
 
Influence
 is the change in aggregation function when removing points
satisfying the predicate.
The cost of computing influence corresponds to the cost of aggregation.
incrementally removable
do not need to recompute on entire set when a small subset is removed
independent
the effect of any individual input on influence does not depend on other inputs
anti-monotone
non-influential predicates must not contain influential predicates
 
11
 
 
Algorithm: Decision Tree (DT) Partitioner
 
Because of the independence property, influence should not decrease
when combining tuples of similar influence.
Uses trees to recursively partition into tuples of similar influence
Finds tuples of similar influence by weighted random sample
Separately partition outliers and holdouts and combines with product
 
12
 
 
Algorithm: Bottom-Up (MC) Partitioner
 
Because of the anti-monotonicity property, influential sub-predicates can
be merged into influential predicates.
Finds influential single-attribute predicates
Computes product of predicates and prunes low influence
Iterates until best predicate is found
 
13
 
 
Additional Optimizations
 
The Merger combines adjacent predicates to increase influence.
Only merge top quartile of predicates by influence
Do not call the Scorer for incrementally removable aggregations
Other techniques:
Dimensionality reduction
User input on which attributes are important or ignored
 
14
 
 
Evaluation
 
Objective: understand performance of DT and MC algorithms compared
to naive implementation.
synthetic and real world datasets
measured with precision, recall, F-score, runtime
 
15
 
 
Results Overview
 
Vastly improved performance, good predicates on synthetic datasets
Good predicates on real-world datasets
Caveat: no end-user study
 
16
 
 
17
 
 
Takeaways
 
Predicates convey information about the data that satisfy the predicate.
Influential predicates can both identify and explain outliers.
Efficient algorithms for finding influential predicates can help end users.
 
18
Slide Note
Embed
Share

Many datasets contain outliers discovered through aggregation. Scorpion aims to explain outliers, improve model quality, and enhance efficiency in outlier problem diagnosis. The paper discusses related work, technical definitions, difficulties in finding influential predicates, and the architecture of Scorpion for handling outliers.

  • Outliers
  • Data Analysis
  • Aggregate Queries
  • Anomaly Detection
  • Efficiency

Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scorpion: Explaining Away Outliers in Aggregate Queries Michael Wang Role: Paper Author 1

  2. 2

  3. Why? Many datasets contain outliers, found through aggregation It's useful to perform why-analysis of outliers problem diagnosis better model quality We want to be more efficient than manual inspection 3

  4. Related Work: Profiler Quick Review: visual data analysis and anomaly detection suggests and provides views to help the user understand anomalies Visual approach vs. technical approach HCI vs. algorithmic 4

  5. Technical Overview 1. Define problem mathematically 2. Design architecture 3. Improve algorithms 5

  6. Definitions provenance = set of data points that produces the outlier hold-out result = set of data points that user indicates are NOT outliers 6

  7. Technical Definitions predicate = conjunction of range clauses and set containment clauses influence = the change in aggregation function when removing points satisfying the predicate A predicate with high influence conveys information about outliers. Influential Predicates Problem: 7

  8. Difficulties 1. Finding influential predicates is harder than returning top k outliers 2. Influence is sensitive to the aggregation function 3. Users can select multiple regions of outliers and hold-outs 8

  9. Architecture Provenance: finds input groups corresponding to user selection Partitioner: generates granular predicates from input groups Merger: combines predicates to increase influence Scorer: evaluates Partitioner and Merger by computing influence 9

  10. Architecture 10

  11. Aggregation Properties Influence is the change in aggregation function when removing points satisfying the predicate. The cost of computing influence corresponds to the cost of aggregation. incrementally removable do not need to recompute on entire set when a small subset is removed independent the effect of any individual input on influence does not depend on other inputs anti-monotone non-influential predicates must not contain influential predicates 11

  12. Algorithm: Decision Tree (DT) Partitioner Because of the independence property, influence should not decrease when combining tuples of similar influence. Uses trees to recursively partition into tuples of similar influence Finds tuples of similar influence by weighted random sample Separately partition outliers and holdouts and combines with product 12

  13. Algorithm: Bottom-Up (MC) Partitioner Because of the anti-monotonicity property, influential sub-predicates can be merged into influential predicates. Finds influential single-attribute predicates Computes product of predicates and prunes low influence Iterates until best predicate is found 13

  14. Additional Optimizations The Merger combines adjacent predicates to increase influence. Only merge top quartile of predicates by influence Do not call the Scorer for incrementally removable aggregations Other techniques: Dimensionality reduction User input on which attributes are important or ignored 14

  15. Evaluation Objective: understand performance of DT and MC algorithms compared to naive implementation. synthetic and real world datasets measured with precision, recall, F-score, runtime 15

  16. Results Overview Vastly improved performance, good predicates on synthetic datasets Good predicates on real-world datasets Caveat: no end-user study 16

  17. 17

  18. Takeaways Predicates convey information about the data that satisfy the predicate. Influential predicates can both identify and explain outliers. Efficient algorithms for finding influential predicates can help end users. 18

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#