Outliers in Aggregate Queries with Scorpion

Scorpion: Explaining Away

Outliers in Aggregate Queries

Michael Wang

Role: Paper Author

Why?

●

Many datasets contain outliers, found through aggregation

●

It's useful to perform why-analysis of outliers

○

problem diagnosis

○

better model quality

●

 We want to be more efficient than manual inspection

Related Work: Profiler

●

Quick Review: visual data analysis and anomaly detection

○

suggests and provides views to help the user understand anomalies

●

Visual approach vs. technical approach

○

HCI vs. algorithmic

Technical Overview

1.

Define problem mathematically

2.

Design architecture

3.

Improve algorithms

Definitions

provenance

 = set of data points that produces the outlier

hold-out result

 = set of data points that user indicates are NOT outliers

Technical Definitions

predicate

 = conjunction of range clauses and set containment clauses

influence

 = the change in aggregation function when removing points

satisfying the predicate

A predicate with high influence conveys information about outliers.

Influential Predicates Problem:

Difficulties

1.

Finding influential predicates is harder than returning top k outliers

2.

Influence is sensitive to the aggregation function

3.

Users can select multiple regions of outliers and hold-outs

Architecture

Provenance: finds input groups corresponding to user selection

Partitioner: generates granular predicates from input groups

Merger: combines predicates to increase influence

Scorer: evaluates Partitioner and Merger by computing influence

Architecture

Aggregation Properties

Influence

 is the change in aggregation function when removing points

satisfying the predicate.

The cost of computing influence corresponds to the cost of aggregation.

●

incrementally removable

○

do not need to recompute on entire set when a small subset is removed

●

independent

○

the effect of any individual input on influence does not depend on other inputs

●

anti-monotone

○

non-influential predicates must not contain influential predicates

Algorithm: Decision Tree (DT) Partitioner

Because of the independence property, influence should not decrease

when combining tuples of similar influence.

●

Uses trees to recursively partition into tuples of similar influence

●

Finds tuples of similar influence by weighted random sample

●

Separately partition outliers and holdouts and combines with product

Algorithm: Bottom-Up (MC) Partitioner

Because of the anti-monotonicity property, influential sub-predicates can

be merged into influential predicates.

●

Finds influential single-attribute predicates

●

Computes product of predicates and prunes low influence

●

Iterates until best predicate is found

Additional Optimizations

The Merger combines adjacent predicates to increase influence.

●

Only merge top quartile of predicates by influence

●

Do not call the Scorer for incrementally removable aggregations

Other techniques:

●

Dimensionality reduction

●

User input on which attributes are important or ignored

Evaluation

Objective: understand performance of DT and MC algorithms compared

to naive implementation.

●

synthetic and real world datasets

●

measured with precision, recall, F-score, runtime

Results Overview

Vastly improved performance, good predicates on synthetic datasets

Good predicates on real-world datasets

Caveat: no end-user study

Takeaways

Predicates convey information about the data that satisfy the predicate.

Influential predicates can both identify and explain outliers.

Efficient algorithms for finding influential predicates can help end users.

Slide Note

Embed Share

Download

Many datasets contain outliers discovered through aggregation. Scorpion aims to explain outliers, improve model quality, and enhance efficiency in outlier problem diagnosis. The paper discusses related work, technical definitions, difficulties in finding influential predicates, and the architecture of Scorpion for handling outliers.

dsom Follow

Uploaded on Sep 20, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Scorpion: Explaining Away Outliers in Aggregate Queries Michael Wang Role: Paper Author 1

Why? Many datasets contain outliers, found through aggregation It's useful to perform why-analysis of outliers problem diagnosis better model quality We want to be more efficient than manual inspection 3

Related Work: Profiler Quick Review: visual data analysis and anomaly detection suggests and provides views to help the user understand anomalies Visual approach vs. technical approach HCI vs. algorithmic 4

Technical Overview 1. Define problem mathematically 2. Design architecture 3. Improve algorithms 5

Definitions provenance = set of data points that produces the outlier hold-out result = set of data points that user indicates are NOT outliers 6

Technical Definitions predicate = conjunction of range clauses and set containment clauses influence = the change in aggregation function when removing points satisfying the predicate A predicate with high influence conveys information about outliers. Influential Predicates Problem: 7

Difficulties 1. Finding influential predicates is harder than returning top k outliers 2. Influence is sensitive to the aggregation function 3. Users can select multiple regions of outliers and hold-outs 8

Architecture Provenance: finds input groups corresponding to user selection Partitioner: generates granular predicates from input groups Merger: combines predicates to increase influence Scorer: evaluates Partitioner and Merger by computing influence 9

Architecture 10

Aggregation Properties Influence is the change in aggregation function when removing points satisfying the predicate. The cost of computing influence corresponds to the cost of aggregation. incrementally removable do not need to recompute on entire set when a small subset is removed independent the effect of any individual input on influence does not depend on other inputs anti-monotone non-influential predicates must not contain influential predicates 11

Algorithm: Decision Tree (DT) Partitioner Because of the independence property, influence should not decrease when combining tuples of similar influence. Uses trees to recursively partition into tuples of similar influence Finds tuples of similar influence by weighted random sample Separately partition outliers and holdouts and combines with product 12

Algorithm: Bottom-Up (MC) Partitioner Because of the anti-monotonicity property, influential sub-predicates can be merged into influential predicates. Finds influential single-attribute predicates Computes product of predicates and prunes low influence Iterates until best predicate is found 13

Additional Optimizations The Merger combines adjacent predicates to increase influence. Only merge top quartile of predicates by influence Do not call the Scorer for incrementally removable aggregations Other techniques: Dimensionality reduction User input on which attributes are important or ignored 14

Evaluation Objective: understand performance of DT and MC algorithms compared to naive implementation. synthetic and real world datasets measured with precision, recall, F-score, runtime 15

Results Overview Vastly improved performance, good predicates on synthetic datasets Good predicates on real-world datasets Caveat: no end-user study 16

Takeaways Predicates convey information about the data that satisfy the predicate. Influential predicates can both identify and explain outliers. Efficient algorithms for finding influential predicates can help end users. 18

Outliers in Aggregate Queries with Scorpion

Download Presentation

Presentation Transcript

Related

More Related Content