Advanced Techniques in Relational Data Outlier Detection

Slide Note
Embed
Share

This document delves into cutting-edge methods for outlier detection in relational data, focusing on profile-based and model-based approaches such as leveraging Bayesian networks, feature generation, and individual feature vector summarization. The examples provided showcase the application of these techniques in identifying exceptional individual databases. With a case study in geographical information systems, the paper highlights the significance of identifying outliers in both population and individual databases.


Uploaded on Sep 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. School of Computing Science Simon Fraser University Vancouver, Canada Anomaly Detection Outlier Detection Exception Mining

  2. Profile-Based Outlier Detection for Relational Data Individual Database Profile, Interpretation, egonet e.g. Brad Pitt s movies Population Database e.g. IMDB Goal: Identify exceptional individual databases Maervoet, J.; Vens, C.; Vanden Berghe, G.; Blockeel, H. & De Causmaecker, P. (2012), 'Outlier Detection in Relational Data: A Case Study in Geographical Information Systems', Expert Systems With Applications 39(5), 4718 4728. 2

  3. Example: population data gender = Man country = U.S. gender = Man country = U.S. gender = Woman country = U.S. gender = Woman country = U.S. False n/a False n/a ActsIn salary True $500K False n/a False n/a True $5M False n/a True $2M runtime = 98 min drama = true action = true runtime = 111 min drama = false action = true 3

  4. Example: individual data gender = Man country = U.S. False n/a False n/a runtime = 98 min drama = true 4

  5. Model-Based Relational Outlier Detection Model-based: Leverage result of Bayesian network learning 1. Feature generation based on BN model 2. Define outlierness metric using BN model Population Database Individual Database Class-level Bayesian network Maervoet, J.; Vens, C.; Vanden Berghe, G.; Blockeel, H. & De Causmaecker, P. (2012), 'Outlier Detection in Relational Data: A Case Study in Geographical Information Systems', Expert Systems With Applications 39(5), 4718 4728. 5

  6. Model-Based Feature Generation Learning Bayesian Networks for Complex Relational Data

  7. Model-Based Outlier Detection for Relational Data Population Database Individual Database Class-level Bayesian network Individual Feature Vector Propositionalization/Relation Elimination/ETL: Feature vectors summarize the individual data leverage outlier detection for i.i.d. feature matrix data Riahi, F. & Schulte, O. (2016), Propositionalization for Unsupervised Outlier Detection in Multi-Relational Data, in 'Proceedings FLAIRS 2016.', pp. 448--453. 7

  8. Example: Class Bayesian Network ActsIn(A,M) Drama(M) gender(A) 8 Learning Bayesian Networks for Complex Relational Data

  9. Example: Feature Matrix 1/2 1/2 1/2 1/2 0 1 0 0 0 0 0 0 1/2 1/2 1/2 1/2 1/2 1/2 0 0 0 0 0 0 1/2 1/2 1/2 1/2 1/2 1/2 0 0 0 0 0 0 1/2 1/2 1/2 1/2 1/2 1/2 0 0 0 0 0 0 Each feature corresponds to a family configuration in the Bayesian network Similar to feature matrix for classification For step-by-step construction, see supplementary slides on website 9 Learning Bayesian Networks for Complex Relational Data

  10. Feature Generation/Propositionalization for Outlier Detection Similar to feature generation for classification Main difference: include all first-order random variables, not just the Markov blanket of the class variable Bayesian network learning discovers relevant conjunctive features Related work: The Oddball system also extracts a feature matrix from relational information based on network analysis (Akoglu et al. 2010) +Leverages existing i.i.d. outlier detection methods - does not define a native relational outlierness metric Akoglu, L.; Mcglohon, M. & Faloutsos, C. (2010), OddBall: Spotting Anomalies in Weighted Graphs, in 'PAKDD', pp. 410-421 Akoglu, L.; Tong, H. & Koutra, D. (2015), 'Graph based anomaly detection and description: a survey', Data Mining and Knowledge Discovery 29(3), 626--688. 10

  11. Relational Outlierness Metrics Learning Bayesian Networks for Complex Relational Data

  12. Exceptional Model Mining for Relational Data EMM approach (Knobbe et al. 2011) for subgroup discovery in i.i.d. data Fix a model class with parameter vector . 2. Learn parameters c for the entire class. 3. Learn parameters g for a subgroup g. 4. Measure difference between c and g quality measure for subgroup g. 5. For relational data, an individual o = subgroup g of size 1. Compare random individual against target individual 1. Knobbe, A.; Feelders, A. & Leman, D. (2011), Exceptional Model Mining'Data Mining: Foundations and Intelligent Paradigms.', Springer Verlag, Heidelberg, Germany . Model-based Outlier Detection for Object-Relational Data . Riahi and Schulte (2015). IEEE SSCI. 12

  13. EMM-Based Outlier Detection for Relational Data Population Database Individual Database Class Bayesian network (for random individual) Individual Bayesian network Outlierness Metric (quality measure) = Measure of dissimilarity between class and individual BN e.g. KLD, ELD (new) Model-based Outlier Detection for Object-Relational Data . Riahi and Schulte (2015). IEEE SSCI. 13

  14. Example: class and individual Bayesian network parameters Gender (A) Drama(M ) Cond. Prob. of ActsIn(A,M)= T 1/2 0 0 1 P(gender(A)=M) = 0.5 P(Drama(M)=T) = 0.5 gender(A) Drama(M) M M W W T F T F ActsIn(A,M) P(Drama(M)=T) = 0.5 Gender P(gender(bradPitt)=M) = 1 Drama (M) Cond. Prob. of ActsIn(A,M)=T 0 0 (bradPitt) gender(BradPitt) Drama(M) M M T F ActsIn(BradPitt,M) 14

  15. Outlierness Metric = Kulback-Leibler Divergence KLD(Bo||Bc)= nodesi PBo(Xi= xik,Pa(Xi)= paj) ln(PBo(Xi= xik|Pa(Xi)= paj) PBc(Xi= xik|Pa(Xi)= paj)) parent-state j values k where Bc models the class database distribution Bo model the individual database distribution Do Assuming that PBo=PDo (MLE estimation), the KLD is the individual data log-likelihood ratio: KLD(Bo||Bc)=L(Bo;Do)-L(Bc;Do) 15 Learning Bayesian Networks for Complex Relational Data

  16. Brad Pitt Example individual joint individual cond ln(ind.cond. ) gender(A) M class cond ln(class cond.)KLD -0.69 1 1 0.5 0 0.69 Drama(M ) individual joint individual cond ln(ind.cond. ) ActsIn(A,M)gender(A) class cond ln(class cond.)KLD F M T 1/2 1 0.5 0 -0.69 0.35 F M F 1/2 1 1 0 0.00 0.00 total 0.35 total KLD = 0.69 + 0.35 = 1.04 KLD for Drama(M) = 0 omitted rows with individual probability = 0 16 Learning Bayesian Networks for Complex Relational Data

  17. Mutual Information Decomposition The interpretability of the metric can be increased by a mutual information decomposition of KLD KLD wrt marginal single-variable distributions PD(xik)ln(PBo(xik) KLD(Bo||Bc)= PBc(xik)) )-ln(PBC(Xi= xik|Pa(Xi)= paj) PBC(xik) nodesi values k PD(xik, paj) [ln(PBo(Xi= xik|Pa(Xi)= paj) + )] PBo(xik) parent-state j nodesi values k lift of parent condition in individual distribution lift of parent condition in class distribution The first sum measures single-variable distribution difference The second sum measures difference in strength of associations 17 Learning Bayesian Networks for Complex Relational Data

  18. ELD = Expected Log-Distance A problem with KLD: some log ratios are positive, some negative cancelling of differences, reduces power Can fix by taking log-distances PD(xik)|ln(PBo(xik) ELD(Bo||Bc)= PBc(xik))| )-ln(PBC(Xi= xik|Pa(Xi)= paj) PBC(xik) nodesi values k PD(xik, paj) |ln(PBo(Xi= xik|Pa(Xi)= paj) + )| PBo(xik) parent-state j nodesi values k 18 Learning Bayesian Networks for Complex Relational Data

  19. Two Types of Outliers Feature Outlier: unusual distribution over single attribute in isolation DribbleEfficiency Correlation Outlier: unusual relevance of parent for children (mutual information, lift) DribbleEfficiency Win 19 Learning Bayesian Networks for Complex Relational Data

  20. Example: Edin Dzeko, Marginals Data are from Premier League Season 2011-2012. Low Dribble Efficiency in 16% of his matches. Random Striker: Low DE in 50% of matches. ELD contribution for marginal sum: 16% x |ln(16%/50%)| = 0.18 20 Learning Bayesian Networks for Complex Relational Data

  21. Example: Edin Dzeko, Associations Association: Shotefficiency = high, TackleEfficiency = medium DribbleEffiency = low For Edin Dzeko: confidence = 50% lift = ln(50%/16%)=1.13 support (joint prob) = 6% For random striker confidence = 38% lift = ln(38%/50%) =-0.27 ELD contribution for association 10% x |1.13-(-0.27)|= 6% x 1.14 = 0.068 Learning Bayesian Networks for Complex Relational Data 21

  22. Evaluation Metrics Use precision as evaluation metric Set the percentages of outliers to be 1% and 5%. How many outliers were correctly recognized Similar results with AUC, recall. Gao, J.; Liang, F.; Fan, W.; Wang, C.; Sun, Y. & Han, J. (2010), On Community Outliers and Their Efficient Detection in Information Networks, in SIGKDD, pp. 813--822. 22

  23. Methods Compared Outlierness metrics KLD |KLD|: replace log-differences by log-distances ELD LOG = -log-likelihood of generic class model on individual database FD: |KLD| with respect to marginals only Aggregation Methods 1. Use counts of single feature values to form data matrix 2. Apply standard single-table methods (LOF, KNN, OutRank) 23 Learning Bayesian Networks for Complex Relational Data

  24. Synthetic Datasets Synthetic Datasets: Should be easy! Two Features per player per match Samples below high correlation Normal ShotEff Match Result 1 1 0 0 1 0 0 1 low correlation Normal ShotEff Match Result 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 Outlier Outlier 24

  25. Synthetic Data Results 25

  26. 1D Scatter-Plots Red points are outliers and blue points are normal class points High correlation Synthetic dataset Low correlation Synthetic dataset LOG LOG FD FD LR LR |LR| |LR| ELD ELD 2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 26 Log(Metrics+1) Log(Metrics+1)

  27. Case Study: Strikers and Movies ELD Rank ELD Max Node 1Dribble Efficiency DE = Low 2SavesMade 3SavesMade FD Max Value Individual Probability Class Probability Player Name Position Edin Dzeko Paul RobinsonGoalie Michel Vorm Striker 0.16 0.30 0.37 0.50 0.04 0.04 SM = Medium SM = Medium Goalie Striker = Normal ELD RankELD Max Node 1Actor_Quality 2Cast_position 3Cast_position FD Max feature Value a_quality=4 cast_num=3 cast_num=3 Individual Probability Class Probability MovieTitle Brave Heart Austin PowersComedy Blue Brothers Comedy Genre Drama 0.93 0.78 0.88 0.42 0.49 0.49 Drama = Normal 27

  28. Conclusion Relational outlier detection: two approaches for leveraging BN structure learning Propositionalization BN structure defines features for single-table outlier detection Relational Outlierness metric Use divergence between database distribution for target individual and random individual Novel variant of Kullback-Leibler divergence works well: interpretable accurate 28 Learning Bayesian Networks for Complex Relational Data

  29. Tutorial Conclusion: First-Order Bayesian Networks Many organizations maintain structured data in relational databases. First-order Bayesian networks model probabilistic associations across the entire database. Halpern/Bacchus probabilistic logic unifies logic and probability. random selection semantics for Bayesian networks: can query frequencies across the entire database. 29 Learning Bayesian Networks for Relational Data

  30. Conclusion: Learning First-Order Bayesian Networks Extend Halpern/Bacchus random selection semantics to statistical concepts new random selection likelihood function tractable parameter and structure learning can also be used to learn Markov Logic Networks relational Bayesian network classification formula log-linear model whose predictors are the proportions of Bayesian network features. New approach to relational anomaly detection compare probability distribution of potential outlier with distribution for reference class Learning Bayesian Networks for Relational Data 30

Related