Robust Decision Tree Induction from Unreliable Data Sources - STAIRS 2020 Presentation

 
Robust Decision Tree Induction from
Unreliable Data Sources
 
Christian Schreckenberger
, Christian Bartelt, Heiner Stuckenschmidt
 
29.-30. August 2020
 
 
STAIRS 2020
 
1
 
Outline
 
Introduction
Background
Related Work
Expected Information Gain
Evaluation
Conclusion
 
29.-30. August 2020
 
STAIRS 2020
 
2
 
Introduction
 
Missing Data is a well established and studied Problem
Intralearning approaches
Preprocessing approaches
Our focus is on Decision Tree Learning
Proposition: Expected Information Gain
Takes source reliablity into account
Sources with low reliability in the past, will have low reliability in the future
The Goal: Increase Robustness
 
29.-30. August 2020
 
STAIRS 2020
 
3
 
Background
 
Missing Data
MCAR
, MAR, MNAR
Multiple Imputation: kNN Imputation
Decision Tree
Divide and conquer
Split dataset until a stopping criterion is met
C4.5 and Missing Data
Calculation of criterion
Propagation of samples during training
Propagation of samples during prediction
 
 
29.-30. August 2020
 
STAIRS 2020
 
4
 
Related Work
 
Propagate Samples with MV down both branches 
[Fri76]
Impute most common value on the fly 
[CN89]
Lazy Decision Trees 
[FKY96]
Surrogate splits 
[BFOS84]
Branch Exclusive Splits 
[BR18]
MIA 
[TJH08]
 
 
29.-30. August 2020
 
STAIRS 2020
 
5
 
Expected Information Gain
 
Information Gain:
 
 
Expected Information Gain:
 
29.-30. August 2020
 
STAIRS 2020
 
6
 
Learning with Expected IG
 
29.-30. August 2020
 
STAIRS 2020
 
7
 
Evaluation
Setting
 
Six UCI ML Repo Datasets & One Synthetic Dataset
Baselines
C45 MV Strategy
Mean Imputation/C45
kNN Imputation/C45
5-fold Cross Validation
Training Data always has MV
Three Scenarios for Test Data
Amount of missing data in a range from 5% to 95%, using steps of 5%
 
29.-30. August 2020
 
STAIRS 2020
 
8
 
Evaluation
Prediction with Full Data
 
29.-30. August 2020
 
STAIRS 2020
 
9
 
Evaluation
Prediction with Missing Data
 
29.-30. August 2020
 
STAIRS 2020
 
10
 
Evaluation
Prediction with Imputed Data
 
29.-30. August 2020
 
STAIRS 2020
 
11
 
Conclusion
 
Discussion
Most beneficial when data is also missing at prediction time
A more accurate imputation method, provides better results
Interdepedency of features is required
Future Work
Analyze impact of the imputation method on the result
Extend to further imputation methods
Work on pruning methods and stopping criterion
 
29.-30. August 2020
 
STAIRS 2020
 
12
 
Q&A
 
Any Questions?
 
Get in Contact
schreckenberger@es.uni-mannheim.de
 
29.-30. August 2020
 
STAIRS 2020
 
13
 
REFERENCES
 
[Fri76] 
Jerome H Friedman. A recursive partitioning decision rule for nonparametric classication. IEEE
Trans. Comput., 26(SLAC-PUB-1573-REV):404, 1976.
[CN89] 
Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning, 3(4):261{283,
1989.
[FKY96] 
Jerome H Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees. In AAAI/IAAI, Vol. 1,
pages 717-724, 1996.
[BFOS84] 
L Breiman, JH Friedman, R Olshen, and CJ Stone. Classication and regression trees. 1984.
[BR18] 
Cedric Beaulac and Jerey S Rosenthal. Best: A decision tree algorithm that handles missing
values. 
arXiv preprint arXiv:1804.10168, 2018.
[TJH08] 
Beth Twala, MC Jones, and David J Hand. Good methods for coping with missing data in
decision 
trees. Pattern Recognition Letters, 29(7):950{956, 2008.
 
29.-30. August 2020
 
STAIRS 2020
 
14
Slide Note
Embed
Share

Introduction to a study focusing on Decision Tree Learning in the context of missing data, proposing Expected Information Gain to enhance robustness. The study explores background concepts, related work, and evaluates the approach using various datasets and strategies. STAIRS 2020 presentation provides insights on handling unreliable data sources for effective decision tree induction.

  • Decision Tree
  • Unreliable Data
  • STAIRS 2020
  • Expected Information Gain
  • Robustness

Uploaded on Aug 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Robust Decision Tree Induction from Unreliable Data Sources Christian Schreckenberger, Christian Bartelt, Heiner Stuckenschmidt STAIRS 2020 29.-30. August 2020 1

  2. Outline Introduction Background Related Work Expected Information Gain Evaluation Conclusion STAIRS 2020 29.-30. August 2020 2

  3. Introduction Missing Data is a well established and studied Problem Intralearning approaches Preprocessing approaches Our focus is on Decision Tree Learning Proposition: Expected Information Gain Takes source reliablity into account Sources with low reliability in the past, will have low reliability in the future The Goal: Increase Robustness STAIRS 2020 29.-30. August 2020 3

  4. Background Missing Data MCAR, MAR, MNAR Multiple Imputation: kNN Imputation Decision Tree Divide and conquer Split dataset until a stopping criterion is met C4.5 and Missing Data Calculation of criterion Propagation of samples during training Propagation of samples during prediction STAIRS 2020 29.-30. August 2020 4

  5. Related Work Propagate Samples with MV down both branches [Fri76] Impute most common value on the fly [CN89] Lazy Decision Trees [FKY96] Surrogate splits [BFOS84] Branch Exclusive Splits [BR18] MIA [TJH08] STAIRS 2020 29.-30. August 2020 5

  6. Expected Information Gain Information Gain: Expected Information Gain: STAIRS 2020 29.-30. August 2020 6

  7. Learning with Expected IG STAIRS 2020 29.-30. August 2020 7

  8. Evaluation Setting Six UCI ML Repo Datasets & One Synthetic Dataset Baselines C45 MV Strategy Mean Imputation/C45 kNN Imputation/C45 5-fold Cross Validation Training Data always has MV Three Scenarios for Test Data Amount of missing data in a range from 5% to 95%, using steps of 5% STAIRS 2020 29.-30. August 2020 8

  9. Evaluation Prediction with Full Data STAIRS 2020 29.-30. August 2020 9

  10. Evaluation Prediction with Missing Data STAIRS 2020 29.-30. August 2020 10

  11. Evaluation Prediction with Imputed Data STAIRS 2020 29.-30. August 2020 11

  12. Conclusion Discussion Most beneficial when data is also missing at prediction time A more accurate imputation method, provides better results Interdepedency of features is required Future Work Analyze impact of the imputation method on the result Extend to further imputation methods Work on pruning methods and stopping criterion STAIRS 2020 29.-30. August 2020 12

  13. Q&A Any Questions? Get in Contact schreckenberger@es.uni-mannheim.de STAIRS 2020 29.-30. August 2020 13

  14. REFERENCES [Fri76] Jerome H Friedman. A recursive partitioning decision rule for nonparametric classication. IEEE Trans. Comput., 26(SLAC-PUB-1573-REV):404, 1976. [CN89] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning, 3(4):261{283, 1989. [FKY96] Jerome H Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees. In AAAI/IAAI, Vol. 1, pages 717-724, 1996. [BFOS84] L Breiman, JH Friedman, R Olshen, and CJ Stone. Classication and regression trees. 1984. [BR18] Cedric Beaulac and Jerey S Rosenthal. Best: A decision tree algorithm that handles missing values. arXiv preprint arXiv:1804.10168, 2018. [TJH08] Beth Twala, MC Jones, and David J Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters, 29(7):950{956, 2008. STAIRS 2020 29.-30. August 2020 14

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#