Robust Decision Tree Induction from Unreliable Data Sources - STAIRS 2020 Presentation

Robust Decision Tree Induction from

Unreliable Data Sources

Christian Schreckenberger

, Christian Bartelt, Heiner Stuckenschmidt

29.-30. August 2020

STAIRS 2020

Outline

•

Introduction

•

Background

•

Related Work

•

Expected Information Gain

•

Evaluation

•

Conclusion

29.-30. August 2020

STAIRS 2020

Introduction

•

Missing Data is a well established and studied Problem

–

Intralearning approaches

–

Preprocessing approaches

•

Our focus is on Decision Tree Learning

•

Proposition: Expected Information Gain

–

Takes source reliablity into account

–

Sources with low reliability in the past, will have low reliability in the future

•

The Goal: Increase Robustness

29.-30. August 2020

STAIRS 2020

Background

•

Missing Data

–

MCAR

, MAR, MNAR

–

Multiple Imputation: kNN Imputation

•

Decision Tree

–

Divide and conquer

–

Split dataset until a stopping criterion is met

–

C4.5 and Missing Data

•

Calculation of criterion

•

Propagation of samples during training

•

Propagation of samples during prediction

29.-30. August 2020

STAIRS 2020

Related Work

•

Propagate Samples with MV down both branches

[Fri76]

•

Impute most common value on the fly

[CN89]

•

Lazy Decision Trees

[FKY96]

•

Surrogate splits

[BFOS84]

•

Branch Exclusive Splits

[BR18]

•

MIA

[TJH08]

29.-30. August 2020

STAIRS 2020

Expected Information Gain

•

Information Gain:

•

Expected Information Gain:

29.-30. August 2020

STAIRS 2020

Learning with Expected IG

29.-30. August 2020

STAIRS 2020

Evaluation

Setting

•

Six UCI ML Repo Datasets & One Synthetic Dataset

•

Baselines

–

C45 MV Strategy

–

Mean Imputation/C45

–

kNN Imputation/C45

•

5-fold Cross Validation

–

Training Data always has MV

–

Three Scenarios for Test Data

–

Amount of missing data in a range from 5% to 95%, using steps of 5%

29.-30. August 2020

STAIRS 2020

Evaluation

Prediction with Full Data

29.-30. August 2020

STAIRS 2020

Evaluation

Prediction with Missing Data

29.-30. August 2020

STAIRS 2020

Evaluation

Prediction with Imputed Data

29.-30. August 2020

STAIRS 2020

Conclusion

•

Discussion

–

Most beneficial when data is also missing at prediction time

–

A more accurate imputation method, provides better results

–

Interdepedency of features is required

•

Future Work

–

Analyze impact of the imputation method on the result

–

Extend to further imputation methods

–

Work on pruning methods and stopping criterion

29.-30. August 2020

STAIRS 2020

Q&A

Any Questions?

Get in Contact

schreckenberger@es.uni-mannheim.de

29.-30. August 2020

STAIRS 2020

REFERENCES

[Fri76]

Jerome H Friedman. A recursive partitioning decision rule for nonparametric classication. IEEE

Trans. Comput., 26(SLAC-PUB-1573-REV):404, 1976.

[CN89]

Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning, 3(4):261{283,

1989.

[FKY96]

Jerome H Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees. In AAAI/IAAI, Vol. 1,

pages 717-724, 1996.

[BFOS84]

L Breiman, JH Friedman, R Olshen, and CJ Stone. Classication and regression trees. 1984.

[BR18]

Cedric Beaulac and Jerey S Rosenthal. Best: A decision tree algorithm that handles missing

values.

arXiv preprint arXiv:1804.10168, 2018.

[TJH08]

Beth Twala, MC Jones, and David J Hand. Good methods for coping with missing data in

decision

trees. Pattern Recognition Letters, 29(7):950{956, 2008.

29.-30. August 2020

STAIRS 2020

Slide Note

Embed Share

Download

Introduction to a study focusing on Decision Tree Learning in the context of missing data, proposing Expected Information Gain to enhance robustness. The study explores background concepts, related work, and evaluates the approach using various datasets and strategies. STAIRS 2020 presentation provides insights on handling unreliable data sources for effective decision tree induction.

lucretia Follow

Uploaded on Aug 14, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Robust Decision Tree Induction from Unreliable Data Sources Christian Schreckenberger, Christian Bartelt, Heiner Stuckenschmidt STAIRS 2020 29.-30. August 2020 1

Outline Introduction Background Related Work Expected Information Gain Evaluation Conclusion STAIRS 2020 29.-30. August 2020 2

Introduction Missing Data is a well established and studied Problem Intralearning approaches Preprocessing approaches Our focus is on Decision Tree Learning Proposition: Expected Information Gain Takes source reliablity into account Sources with low reliability in the past, will have low reliability in the future The Goal: Increase Robustness STAIRS 2020 29.-30. August 2020 3

Background Missing Data MCAR, MAR, MNAR Multiple Imputation: kNN Imputation Decision Tree Divide and conquer Split dataset until a stopping criterion is met C4.5 and Missing Data Calculation of criterion Propagation of samples during training Propagation of samples during prediction STAIRS 2020 29.-30. August 2020 4

Related Work Propagate Samples with MV down both branches [Fri76] Impute most common value on the fly [CN89] Lazy Decision Trees [FKY96] Surrogate splits [BFOS84] Branch Exclusive Splits [BR18] MIA [TJH08] STAIRS 2020 29.-30. August 2020 5

Expected Information Gain Information Gain: Expected Information Gain: STAIRS 2020 29.-30. August 2020 6

Learning with Expected IG STAIRS 2020 29.-30. August 2020 7

Evaluation Setting Six UCI ML Repo Datasets & One Synthetic Dataset Baselines C45 MV Strategy Mean Imputation/C45 kNN Imputation/C45 5-fold Cross Validation Training Data always has MV Three Scenarios for Test Data Amount of missing data in a range from 5% to 95%, using steps of 5% STAIRS 2020 29.-30. August 2020 8

Evaluation Prediction with Full Data STAIRS 2020 29.-30. August 2020 9

Evaluation Prediction with Missing Data STAIRS 2020 29.-30. August 2020 10

Evaluation Prediction with Imputed Data STAIRS 2020 29.-30. August 2020 11

Conclusion Discussion Most beneficial when data is also missing at prediction time A more accurate imputation method, provides better results Interdepedency of features is required Future Work Analyze impact of the imputation method on the result Extend to further imputation methods Work on pruning methods and stopping criterion STAIRS 2020 29.-30. August 2020 12

Q&A Any Questions? Get in Contact schreckenberger@es.uni-mannheim.de STAIRS 2020 29.-30. August 2020 13

REFERENCES [Fri76] Jerome H Friedman. A recursive partitioning decision rule for nonparametric classication. IEEE Trans. Comput., 26(SLAC-PUB-1573-REV):404, 1976. [CN89] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning, 3(4):261{283, 1989. [FKY96] Jerome H Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees. In AAAI/IAAI, Vol. 1, pages 717-724, 1996. [BFOS84] L Breiman, JH Friedman, R Olshen, and CJ Stone. Classication and regression trees. 1984. [BR18] Cedric Beaulac and Jerey S Rosenthal. Best: A decision tree algorithm that handles missing values. arXiv preprint arXiv:1804.10168, 2018. [TJH08] Beth Twala, MC Jones, and David J Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters, 29(7):950{956, 2008. STAIRS 2020 29.-30. August 2020 14

Robust Decision Tree Induction from Unreliable Data Sources - STAIRS 2020 Presentation

Download Presentation

Presentation Transcript

Related

More Related Content