Overfitting in Data Mining Models

Data Mining
Model Overfitting
Introduction to Data Mining, 2
nd
 Edition
by
Tan, Steinbach, Karpatne, Kumar
Classification Errors
l
T
r
a
i
n
i
n
g
 
e
r
r
o
r
s
:
 
E
r
r
o
r
s
 
c
o
m
m
i
t
t
e
d
 
o
n
 
t
h
e
 
t
r
a
i
n
i
n
g
 
s
e
t
l
T
e
s
t
 
e
r
r
o
r
s
:
 
 
E
r
r
o
r
s
 
c
o
m
m
i
t
t
e
d
 
o
n
 
t
h
e
 
t
e
s
t
 
s
e
t
l
G
e
n
e
r
a
l
i
z
a
t
i
o
n
 
e
r
r
o
r
s
:
 
E
x
p
e
c
t
e
d
 
e
r
r
o
r
 
o
f
 
a
 
m
o
d
e
l
 
o
v
e
r
 
r
a
n
d
o
m
 
s
e
l
e
c
t
i
o
n
 
o
f
 
r
e
c
o
r
d
s
 
f
r
o
m
 
s
a
m
e
 
d
i
s
t
r
i
b
u
t
i
o
n
Example Data Set
T
w
o
 
c
l
a
s
s
 
p
r
o
b
l
e
m
:
+
 
:
 
5
4
0
0
 
i
n
s
t
a
n
c
e
s
5
0
0
0
 
i
n
s
t
a
n
c
e
s
 
g
e
n
e
r
a
t
e
d
f
r
o
m
 
a
 
G
a
u
s
s
i
a
n
 
c
e
n
t
e
r
e
d
 
a
t
(
1
0
,
1
0
)
4
0
0
 
n
o
i
s
y
 
i
n
s
t
a
n
c
e
s
 
a
d
d
e
d
o
 
:
 
5
4
0
0
 
i
n
s
t
a
n
c
e
s
G
e
n
e
r
a
t
e
d
 
f
r
o
m
 
a
 
u
n
i
f
o
r
m
d
i
s
t
r
i
b
u
t
i
o
n
1
0
 
%
 
o
f
 
t
h
e
 
d
a
t
a
 
u
s
e
d
 
f
o
r
t
r
a
i
n
i
n
g
 
a
n
d
 
9
0
%
 
o
f
 
t
h
e
d
a
t
a
 
u
s
e
d
 
f
o
r
 
t
e
s
t
i
n
g
Increasing number of nodes in Decision Trees
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
Decision Tree with 50 nodes
Decision Tree
Decision Tree
Decision boundaries on Training data
Which tree is better?
D
e
c
i
s
i
o
n
 
T
r
e
e
 
w
i
t
h
 
4
 
n
o
d
e
s
D
e
c
i
s
i
o
n
 
T
r
e
e
 
w
i
t
h
 
5
0
 
n
o
d
e
s
W
h
i
c
h
 
t
r
e
e
 
i
s
 
b
e
t
t
e
r
 
?
Model Underfitting and Overfitting
U
n
d
e
r
f
i
t
t
i
n
g
:
 
w
h
e
n
 
m
o
d
e
l
 
i
s
 
t
o
o
 
s
i
m
p
l
e
,
 
b
o
t
h
 
t
r
a
i
n
i
n
g
 
a
n
d
 
t
e
s
t
 
e
r
r
o
r
s
 
a
r
e
 
l
a
r
g
e
O
v
e
r
f
i
t
t
i
n
g
:
 
w
h
e
n
 
m
o
d
e
l
 
i
s
 
t
o
o
 
c
o
m
p
l
e
x
,
 
t
r
a
i
n
i
n
g
 
e
r
r
o
r
 
i
s
 
s
m
a
l
l
 
b
u
t
 
t
e
s
t
 
e
r
r
o
r
 
i
s
 
l
a
r
g
e
As the model becomes more and more complex, test errors can start
increasing even though training error may be decreasing
Model Overfitting – Impact of Training Data Size
U
s
i
n
g
 
t
w
i
c
e
 
t
h
e
 
n
u
m
b
e
r
 
o
f
 
d
a
t
a
 
i
n
s
t
a
n
c
e
s
Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Model Overfitting – Impact of Training Data Size
U
s
i
n
g
 
t
w
i
c
e
 
t
h
e
 
n
u
m
b
e
r
 
o
f
 
d
a
t
a
 
i
n
s
t
a
n
c
e
s
Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
D
e
c
i
s
i
o
n
 
T
r
e
e
 
w
i
t
h
 
5
0
 
n
o
d
e
s
D
e
c
i
s
i
o
n
 
T
r
e
e
 
w
i
t
h
 
5
0
 
n
o
d
e
s
Reasons for Model Overfitting
l
Not enough training data
l
High model complexity
Multiple Comparison Procedure
Effect of Multiple Comparison Procedure
 
l
Consider the task of predicting whether
stock market will rise/fall in the next 10
trading days
 
l
Random guessing:
 P
(
correct
) = 0.5
 
l
Make 10 random guesses in a row:
 
Effect of Multiple Comparison Procedure
 
l
Approach:
Get 50 analysts
Each analyst makes 10 random guesses
Choose the analyst that makes the most
number of correct predictions
 
l
Probability that at least one analyst makes at
least 8 correct predictions
Effect of Multiple Comparison Procedure
 
l
Many algorithms employ the following greedy strategy:
Initial model: M
Alternative model: M’ = M 
 
,
where 
 is a component to be added to the model
(e.g., a test condition of a decision tree)
Keep M’ if improvement, 
(M,M’) > 
 
l
Often times, 
 is chosen from a set of alternative
components, 
 = {
1
, 
2
, …, 
k
}
 
l
If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
 
U
s
i
n
g
 
o
n
l
y
 
X
 
a
n
d
 
Y
 
a
s
 
a
t
t
r
i
b
u
t
e
s
Notes on Overfitting
l
Overfitting results in decision trees that are 
more
complex
 than necessary
l
Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
l
Need ways for estimating generalization errors
Model Selection
l
Performed during model building
l
Purpose is to ensure that model is not overly
complex (to avoid overfitting)
l
Need to estimate generalization error
Using Validation Set
Incorporating Model Complexity
Model Selection:
Using Validation Set
l
Divide 
training
 data into two parts:
Training set:
 use for model building
Validation set:
 use for estimating generalization error
 Note: validation set is not the same as test set
l
Drawback:
Less data available for training
Model Selection:
Incorporating Model Complexity
l
Rationale: Occam’s Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
A complex model has a greater chance of being fitted
accidentally
Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
    
 
 
x Complexity(Model)
Estimating the Complexity of Decision Trees
l
P
e
s
s
i
m
i
s
t
i
c
 
E
r
r
o
r
 
E
s
t
i
m
a
t
e
 
o
f
 
d
e
c
i
s
i
o
n
 
t
r
e
e
 
T
w
i
t
h
 
k
 
l
e
a
f
 
n
o
d
e
s
:
err(T): error rate on all training records
: trade-off hyper-parameter (similar to   )
 Relative cost of adding a leaf node
k: number of leaf nodes
N
train
: total number of training records
Estimating the Complexity of Decision Trees: Example
 
e
(
T
L
)
 
=
 
4
/
2
4
e
(
T
R
)
 
=
 
6
/
2
4
 
=
 
1
 
e
g
e
n
(
T
L
)
 
=
 
4
/
2
4
 
+
 
1
*
7
/
2
4
 
=
 
1
1
/
2
4
 
=
 
0
.
4
5
8
e
g
e
n
(
T
R
)
 
=
 
6
/
2
4
 
+
 
1
*
4
/
2
4
 
=
 
1
0
/
2
4
 
=
 
0
.
4
1
7
Estimating the Complexity of Decision Trees
l
Resubstitution Estimate:
Using training error as an 
optimistic
 estimate of
generalization error
Referred to as 
optimistic error 
estimate
 
e
(
T
L
)
 
=
 
4
/
2
4
e
(
T
R
)
 
=
 
6
/
2
4
Minimum Description Length (MDL)
l
Cost(Model,Data) = Cost(Data|Model) +    x Cost(Model)
Cost is the number of bits needed for encoding.
Search for the least costly model.
l
Cost(Data|Model) encodes the misclassification errors.
l
Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
Model Selection for Decision Trees
l
Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
More restrictive conditions:
 Stop if number of instances is less than some user-specified
threshold
 Stop if class distribution of instances are independent of the
available features (e.g., using 
 2
 test)
 Stop if expanding the current node does not improve impurity
    measures (e.g., Gini or information gain).
 Stop if estimated generalization error falls below certain threshold
Model Selection for Decision Trees
l
Post-pruning
Grow decision tree to its entirety
Subtree replacement
 Trim the nodes of the decision tree in a bottom-up
fashion
 If generalization error improves after trimming,
replace sub-tree by a leaf node
 Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
T
r
a
i
n
i
n
g
 
E
r
r
o
r
 
(
B
e
f
o
r
e
 
s
p
l
i
t
t
i
n
g
)
 
=
 
1
0
/
3
0
P
e
s
s
i
m
i
s
t
i
c
 
e
r
r
o
r
 
=
 
(
1
0
 
+
 
0
.
5
)
/
3
0
 
=
 
1
0
.
5
/
3
0
T
r
a
i
n
i
n
g
 
E
r
r
o
r
 
(
A
f
t
e
r
 
s
p
l
i
t
t
i
n
g
)
 
=
 
9
/
3
0
P
e
s
s
i
m
i
s
t
i
c
 
e
r
r
o
r
 
(
A
f
t
e
r
 
s
p
l
i
t
t
i
n
g
)
=
 
(
9
 
+
 
4
 
 
0
.
5
)
/
3
0
 
=
 
1
1
/
3
0
P
R
U
N
E
!
Examples of Post-pruning
Model Evaluation
l
Purpose:
To estimate performance of classifier on previously
unseen data (test set)
l
Holdout
Reserve k% for training and (100-k)% for testing
Random subsampling: repeated holdout
l
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out:   k=n
Cross-validation Example
l
3-fold cross-validation
Variations on Cross-validation
l
Repeated cross-validation
Perform cross-validation a number of times
Gives an estimate of the variance of the
generalization error
l
Stratified cross-validation
Guarantee the same percentage of class
labels in training and test
Important when classes are imbalanced and
the sample is small
l
Use nested cross-validation approach for model
selection and evaluation
Slide Note
Embed
Share

Overfitting is a common issue in data mining models where the model performs exceptionally well on the training data but fails to generalize to new data. This content discusses how overfitting can occur, its impact on model performance, and strategies to mitigate it. Through examples and visualizations, it demonstrates the consequences of overfitting, such as increased errors on the test set and poor generalization. Additionally, it explores the trade-off between model complexity and performance, specifically focusing on decision trees with varying numbers of nodes. The comparison between decision trees with 4 and 50 nodes illustrates the concept of overfitting and the importance of finding the right balance to achieve optimal model performance.

  • Overfitting
  • Data Mining
  • Model Performance
  • Decision Trees
  • Generalization

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Mining Model Overfitting Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2021 Introduction to Data Mining, 2ndEdition 1

  2. Classification Errors Training errors: Errors committed on the training set Test errors: Errors committed on the test set Generalization errors: Expected error of a model over random selection of records from same distribution Learning algorithm Attrib1 Attrib2 Attrib3 Class Tid No 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K Induction No 4 Yes Medium 120K Yes 5 No Large 95K No 6 No Medium 60K Learn Model No 7 Yes Large 220K Yes 8 No Small 85K No 9 No Medium 75K Yes 10 No Small 90K Model 10 Training Set Apply Model Attrib1 Attrib2 Attrib3 Class Tid ? 11 No Small 55K ? 12 Yes Medium 80K Deduction ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K 10 Test Set 02/03/2021 Introduction to Data Mining, 2nd Edition 2

  3. Example Data Set Two class problem: + : 5400 instances 5000 instances generated from a Gaussian centered at (10,10) 400 noisy instances added o : 5400 instances Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing 02/03/2021 Introduction to Data Mining, 2nd Edition 3

  4. Increasing number of nodes in Decision Trees 02/03/2021 Introduction to Data Mining, 2nd Edition 4

  5. Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data 02/03/2021 Introduction to Data Mining, 2nd Edition 5

  6. Decision Tree with 50 nodes Decision Tree Decision Tree Decision boundaries on Training data 02/03/2021 Introduction to Data Mining, 2nd Edition 6

  7. Which tree is better? Decision Tree with 4 nodes Which tree is better ? Decision Tree with 50 nodes 02/03/2021 Introduction to Data Mining, 2nd Edition 7

  8. Model Underfitting and Overfitting As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large 02/03/2021 Introduction to Data Mining, 2nd Edition 8

  9. Model Overfitting Impact of Training Data Size Using twice the number of data instances Increasing the size of training data reduces the difference between training and testing errors at a given size of model 02/03/2021 Introduction to Data Mining, 2nd Edition 9

  10. Model Overfitting Impact of Training Data Size Decision Tree with 50 nodes Decision Tree with 50 nodes Using twice the number of data instances Increasing the size of training data reduces the difference between training and testing errors at a given size of model 02/03/2021 Introduction to Data Mining, 2nd Edition 10

  11. Reasons for Model Overfitting Not enough training data High model complexity Multiple Comparison Procedure 02/03/2021 Introduction to Data Mining, 2nd Edition 11

  12. Effect of Multiple Comparison Procedure Consider the task of predicting whether stock market will rise/fall in the next 10 trading days Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Up Down Down Up Down Down Up Up Up Down Random guessing: P(correct) = 0.5 Make 10 random guesses in a row: 10 10 10 + + 8 9 10 10 = = (# ) 8 0547 . 0 P correct 2 02/03/2021 Introduction to Data Mining, 2nd Edition 12

  13. Effect of Multiple Comparison Procedure Approach: Get 50 analysts Each analyst makes 10 random guesses Choose the analyst that makes the most number of correct predictions Probability that at least one analyst makes at least 8 correct predictions = 50= (# ) 8 1 1 ( . 0 0547 ) . 0 9399 P correct 02/03/2021 Introduction to Data Mining, 2nd Edition 13

  14. Effect of Multiple Comparison Procedure Many algorithms employ the following greedy strategy: Initial model: M Alternative model: M = M , where is a component to be added to the model (e.g., a test condition of a decision tree) Keep M if improvement, (M,M ) > Often times, is chosen from a set of alternative components, = { 1, 2, , k} If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting 02/03/2021 Introduction to Data Mining, 2nd Edition 14

  15. Effect of Multiple Comparison - Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing Using only X and Y as attributes 02/03/2021 Introduction to Data Mining, 2nd Edition 15

  16. Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error does not provide a good estimate of how well the tree will perform on previously unseen records Need ways for estimating generalization errors 02/03/2021 Introduction to Data Mining, 2nd Edition 16

  17. Model Selection Performed during model building Purpose is to ensure that model is not overly complex (to avoid overfitting) Need to estimate generalization error Using Validation Set Incorporating Model Complexity 02/03/2021 Introduction to Data Mining, 2nd Edition 17

  18. Model Selection: Using Validation Set Divide training data into two parts: Training set: use for model building Validation set: use for estimating generalization error Note: validation set is not the same as test set Drawback: Less data available for training 02/03/2021 Introduction to Data Mining, 2nd Edition 18

  19. Model Selection: Incorporating Model Complexity Rationale: Occam s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model A complex model has a greater chance of being fitted accidentally Therefore, one should include model complexity when evaluating a model Gen. Error(Model) = Train. Error(Model, Train. Data) + x Complexity(Model) 02/03/2021 Introduction to Data Mining, 2nd Edition 19

  20. Estimating the Complexity of Decision Trees Pessimistic Error Estimate of decision tree T with k leaf nodes: err(T): error rate on all training records : trade-off hyper-parameter (similar to ) Relative cost of adding a leaf node k: number of leaf nodes Ntrain: total number of training records 02/03/2021 Introduction to Data Mining, 2nd Edition 20

  21. Estimating the Complexity of Decision Trees: Example e(TL) = 4/24 +: 3 -: 0 +: 5 -: 2 +: 3 -: 6 +: 1 -: 4 +: 3 -: 0 e(TR) = 6/24 = 1 +: 3 -: 1 +: 0 -: 5 +: 3 -: 1 +: 2 -: 1 +: 0 -: 2 +: 1 -: 2 Decision Tree, TL Decision Tree, TR egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458 egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417 02/03/2021 Introduction to Data Mining, 2nd Edition 21

  22. Estimating the Complexity of Decision Trees Resubstitution Estimate: Using training error as an optimistic estimate of generalization error Referred to as optimistic error estimate e(TL) = 4/24 e(TR) = 6/24 +: 3 -: 0 +: 5 -: 2 +: 3 -: 6 +: 1 -: 4 +: 3 -: 0 +: 3 -: 1 +: 0 -: 5 +: 3 -: 1 +: 2 -: 1 +: 0 -: 2 +: 1 -: 2 Decision Tree, TL Decision Tree, TR 02/03/2021 Introduction to Data Mining, 2nd Edition 22

  23. Minimum Description Length (MDL) A? y 1 0 0 1 X X1 X2 X3 X4 Xn Yes No y ? ? ? ? X X1 X2 X3 X4 Xn 0 B? B1 B2 C? 1 A B C1 C2 0 1 1 ? Cost(Model,Data) = Cost(Data|Model) + x Cost(Model) Cost is the number of bits needed for encoding. Search for the least costly model. Cost(Data|Model) encodes the misclassification errors. Cost(Model) uses node encoding (number of children) plus splitting condition encoding. 02/03/2021 Introduction to Data Mining, 2nd Edition 23

  24. Model Selection for Decision Trees Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Stop if estimated generalization error falls below certain threshold 02/03/2021 Introduction to Data Mining, 2nd Edition 24

  25. Model Selection for Decision Trees Post-pruning Grow decision tree to its entirety Subtree replacement Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node Class label of leaf node is determined from majority class of instances in the sub-tree 02/03/2021 Introduction to Data Mining, 2nd Edition 25

  26. Example of Post-Pruning Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Class = Yes 20 Training Error (After splitting) = 9/30 Class = No Error = 10/30 10 Pessimistic error (After splitting) = (9 + 4 0.5)/30 = 11/30 PRUNE! A? A1 A4 A3 A2 Class = Yes Class = No 8 4 Class = Yes Class = No 3 4 Class = Yes Class = No 4 1 Class = Yes Class = No 5 1 02/03/2021 Introduction to Data Mining, 2nd Edition 26

  27. Examples of Post-pruning Decision Tree: depth = 1 : | breadth > 7 : class 1 | breadth <= 7 : | | breadth <= 3 : | | | ImagePages > 0.375 : class 0 | | | ImagePages <= 0.375 : | | | | totalPages <= 6 : class 1 | | | | totalPages > 6 : | | | | | breadth <= 1 : class 1 | | | | | breadth > 1 : class 0 | | width > 3 : | | | MultiIP = 0: | | | | ImagePages <= 0.1333 : class 1 | | | | ImagePages > 0.1333 : | | | | | breadth <= 6 : class 0 | | | | | breadth > 6 : class 1 | | | MultiIP = 1: | | | | TotalTime <= 361 : class 0 | | | | TotalTime > 361 : class 1 depth > 1 : | MultiAgent = 0: | | depth > 2 : class 0 | | depth <= 2 : | | | MultiIP = 1: class 0 | | | MultiIP = 0: | | | | breadth <= 6 : class 0 | | | | breadth > 6 : | | | | | RepeatedAccess <= 0.0322 : class 0 | | | | | RepeatedAccess > 0.0322 : class 1 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Simplified Decision Tree: depth = 1 : | ImagePages <= 0.1333 : class 1 | ImagePages > 0.1333 : | | breadth <= 6 : class 0 | | breadth > 6 : class 1 depth > 1 : | MultiAgent = 0: class 0 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Subtree Raising Subtree Replacement 02/03/2021 Introduction to Data Mining, 2nd Edition 27

  28. Model Evaluation Purpose: To estimate performance of classifier on previously unseen data (test set) Holdout Reserve k% for training and (100-k)% for testing Random subsampling: repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n 02/03/2021 Introduction to Data Mining, 2nd Edition 28

  29. Cross-validation Example 3-fold cross-validation 02/03/2021 Introduction to Data Mining, 2nd Edition 29

  30. Variations on Cross-validation Repeated cross-validation Perform cross-validation a number of times Gives an estimate of the variance of the generalization error Stratified cross-validation Guarantee the same percentage of class labels in training and test Important when classes are imbalanced and the sample is small Use nested cross-validation approach for model selection and evaluation 02/03/2021 Introduction to Data Mining, 2nd Edition 30

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#