Classifiers in Data Analysis

undefined
Classifiers, Part 1
Week 1, video 3:
Prediction
 
Develop a model which can infer a single aspect of
the data (predicted variable) from some
combination of other aspects of the data (predictor
variables)
 
Sometimes used to predict the future
Sometimes used to make inferences about the
present
Classification
 
There is something you want to predict (“the label”)
The thing you want to predict is categorical
The answer is one of a set of categories, not a number
 
CORRECT/WRONG (sometimes expressed as 0,1)
We’ll talk about this specific problem later in the course
within latent knowledge estimation
HELP REQUEST/WORKED EXAMPLE
REQUEST/ATTEMPT TO SOLVE
WILL DROP OUT/WON’T DROP OUT
WILL ENROLL IN MOOC A,B,C,D,E,F, or G
Where do those labels come from?
 
In-software performance
School records
Test data
Survey data
Field observations or video coding
Text replays
Classification
 
Associated with each label are a set
of “features”, which maybe you can
use to predict the label
 
Skill
  
pknow
  
time
  
totalactions
 
right
ENTERINGGIVEN
 
0.704
  
9
  
1
  
WRONG
ENTERINGGIVEN
 
0.502
  
10
  
2
  
RIGHT
USEDIFFNUM
 
0.049
  
6
  
1
  
WRONG
ENTERINGGIVEN
 
0.967
  
7
  
3
  
RIGHT
REMOVECOEFF
 
0.792
  
16
  
1
  
WRONG
REMOVECOEFF
 
0.792
  
13
  
2
  
RIGHT
USEDIFFNUM
 
0.073
  
5
  
2
  
RIGHT
….
Classification
 
The basic idea of a classifier is to
determine which features, in which
combination, can predict the label
 
Skill
  
pknow
  
time
  
totalactions
 
right
ENTERINGGIVEN
 
0.704
  
9
  
1
  
WRONG
ENTERINGGIVEN
 
0.502
  
10
  
2
  
RIGHT
USEDIFFNUM
 
0.049
  
6
  
1
  
WRONG
ENTERINGGIVEN
 
0.967
  
7
  
3
  
RIGHT
REMOVECOEFF
 
0.792
  
16
  
1
  
WRONG
REMOVECOEFF
 
0.792
  
13
  
2
  
RIGHT
USEDIFFNUM
 
0.073
  
5
  
2
  
RIGHT
….
Classification
 
Of course, usually there are more than 4 features
 
And more than 7 actions/data points
 
 
Classifiers
 
There are hundreds of classification algorithms you
can choose
 
Largely, they boil down into neural networks and
“classic algorithms”
Neural networks have been around for a long time, but
they got a lot better a few years ago
Some more recent algorithms such as XGBoost get
treated as “classic” too
Some popular “classic algorithms”
 
Logistic Regression (and close relatives like LASSO)
Decision Trees (J48/C4.5/CART)
Random Forest
XGBoost
 
There are many others!
For a fuller selection of classic algorithms, see the
previous editions of this textbook
Step Regression
 
Not step-wise regression
 
Used for binary classification (0,1)
 
Rarely used anymore, but I’m discussing it because
it’s a building block to other concepts
Step Regression
 
Fits a linear regression function
(as discussed in previous video)
with an arbitrary cut-off
 
Selects parameters
Assigns a weight to each parameter
Computes a numerical value
 
Then all values below 0.5 are treated as 0, and all values
>= 0.5 are treated as 1
Important: you can use a different cut-off than 0.5!
 
Example
Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3
Cut-off 0.5
Example
Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3
Cut-off 0.5
Example
Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3
Cut-off 0.5
Example
Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3
Cut-off 0.5
Quiz
Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3
Cut-off 0.5
Logistic Regression
 
Another algorithm for binary classification (0,1)
Logistic Regression
 
Given a specific set of values of predictor variables
 
Fits logistic function to data to find out the
frequency/odds of a specific value of the
dependent variable
Logistic Regression
Logistic Regression
m = a0 + a1v1 + a2v2 + a3v3 + a4v4…
Logistic Regression
m = 0.2A + 0.3B + 0.4C
Logistic Regression
m = 0.2A + 0.3B + 0.4C
Logistic Regression
m = 0.2A + 0.3B + 0.4C
Logistic Regression
m = 0.2A + 0.3B + 0.5C
Logistic Regression
m = 0.2A + 0.3B + 0.5C
Logistic Regression
m = 0.2A + 0.3B + 0.5C
Logistic Regression
m = 0.2A + 0.3B + 0.5C
Logistic Regression
m = 0.2A + 0.3B + 0.5C
Relatively conservative
 
Thanks to simple functional form, is a relatively
conservative algorithm
I’ll explain this in more detail later in the course
Good for
 
Cases where changes in value of predictor
variables have predictable effects on probability
of predicted variable class
 
m = 0.2A + 0.3B + 0.5C
 
Higher A always leads to higher probability
But there are some data sets where this isn’t true!
What about interaction effects?
 
A = Bad
 
B = Bad
 
A+B = Good
What about interaction effects?
 
Ineffective Educational Software = Bad
 
Off-Task Behavior = Bad
 
Ineffective Educational Software 
PLUS
 
Off-Task Behavior = Good
Logistic and Step Regression are good when
interactions are not particularly common
 
Can be given interaction effects through automated
feature distillation
We’ll discuss this later
 
But is not particularly optimal for this
What about interaction effects?
 
Fast Responses + Material Student Already Knows -
> Associated with Better Learning
 
Fast Responses + Material Student Does not Know -
> Associated with Worse Learning
Decision Trees
 
An approach that explicitly deals with interaction
effects
Decision Tree
KNOWLEDGE
TIME
TOTALACTIONS
RIGHT
RIGHT
WRONG
WRONG
<0.5
>=0.5
<6s.
>=6s.
<4
>=4
Skill
  
knowledge
 
time
  
totalactions
 
right?
COMPUTESLOPE
 
0.544
  
9
  
1
  
?
Decision Tree
Skill
  
knowledge
 
time
  
totalactions
 
right?
COMPUTESLOPE
 
0.544
  
9
  
1
  
RIGHT
KNOWLEDGE
TIME
TOTALACTIONS
RIGHT
RIGHT
WRONG
WRONG
<0.5
>=0.5
<6s.
>=6s.
<4
>=4
Decision Tree
Skill
  
knowledge
 
time
  
totalactions
 
right?
COMPUTESLOPE
 
0.444
  
9
  
1
  
?
KNOWLEDGE
TIME
TOTALACTIONS
RIGHT
RIGHT
WRONG
WRONG
<0.5
>=0.5
<6s.
>=6s.
<4
>=4
Decision Tree Algorithms
 
There are several Decision Tree classifiers that are
decent: C4.5/J48, CART are good choices
You already saw REPTree and M5’ as Decision Tree
regressors (previous video)
 
Decision Trees are the basis of more sophisticated
algorithms like Random Forest and XGBoost (next
video)
Relatively conservative by themselves, making them
more robust to changes in conditions than other
algorithms (see Levin et al., 2022)
Good when data has natural splits
Good when multi-level interactions
are common
Good when same construct can be
arrived at in multiple ways
 
A student is likely to drop out of college when he
Starts assignments early but lacks prerequisites
 
OR when he
Starts assignments the day they’re due
What variables should you use?
 
What variables should you use?
In one sense, the entire point of data mining is to
figure out which variables matter
But some variables have more construct validity or
theoretical justification than others – using those
variables generally leads to more generalizable
models
We’ll talk more about this in a future lecture
What variables should you use?
In one sense, the entire point of data mining is to
figure out which variables matter
More urgently, some variables will make your model
general only to the data set where they were
trained
These should not be included in your model
They are typically the variables you want to test
generalizability across during cross-validation
More on this later
Example
Your model of student off-task behavior should not
depend on which student you have
“If student = BOB, and time > 80 seconds, then…”
This model won’t be useful when you’re looking at
totally new students
Example
Your model of student off-task behavior should not
depend on which college the student is in
“If school = University of Pennsylvania, and time >
80 seconds, then…”
This model won’t be useful when you’re looking at
data from new colleges
Note
In modern statistics, you often need to explicitly
include these types of variables in models to
conduct valid statistical testing
This is a 
difference
 between classification and
statistical modeling
We’ll discuss it more in future lectures
Later Lectures
 
More classification algorithms
 
Goodness metrics for comparing classifiers
 
Validating classifiers for generalizability
 
What does it mean for a classifier to be
conservative?
Next Lecture
 
More advanced/modern classifiers
Slide Note
Embed
Share

In data analysis, classifiers play a crucial role in predicting categorical outcomes based on various features within the data. Through models and algorithms, classifiers can be used to make predictions about the future or infer present situations. Various classification methods and techniques are explored, ranging from traditional algorithms to neural networks, providing a deeper understanding of how data can be leveraged to make informed decisions.

  • Data Analysis
  • Classifiers
  • Predictive Modeling
  • Categorical Data
  • Algorithms

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Week 1, video 3: Classifiers, Part 1

  2. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Sometimes used to predict the future Sometimes used to make inferences about the present

  3. Classification There is something you want to predict ( the label ) The thing you want to predict is categorical The answer is one of a set of categories, not a number CORRECT/WRONG (sometimes expressed as 0,1) We ll talk about this specific problem later in the course within latent knowledge estimation HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE WILL DROP OUT/WON T DROP OUT WILL ENROLL IN MOOC A,B,C,D,E,F, or G

  4. Where do those labels come from? In-software performance School records Test data Survey data Field observations or video coding Text replays

  5. Classification Associated with each label are a set of features , which maybe you can use to predict the label Skill ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM . pknow 0.704 0.502 0.049 0.967 0.792 0.792 0.073 time 9 10 6 7 16 13 5 totalactions 1 2 1 3 1 2 2 right WRONG RIGHT WRONG RIGHT WRONG RIGHT RIGHT

  6. Classification The basic idea of a classifier is to determine which features, in which combination, can predict the label Skill ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM . pknow 0.704 0.502 0.049 0.967 0.792 0.792 0.073 time 9 10 6 7 16 13 5 totalactions 1 2 1 3 1 2 2 right WRONG RIGHT WRONG RIGHT WRONG RIGHT RIGHT

  7. Classification Of course, usually there are more than 4 features And more than 7 actions/data points

  8. Classifiers There are hundreds of classification algorithms you can choose Largely, they boil down into neural networks and classic algorithms Neural networks have been around for a long time, but they got a lot better a few years ago Some more recent algorithms such as XGBoost get treated as classic too

  9. Some popular classic algorithms Logistic Regression (and close relatives like LASSO) Decision Trees (J48/C4.5/CART) Random Forest XGBoost There are many others! For a fuller selection of classic algorithms, see the previous editions of this textbook

  10. Step Regression Not step-wise regression Used for binary classification (0,1) Rarely used anymore, but I m discussing it because it s a building block to other concepts

  11. Step Regression Fits a linear regression function (as discussed in previous video) with an arbitrary cut-off Selects parameters Assigns a weight to each parameter Computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1 Important: you can use a different cut-off than 0.5!

  12. Example Y= 0.5a + 0.7b 0.2c + 0.4d + 0.3 Cut-off 0.5 a b c d Y 1 1 1 1 0 0 0 0 -1 -1 1 3

  13. Example Y= 0.5a + 0.7b 0.2c + 0.4d + 0.3 Cut-off 0.5 a b c d Y 1 1 1 1 1 0 0 0 0 -1 -1 1 3

  14. Example Y= 0.5a + 0.7b 0.2c + 0.4d + 0.3 Cut-off 0.5 a b c d Y 1 1 1 1 1 0 0 0 0 0 -1 -1 1 3

  15. Example Y= 0.5a + 0.7b 0.2c + 0.4d + 0.3 Cut-off 0.5 a b c d Y 1 1 1 1 1 0 0 0 0 0 -1 -1 1 3 0

  16. Quiz Y= 0.5a + 0.7b 0.2c + 0.4d + 0.3 Cut-off 0.5 a b c d Y 2 -1 0 1

  17. Logistic Regression Another algorithm for binary classification (0,1)

  18. Logistic Regression Given a specific set of values of predictor variables Fits logistic function to data to find out the frequency/odds of a specific value of the dependent variable

  19. Logistic Regression p(m) 1.2 1 0.8 0.6 0.4 0.2 0 -4 -3 -2 -1 0 1 2 3 4

  20. Logistic Regression m = a0 + a1v1 + a2v2 + a3v3 + a4v4

  21. Logistic Regression m = 0.2A + 0.3B + 0.4C

  22. Logistic Regression m = 0.2A + 0.3B + 0.4C A B C M P(M) 0 0 0

  23. Logistic Regression m = 0.2A + 0.3B + 0.4C A B C M P(M) 0 0 0 0 0.5

  24. Logistic Regression m = 0.2A + 0.3B + 0.5C A B C M P(M) 1 1 1 1 0.73

  25. Logistic Regression m = 0.2A + 0.3B + 0.5C A B C M P(M) -1 -1 -1 -1 0.27

  26. Logistic Regression m = 0.2A + 0.3B + 0.5C A B C M P(M) 2 2 2 2 0.88

  27. Logistic Regression m = 0.2A + 0.3B + 0.5C A B C M P(M) 3 3 3 3 0.95

  28. Logistic Regression m = 0.2A + 0.3B + 0.5C A B C M P(M) 50 50 50 50 ~1

  29. Relatively conservative Thanks to simple functional form, is a relatively conservative algorithm I ll explain this in more detail later in the course

  30. Good for Cases where changes in value of predictor variables have predictable effects on probability of predicted variable class m = 0.2A + 0.3B + 0.5C Higher A always leads to higher probability But there are some data sets where this isn t true!

  31. What about interaction effects? A = Bad B = Bad A+B = Good

  32. What about interaction effects? Ineffective Educational Software = Bad Off-Task Behavior = Bad Ineffective Educational Software PLUS Off-Task Behavior = Good

  33. Logistic and Step Regression are good when interactions are not particularly common Can be given interaction effects through automated feature distillation We ll discuss this later But is not particularly optimal for this

  34. What about interaction effects? Fast Responses + Material Student Already Knows - > Associated with Better Learning Fast Responses + Material Student Does not Know - > Associated with Worse Learning

  35. Decision Trees An approach that explicitly deals with interaction effects

  36. Decision Tree KNOWLEDGE <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG Skill COMPUTESLOPE knowledge 0.544 time 9 totalactions 1 right? ?

  37. Decision Tree KNOWLEDGE <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG Skill COMPUTESLOPE knowledge 0.544 time 9 totalactions 1 right? RIGHT

  38. Decision Tree KNOWLEDGE <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG Skill COMPUTESLOPE knowledge 0.444 time 9 totalactions 1 right? ?

  39. Decision Tree Algorithms There are several Decision Tree classifiers that are decent: C4.5/J48, CART are good choices You already saw REPTree and M5 as Decision Tree regressors (previous video) Decision Trees are the basis of more sophisticated algorithms like Random Forest and XGBoost (next video) Relatively conservative by themselves, making them more robust to changes in conditions than other algorithms (see Levin et al., 2022)

  40. Good when data has natural splits 16 14 12 10 8 6 4 2 20 0 1 2 3 4 5 6 7 8 9 10 11 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11

  41. Good when multi-level interactions are common

  42. Good when same construct can be arrived at in multiple ways A student is likely to drop out of college when he Starts assignments early but lacks prerequisites OR when he Starts assignments the day they re due

  43. What variables should you use?

  44. What variables should you use? In one sense, the entire point of data mining is to figure out which variables matter But some variables have more construct validity or theoretical justification than others using those variables generally leads to more generalizable models We ll talk more about this in a future lecture

  45. What variables should you use? In one sense, the entire point of data mining is to figure out which variables matter More urgently, some variables will make your model general only to the data set where they were trained These should not be included in your model They are typically the variables you want to test generalizability across during cross-validation More on this later

  46. Example Your model of student off-task behavior should not depend on which student you have If student = BOB, and time > 80 seconds, then This model won t be useful when you re looking at totally new students

  47. Example Your model of student off-task behavior should not depend on which college the student is in If school = University of Pennsylvania, and time > 80 seconds, then This model won t be useful when you re looking at data from new colleges

  48. Note In modern statistics, you often need to explicitly include these types of variables in models to conduct valid statistical testing This is a difference between classification and statistical modeling We ll discuss it more in future lectures

  49. Later Lectures More classification algorithms Goodness metrics for comparing classifiers Validating classifiers for generalizability What does it mean for a classifier to be conservative?

  50. Next Lecture More advanced/modern classifiers

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#