A Machine Learning practical with Kaggle

Alan Chalk
A
Machine Learning
practical
with Kaggle
D
i
s
c
l
a
i
m
e
r
The views expressed in this presentation are
those of the author,
Alan Chalk,
and not necessarily of the
Staple Inn Actuarial Society
Over the next 45 minutes
 ?
Loss functions
Greedy algorithms
Performance measurement
Feature engineering
Generalisation error
Bias and variance
Penalisation
Training and validation
curves
Hyperparameter tuning
R
Python
Decision trees
Random forests
Gradient boosting
Basis functions
Adaptive basis
functions
Renthop (2 sigma connect)
Example renthop posting
What is our task?
Evaluation (loss function)
Process
Create pipeline and baseline
First R code
00a_Packages.R
00b_Working Directories.R
01a_ReadRawData.R
01b_CleanData.R
04__LoadAndPrepareData.R
04a_BaselinePredictions.R
Bottom of the leaderboard
Decision trees
bathrooms, bedrooms,
latitude, longitude,
price,
listing_id
Some decision tree vocab
CART (rpart), C5.0 etc
split rule (loss function)
NP-hard
greedy
over-fitting
complexity parameter
R code: formulas
In an R formula, 
~
 means “is allowed to
depend on”. Our first formula is:
(In our code, you will see that this formula
is saved in a variable called “fmla_”)
interest_level ~ bathrooms + bedrooms +
latitude + longitude + price + listing_id
R code: rpart code
rpart_1 <-
    rpart(fmla_ ,
          data = dt_all[idx_train1,],
          method = "class",
          cp = 1e-8,
          )
R code
04b01_rpart.R
A first tree for renthop
How can we do better?
 
Better techniques
More features (feature engineering)
Anything else?
Feature engineering?
Based only on the data and files provided:
b
athrooms, bedrooms, building_id, created,
description, display_address, features, latitude,
listing_id, longitude, manager_id, photos, price,
street_address, interest_level
Note: You also have loads and loads of
photos.  “description” is free format.
“features” is a list of words.
Feature engineering?
 
Simple features; price per bedroom,
bathroom / bedroom ratio, created hour
or day of week
Simplifications of complex features:
number of photos, number of words in
description
Presence of each feature or not; e.g.
laundry yes or no
Good value rental
High cardinality features?
manager_id, building_id
Simplifications:
“size” of manager or building
turn into numeric
What else?
Leakage features?
A key aspect of winning Machine
Learning competitions
Where might there be leakage in the
data we have been given
Paper: Leakage in Data Mining:
Formulation, Detection, and Avoidance,
(Kaufman, Rosset and Perlich)
Now what?
We have loads of features – good
But there is every chance that our
decision tree will pick up random noise
in the training data (called “variance”)
How can we control for this?
Cost complexity pruning
R code
02b_FeatureCreation_1/2/3.R
04b_01_rpart.R
04b_02_VariableImportance.R
Training and validation curves
Variable importance
Random Forest
Introduce randomness.  Why?
Boostrapping and then aggregating the
results (“bagging”)
How else can we create randomness
Sample the features available to each split
OOB error
What are our hyper-parameters?
number of trees?
nodesize?
mtry?
R code: random forest code
randomForest(
    x = dt_train,
    y = as.factor(y_train),
    ntree = 300,
    nodesize = 1,
    mtry = 6,
    keep.forest = TRUE)
R code
04c_RandomForest.R
gradient boosting
Add lots of “weak learners”
Create new weak learners by focusing on
examples which are incorrectly classified
Add all the weak learners using weights
which are higher for the better weak
learners
The weak learners are “adaptive basis
functions”
hpyerparameters
learning rate
depth of trees
min child weight
data subsampling
column subsampling
grid search
random search
hyperopt
Python code
xgb.train(params = param
                      , dtrain = xg_train
                      , num_boost_round = num_rounds
                      , evals = watchlist
                      , early_stopping_rounds = 20
                      , verbose_eval = False
                      )
Gradient boosting (machines)
04d_GradientBoosting_presentation.ipynb
Over the next 45 minutes
 ?
Loss functions
Greedy algorithms
Performance measurement
Feature engineering
Generalisation error
Bias and variance
Penalisation
Training and validation
curves
Hyperparameter tuning
R
Python
Decision trees
Random forests
Gradient boosting
Basis functions
Adaptive basis
functions
Slide Note
Embed
Share

Delve into the world of practical machine learning with Kaggle expert Alan Chalk as he covers topics like loss functions, greedy algorithms, performance measurement, feature engineering, and more over the course of 45 minutes. Learn about decision trees, random forests, gradient boosting, and the process of creating pipelines and baselines for machine learning projects. Follow along a journey of improvement, from baselines to final predictions, while understanding the nuances of bias and variance in model training. Explore examples, evaluation metrics, and hands-on R and Python code snippets to enhance your machine learning skills.

  • Machine Learning
  • Kaggle
  • Feature Engineering
  • Loss Functions
  • Decision Trees

Uploaded on Mar 03, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A Machine Learning practical with Kaggle Alan Chalk

  2. Disclaimer Disclaimer The views expressed in this presentation are those of the author, Alan Chalk, and not necessarily of the Staple Inn Actuarial Society

  3. Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

  4. Renthop (2 sigma connect)

  5. Example renthop posting

  6. What is our task?

  7. Evaluation (loss function) Example (i) 1 True class (j) medium Probability high 0.1 Probability medium 0.6 Probability low 0.3 2 high 0.0 0.5 0.5

  8. Process Create Pipeline Baseline guess Improve guess

  9. Create pipeline and baseline Read raw data Clean data Create guess Submit on Kaggle Create Pipeline Baseline guess

  10. First R code 00a_Packages.R 00b_Working Directories.R 01a_ReadRawData.R 01b_CleanData.R 04__LoadAndPrepareData.R 04a_BaselinePredictions.R

  11. Bottom of the leaderboard

  12. Decision trees bathrooms, bedrooms, latitude, longitude, price, listing_id

  13. Some decision tree vocab CART (rpart), C5.0 etc split rule (loss function) NP-hard greedy over-fitting complexity parameter

  14. R code: formulas In an R formula, ~means is allowed to depend on . Our first formula is: interest_level ~ bathrooms + bedrooms + latitude + longitude + price + listing_id (In our code, you will see that this formula is saved in a variable called fmla_ )

  15. R code: rpart code rpart_1 <- rpart(fmla_ , data = dt_all[idx_train1,], method = "class", cp = 1e-8, )

  16. R code 04b01_rpart.R

  17. A first tree for renthop high medium low low .08 .23 .70 100% price < 2016 no yes low .25 .32 .43 11% price < 1508 low .21 .34 .45 9% bedrooms >= 1.5 high low .36 .35 .29 2% .17 .33 .49 7% price < 1829 longitude < 74 medium .26 .39 .35 1% longitude >= 74 high high medium .28 .49 .22 0% low medium .27 .40 .33 1% low low .51 .23 .26 1% .57 .25 .18 1% .25 .34 .41 1% .15 .32 .53 6% .06 .22 .73 89%

  18. How can we do better? Better techniques More features (feature engineering) Anything else?

  19. Feature engineering? Based only on the data and files provided: bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, listing_id, longitude, manager_id, photos, price, street_address, interest_level Note: You also have loads and loads of photos. description is free format. features is a list of words.

  20. Feature engineering? Simple features; price per bedroom, bathroom / bedroom ratio, created hour or day of week Simplifications of complex features: number of photos, number of words in description Presence of each feature or not; e.g. laundry yes or no Good value rental

  21. High cardinality features? manager_id, building_id Simplifications: size of manager or building turn into numeric What else?

  22. Leakage features? A key aspect of winning Machine Learning competitions Where might there be leakage in the data we have been given Paper: Leakage in Data Mining: Formulation, Detection, and Avoidance, (Kaufman, Rosset and Perlich)

  23. Now what? We have loads of features good But there is every chance that our decision tree will pick up random noise in the training data (called variance ) How can we control for this? Cost complexity pruning

  24. R code 02b_FeatureCreation_1/2/3.R 04b_01_rpart.R 04b_02_VariableImportance.R

  25. Training and validation curves

  26. Variable importance GINI (training) based variable importance manager_id_mean_med manager_id_mean_high price building_id_mean_high time_stamp building_id_mean_med pricePerBed pricePerRoom listing_id street_address_mean_high Increasing Variable Importance

  27. Random Forest Introduce randomness. Why? Boostrapping and then aggregating the results ( bagging ) How else can we create randomness Sample the features available to each split OOB error What are our hyper-parameters? number of trees? nodesize? mtry?

  28. R code: random forest code randomForest( x = dt_train, y = as.factor(y_train), ntree = 300, nodesize = 1, mtry = 6, keep.forest = TRUE)

  29. R code 04c_RandomForest.R

  30. gradient boosting Add lots of weak learners Create new weak learners by focusing on examples which are incorrectly classified Add all the weak learners using weights which are higher for the better weak learners The weak learners are adaptive basis functions

  31. hpyerparameters learning rate depth of trees min child weight data subsampling column subsampling grid search random search hyperopt

  32. Python code xgb.train(params = param , dtrain = xg_train , num_boost_round = num_rounds , evals = watchlist , early_stopping_rounds = 20 , verbose_eval = False )

  33. Gradient boosting (machines) 04d_GradientBoosting_presentation.ipynb

  34. Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#