A Machine Learning practical with Kaggle

Alan Chalk

Machine Learning

practical

with Kaggle

The views expressed in this presentation are

those of the author,

Alan Chalk,

and not necessarily of the

Staple Inn Actuarial Society

Over the next 45 minutes

•

Loss functions

•

Greedy algorithms

•

Performance measurement

•

Feature engineering

•

Generalisation error

•

Bias and variance

•

Penalisation

•

Training and validation

curves

•

Hyperparameter tuning

•

•

Python

•

Decision trees

•

Random forests

•

Gradient boosting

•

Basis functions

•

Adaptive basis

functions

Renthop (2 sigma connect)

Example renthop posting

What is our task?

Evaluation (loss function)

Process

…

Create pipeline and baseline

First R code

•

00a_Packages.R

•

00b_Working Directories.R

•

01a_ReadRawData.R

•

01b_CleanData.R

•

04__LoadAndPrepareData.R

•

04a_BaselinePredictions.R

Bottom of the leaderboard

Decision trees

bathrooms, bedrooms,

latitude, longitude,

price,

listing_id

Some decision tree vocab

•

CART (rpart), C5.0 etc

•

split rule (loss function)

•

NP-hard

•

greedy

•

over-fitting

•

complexity parameter

R code: formulas

•

In an R formula,

 means “is allowed to

depend on”. Our first formula is:

(In our code, you will see that this formula

is saved in a variable called “fmla_”)

interest_level ~ bathrooms + bedrooms +

latitude + longitude + price + listing_id

R code: rpart code

rpart_1 <-

    rpart(fmla_ ,

          data = dt_all[idx_train1,],

          method = "class",

          cp = 1e-8,

R code

•

04b01_rpart.R

A first tree for renthop

How can we do better?

•

Better techniques

•

More features (feature engineering)

•

Anything else?

Feature engineering?

•

Based only on the data and files provided:

athrooms, bedrooms, building_id, created,

description, display_address, features, latitude,

listing_id, longitude, manager_id, photos, price,

street_address, interest_level

•

Note: You also have loads and loads of

photos.  “description” is free format.

“features” is a list of words.

Feature engineering?

•

Simple features; price per bedroom,

bathroom / bedroom ratio, created hour

or day of week

•

Simplifications of complex features:

number of photos, number of words in

description

•

Presence of each feature or not; e.g.

laundry yes or no

•

Good value rental

High cardinality features?

•

manager_id, building_id

•

Simplifications:

•

“size” of manager or building

•

turn into numeric

•

What else?

Leakage features?

•

A key aspect of winning Machine

Learning competitions

•

Where might there be leakage in the

data we have been given

•

Paper: Leakage in Data Mining:

Formulation, Detection, and Avoidance,

(Kaufman, Rosset and Perlich)

Now what?

•

We have loads of features – good

•

But there is every chance that our

decision tree will pick up random noise

in the training data (called “variance”)

•

How can we control for this?

•

Cost complexity pruning

R code

•

02b_FeatureCreation_1/2/3.R

•

04b_01_rpart.R

•

04b_02_VariableImportance.R

Training and validation curves

Variable importance

Random Forest

•

Introduce randomness.  Why?

•

Boostrapping and then aggregating the

results (“bagging”)

•

How else can we create randomness

•

Sample the features available to each split

•

OOB error

•

What are our hyper-parameters?

•

number of trees?

•

nodesize?

•

mtry?

R code: random forest code

randomForest(

    x = dt_train,

    y = as.factor(y_train),

    ntree = 300,

    nodesize = 1,

    mtry = 6,

    keep.forest = TRUE)

R code

•

04c_RandomForest.R

gradient boosting

•

Add lots of “weak learners”

•

Create new weak learners by focusing on

examples which are incorrectly classified

•

Add all the weak learners using weights

which are higher for the better weak

learners

•

The weak learners are “adaptive basis

functions”

hpyerparameters

•

learning rate

•

depth of trees

•

min child weight

•

data subsampling

•

column subsampling

•

grid search

•

random search

•

hyperopt

Python code

xgb.train(params = param

                      , dtrain = xg_train

                      , num_boost_round = num_rounds

                      , evals = watchlist

                      , early_stopping_rounds = 20

                      , verbose_eval = False

Gradient boosting (machines)

•

04d_GradientBoosting_presentation.ipynb

Over the next 45 minutes

•

Loss functions

•

Greedy algorithms

•

Performance measurement

•

Feature engineering

•

Generalisation error

•

Bias and variance

•

Penalisation

•

Training and validation

curves

•

Hyperparameter tuning

•

•

Python

•

Decision trees

•

Random forests

•

Gradient boosting

•

Basis functions

•

Adaptive basis

functions

Slide Note

Embed Share

Download

Delve into the world of practical machine learning with Kaggle expert Alan Chalk as he covers topics like loss functions, greedy algorithms, performance measurement, feature engineering, and more over the course of 45 minutes. Learn about decision trees, random forests, gradient boosting, and the process of creating pipelines and baselines for machine learning projects. Follow along a journey of improvement, from baselines to final predictions, while understanding the nuances of bias and variance in model training. Explore examples, evaluation metrics, and hands-on R and Python code snippets to enhance your machine learning skills.

wellborn_e Follow

Uploaded on Mar 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

A Machine Learning practical with Kaggle Alan Chalk

Disclaimer Disclaimer The views expressed in this presentation are those of the author, Alan Chalk, and not necessarily of the Staple Inn Actuarial Society

Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

Renthop (2 sigma connect)

Example renthop posting

What is our task?

Evaluation (loss function) Example (i) 1 True class (j) medium Probability high 0.1 Probability medium 0.6 Probability low 0.3 2 high 0.0 0.5 0.5

Process Create Pipeline Baseline guess Improve guess

Create pipeline and baseline Read raw data Clean data Create guess Submit on Kaggle Create Pipeline Baseline guess

First R code 00a_Packages.R 00b_Working Directories.R 01a_ReadRawData.R 01b_CleanData.R 04__LoadAndPrepareData.R 04a_BaselinePredictions.R

Bottom of the leaderboard

Decision trees bathrooms, bedrooms, latitude, longitude, price, listing_id

Some decision tree vocab CART (rpart), C5.0 etc split rule (loss function) NP-hard greedy over-fitting complexity parameter

R code: formulas In an R formula, ~means is allowed to depend on . Our first formula is: interest_level ~ bathrooms + bedrooms + latitude + longitude + price + listing_id (In our code, you will see that this formula is saved in a variable called fmla_ )

R code: rpart code rpart_1 <- rpart(fmla_ , data = dt_all[idx_train1,], method = "class", cp = 1e-8, )

R code 04b01_rpart.R

A first tree for renthop high medium low low .08 .23 .70 100% price < 2016 no yes low .25 .32 .43 11% price < 1508 low .21 .34 .45 9% bedrooms >= 1.5 high low .36 .35 .29 2% .17 .33 .49 7% price < 1829 longitude < 74 medium .26 .39 .35 1% longitude >= 74 high high medium .28 .49 .22 0% low medium .27 .40 .33 1% low low .51 .23 .26 1% .57 .25 .18 1% .25 .34 .41 1% .15 .32 .53 6% .06 .22 .73 89%

How can we do better? Better techniques More features (feature engineering) Anything else?

Feature engineering? Based only on the data and files provided: bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, listing_id, longitude, manager_id, photos, price, street_address, interest_level Note: You also have loads and loads of photos. description is free format. features is a list of words.

Feature engineering? Simple features; price per bedroom, bathroom / bedroom ratio, created hour or day of week Simplifications of complex features: number of photos, number of words in description Presence of each feature or not; e.g. laundry yes or no Good value rental

High cardinality features? manager_id, building_id Simplifications: size of manager or building turn into numeric What else?

Leakage features? A key aspect of winning Machine Learning competitions Where might there be leakage in the data we have been given Paper: Leakage in Data Mining: Formulation, Detection, and Avoidance, (Kaufman, Rosset and Perlich)

Now what? We have loads of features good But there is every chance that our decision tree will pick up random noise in the training data (called variance ) How can we control for this? Cost complexity pruning

R code 02b_FeatureCreation_1/2/3.R 04b_01_rpart.R 04b_02_VariableImportance.R

Training and validation curves

Variable importance GINI (training) based variable importance manager_id_mean_med manager_id_mean_high price building_id_mean_high time_stamp building_id_mean_med pricePerBed pricePerRoom listing_id street_address_mean_high Increasing Variable Importance

Random Forest Introduce randomness. Why? Boostrapping and then aggregating the results ( bagging ) How else can we create randomness Sample the features available to each split OOB error What are our hyper-parameters? number of trees? nodesize? mtry?

R code: random forest code randomForest( x = dt_train, y = as.factor(y_train), ntree = 300, nodesize = 1, mtry = 6, keep.forest = TRUE)

R code 04c_RandomForest.R

gradient boosting Add lots of weak learners Create new weak learners by focusing on examples which are incorrectly classified Add all the weak learners using weights which are higher for the better weak learners The weak learners are adaptive basis functions

hpyerparameters learning rate depth of trees min child weight data subsampling column subsampling grid search random search hyperopt

Python code xgb.train(params = param , dtrain = xg_train , num_boost_round = num_rounds , evals = watchlist , early_stopping_rounds = 20 , verbose_eval = False )

Gradient boosting (machines) 04d_GradientBoosting_presentation.ipynb

Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

A Machine Learning practical with Kaggle

Download Presentation

Presentation Transcript

Related

More Related Content