A Machine Learning practical with Kaggle
Delve into the world of practical machine learning with Kaggle expert Alan Chalk as he covers topics like loss functions, greedy algorithms, performance measurement, feature engineering, and more over the course of 45 minutes. Learn about decision trees, random forests, gradient boosting, and the process of creating pipelines and baselines for machine learning projects. Follow along a journey of improvement, from baselines to final predictions, while understanding the nuances of bias and variance in model training. Explore examples, evaluation metrics, and hands-on R and Python code snippets to enhance your machine learning skills.
Uploaded on Mar 03, 2025 | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Machine Learning practical with Kaggle Alan Chalk
Disclaimer Disclaimer The views expressed in this presentation are those of the author, Alan Chalk, and not necessarily of the Staple Inn Actuarial Society
Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions
Evaluation (loss function) Example (i) 1 True class (j) medium Probability high 0.1 Probability medium 0.6 Probability low 0.3 2 high 0.0 0.5 0.5
Process Create Pipeline Baseline guess Improve guess
Create pipeline and baseline Read raw data Clean data Create guess Submit on Kaggle Create Pipeline Baseline guess
First R code 00a_Packages.R 00b_Working Directories.R 01a_ReadRawData.R 01b_CleanData.R 04__LoadAndPrepareData.R 04a_BaselinePredictions.R
Decision trees bathrooms, bedrooms, latitude, longitude, price, listing_id
Some decision tree vocab CART (rpart), C5.0 etc split rule (loss function) NP-hard greedy over-fitting complexity parameter
R code: formulas In an R formula, ~means is allowed to depend on . Our first formula is: interest_level ~ bathrooms + bedrooms + latitude + longitude + price + listing_id (In our code, you will see that this formula is saved in a variable called fmla_ )
R code: rpart code rpart_1 <- rpart(fmla_ , data = dt_all[idx_train1,], method = "class", cp = 1e-8, )
R code 04b01_rpart.R
A first tree for renthop high medium low low .08 .23 .70 100% price < 2016 no yes low .25 .32 .43 11% price < 1508 low .21 .34 .45 9% bedrooms >= 1.5 high low .36 .35 .29 2% .17 .33 .49 7% price < 1829 longitude < 74 medium .26 .39 .35 1% longitude >= 74 high high medium .28 .49 .22 0% low medium .27 .40 .33 1% low low .51 .23 .26 1% .57 .25 .18 1% .25 .34 .41 1% .15 .32 .53 6% .06 .22 .73 89%
How can we do better? Better techniques More features (feature engineering) Anything else?
Feature engineering? Based only on the data and files provided: bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, listing_id, longitude, manager_id, photos, price, street_address, interest_level Note: You also have loads and loads of photos. description is free format. features is a list of words.
Feature engineering? Simple features; price per bedroom, bathroom / bedroom ratio, created hour or day of week Simplifications of complex features: number of photos, number of words in description Presence of each feature or not; e.g. laundry yes or no Good value rental
High cardinality features? manager_id, building_id Simplifications: size of manager or building turn into numeric What else?
Leakage features? A key aspect of winning Machine Learning competitions Where might there be leakage in the data we have been given Paper: Leakage in Data Mining: Formulation, Detection, and Avoidance, (Kaufman, Rosset and Perlich)
Now what? We have loads of features good But there is every chance that our decision tree will pick up random noise in the training data (called variance ) How can we control for this? Cost complexity pruning
R code 02b_FeatureCreation_1/2/3.R 04b_01_rpart.R 04b_02_VariableImportance.R
Variable importance GINI (training) based variable importance manager_id_mean_med manager_id_mean_high price building_id_mean_high time_stamp building_id_mean_med pricePerBed pricePerRoom listing_id street_address_mean_high Increasing Variable Importance
Random Forest Introduce randomness. Why? Boostrapping and then aggregating the results ( bagging ) How else can we create randomness Sample the features available to each split OOB error What are our hyper-parameters? number of trees? nodesize? mtry?
R code: random forest code randomForest( x = dt_train, y = as.factor(y_train), ntree = 300, nodesize = 1, mtry = 6, keep.forest = TRUE)
R code 04c_RandomForest.R
gradient boosting Add lots of weak learners Create new weak learners by focusing on examples which are incorrectly classified Add all the weak learners using weights which are higher for the better weak learners The weak learners are adaptive basis functions
hpyerparameters learning rate depth of trees min child weight data subsampling column subsampling grid search random search hyperopt
Python code xgb.train(params = param , dtrain = xg_train , num_boost_round = num_rounds , evals = watchlist , early_stopping_rounds = 20 , verbose_eval = False )
Gradient boosting (machines) 04d_GradientBoosting_presentation.ipynb
Over the next 45 minutes ? Loss functions Greedy algorithms Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions