Introduction to Data Mining
This workshop on data mining covers key concepts such as what data mining is, common techniques, and practical applications. Participants will explore linear regression, classification, tree-based methods, cluster analysis, and more. Additionally, the session delves into resampling methods like bootstrap, jackknife, and subsample for estimating sample precision. Attendees will gain insights on using R, Minitab, and SPSS for data mining applications, with a focus on developing a foundational understanding to pursue further knowledge in the field.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to Data Mining A workshop hosted by STAT CLUB February 24, 2015
Goals for todays workshop Learn what exactly data mining is and why it is important Gain a general understanding of what data mining methods are designed to do Hear about some common data mining techniques and when we would use them Run some simple examples in R, Minitab, and SPSS to better understand how these techniques are performed (with emphasis in R) Nurture the ability to hold a discussion on the concept of data mining in practice Leave with enough knowledge to pursue these topics further if you desire to or need to Introduction to Data Mining 3/8/2025 2
What you will NOT learn The details and theory behind the data mining methods discussed Every method or technique that exists How to perform every one of these methods or interpret the results Introduction to Data Mining 3/8/2025 3
Outline I. II. What is data mining? III. Linear regression IV. Dimension reduction and variable selection V. Classification VI. Tree-based methods VII. Cluster analysis VIII.Factor analysis IX. Other topics Resampling Introduction to Data Mining 3/8/2025 4
I. Resampling Introduction to Data Mining 3/8/2025 5
What is resampling? Repeated samples are taken from the original sample Performed to estimate the theoretical sampling distribution, rather than make assumptions Major types: Bootstrap, jackknife, subsample Cross-validation Permutation/exact/randomization testing Introduction to Data Mining 3/8/2025 6
Bootstrap, jackknife, and subsample All are done to estimate precision of sample statistics (mean, proportion, odds ratio, etc.) Jackknifing: leave-one-out subsets Subsampling: take subsample without replacement Bootstrapping: sample with replacement, with resample being same size as original Use when: theoretical distribution is complicated or unknown, sample size is small Introduction to Data Mining 3/8/2025 7
Cross-validation Done to validate a predictive model Subsets are taken repeatedly and prediction made Will discuss again later Introduction to Data Mining 3/8/2025 8
Permutation test Nonparametric way to test hypothesis when distribution of test statistics is unknown or the sample is very small Basic idea: Repeatedly resample the data to estimate test statistic distribution Mix the data in the population together, and rearrange the labels randomly on the data as a whole. Recalculate the test statistic each time, and when done, you ll have an estimated distribution of the test statistic, which you can use to estimate a p- value for your original data Introduction to Data Mining 3/8/2025 9
II. What is data mining? Introduction to Data Mining 3/8/2025 10
What is data mining? AKA statistical learning, machine learning It is the process of finding information from large, complex data Has become more important due to cheaper storage devices, faster communication, better database management systems, and increasing computer power Combines elements of statistics, computer science, engineering, and other fields Introduction to Data Mining 3/8/2025 11
Applications Business - transaction data Wal-Mart data warehouse mined for advertising, logistics Credit card companies mined for fraudulent use of your card based on purchase patterns of consumers - They can deny access if your purchase patterns change drastically! Genomics - speed up research using computational methods The Human Genome Project: collection DNA sequences Microarray data Information retrieval Terabytes of data on the internet - The growth rate is very impressive and research is currently underway examining this trend. Multimedia information - Visual as well as audio data files. How do we manage these types of data efficiently? Communication systems Speech recognition A long-existing area where important methods were developed and have been transferred to other application areas. Image analysis Many other scientific fields Taken from: https://onlinecourses.science.psu.edu/stat557/node/34 Introduction to Data Mining 3/8/2025 12
Data mining terminology and usage Can be descriptive or predictive Predictive: classification or prediction Types of learning algorithms: Given Y: supervised Not given Y: unsupervised Introduction to Data Mining 3/8/2025 13
III. Linear regression Introduction to Data Mining 3/8/2025 14
Regression methods Regression methods in data mining are supervised learning methods primarily used for prediction, but can also be used to estimate the effect of predictors on a response Linear regression (quantitative Y) Logistic regression (classification) Ridge and lasso regression ( shrinkage methods) Introduction to Data Mining 3/8/2025 15
Linear regression ? ?? = ?0+ ?1?1+ + ?? 1?? 1 Use equation to estimate Y, a continuous response The predictors can be quantitative or categorical MSE can be used to estimate error Introduction to Data Mining 3/8/2025 16
Estimating prediction error If interested in prediction accuracy, we can estimated prediction error using training samples. The idea: 1. Split your data randomly into a training sample and a test sample (doesn t need to be 50/50, but could be) 2. Fit the model to your training set only 3. Calculate prediction error (MSE) using your test set K-fold Cross validation: 1. Split your data into K disjoint subsets 2. Then, select K-1 sets to fit (train) the model 3. Use the remaining set to estimate prediction error 4. Repeat Steps 1-3 K times, each time leaving one set out as the test set 5. Your final estimate of the prediction error is the average over the K iterations Introduction to Data Mining 3/8/2025 17
Try it! Hardware data we ll run a linear regression and compare MSE using K-fold cross validation (UCI Machine Learning Repository) name = vendor name model = model MYCT = machine cycle time in nanoseconds MMIN = minimum main memory in kilobytes MMAX = maximum main memory in kilobytes CACH = cache memory in kilobytes CHMIN = minimum channels in units CHMAX = maximum channels in units PRP = published relative performance (our Y) ERP = estimated relative performance R: Open and run code Introduction to Data Mining 3/8/2025 18
IV. Dimension reduction and variable selection Introduction to Data Mining 3/8/2025 19
Variable selection We perform variable selection to decrease the risk of overfitting and to reduce the number of predictors in our model. The goal is for a parsimonious model that still provide good predictive power. Best subsets, forward or backward stepwise selection Ridge or lasso regression ( shrinkage methods) Introduction to Data Mining 3/8/2025 20
Shrinkage methods Useful when you have MORE predictors than observations this causes problems in regression The idea: force coefficients to be closer to 0 Can use as variable selection procedure throw out coefficients near 0 Generally of the form: ? ? ?? ????= ??????? 2+ ? ?? ?? ? ?? ?=1 ?=1 Introduction to Data Mining 3/8/2025 21
Dimension reduction In dimension reduction, we want to reduce the number of predictors we include in the model without losing information. This is like variable selection, but instead of selecting a subset of our current, we use other new predictors in place. This also helps with correlated predictors. Principal components regression (PCR) Select the m first principal components attained from PCA Partial least squares (PLS) Similar to PCR, but we use Yto supervise the direction of the principal components Introduction to Data Mining 3/8/2025 22
Principal components analysis (PCA) Unsupervised learning technique to reduce number of predictors in model Principal components (PC) are linear combinations of our original predictors: ? 1 ???= ???? ?=1 PC are linearly uncorrelated The first PC accounts for the largest amount of variability in the data; the second accounts for the second largest, and so on You can select the first m PC as your predictors in your model m can be chosen a number of ways; common way: select the number of PC that explains at least 80% of the total variability, use the elbow in the Scree plot, use eigenvalues above 1 Problem: great for prediction and dimension reduction, but not good for estimation of coefficients or determining which predictors are the most important in predicting Y Introduction to Data Mining 3/8/2025 23
Scree plot Introduction to Data Mining 3/8/2025 24
Try it! Hardware data we ll use PCA to reduce dimension and then fit a linear regression to the principal components (UCI Machine Learning Repository) name = vendor name model = model MYCT = machine cycle time in nanoseconds MMIN = minimum main memory in kilobytes MMAX = maximum main memory in kilobytes CACH = cache memory in kilobytes CHMIN = minimum channels in units CHMAX = maximum channels in units PRP = published relative performance (our Y) ERP = estimated relative performance R: Open and run code Minitab: Stat > Multivariate > Principal Components. Put columns 3 through 8 into the Variables box. Under the Graphs option, select Scree plot. In the Extraction button, select Fixed Number of Factors and enter 6. Under the Storage button, in the Scores box type in PC1 PC2 PC3 . This will save the values of the first three principal components Click OK. SPSS: Analyze > Dimension Reduction > Factor. Put variables MYCT through CHMAX into the Variables box. Under Extraction button, ensure that Principal Components method is selected. In that same dialog box, check Scree Plot under the Display. Under the Scores button, check the Save as Variables box to save the PC scores. Click OK. Introduction to Data Mining 3/8/2025 25
V. Classification Introduction to Data Mining 3/8/2025 26
What is classification? Supervised learning for categorical response Try to identify to which category (class) an observation belongs Usually create decision boundaries Common classification methods: Linear regression on indicator matrix Logistic regression Discriminant analysis K-nearest neighbors (non-parametric) Support vector machines Image taken from: http://onlinecourses.science.psu.edu/stat557/node/15 Introduction to Data Mining 3/8/2025 27
Checking effectiveness of a classification method Error/misclassification rate: (# incorrectly categorized) / (total # of observations) The following two work well only when there are two categories, a positive and a negative : Sensitivity: P(claim positive | truth positive) Specificity: P(claim negative | truth negative) You can still break data into training and test sets, and can still perform cross-validation. Instead of prediction error, you d estimate error rate each time and then average. Introduction to Data Mining 3/8/2025 28
Logistic regression For two classes (Y=0 and Y=1), binary logistic regression: ? log = ?0+ ?1?1+ + ?? 1?? 1 1 ? Estimate p = P(Y = 1) Then, set a boundary for p (such as 50%) any value above the boundary gets predicted value Y=1 and otherwise gets Y=0 For more than two classes, cumulative or ordinal logistic regression is needed. Problem: does not work well with unbalanced class sizes Introduction to Data Mining 3/8/2025 29
Discriminant analysis In discriminant analysis, we estimate the Bayes probabilities: ??? ?? ?=1 ? ? = ? ? = ? = ? ??? ?? Different estimation method than logistic regression, but tries to perform the same task Can assume linear or quadratic decision boundaries (LDA or QDA) Problem: sensitive to outliers, does not work well with unbalanced class sizes Introduction to Data Mining 3/8/2025 30
K-nearest neighbors Nonparametric classification method: 1. Spit your data into a test set and a training set 2. Define a distance metric (such as Euclidean : two dimension Euclidean distance = ?2 ?1 3. Using that distance metric, for each observation in your test set identify the Knearest points (neighbors) in your training set 4. To your observation, you assign the class that is the most-represented class in those K neighbors Problem: picking k, computationally intense, does not estimate probabilities, and sensitive to local phenomena 2+ (?2 ?1)2 ) Introduction to Data Mining 3/8/2025 31
Support vector machines (SVM) SVM is a primarily a linear classifier, though extensions can be made to create nonlinear boundaries Defined with a kernel function Problem: computationally intense, and does not estimate probabilities Introduction to Data Mining 3/8/2025 32
Try it! Kernels data we ll compare LDA and K-nearest neighbors using misclassification/error rate. (UCI Machine Learning Repository) A = area A P = perimeter P C = compactness L = length of kernel W = width of kernel AC = asymmetry coefficient KG = length of kernel groove R: Open and run code Minitab: Stat > Multivariate > Discriminant Analysis. Put Y in the Groups box and all other variables in the Predictors box. Click OK. SPSS: Analyze > Classify > Nearest Neighbor. Put Y in the Target box and all other variables in the Features box. Under the Neighbors tab, select Automatically select K and put the range from 3 to 10. click OK. Introduction to Data Mining 3/8/2025 33
VI. Tree-based methods Introduction to Data Mining 3/8/2025 34
Classification and regression trees (CART) AKA Decision trees : Supervised learning method For regression (continuous response) or classification (categorical response) Hierarchical partitions of the space: Advantages: Highly interpretable, flexible, nonparametric, robust to outliers Can sometimes help with variable selection Does not always perform well with unbalanced classes Image taken from: http://onlinecourses.science.psu.edu/stat557/node/83 Introduction to Data Mining 3/8/2025 35
Tree construction Repeatedly split the sample space, X, into smaller and smaller subsets See a short video of the process Automatically takes into account interactions Things to consider: 1. How do we decide how to split? 2. How do we know when to stop splitting? 3. How do assign the final class labels or Y value? Image taken from: http://onlinecourses.science.psu.edu/stat557/node/83 Introduction to Data Mining 3/8/2025 36
How do assign the final class labels or Y value? Common strategy: Regression: take mean of sample values at the terminal node Classification: take majority class at the terminal node More splits will always decrease misclassification rate, but also increases model complexity and chance of overfitting Introduction to Data Mining 3/8/2025 37
Pruning trees You prune a large tree to reduce the size and complexity Find a good subtree , which will balance between error and size of tree See a short video of this process Can use cross-validation to determine the best subtree Introduction to Data Mining 3/8/2025 38
Aggregate CART methods These methods focus on building several trees and averaging the results. These help to avoid overfitting and reduce influence of outliers. Bagging Random forest Boosting Aggregating can be applied to other types of methods, not just trees. Introduction to Data Mining 3/8/2025 39
Bagging 1. Take a random sample of size N with replacement from the data (a bootstrap sample). Construct a classification tree as usual but do not prune. Assign a class to each terminal node, and store the class (or mean value, for a regression tree) attached to each case coupled with the predictor values for each observation. Repeat Steps 1-3 a large number of times. For each observation in the dataset, count the number of trees that it is classified in one category over the number of trees. Assign each observation to a final category by a majority vote over the set of trees (or take the average, for a regression tree). 2. 3. 4. 5. For each tree, observations not included in the bootstrap sample are called "out-of-bag'' observations. These "out-of-bag'' observations can be treated as a test dataset, and dropped down the tree. To get a better evaluation of the model, the prediction error is estimated only based on the "out-of-bag'' observations. Take from: http://onlinecourses.science.psu.edu/stat857/node/180 Introduction to Data Mining 3/8/2025 40
Random forests Just like the bagging algorithm, except at each node, select a susbet of the predictors to consider for the best split Tend to perform better: reduction in variance and bias Introduction to Data Mining 3/8/2025 41
Try it! Kernels data we ll perform CART, bagging, and random forest (UCI Machine Learning Repository) A = area A P = perimeter P C = compactness L = length of kernel W = width of kernel AC = asymmetry coefficient KG = length of kernel groove R: Open and run code SPSS: Analyze > Classify > Tree. Introduction to Data Mining 3/8/2025 42
VII. Cluster Analysis Introduction to Data Mining 3/8/2025 43
What is cluster analysis? Unsupervised learning for grouping observations together Like classification, except we have no response we try to classify each observation according to how well it resembles other observations in the dataset The main issue is determining how many classes or clusters to define Some common clustering methods: K-means Hierarchical Introduction to Data Mining 3/8/2025 44
Introduction to Data Mining 3/8/2025 45
K-means clustering Nonparametric clustering method: 1. Select the number of groups you d like to cluster your set into, k 2. Initially select k centroids, which represent the centers (means) of the k classes 3. For each observation, calculate its Euclidean distance to each of the k centers, and then assign it to the class to which it is nearest 4. Once that has been done for all observations, update the centroids by averaging all the now observations assigned to the class 5. Repeat steps 4-5 until convergence is reached (minimize sum of Euclidean distances) Problem: picking initial centers, picking k, and computationally intense Note: there are alternatives to Euclidean distance Introduction to Data Mining 3/8/2025 46
Hierarchical clustering Seeks to build a hierarchy of clusters, avoiding the need to select k, the number of clusters, apriori A common method is agglomerative clustering, bottom up approach: Each observation starts in its own cluster, and pairs of clusters are merged as you move up Merged according to some minimized between-cluster distance Common distances: Single-link (minimum distance between two objects in the clusters) Complete-link (maximum distance between two objects in the clusters) Average linkage (average distance between objects in the clusters) The tree image is usually called a dendogram Problem: picking distance, computationally intense, selecting number of clusters Introduction to Data Mining 3/8/2025 47
Dendrogram Introduction to Data Mining 3/8/2025 48
Try it! Kernels data try to cluster into 3 groups (UCI Machine Learning Repository) A = area A P = perimeter P C = compactness L = length of kernel W = width of kernel AC = asymmetry coefficient KG = length of kernel groove R: Open and run code Minitab: Stat > Multivariate > Cluster K-Means. Put 3 in for the Number of Clusters. Under the Storage button, enter a variable name ( cluster ) into the Cluster Member Column box. Click OK. SPSS: Analyze > Classify > K-means Cluster. Put all variables except Y in the Variables box. Change Number of Clusters to 3. Under the Save button, select Cluster Membership. Click OK. Introduction to Data Mining 3/8/2025 49
VIII. Factor analysis Introduction to Data Mining 3/8/2025 50