Data Mining Course Project Overview: Pre-Processing to Classification
Explore the challenges and tasks involved in a data mining course project, from pre-processing to redefining classification tasks. The project involves handling a large dataset with numerous features, including numerical and categorical ones, addressing missing values, noisy data, and feature selection. Learn about the intricate process of refining and validating parameters to achieve accurate results. The project also delves into the complexities of data distribution and the classification of objects into four distinct classes.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Data Mining Course Project PB10000603 2010 001
Overview Pre-Processing Feature Selection Numerical Discretization Categorical Grouping First Sight Missing Values Classification Ensemble Methods Parameters Refining Validation Metrics Decision Tree Summary Results Conclusion Open Problems Roadmap
First Sight Into Data 230 attributes/features, out of which 40 are categorical, 190 are numerical. The number of categories varies from 2 to 1000+. Large amount of missing values Imbalanced class distribution Total Churn Appetency + - + 10001 1826 8175 434 Upselling + 1897 - - 9567 8104 No dominating feature, i.e., there is not a simple combination of features that can perfectly determine the class, which disables manual discovery of mining rules. Not so many outliers (flooring at 1%th, or beyond 3 )
Task Redefining The positive labels of 3 classes are exclusive, i.e., an object has label churn, appetency, (exclusively) or upselling, or none. Deduce the mining task to 4-class classification (churn, appetency, upselling, none).
Pre-Processing Challenges Noisy Large amount of missing values: Value Prediction Irrelevant or too noisy features: Feature Selection Heterogeneous Large range of numerical values: Discretization Large number of categories: Grouping
Missing Values NULLs are almost everywhere 20 out of 230 attrs are all null 38 attrs are bad (>=9950 nulls) 63 attrs are good (<5000 nulls) 19 attrs are free of missing values Data Distribution Continuous (value num>1000) Unbalanced (Most frequent>9000) Exponential Distribution Normal Distribution Other Distribution Number of Null Values 12000 10000 8000 6000 4000 2000 0 1 22 43 64 85 106 127 148 169 190 211
Missing Values (continued) Simple prediction of values does not work well. Mean for numerical features Most frequent value for categorical values Missing categorical values can be treated as a separate value. Missing numerical values are simply filled with the mean of the feature. Note that the missing values itself is a kind of information, at least reflecting data quality of the feature, an additional attribute Missing Value Ratio is attached to each feature, which is used for feature weight in the classifier. An binary value indicating Missing Value is added, but it is not used by classification algorithm. (Open problem)
Feature Selection Information Gain Prefer features with many different values, resulting in many non-predictive features. Some features are highly correlated and redundant. Simple classifier to measure predictability Separate the dataset into 2 parts for predicting and measuring For categorical features, simply use the proportion of positive objects (rare class) as the prediction for measure data. For numerical features, use equal-depth discretization to separate into 10 categories, then process as categorical ones. Missing values have been processed in the previous step. The algorithm name simple classifier is from myself.
Feature Selection (continued) Advantages of simple classifier selection algorithm Easy to implement Fast Suitable for non-linear features Robustness against missing values Drawbacks of simple classifier selection algorithm Linear features are not likely to be discovered However, in this dataset, features tend to be non-linear, most exponential, so it is not a major problem.
Feature Extraction Flags for missing values Length for string-type values Since the values are encrypted, Euclidean distance between strings seem useless, so take length instead. Binary indicator of null or not null (manual work) Add a specific binary flag to predictive indicators found manually. The result shows that this step is not effective, since the tree-based classifiers themselves have found such hidden characteristics of features.
Discretization The dataset contains rare outliers, but a large amount of missing values, so it is unwise to choose Equal-width or Equal-depth discretization techniques. Minimum-entropy approach is finally chosen. Bisect the initial values so that the resulting two intervals give minimum entropy. The splitting process is then repeated with the interval with highest entropy. Until the maximum entropy is below some threshold, or the number of intervals reaches 10. The main drawback is the cost of time.
Categorical Grouping Features with more than 10 categories are not suitable for most classifiers supporting only numerical input. Furthermore, many categories only include one or two objects, which are not enough informational for classifier. The main idea of categorical grouping is to treat categories as objects, and the size of a category as the value of the object. Then the categorical data is transformed into numerical data, and perform equal-width discretization on them. The categories fall in a same interval are aggregated. Finally the categories have to be binarized.
Categorical Grouping (continued) AUC scores for aggregated categories and original ones Model Churn Appetency Up-selling Aggregated 0.7495 0.8807 0.9025 Original 0.7479 0.8795 0.9027 Improvement 0.0016 0.0008 -0.0002 The improvement of grouping is not as significant as predicted. Any measure taken in Data Mining can not be only theoretical, but should be measured in practice. Good models often do not work.
Classification Challenge Curse of dimensionality (data sparseness) Distance-based classifiers are not likely to work well. Presence of noise Overfitting problem No simple method fits all Use ensemble methods to improve result. Dataset is not large enough to validate every rule. Use cross validation to enlarge the dataset.
Evaluation Metrics Class Imbalance Problem Use confusion matrix of TP, FP, FN, TN. ROC (Receiver Operating Characteristic Curve) True positive rate along y axis, false positive rate along x axis. Evaluation metric: AUC (Area Under Curve) Cross Validation Separate the dataset into 5 subsets. Perform cross validation with 4 subsets for training and 1 subset for testing. Take the average of 5 testing results as the final result.
Decision Tree Tree-based classifier Easy to handle interactions and redundancies among features. Decision tree is rule-based instead of distance-based. Able to model non-linear features. Invariant under monotone transformations Robustness against peculiar value distribution or outliers. Robustness against missing value Predictable even if an important predictor contains missing value. Ease of tuning and refining Multiple parameters can be adjusted to avoid overfitting, etc.
Tree Replication Problem Since the partitioning algorithm in decision tree generation is divide-and-conquer, the same condition(s) may appear in different parts of the attribute space, resulting in a decision tree with much more complexity and hard to analysis manually. Rule-based classifier can be deployed to fight this problem by rule selection, extraction and combination. However, in this fast data mining task, manual analysis and interpretation of the model are not necessary, and the data set is not large enough to take efficiency into account, so the rule extraction step is not performed.
Ensemble Methods Bagging Every object has a same probability to be selected, and not focus on any particular object, so it is less susceptible to overfitting when applied to noisy data. Boosting Perfectly classifies all objects in training data Because of its tendency to focus on wrongly classified objects, it is susceptible to noisy data. Random Forests Randomization helps to reduce the correlation among decision trees. More robust to noise and much faster than Boosting.
Bagging Fight against overfitting Multiple (100) decision trees are generated with different parameters Multiple (10) ensembles are created from random subsets of decision trees Vote the ensembles to get the final result (true/false) This approach can only generate binary results, so 3 classification problems need to be treated individually. Result Churn 0.7392 Appetency 0.8425 Up-selling 0.8970 Average 0.8262
Boosting Class weights Classes should be assigned different weights (3,4,3) based on proportion of positive objects (about 3:4:3). Feature weights Predictive features are assigned more weight in pre-processing phase. Object weights Object weights are assigned by Boosting algorithm that forces the classifier to focus on objects that are difficult to classify. Need to determine: How the weights of training examples are updated at the end of each boosting round Adaboost algorithm
Parameters Refining Performance of decision tree of various depth limits 1 0.9 0.8 0.7 0.6 Depth 1 Depth 2 Depth 5 Depth 10 AUC 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 101112131415 Boosting Rounds
Parameters Refining (continued) It is weird that depth 1 (flat) decision trees with 5~10 boosting rounds show the best performance. Maybe more complex decision trees are overfitting. Real-world data is a non-perfect, complicated system. Focus on the naive decision tree result (round 1): Depth 1 0.625 Depth 2 0.723 Depth 3 0.810 Depth 4 0.852 Depth 5 0.849 Depth 10 0.826 As the depth of trees increase, the predictability first increases (under fitting), then decreases (overfitting). It is shown that Depth 4~5 is the optimal depth. Ensemble methods outperform any single decision tree.
Random Forests Random vectors are generated from a fixed probability distribution. Boosting is an adaptive approach where the probability distribution is changed to emphasize objects that are hard to classify. Similar with bagging, while randomization is introduced to increase the generalization error bound.
Results Method Churn Appetency Up-selling Score Bagging 0.7392 0.8425 0.8970 0.8262 Boosting 0.7565 0.8589 0.9012 0.8389 Random Forests 0.7428 0.8576 0.8911 0.8305 The best model is Boosting. Note that in theoretical analysis, Boosting is suspectical to noisy data, while in practice Boosting outperforms Random Forests. Maybe some other factor dominates.
Other Classifiers Considered Naive Bayes Act as baseline. Poor performance, for the model is too simple. SVM (Support Vector Machine) Tried, not so good performance, for its distance-based nature. Nearest-Neighbor classifiers Not even tried, the sparseness of data is a key factor. Rule-based classifiers Not tried, for there are not so much difference in effectiveness between rules and decision trees. At last decision tree is selected.
Conclusion Challenges in data pre-processing Noisy: Handling missing values, Feature selection and weighting Many categories: Grouping, Discretization Classification Decision tree: Dimensionality, Noisy Ensemble models: Bagging, Boosting, Random Forests The decision tree outperforms other much more complicated classifiers for real-world dataset. Large numbers of samples and attributes, mixed types of variables, and lots of missing values. Ensemble methods prove to be effective.
Key Factor to Achieve the Result Make no assumptions about the data distribution, every measurement taken should be measured. Rob Pike, Notes on C Programming Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
Keep It Simple And Stupid (continued) Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. (Even if n does get big, use Rule 2 first.) Rule 4. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. Rule 6. There is no Rule 6.
Open Problems Manual coding is tiring. Open source packages or software such as Clementine should be used next time. Why flat (depth=1) decision trees are optimal for Boosting? How to treat missing values more elegantly and effectively in classification models? Find the core factor for Boosting to outperform Random Forests, which seems to be more stable for noisy data.
Alternative Approaches Train 3 classifiers for churn, appetency and up-selling separately, for they might have different characteristics, and select the most efficient ones for each task. Use score adjustment after each iteration of classification to take weights into consideration. Train more classifiers of different types (beyond decision- tree based ones) and select the most effective ones as the final classifier. Sample the training data at each tree generation to introduce more random.
Roadmap 2011-09-03 ~ 09-10, every evening (4 hours) after class 09-03 ~ 09-04: Data Exploration 09-04 ~ 09-05: Pre-processing Laptop (Linux), PHP script language, for ease of string processing and fast deployment. 09-06 ~ 09-08: Classification Web server of gewu.ustc.edu.cn (in SCGY computer room, 2.9GHz, 2GB RAM, Linux), C programming language, for efficiency of computational intensive modeling and measuring. Total 30 hours of CPU time 09-09 ~ 09-10: Summary, Slides Laptop (Windows), MS PowerPoint 2007 (The only proprietary software used)
Reference Introduction to Data Mining, Pang-Ning Tan.et.al Dragonstar Data Mining Slides Slides and papers from KDDcup 2009 workshop: http://www.kddcup-orange.com/ Other references online