Data Preprocessing Techniques in Python
This article covers various data preprocessing techniques in Python, including standardization, normalization, missing value replacement, resampling, discretization, feature selection, and dimensionality reduction using PCA. It also explores Python packages and tools for data mining, such as Scikit-learn, Orange, Pandas, MLPy, MDP, PyBrain, NumPy, SciPy, and Matplotlib. Examples of standardization/scaling and missing value replacement are provided using scikit-learn.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Data Preprocessing in Python Ahmedul Kabir 1 TA, CS 548, Spring 2015
Preprocessing Techniques Covered 2 Standardization and Normalization Missing value replacement Resampling Discretization Feature Selection Dimensionality Reduction: PCA
Python Packages/Tools for Data Mining 3 Scikit-learn Orange Pandas MLPy MDP PyBrain and many more
Some Other Basic Packages 4 NumPy and SciPy Fundamental Packages for scientific computing with Python Contains powerful n-dimensional array objects Useful linear algebra, random number and other capabilities Pandas Contains useful data structures and algorithms Matplotlib Contains functions for plotting/visualizing data.
Standardization and Normalization 5 Standardization: To transform data so that it has zero mean and unit variance. Also called scaling Use function sklearn.preprocessing.scale() Parameters: X: Data to be scaled with_mean: Boolean. Whether to center the data (make zero mean) with_std: Boolean (whether to make unit standard deviation Normalization: to transform data so that it is scaled to the [0,1] range. Use function sklearn.preprocessing.normalize() Parameters: X: Data to be normalized norm: which norm to use: l1 or l2 axis: whether to normalize by row or column
Example code of Standardization/Scaling 6 >>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]])
Missing Value Replacement 7 In scikit-learn, this is referred to as Imputation Class be used sklearn.preprocessing.Imputer Important parameters: strategy: What to replace the missing value with: mean / median / most_frequent axis: Boolean. Whether to replace along rows or columns Attribute: statistics_ : The imputer-filled values for each feature Important methods fit(X[, y]) Fit the model with X. transform(X) Replace all the missing values in X.
Example code for Replacing Missing Values 8 >>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> X = [[np.nan, 2], [6, np.nan], [7, 6]] >>> print(imp.transform(X)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]]
Resampling 9 Using class sklearn.utils.resample Important parameters: n_sample: No. of samples to keep replace: Boolean. Whether to resample with or without replacement Returns sequence of resampled views of the collections. The original arrays are not impacted. Another useful class is sklearn.utils.shuffle
Discretization 10 Scikit-learn doesn t have a direct class that performs discretization. Can be performed with cut and qcut functions available in pandas. Orange has discretization functions in Orange.feature.discretization
Feature Selection 11 The sklearn.feature_selection module implements feature selection algorithms. Some classes in this module are: GenericUnivariateSelect: Univariate feature selector based on statistical tests. SelectKBest: Select features according to the k highest scores. RFE: Feature ranking with recursive feature elimination. VarianceThreshold: Feature selector that removes all low-variance features. Scikit-learn does not have a CFS implementation, but RFE works in somewhat similar fashion.
Dimensionality Reduction: PCA 12 The sklearn.decomposition module includes matrix decomposition algorithms, including PCA sklearn.decomposition.PCA class Important parameters: n_components: No. of components to keep Important attributes: components_ : Components with maximum variance explained_variance_ratio_ : Percentage of variance explained by each of the selected components Important methods fit(X[, y]) Fit the model with X. score_samples(X) Return the log-likelihood of each sample transform(X) Apply the dimensionality reduction on X.
Other Useful Information 13 Generate a random permutation of numbers 1. n: numpy.random.permutation(n) You can randomly generate some toy datasets using Sample generators in sklearn.datasets Scikit-learn doesn t directly handle categorical/nominal attributes well. In order to use them in the dataset, some sort of encoding needs to be performed. One good way to encode categorical attributes: if there are n categories, create n dummy binary variables representing each category. Can be done easily using the sklearn.preprocessing.oneHotEncoder class.
References 14 Preprocessing Modules: http://scikit-learn.org/stable/modules/preprocessing.html Video Tutorial: http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107 Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html User Guide http://scikit-learn.org/stable/user_guide.html API Reference http://scikit-learn.org/stable/modules/classes.html Example Gallery http://scikit-learn.org/stable/auto_examples/index.html