Data Preprocessing Techniques in Python

Data Preprocessing in
Python
  
Ahmedul Kabir
TA, CS 548, Spring 2015
1
Preprocessing Techniques Covered
Standardization and Normalization
Missing value replacement
Resampling
Discretization
Feature Selection
Dimensionality Reduction: PCA
2
Python Packages/Tools for Data Mining
Scikit-learn
Orange
Pandas
MLPy
MDP
PyBrain … and many more
3
Some Other Basic Packages
NumPy
 and 
SciPy
Fundamental Packages for scientific computing with Python
Contains powerful n-dimensional array objects
Useful linear algebra, random number and other capabilities
Pandas
Contains useful data structures and algorithms
Matplotlib
Contains functions for plotting/visualizing data.
4
Standardization and Normalization
Standardization
: To transform data so that it has zero mean and unit variance.
Also called scaling
Use function sklearn.preprocessing.scale()
Parameters:
 
X
: Data to be scaled
with_mean
: Boolean. Whether to center the data (make zero mean)
with_std
: Boolean (whether to make unit standard deviation
Normalization
: to transform data so that it is scaled to the [0,1] range.
Use function sklearn.preprocessing.normalize()
Parameters:
 
X
: Data to be normalized
norm
: which norm to use: l1 or l2
axis
: whether to normalize by row or column
5
Example code of
Standardization/Scaling
>>> from 
sklearn
 import preprocessing
>>> import 
numpy
 as 
np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])
6
Missing Value Replacement
In scikit-learn, this is referred to as “Imputation”
Class be used 
sklearn.preprocessing.Imputer
Important parameters:
strategy
: What to replace the missing value with: 
mean / median / most_frequent
axis
: Boolean. Whether to replace along rows or columns
Attribute:
statistics
_ : The imputer-filled values for each feature
Important methods
fit(X[, y])
 
Fit the model with X.
transform(X)
 
Replace all the missing values in X.
7
Example code for Replacing Missing
Values
>>> import 
numpy
 as 
np
>>> from 
sklearn.preprocessing
 import 
Imputer
>>> imp = Imputer(missing_values=
'NaN
', strategy=
'mean
', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4.          2.        ]
 [ 6.          3.666...]
 [ 7.          6.        ]]
8
Resampling
Using class 
sklearn.utils.resample
Important parameters:
n_sample: 
No. of samples to keep
replace
: Boolean. Whether to resample with or without replacement
Returns sequence of resampled views of the collections. The
original arrays are not impacted.
Another useful class is sklearn.utils.shuffle
9
Discretization
Scikit-learn doesn’t have a direct class that performs
discretization.
Can be performed with 
cut
 and 
qcut
 functions available
in pandas.
Orange has discretization functions in
Orange.feature.discretization
10
Feature Selection
The 
sklearn.feature_selection
 module implements feature selection
algorithms.
Some classes in this module are:
GenericUnivariateSelect
: Univariate feature selector based on statistical tests.
SelectKBest
: Select features according to the k highest scores.
RFE
: Feature ranking with recursive feature elimination.
VarianceThreshold
: Feature selector that removes all low-variance features.
Scikit-learn does not have a CFS implementation, but RFE works in
somewhat similar fashion.
11
Dimensionality Reduction: PCA
The 
sklearn.decomposition
 module includes matrix decomposition
algorithms, including 
PCA
sklearn.decomposition.PCA
 class
Important parameters:
n_components
: No. of components to keep
Important attributes:
components
_ : Components with maximum variance
explained_variance_ratio
_ : Percentage of variance explained by each of the selected
components
Important methods
fit(X[, y])
 
Fit the model with X.
score_samples(X)
 
Return the log-likelihood of each sample
transform(X)
 
Apply the dimensionality reduction on X.
12
Other Useful Information
Generate a random permutation of numbers 1.… n: 
   
          
 
    
 
    
numpy.random.permutation(n
)
You can randomly generate some toy datasets using Sample generators in
sklearn.datasets
Scikit-learn doesn’t directly handle categorical/nominal attributes well. In
order to use them in the dataset, some sort of encoding needs to be
performed.
One good way to encode categorical attributes: if there are n categories,
create n dummy binary variables representing each category.
Can be done easily using the 
sklearn.preprocessing.oneHotEncoder
 class.
13
References
Preprocessing Modules:
 http://scikit-learn.org/stable/modules/preprocessing.html
Video Tutorial: 
http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107
Quick Start Tutorial 
http://scikit-learn.org/stable/tutorial/basic/tutorial.html
User Guide 
http://scikit-learn.org/stable/user_guide.html
API Reference 
http://scikit-learn.org/stable/modules/classes.html
Example Gallery 
http://scikit-learn.org/stable/auto_examples/index.html
14
Slide Note
Embed
Share

This article covers various data preprocessing techniques in Python, including standardization, normalization, missing value replacement, resampling, discretization, feature selection, and dimensionality reduction using PCA. It also explores Python packages and tools for data mining, such as Scikit-learn, Orange, Pandas, MLPy, MDP, PyBrain, NumPy, SciPy, and Matplotlib. Examples of standardization/scaling and missing value replacement are provided using scikit-learn.

  • Data preprocessing
  • Python
  • Scikit-learn
  • Data mining
  • PCA

Uploaded on Oct 07, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Preprocessing in Python Ahmedul Kabir 1 TA, CS 548, Spring 2015

  2. Preprocessing Techniques Covered 2 Standardization and Normalization Missing value replacement Resampling Discretization Feature Selection Dimensionality Reduction: PCA

  3. Python Packages/Tools for Data Mining 3 Scikit-learn Orange Pandas MLPy MDP PyBrain and many more

  4. Some Other Basic Packages 4 NumPy and SciPy Fundamental Packages for scientific computing with Python Contains powerful n-dimensional array objects Useful linear algebra, random number and other capabilities Pandas Contains useful data structures and algorithms Matplotlib Contains functions for plotting/visualizing data.

  5. Standardization and Normalization 5 Standardization: To transform data so that it has zero mean and unit variance. Also called scaling Use function sklearn.preprocessing.scale() Parameters: X: Data to be scaled with_mean: Boolean. Whether to center the data (make zero mean) with_std: Boolean (whether to make unit standard deviation Normalization: to transform data so that it is scaled to the [0,1] range. Use function sklearn.preprocessing.normalize() Parameters: X: Data to be normalized norm: which norm to use: l1 or l2 axis: whether to normalize by row or column

  6. Example code of Standardization/Scaling 6 >>> from sklearn import preprocessing >>> import numpy as np >>> X = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]])

  7. Missing Value Replacement 7 In scikit-learn, this is referred to as Imputation Class be used sklearn.preprocessing.Imputer Important parameters: strategy: What to replace the missing value with: mean / median / most_frequent axis: Boolean. Whether to replace along rows or columns Attribute: statistics_ : The imputer-filled values for each feature Important methods fit(X[, y]) Fit the model with X. transform(X) Replace all the missing values in X.

  8. Example code for Replacing Missing Values 8 >>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> X = [[np.nan, 2], [6, np.nan], [7, 6]] >>> print(imp.transform(X)) [[ 4. 2. ] [ 6. 3.666...] [ 7. 6. ]]

  9. Resampling 9 Using class sklearn.utils.resample Important parameters: n_sample: No. of samples to keep replace: Boolean. Whether to resample with or without replacement Returns sequence of resampled views of the collections. The original arrays are not impacted. Another useful class is sklearn.utils.shuffle

  10. Discretization 10 Scikit-learn doesn t have a direct class that performs discretization. Can be performed with cut and qcut functions available in pandas. Orange has discretization functions in Orange.feature.discretization

  11. Feature Selection 11 The sklearn.feature_selection module implements feature selection algorithms. Some classes in this module are: GenericUnivariateSelect: Univariate feature selector based on statistical tests. SelectKBest: Select features according to the k highest scores. RFE: Feature ranking with recursive feature elimination. VarianceThreshold: Feature selector that removes all low-variance features. Scikit-learn does not have a CFS implementation, but RFE works in somewhat similar fashion.

  12. Dimensionality Reduction: PCA 12 The sklearn.decomposition module includes matrix decomposition algorithms, including PCA sklearn.decomposition.PCA class Important parameters: n_components: No. of components to keep Important attributes: components_ : Components with maximum variance explained_variance_ratio_ : Percentage of variance explained by each of the selected components Important methods fit(X[, y]) Fit the model with X. score_samples(X) Return the log-likelihood of each sample transform(X) Apply the dimensionality reduction on X.

  13. Other Useful Information 13 Generate a random permutation of numbers 1. n: numpy.random.permutation(n) You can randomly generate some toy datasets using Sample generators in sklearn.datasets Scikit-learn doesn t directly handle categorical/nominal attributes well. In order to use them in the dataset, some sort of encoding needs to be performed. One good way to encode categorical attributes: if there are n categories, create n dummy binary variables representing each category. Can be done easily using the sklearn.preprocessing.oneHotEncoder class.

  14. References 14 Preprocessing Modules: http://scikit-learn.org/stable/modules/preprocessing.html Video Tutorial: http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107 Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html User Guide http://scikit-learn.org/stable/user_guide.html API Reference http://scikit-learn.org/stable/modules/classes.html Example Gallery http://scikit-learn.org/stable/auto_examples/index.html

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#