Understanding Feature Selection and Reduction Techniques Using PCA

Slide Note
Embed
Share

In machine learning, Principal Components Analysis (PCA) is a common method for dimensionality reduction. It helps combine information from multiple features into a smaller set, focusing on directions of highest variance to eliminate noise in the data. PCA is unsupervised and works well with linear correlations, although non-linear techniques are also emerging for better results in complex datasets.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Feature Selection and Reduction II PAUL BODILY IDAHO STATE UNIVERSITY

  2. 134 Machine Learning: An Algorithmic Perspective FIGURE 6.5 Plot of the iris data showing the three classes left: before and right: after Can we really get away with reducing dimensionality? LDA has been applied. FIGURE 6.6 Two different sets of coordinate axes. The second consists of a rotation and translation of the first and wasfound using Principal Components Analysis. CS 4478/5578 - Perceptrons 2 without compromising the results of a learning algorithm. In fact, it can make the results better, since we are often removing some of the noise in the data. The question is how to choose the axes. The first method we are going to look at is Principal Components Analysis (PCA). The idea of a principal component is that it is a direction in the data with the largest variation. The algorithm first centres the data by subtracting off the mean, and then chooses the direction with the largest variation and places an axis in that direction, and then looks at the variation that remains and finds another axis that is orthogonal to the first and covers as much of the remaining variation as possible. It then iterates this until it has run out of possible axes. The end result is that all the variation is along the axes of the coordinate set, and so the covariance matrix is diagonal each new variable is uncorrelated with every variable except itself. Some of the axes that are found last have very little variation, and so they can be removed without affecting the variability in the data. Putting thisin moreformal terms, wehavea data matrix X and wewant to rotateit so that the data lies along the directions of maximum variation. This means that we multiply our data matrix by a rotation matrix (often written as PT) so that Y = PTX , where P is chosen so that the covariance matrix of Y is diagonal, i.e.,

  3. PCA Principal Components Analysis PCA is one of the most common feature reduction techniques A linear method for dimensionality reduction Allows us to combine much of the information contained in n features into m features where m < n PCA is unsupervised in that it does not consider the output class/value of an instance There are other algorithms which do (e.g. Linear Discriminant Analysis) PCA works well in many cases where data has mostly linear correlations Non-linear dimensionality reduction is also a relatively new and successful area and can give much better results for data with significant non-linearities CS 4478/5578 - Feature Selection and Reduction 3

  4. PCA Overview Seek new set of bases which correspond to the highest variance in the data Transform n-dimensional data to a new n-dimensional basis The new dimension with the most variance is the first principal component The next is the second principal component, etc. Note z1 combines/fuses significant information from both x1 and x2 Drop those dimensions for which there is little variance CS 4478/5578 - Feature Selection and Reduction 4

  5. Variance and Covariance Variance is a measure of data spread in one dimension (feature) Covariance measures how two dimensions (features) vary with respect to each other n ( ) ( ) Xi- X ( ( i=1 Xi- X ) )Yi-Y n -1 ) var(X) = i=1 n -1 n ( ) Xi- X ( cov(X,Y) = CS 4478/5578 - Feature Selection and Reduction 5

  6. Predicting cruise ship crew size from age, tonnage, passengers, length, cabins, passenger density CS 4478/5578 - Feature Selection and Reduction 6

  7. Predicting cruise ship crew size from age, tonnage, passengers, length, cabins, passenger density CS 4478/5578 - Feature Selection and Reduction 7

  8. Covariance and the Covariance Matrix Considering the sign (rather than exact value) of covariance: Positive value means that as one feature increases or decreases the other does also (positively correlated) Negative value means that as one feature increases the other decreases and vice versa (negatively correlated) Value close to zero means features are independent If highly covariant, are both features necessary? Covariance matrix is an n n matrix containing the covariance values for all pairs of features in a data set with n features (dimensions) The diagonal contains the covariance of a feature with itself which is the variance (which is the square of the standard deviation) The matrix is symmetric CS 4478/5578 - Feature Selection and Reduction 8

  9. PCA Example First step is to center the original data around 0 by subtracting the mean in each dimension X Y X Y 2.5 0.5 2.2 1.9 3.1 2.3 2.0 1.6 1.0 1.5 1.2 2.4 0.7 2.9 2.2 3.0 2.7 0.69 -1.31 -1.21 0.39 0.09 1.29 0.49 0.19 -0.81 -0.81 -0.31 -0.31 -0.71 -1.01 0.49 0.99 0.29 1.09 0.79 -0.31 X =1.81 Y =1.91 1.1 1.6 0.9 CS 4478/5578 - Feature Selection and Reduction 9

  10. PCA Example Second: Calculate the covariance matrix of the centered data Only 2 2 for this case X Y X Y 2.5 0.5 2.2 1.9 3.1 2.3 2.0 1.6 1.0 1.5 1.2 2.4 0.7 2.9 2.2 3.0 2.7 0.69 -1.31 -1.21 0.39 0.09 1.29 0.49 0.19 -0.81 -0.81 -0.31 -0.31 -0.71 -1.01 0.49 n ( )Yi-Y n -1 ) ( ) Xi- X ( 0.99 0.29 1.09 0.79 -0.31 cov(X,Y) = i=1 X =1.81 Y =1.91 cov =0.616555556 0.615444444 0.615444444 0.716555556 1.1 1.6 0.9 CS 4478/5578 - Feature Selection and Reduction 10

  11. PCA Example Third: Calculate the unit eigenvectors and eigenvalues of the covariance matrix (remember linear algebra) Covariance matrix is always square n n and positive semi-definite, thus n non-negative eigenvalues will exist All eigenvectors (principal components/dimensions) are orthogonal to each other and will form the new set of bases/dimensions for the data The magnitude of each eigenvalue corresponds to the variance along that new dimension Just what we wanted! We can sort the principal components according to their eigenvalues Just keep those dimensions with the largest eigenvalues eigenvalues=0.490833989 1.28402771 eigenvectors=-0.735178656 -0.677873399 -0.735178656 0.677873399 CS 4478/5578 - Feature Selection and Reduction 11

  12. PCA Example Below are the two eigenvectors overlaying the centered data Which eigenvector has the largest eigenvalue? Fourth Step: Just keep the p eigenvectors (rows) with the largest eigenvalues Do lose some information, but if we just drop dimensions with small eigenvalues then we lose only a little information, hopefully noise We can then have p input features rather than n The p features contain the most pertinent combined information from all n original features How many dimensions p should we keep? eigenvalues=0.490833989 1.28402771 eigenvectors=-0.735178656 -0.677873399 -0.735178656 0.677873399 Proportion of Variance Eigenvalue p li l1+l2+ +lp l1+l2+ +lp+ +ln = i=1 n li 1 2 3 4 5 6 7 n i=1

  13. PCA Example Last Step: Transform the n features to the p (< n) chosen bases (Eigenvectors) Transformed data (m instances) is a matrix multiply T = A B A is a p n matrix with the p principal components in the rows, component one on top B is a n m matrix containing the transposed centered original data set TT is a m p matrix containing the transformed data set Now we have the new transformed data set with dimensionality p Keep matrix A to transform future 0-centered data instances Below shows transform of both dimensions, would if we just kept the 1st component 13

  14. PCA Algorithm Summary Center the m training set features around 0 (subtract m means) 1. 2. Calculate the covariance matrix of the centered training set Calculate the unit eigenvectors and eigenvalues of the covariance matrix Keep the p (< m) eigenvectors with the largest eigenvalues 5. Matrix multiply the p eigenvectors with the centered TS to get a new TS with only p features 3. 4. Given a novel instance during execution 1. Center instance around 0 (using same centering transform done in training) 2. Do the matrix multiply (step 5 above) to change the new instance from m to p features CS 4478/5578 - Feature Selection and Reduction 14

  15. Terms PCA Example 5 Number of instances in data set m 2 Number of input features n 1 Final number of principal components chosen p Center the m TS features around 0 (subtract m means) Calculate the covariance matrix of the centered TS Calculate the unit eigenvectors and eigenvalues of the covariance matrix Keep the p (< m) eigenvectors with the largest eigenvalues Matrix multiply the p eigenvectors with the centered TS to get a new TS with only p features Given a novel instance during execution 1. Center instance around 0 2. Do the matrix multiply (step 5 above) to change the new instance from m to p features 1. Original Data 2. x y 3. p1 .2 -.3 p2 -1.1 2 p3 1 -2.2 4. p4 .5 -1 p5 -.6 1 5. mean 0 -.1 n ( ) ( ) Xi- X ( ( i=1 Xi- X ) )Yi-Y n -1 ) var(X) = i=1 n -1 n ( ) Xi- X ( cov(X,Y) = CS 4478/5578 - Feature Selection and Reduction 15

  16. Terms PCA Example 5 Number of instances in data set m 2 Number of input features n 1 Final number of principal components chosen p Original Data Zero Centered Data Covariance Matrix EigenVectors x y x y x y x y Eigenvalue p1 .2 -.3 p1 .2 -.2 .715 -1.39 -.456 -.890 3.431 p2 -1.1 2 p2 -1.1 2.1 -1.39 2.72 -.890 -.456 .0037 p3 1 -2.2 p3 1 -2.1 % total info in 1st principal component 3.431/(3.431 + . 0037) = 99.89% p4 .5 -1 p4 .5 -.9 p5 -.6 1 p5 -.6 1.1 mean 0 -.1 mean 0 0 A B New Data Set Matrix A p n Matrix B = Transposed zero centered Training Set 1st PC x y p1 .0870 p1 p2 p3 p4 p5 1st PC -.456 -.890 p2 -1.368 .2 -1.1 1 .5 -.6 x p3 1.414 -.2 2.1 -2.1 -.9 1.1 y p4 0.573 p5 -0.710 CS 4478/5578 - Feature Selection and Reduction 16

  17. PCA Summary PCA is a linear transformation, so if the data is highly non-linear then the transformed data will be less informative Non linear dimensionality reduction techniques can handle these situations better (e.g. LLE, Isomap, Manifold-Sculpting) PCA is good at removing redundant correlated features With high dimensional data the eigenvector is a hyper-plane Interesting note: The 1st principal component is the multiple regression plane that delta rule will discover Caution: Not a "cure all" and can lose important info in some cases How would you know if it is effective? Pros and Cons of PCA vs Wrapper approach, example CS 4478/5578 - Feature Selection and Reduction 17

  18. Group Projects CS 4478/5578 - Feature Selection and Reduction 18

Related


More Related Content