Understanding Dimensionality Reduction and Principal Component Analysis

Slide Note
Embed
Share

Dimensionality reduction techniques like Principal Component Analysis (PCA) help in transforming high-dimensional data into a lower-dimensional space, leading to efficient storage and better understanding of underlying patterns. By capturing maximum variance in the data, PCA learns projection directions that minimize reconstruction errors, resulting in a compressed representation of the inputs.


Uploaded on Aug 05, 2024 | 5 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Dimensionality Reduction: Principal Component Analysis CS771: Introduction to Machine Learning Nisheeth

  2. 2 K-means loss function: recap ?? = [??1,??2, ,???] denotes a length ? one-hot encoding of ?? Remember the matrix factorization view of the k-means loss function? D K D X Z N K We approximated an N x D matrix with An NxK matrix and a KXD matrix This could be storage efficient if K is much smaller than D CS771: Intro to ML

  3. 3 Dimensionality Reduction Can think of W W as a linear mapping that transforms low-dim ?? to high-dim ?? A broad class of techniques Some dim-red techniques assume a nonlinear mapping function ? such that ??= ?(??) Goal is to compress the original representation of the inputs For example, ? can be modeled by a kernel or a deep neural net Example: Approximate each input ?? ?, ? = 1,2, ,? as a linear combination of ? < min{?,?} basis vectors ?1,?2, ,??, each also ? ? = ?1,?2, ,?? is ? ? ? Note: These basis vectors need not necessarily be linearly independent. But for some dim. red. techniques, e.g., classic principal component analysis (PCA), they are ?? ?????= ??? ??= ??1,??2, ,??? is ? 1 ?=1 We have represented each ?? ?by a ?-dim vector ?? (a new feat. rep) ? ? To store ? such inputs {??}?=1 Originally we required ? ? storage, now ? ? + ? ? = ? + ? ? storage If? min{?,?}, this yields substantial storage saving, hence good compression , we need to keep ? and {??}?=1 CS771: Intro to ML

  4. 4 Dimensionality Reduction Each basis image is like a template that captures the common properties of face images in the dataset Dim-red for face images K=4 basis face images + A face image ?? ? ??1 ??3 ??4 ??2 ?4 ?1 ?2 ?3 In this example, ?? ? (? = 4) is a low-dim feature rep. for ?? ? Essentially, each face image in the dataset now represented by just 4 real numbers Different dim-red algos differ in terms of how the basis vectors are defined/learned .. And in general, how the function ? in the mapping ??= ?(??) is defined Like 4 new features CS771: Intro to ML

  5. 5 Principal Component Analysis (PCA) A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930) Can be seen as Learning directions (co-ordinate axes) that capture maximum variance in data ?2 PCA is essentially doing a change of axes in which we are representing the data ?1,?2: Standard co-ordinate axis (? = [?1,?2]) ?1,?2: New co-ordinate axis (? = [?1,?2]) ?1 ?2 Each input will still have 2 co-ordinates, in the new co-ordinate system, equal to the distances measured from the new origin To reduce dimension, can only keep the co-ordinates of those directions that have largest variances (e.g., in this example, if we want to reduce to one-dim, we can keep the co-ordinate ?1 of each point along ?1 and throw away ?2). We won t lose much information ?1 Learning projection directions that result in smallest reconstruction error ? ?? ??? Subject to orthonormality constraints: ?? ? ? and ?? ??= 0 for 2= 1 2= argmin ? ??2 argmin ?,? ?,? ?=1 PCA also assumes that the projection directions are orthonormal CS771: Intro to ML

  6. 6 Principal Component Analysis: the algorithm 1 ? ?=1 ? Center the data (subtract the mean ? = ?? from each data point) Compute the ? ? covariance matrix ? using the centered data matrix ? as 1 ?? ? ? = (Assuming ? is arranged as ? ?) Do an eigendecomposition of the covariance matrix ? (many methods exist) Take top ? < ? leading eigvectors {?1,?2, ,??} with eigvalues {?1,?2, ,??} The ?-dimensional projection/embedding of each input is Note: Can decide how many eigvecs to use based on how much variance we want to campure (recall that each ?? gives the variance in the ?? direction (and their sum is the total variance) ?? ?? ?? ?K= ?1,?2, ,??is the projection matrix of size ? ? CS771: Intro to ML

  7. 7 Understanding PCA: The variance perspective CS771: Intro to ML

  8. 8 Solving PCA by Finding Max. Variance Directions Consider projecting an input ?? ?along a direction ?1 ? ?? (green pts below) Projection/embedding of ?? (red points below) will be ?1 Mean of projections of all inputs: ?1 1 ? ?=1 (1 ? ??= ?1 ? ? ? ?=1 ?1 ??) = ?1 ?? ? is the ? ? cov matrix of the data: ? =1 ? ?=1 ? (?? ?)(?? ?) Variance of the projections: 1 ? ?=1 ? ? ?)2=1 ?? ?1 (?? ?)}2=?1 ??1 (?1 ? {?1 ?=1 ??1 is maximized Want?1such that variance ?1 For already centered data, ? = ? and ? = ? ?=1 ???? 1 = 1 ? ??? ??1 s.t. ?1 ?1= 1 argmax ?1?1 Need this constraint otherwise the objective s max will be infinity CS771: Intro to ML

  9. 9 Max. Variance Direction Variance along the direction ?1 Note: Total variance of the data is equal to the sum of eigenvalues of ?, i.e., ?=1 ??1s.t.?1 ?1= 1 Our objective function was argmax ?1?1 ? ?? PCA would keep the top ? < ? such directions of largest variances Can construct a Lagrangian for this problem ??1+ ?1(1-?1 ?1) argmax ?1?1 Taking derivative w.r.t. ?1 and setting to zero gives ??1= ?1?1 Note: In general, ? will have ? eigvecs Therefore ?1 is an eigenvector of the cov matrix ? with eigenvalue ?1 Claim: ?1 is the eigenvector of ? with largest eigenvalue ?1. Note that ?1 ??1= ?1?1 ?1= ?1 ??1 will be max. if ?1 is the largest eigenvalue (and ?1 is the Thus variance ?1 corresponding top eigenvector; also known as the first Principal Component) Other large variance directions can also be found likewise (with each being orthogonal to all others) using the eigendecomposition of cov matrix ? (this is PCA) CS771: Intro to ML

  10. 10 Understanding PCA: The reconstruction perspective CS771: Intro to ML

  11. 11 Alternate Basis and Reconstruction Representing a data point ??= [??1,??2, ,???] in the standard orthonormal basis ?1,?2, ,?? ??= ?=1 ? ?? is a vector of all zeros except a single 1 at the ?? position. Also, ?? ?? = 0 for ? ? ????? Let s represent the same data point in a new orthonormal basis ?1,?2, ,?? ? ??? is the projection of ?? along the direction ?? since ???= ?? ??= [??1,??2, ,???] denotes the co-ordinates of ?? in the new basis ??= ????? ??= ?? ??(verify) ?=1 Ignoring directions along which projection ??? is small, we can approximate ?? as ? ????? = ?=1 2 ? )?? Note that ?? ?=1 is the reconstruction error on ??. Would like it to minimize w.r.t. ?1,?2, ,?? (???? ? ? )?? ??)?? = ?? ??= (???? (?? ?=1 ?=1 Now ?? is represented by ? < ? dim. rep. ?? = [??1,??2, ,???] and (verify) ?K= ?1,?2, ,??is the projection matrix of size ? ? ?? ?? ?? Also, ?? ???? CS771: Intro to ML

  12. 12 Minimizing Reconstruction Error We plan to use only ? directions ?1,?2, ,??so would like them to be such that the total reconstruction error is minimized Constant; doesn t depend on the ?? s 2 ? ? ? ? ???(verify) Variance along ?? = ? ?=1 ?? )?? 2= ?? ?? ?1,?2, ,?? = ?? (???? ?=1 ?=1 ?=1 Each optimal ?? can be found by solving argmin ??? ?? ?1,?2, ,?? = argmax ?? ?? Thus minimizing the reconstruction error is equivalent to maximizing variance The ? directions can be found by solving the eigendecomposition of ? ? ???= trace( ?? ???) Note: ?=1 Thus argmax??trace( ?? solving the eigendec. of ? (recall that Spectral Clustering also required solving this) ?? ???) s.t. orthonormality on columns of ?? is the same as CS771: Intro to ML

  13. 13 Principal Component Analysis 1 ? ?=1 ? Center the data (subtract the mean ? = ?? from each data point) Compute the ? ? covariance matrix ? using the centered data matrix ? as 1 ?? ? ? = (Assuming ? is arranged as ? ?) Do an eigendecomposition of the covariance matrix ? (many methods exist) Take top ? < ? leading eigvectors {?1,?2, ,??} with eigvalues {?1,?2, ,??} The ?-dimensional projection/embedding of each input is Note: Can decide how many eigvecs to use based on how much variance we want to campure (recall that each ?? gives the variance in the ?? direction (and their sum is the total variance) ?? ?? ?? ?K= ?1,?2, ,??is the projection matrix of size ? ? CS771: Intro to ML

  14. 14 Singular Value Decomposition (SVD) Any matrix ? of size ? ? can be represented as the following decomposition min{?,?} ? = ??? = ?????? ?=1 Diagonal matrix. If ? > ?, last ? ? rows are all zeros; if ? > ?, last ? ? columns are all zeros ? = ?1,?2, ,?? is ? ? matrix of left singular vectors, each ?? ? ? is also orthonormal ? = ?1,?2, ,?? is ? ? matrix of right singular vectors, each ?? ? ? is also orthonormal is ? ? with only min(?,?) diagonal entries - singular values Note: If ? is symmetric then it is known as eigenvalue decomposition (? = ?) CS771: Intro to ML

  15. 15 Low-Rank Approximation via SVD If we just use the top ? < min{?,?} singular values, we get a rank-? SVD Above SVD approx. can be shown to minimize the reconstruction error ? ? Fact: SVD gives the best rank-? approximation of a matrix PCA is done by doing SVD on the covariance matrix ? (left and right singular vectors are the same and become eigenvectors, singular values become eigenvalues) CS771: Intro to ML

  16. 16 Dim-Red as Matrix Factorization If we don t care about the orthonormality constraints, then dim-red can also be achieved by solving a matrix factorization problem on the data matrix ? D K D W K N N X Z Matrix containing the low-dim rep of ? ?, ? = argmin?,?? ??2 If ? < min{?,?}, such a factorization gives a low-rank approximation of the data matrix X Can solve such problems using ALT-OPT Can impose various constraints on ? and and ?(e.g., sparsity, non-negativity, etc) CS771: Intro to ML

Related


More Related Content