Exploring Matrix Sketching Techniques for Data Analysis

matrix sketching over sliding windows l.w
1 / 21
Embed
Share

This study delves into the use of matrix sketching over sliding windows for analyzing modern data sets represented as large matrices. Techniques such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and K-means clustering are explored in the context of covariance matrices and eigenvalue decomposition. The concept of matrix sketching is presented as an efficient approach to approximating large matrices in an online fashion, essential for applications in real-world scenarios. The use of sliding windows in data analysis is highlighted, showcasing its relevance in detecting changes and anomalies in time-sensitive data streams.

  • Matrix Sketching
  • Data Analysis
  • Sliding Windows
  • SVD
  • PCA

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Matrix Sketching over Sliding Windows Zhewei Wei1, Xuancheng Liu1, Feifei Li2, Shuo Shang1 Xiaoyong Du1, Ji-Rong Wen1 1School of Information, Renmin University of China 2School of Computing, The University of Utah

  2. Matrix data Modern data sets are modeled as large matrices. Think of ? ?? ? as n rows in ??. Data Rows Columns d n 105 107 >1010 Textual Documents Words 101 104 >107 Actions Users Types 105 106 >108 Visual Images Pixels, SIFT 105 106 >108 Audio Songs, tracks Frequencies 102 104 >106 Machine Learning Examples Features 103 105 >106 Financial Prices Items, Stocks

  3. Singular Value Decomposition (SVD) ?? ? ? ??1 ?11 ?11 ?1? ?1? ?11 ?1 0 0 0 0 ?2 ?1? ??? ?? 0 0 0 0 0 = ??? ??1 0 0 0 ??1 ??? Principal component analysis (PCA) K-means clustering Latent semantic indexing (LSI)

  4. SVD & Eigenvalue decomposition ? ?? ??1 ?11 ?11 ?1? Covariance Matrix ??? ?1? ??? ??1 ??? ?? 2 ? ??1 ?11 ?1? ?11 2 ?1 0 0 0 0 2 ?2 = ??1 ??? ?1? ??? 2 0 0 ??

  5. Matrix Sketching ? Computing SVD is slow (and offline). Matrix sketching: approximate large matrix ? ?? ? with B ?? ?, ? ?, in an online fashion. Row-update stream: each update receives a row. Covariance error [Liberty2013, Ghashami2014, Woodruff2016]: ??? ??? /||?||? Feature hashing [Weinberger2009], random projection [Papadimitriou2011], Frequent Directions (FD) [Liberty2013]: B ?? ?, ? =1 ?, s.t. covariance error ?. ? ? ?? 2 ?. ? ? ??

  6. Matrix Sketching over Sliding Windows Each row is associated with a timestamp. Maintain ?? for ??: rows in sliding window ?. ??? ?? ???||/||??||? 2 ? Covariance error: ||?? Sequence-based window: past N rows. ??: ? rows Time-based window: rows in a past time period . ??: rows in time units

  7. Motivation 1: Sliding windows vs. unbounded streams Sliding window model is a more appropriate model in many real-world applications. Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. Applications: Analyzing tweets for the past 24 hours. Sliding window PCA for detecting changes and anomalies [Papadimitriou2006, Qahtan2015].

  8. Motivation 2: Lower bound Unbounded stream solution: use O(?2) space to store ???. Update: ??? ??? + ?? ??? Theorem 4.1 An algorithm that returns ??? for any sequence- based sliding window must use (??) bits space. Matrix sketching is necessary for sliding window, even when dimension ? is small. Matrix sketching over sliding windows requires new techniques.

  9. Three algorithms Sampling: Sample ?? w.p. proportional to ||??||2[Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? ? ?2loglog?? ? ?2log?? 1 ?2log??? ? ?log? Sampling Sequence & time Yes ?log??? LM-FD Sequence & time No ? ?log? DI-FD Sequence No ? ?

  10. Three algorithms Sampling: Sample ?? w.p. proportional to ||??||2[Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ? DI-FD Slow Sequence No Interpretable: rows of the sketch ? come from ?. ?: ratio between maximum squared norm and minimum squared norms.

  11. Experiments: space vs. error ? = 8.35 ? = 1 ? =90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ? DI-FD Slow Sequence No Interpretable: rows of the sketch ? come from ?. ?: ratio between maximum squared norm and minimum squared norms.

  12. Experiments: time vs. space ? = 8.35 ? = 1 ? =90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ? DI-FD Slow Sequence No Interpretable: rows of the sketch ? come from ?. ?: ratio between maximum squared norm and minimum squared norms.

  13. Conclusions First attempt to tackle the sliding window matrix sketching problem. Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. Propose algorithms for both time-based and sequence- based windows with theoretical guarantee and experimental evaluation.

  14. Thanks!

  15. Experiments ? = 8.35 ? = 1 ? =90089 LM-FD provide better space-error tradeoffs than sampling algorithms. DI-FD vs. LM-FD: depends on the ratio R SWOR vs. SWR: depends on data set.

  16. Experiments Run algorithms on real world matrices. Measure actual covariance error, space and update time. Datasets for sequence-based windows: SYNTHETIC: random noisy matrix, used by [Liberty2013] BIBD: incidence matrix of a Balanced Incomplete Block Design from Mark Giesbrecht, University of Waterloo PAMAP: physical activity monitoring data set

  17. Sampling based algorithms Insight [Frieze2004]: Sample each row ?? with probability proportional to its squared norm ||??||2 and rescale with proper factors. Priority sampling [Efraimidis2006] Magical priority ? Top-1 priority: a sample with probability proportional to ||??||2. Sample with replacement (SWR): Run ? independent samplers. 1 ||??||2,? = ????(0,1).

  18. Maintaining top-1 priority over sliding window priority time For ? = 1: 1 2 1 3 1 ? 1 Expectation: log? Probability in the skyline: Number of skyline points:O(log??). Space for sampling ? rows: ?(? log??).

  19. Logarithmic Method: LM-FD algorithm Work for time/sequence-based window. Combine FD with Exponential Histogram [Datar2002]. Mergeablility: ?1= ?? ?1, ? ,?2= ?? ?2, ? , then B = ?? ?1,?2,? is a FD sketch for A. Merge all blocks to form ?.

  20. Dyadic Interval: DI-FD algorithm Work for sequence-based window. Combine FD with dyadic interval techniques [Arasu2004]. Window of size ? can be decomposed into log? dyadic intervals. Sketches at different levels have different error parameters. Combine log? sketches to form ?.

  21. Low rank approximation ?? ? ? ??1 ?11 ?11 ?1? ?1? ?11 0 ?1 0 0 0 0 ?2 ?1? ??? ?? 0 0 0 0 0 0 Principal component analysis (PCA) k-means clustering Latent Semantic Indexing (LSI) ??1 ??? ??? ??1 0

Related


More Related Content