Understanding Data Cleaning in Machine Learning

Slide Note
Embed
Share

Today's lecture covers data cleaning for machine learning, including the importance of minimizing loss, gradient descent, stochastic methods, and dealing with noise in training data. Sections delve into ML models, training under noise, and methods to optimize for different data distributions.


Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Lecture 15: Data Cleaning for ML Data Cleaning for ML 1

  2. Announcements 1. Intermediate report due right after spring break: 4/3 2. Required elements: https://cs.stanford.edu/people/widom/paper- writing.html Introduction Related work Outline of contribution and technique Experimental setup https://drive.google.com/drive/folders/1tdPe6H77DL0lHb7JOvl2In-xc5Cahx3u 4/3 3. Project meetings next week Send me emails if you want to meet 2

  3. Todays Agenda 1. Recap on ML models 2. Training under noise 3

  4. Section 1 Section 1 1. Recap on ML models 4

  5. Section 1 Section 1 What is ML all about? Minimization of a modular loss Minimization of a modular loss Example for a linear model Example for a linear model 5

  6. Section 1 Section 1 Gradient Descent [Cauchy 1847] 6

  7. Section 1 Section 1 Stochastic Methods 7

  8. Section 1 Section 1 Convergence Rate and Computational Complexity 8

  9. Section 2 Section 2 2. Training under noise 9

  10. Section 2 Section 2 What is the problem with noise? Optimizing for data obtained by a different distribution. Optimizing for data obtained by a different distribution. Empirical risk is different. Empirical risk is different. 10

  11. Section 2 Section 2 How can we deal with noise? 11

  12. Section 2 Section 2 How can we deal with noise? 12

  13. Section 2 Section 2 Model update 13

  14. Section 2 Section 2 Estimating the gradient 14

  15. Section 2 Section 2 Detecting Dirty Data Detector returns: Detector returns: Whether a record is dirty Whether a record is dirty And if it is dirty, which attributes have errors And if it is dirty, which attributes have errors Enumerate set of records that violate at least one rule: Enumerate set of records that violate at least one rule: Clean data = union between the set of clean data and records Clean data = union between the set of clean data and records that satisfy all rules that satisfy all rules Dirty = violate at least one rule Dirty = violate at least one rule Adaptive methods for detection = train a classifier Adaptive methods for detection = train a classifier 15

  16. Section 2 Section 2 Selecting which records to clean Sampling problem Sampling problem Minimize the variance of the sampled gradient Minimize the variance of the sampled gradient Use a detector to estimate cleaned values Use a detector to estimate cleaned values 16

  17. Section 2 Section 2 Selecting which records to clean Estimator: Estimator: Estimate clean gradient using the dirty gradient and previous Estimate clean gradient using the dirty gradient and previous cleaning actions cleaning actions Linear approximation of gradient: uses average change of each Linear approximation of gradient: uses average change of each feature value feature value 17

  18. Section 3 Section 3 Strengths? Weaknesses? Discussion time! 18

Related