About Data Analysis

About Data Analysis
Slide Note
Embed
Share

Delve into the world of data analysis with insights on data processing pipelines, elementary feature engineering, variable transformation, discretization, missing data imputation, categorical encoding, outlier removal, and date/time engineering. Explore various methods to enhance data quality and optimize predictive modeling.

  • Data Analysis
  • Data Processing
  • Feature Engineering
  • Transformation Techniques
  • Data Imputation

Uploaded on Feb 15, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. About Data Analysis

  2. Data Processing Pipeline Define problem data cleaning data preparation acquire data Model training/validation Feature Engineering Prediction/interpretation solution

  3. Elementary Feature Engieering Feature Scaling: Standardization : (x- mean)/sd MinMax Scaling: (x- min)/(max-min) Mean Scaling: (x-mean)/(max- min) Max Absolute Scaling : x/max(|x|) Unit norm-Scaling: x/||x||

  4. Variable Transformation: Logarithm: log(x) Reciprocal: 1/x Square root: ? Exponential: ?? Box-Cox (Yeo-Johnson): power transformation. General theme: bring variable to more workable form without affecting monotonicity.

  5. Discretization: Equal frequency discretization Equal length discretization Discretization with trees Discretization with ChiMerge

  6. Missing Data Imputation: Complete case analysis Mean / Median / Mode imputation Random Sample Imputation Replacement by Arbitrary Value Missing Value Indicator Multivariate imputation

  7. Categorical Encoding: One hot encoding Count and Frequency encoding Target encoding / Mean encoding Ordinal encoding Weight of Evidence Rare label encoding BaseN, feature hashing and others

  8. Outlier Removal: Removing outliers Treating outliers as NaN Capping, Winsorization

  9. Date and Time Engineering: Extracting days, months, years, quarters, time elapsed Feature Creation: Sum, subtraction, mean, min, max, product, quotient of group of features Aggregating Transaction Data: Same as above but in same feature over time window

  10. More advance FeatEngi Bin counting Math transform: Fouriour, Laplace, wavelet, spectral. Discretization with trees Ranking, differencing, returns, moving average, recency weighted moving average. Imputing (MICE ) Use of correlation, covariance Mutual information

  11. Similarity measures. Use first few layers of nnets. AutoEncoder, VAE, stacked auto-encoder, Restricted Boltzman machine Matrix factorization

  12. Use of the membership of (e.g., K-means) clusters or distances to the hubs of clusters. Meta features (R-pub functions) Data augmentation Representation learning

  13. General tips for FeatEngi Use complex models for feature construction, and use simple models for final stacking. Data understanding + creativity = Good featEngi Lots of ideas (good or bad) + quick and efficient iteration/checking + double check data points where your predictions are poor One important/sound expert idea => construct feature to reflect it => verify its goodness. Learn featureTools . Meta-feature (check R-pub).

  14. Short list of Some ML/Stat models Supervised Learning: Linear regression Nonlinear regression: local polynomial, spline, Generalized linear models: (e.g., logistic regression ) KNN Naive Bayes LDA, QDA, .. SVM Tree based (eg., decision trees, random forest .) Ensemble: (e.g., bagging, boostings, stacking/blending ) Neural nets

  15. Unsupervised learning: PCA, ICA, .. Factor analysis. Cluster analysis. Representation learning.

  16. A few key techniques Bayesian Methods Model checking/Model selection/Model combination. Validation/Cross validation. Regularization: Lasso, ridge, slow-learning, early-stopping, dropout .

Related


More Related Content