Robust High-Dimensional Classification Approaches for Limited Data Challenges
In the realm of high-dimensional classification with scarce positive examples, challenges like imbalanced data distribution and limited data availability can hinder traditional classification methods. This study explores innovative strategies such as robust covariances and smoothed kernel distributions to address these obstacles effectively. By optimizing directly on the smoothed distribution without the need for sampling, a robust classifier can be developed to improve classification accuracy. This parameter-free approach, guided by the main idea of DIRECT, offers a new perspective on handling high-dimensional, limited-data classification problems.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Robust High-Dimensional Classification From Few Positive Examples Deepayan Chakrabarti (deepay@utexas.edu) https://github.com/deepayan12/direct
Tumor Classification From Genes 16,063 features (gene expressions) How can we classify tumors from genes?
Document classification O(100,000) keyword/bigram features How can we identify topics for test documents?
Problem Binary High-dimensional Limited-data Imbalanced Classification ^ Existing approaches: Modify the data: sample, then train But samples built from limited data can be biased Ensemble methods: many repetitions of the above Can overfit due to limited data and high dimensionality Cost-sensitive methods: modify the loss Underperform for limited-size datasets [Cunha+/21]
Optimal separator Main Idea of DIRECT Difficulty Minority class sample statistics are skewed Na ve imbalanced separator
Main Idea of DIRECT Difficulty Minority class sample statistics are skewed Sample covariance has too-small/zero variance along some directions [Marchenko+/67] Need a better proxy for the minority class distribution Sample covariance Robust covariance This is a more accurate covariance estimate Smoothed kernel distribution Sample covariance Robust covariance
DIRECTs separator Main Idea of DIRECT Difficulty Minority class sample statistics are skewed Sample covariance has too-small/zero variance along some directions Need a better proxy for the minority class distribution Smoothed kernel distribution Optimize loss directly on smoothed distribution (no sampling) Robust classifier Sample covariance Robust covariance
Parameter-free; no user inputs needed Steps Parameters can be chosen optimally [Ledoit+/04] Robust covariance Smoothed kernel distribution Max-entropy distribution, given mean & covariance Direct optimization Optimizing over smooth kernel Closed-form and convex for hinge loss
Experiments Datasets 1 medical (16K features) 2 image 5 text (10K-100K features) Metric: Area under the Precision-Recall curve
Conclusions Binary High-dimensional Limited-data Imbalanced Classification ^ DIRECT is fast, parameter-free, and accurate https://github.com/deepayan12/direct