Determining Email Spam using Statistical Analysis and Machine Learning

Slide Note
Embed
Share

The discussion revolves around classifying spam from ham emails by analyzing word frequencies. Various techniques such as Logistic Regression, Linear Discriminant Analysis, and 10-fold Cross-Validation are employed to achieve this goal. Statistical analysis and machine learning models like LDA and Logistic Regression show promising results in filtering out spam emails, with a mean error rate of around 10%. References to notable resources on the topic are also provided.


Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Can We Determine Whether an Email is SPAM? XINYUE LIU

  2. The Spambase Data Set Goal: classify spam from ham based on the frequencies of words in the email. o Source and Origin o Goal o Instances and Attributes o Examples o Tool

  3. Logistic Regression Linear Regression: Assign weights to each of the predictors which minimize the classification error Logistic regression

  4. Linear Discriminant Analysis (LDA) Bayes Theorem:

  5. 10-fold Cross- Validation

  6. Conclusion Linear Discriminant Analysis (LDA) Logistic Regression Mean Error Rate with 96% CI: Mean Error Rate with 96% CI: 10.6% 11.1% 9.9% 10.5% We can filter about 90% of spam emails using LDA.

  7. References Trevor Hastie, Rob Tibshirani. Statistical Learning. Statistical Learning. Stanford University Online CourseWare, 21 January 2014. Lecture. <http://online.stanford.edu/course/statistical-learning-winter-2014 > Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. "Spambase Data Set." Spambase Data Set. Hewlett-Packard Labs, 1 July 1999. Web. 1 Mar. 2015. <http://archive.ics.uci.edu/ml/datasets/Spambase>. "Logistic Regression." Logistic Regression. N.p., n.d. Web. 5 May 2015. <http://www.saedsayad.com/logistic_regression.htm>. "Binary Classification." Linear Discriminant Analysis Classifier (LDAC). N.p., n.d. Web. 5 May 2015. <http://mlpy.sourceforge.net/docs/3.5/lin_class.html>. Kaewchinporn, Chinnapat. "10-fold Cross-Validation." K-fold Cross-validation. N.p., n.d. Web. 5 May 2015. <http://scriptslines.com/blog/k-fold-cross-validation/>.

  8. Thank you!

Related


More Related Content