Determining Email Spam using Statistical Analysis and Machine Learning
The discussion revolves around classifying spam from ham emails by analyzing word frequencies. Various techniques such as Logistic Regression, Linear Discriminant Analysis, and 10-fold Cross-Validation are employed to achieve this goal. Statistical analysis and machine learning models like LDA and Logistic Regression show promising results in filtering out spam emails, with a mean error rate of around 10%. References to notable resources on the topic are also provided.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Can We Determine Whether an Email is SPAM? XINYUE LIU
The Spambase Data Set Goal: classify spam from ham based on the frequencies of words in the email. o Source and Origin o Goal o Instances and Attributes o Examples o Tool
Logistic Regression Linear Regression: Assign weights to each of the predictors which minimize the classification error Logistic regression
Linear Discriminant Analysis (LDA) Bayes Theorem:
10-fold Cross- Validation
Conclusion Linear Discriminant Analysis (LDA) Logistic Regression Mean Error Rate with 96% CI: Mean Error Rate with 96% CI: 10.6% 11.1% 9.9% 10.5% We can filter about 90% of spam emails using LDA.
References Trevor Hastie, Rob Tibshirani. Statistical Learning. Statistical Learning. Stanford University Online CourseWare, 21 January 2014. Lecture. <http://online.stanford.edu/course/statistical-learning-winter-2014 > Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. "Spambase Data Set." Spambase Data Set. Hewlett-Packard Labs, 1 July 1999. Web. 1 Mar. 2015. <http://archive.ics.uci.edu/ml/datasets/Spambase>. "Logistic Regression." Logistic Regression. N.p., n.d. Web. 5 May 2015. <http://www.saedsayad.com/logistic_regression.htm>. "Binary Classification." Linear Discriminant Analysis Classifier (LDAC). N.p., n.d. Web. 5 May 2015. <http://mlpy.sourceforge.net/docs/3.5/lin_class.html>. Kaewchinporn, Chinnapat. "10-fold Cross-Validation." K-fold Cross-validation. N.p., n.d. Web. 5 May 2015. <http://scriptslines.com/blog/k-fold-cross-validation/>.