Determining Email Spam using Statistical Analysis and Machine Learning

Slide Note

The discussion revolves around classifying spam from ham emails by analyzing word frequencies. Various techniques such as Logistic Regression, Linear Discriminant Analysis, and 10-fold Cross-Validation are employed to achieve this goal. Statistical analysis and machine learning models like LDA and Logistic Regression show promising results in filtering out spam emails, with a mean error rate of around 10%. References to notable resources on the topic are also provided.

stavros_s Follow

Uploaded on Sep 30, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Can We Determine Whether an Email is SPAM? XINYUE LIU

The Spambase Data Set Goal: classify spam from ham based on the frequencies of words in the email. o Source and Origin o Goal o Instances and Attributes o Examples o Tool

Logistic Regression Linear Regression: Assign weights to each of the predictors which minimize the classification error Logistic regression

Linear Discriminant Analysis (LDA) Bayes Theorem:

10-fold Cross- Validation

Conclusion Linear Discriminant Analysis (LDA) Logistic Regression Mean Error Rate with 96% CI: Mean Error Rate with 96% CI: 10.6% 11.1% 9.9% 10.5% We can filter about 90% of spam emails using LDA.

References Trevor Hastie, Rob Tibshirani. Statistical Learning. Statistical Learning. Stanford University Online CourseWare, 21 January 2014. Lecture. <http://online.stanford.edu/course/statistical-learning-winter-2014 > Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. "Spambase Data Set." Spambase Data Set. Hewlett-Packard Labs, 1 July 1999. Web. 1 Mar. 2015. <http://archive.ics.uci.edu/ml/datasets/Spambase>. "Logistic Regression." Logistic Regression. N.p., n.d. Web. 5 May 2015. <http://www.saedsayad.com/logistic_regression.htm>. "Binary Classification." Linear Discriminant Analysis Classifier (LDAC). N.p., n.d. Web. 5 May 2015. <http://mlpy.sourceforge.net/docs/3.5/lin_class.html>. Kaewchinporn, Chinnapat. "10-fold Cross-Validation." K-fold Cross-validation. N.p., n.d. Web. 5 May 2015. <http://scriptslines.com/blog/k-fold-cross-validation/>.

Thank you!

Determining Email Spam using Statistical Analysis and Machine Learning

Download Presentation

Presentation Transcript

Related

More Related Content