Gender Identification in SMS Texts: An Exploration of Authorship Characteristics
Cyber forensics methods play a crucial role in detecting SMS authors for potential use in criminal persecution cases as visual anonymity in text messages can be exploited by criminals. This study delves into the authorship characterization of SMS texts by categorizing authors based on sociolinguistic attributes such as gender, age, educational background, income, nationality, and race. Unlike traditional stylometric techniques applied to larger written documents, SMS messages present unique challenges due to their distinct characteristics like abbreviations, emoticons, limited character count, phonetic spellings, and compression techniques. Researchers have examined datasets for psycholinguistic and gender-preferential cues in English language stories to better understand the performance of various classification algorithms like Bayesian-based logistic regression, Ada-Boost decision tree, and Support Vector Machine (SVM).
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Gender Identification of SMS Texts By: Shannon Silessi
Introduction Introduction Cyber forensics methods are needed for detecting SMS authors for use in criminal persecution cases U.S. wireless users send & receive an average of 6 billion text messages a day Visual anonymity can be misused and exploited by criminals
Introduction (Contd) Authorship Characterization Categorizing an author s text according to sociolinguistic attributes such as: Gender Age Educational Background Income Nationality Race
Introduction (Contd) Most research in the area of authorship characterization has been conducted on larger, more formal written documents Unusual characteristics of SMS make it difficult to apply traditional stylometric techniques
Introduction (Contd) Characteristics of SMS text messages Often contain abbreviations or written representations of sounds (e.g. kt for Katie) [3] Emoticons, such as (representing a frown) [3] Limited to 140 characters [3] Various phonetic spellings for verbal effects ( hehe for laughter and muaha for evil laughter) [3] Combined letters & numbers for compression ( CUL8R for See You Later ) [3]
Background 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words
Background (Cheng Contd) Examined performance of Bayesian-based logistic regression, Ada-Boost decision tree, & Support Vector Machine (SVM) [4] Enron email dataset Messages containing 50 < 1000 words [4] Examination of parameter performance Accuracy increases with increasing number of words [4] Best classification result SVM with 76.75% (Reuter s) & 82.23% (Enron) accuracies [4]
Background (Cheng Contd) Issues Examination of feature sets Word-based features & function words were more important [4] Categorization of documents by gender was based on perceived gender of a person s name Unequal amount of male vs. female authored documents
Background (Contd) Argamon et al proposed using content-based features & style-based features [5] Bayesian Multinomial Regression [5] Issues Varying length of texts ranging from several hundred to tens of thousand of words Dataset: blog posts by 19,320 authors [5] Content features more effective classifiers [5]
Background (Contd) Algorithms: C4.5, k-nearest neighbor, Na ve Bayes, & SVM [2]
Background (Orebaugh Contd) Datasets: IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2] IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2] Optimal algorithm was SVM, using 356 features [2]
Background (Contd) 83 features [6] Algorithms: WEKA s Bagging variant for classification with REPTree as base classifier [6] Dataset: 1,672 NY Times opinion blogs written by 100 male & 100 female authors [6] Use of only syntactic feature group: 77.03% [6]
Background Dataset: NUS SMS corpus [7] Cosine similarity performed best to calculate distance between two vectors [7] As the # of stacked messages increases, accuracy increases, but saturation is reached around 20 messages [7] 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words
Methodology Dataset NSU SMS corpus Accuracy will be measured by the percentage of correct author gender classifications Hybrid approach using a classification technique & N-gram modeling N-gram model longest common string Classification using Na ve Bayes
References [1] US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The Wireless Association. http://www.ctia.org/resource-library/facts-and- infographics/archive/us-text-messages-sms . A. Orebaugh and J. Allnutt. 2010. Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. http://dx.doi.org/10.1007/978-3-642-11534-9_10 M. Rafi. 2008. SMS Text Analysis: Language, Gender and Current Practices, Online Journal of TESOL France. http://www.tesolfrance.org/Documents/ Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20 and%20Current%20Practice%20_1_.pdf. N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. Author gender identification from text, Digital Investigation: The International Journal of Digital Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158. [2] [3] [4]
References (Contd) [5] S. Argamon et al. 2009. Automatically profiling the author of an anonymous text, Communications of the ACM - Inspiring Women in Computing. http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL &dl=ACM&CFID=615755468&CFTOKEN=20724033 J. Soler and L. Wanner. 2014. How to Use Less Features and Reach Better Performance in Author Gender Identification, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). R. Ragel, P. Herath, and U. Senanayake. 2013. Authorship Detection of SMS Messages Using Unigrams, Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314 Z. Miller, B. Dickinson, and W. Hu. 2012. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features, International Journal of Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019 [6] [7] [8]