Gender Identification in SMS Texts: An Exploration of Authorship Characteristics

By: Shannon Silessi
Introduction
Introduction
Introduction (Cont’d)
Introduction (Cont’d)
Introduction (Cont’d)
Characteristics of SMS text messages
Background
Author gender
identification for
short length
internet
applications
proposed by
Cheng et al [4]
Background (Cheng Cont’d)
Background (Cheng Cont’d)
Background (Cont’d)
Argamon et al proposed using content-based features & style-based features [5]
Background (Cont’d)
Orebaugh et al
proposed an IM
authorship analysis
framework that
extracts features
from messages to
create author
writeprints and
applies several data
mining algorithms to
build classification
models [2]
Background (Orebaugh Cont’d)
Background (Cont’d)
Soler et al
proposed using a
small number of
features that
depend on the
structure of text
[6]
Background
Ragel et al
proposed N-gram
method [7]
Methodology
Methodology (Cont’d)
References
[1]
 
US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The
 
Wireless Association. 
http://www.ctia.org/resource-library/facts-and-
  
infographics/archive/us-text-messages-sms
 .
[2]
 
A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging
 
Communications to Perform Author Identification for Cybercrime
 
Investigations,” Lecture Notes of the Institute for Computer Sciences, Social
 
Informatics and Telecommunications Engineering.
 
http://dx.doi.org/10.1007/978-3-642-11534-9_10
[3]
 
M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current Practices,”
 
Online Journal of TESOL France. 
http://www.tesolfrance.org/Documents/
  
Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20
  
and%20Current%20Practice%20_1_.pdf
.
[4]
 
N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author gender
 
identification from text,” Digital Investigation: The International Journal of Digital
 
Forensics & Incident Response. 
http://dl.acm.org/citation.cfm?id=2296158
.
References (Cont’d)
[5]
 
S. Argamon et al. 2009. “Automatically profiling the author of an anonymous
 
text,” Communications of the ACM - Inspiring Women in Computing.
 
http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL
 
&dl=ACM&CFID=615755468&CFTOKEN=20724033
[6]
 
J. Soler and L. Wanner. 2014. “How to Use Less Features and Reach Better
 
Performance in Author Gender Identification,” Proceedings of the Ninth
 
International Conference on Language Resources and Evaluation (LREC).
[7]
 
R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of SMS
 
Messages Using Unigrams,” Eighth IEEE International Conference on
 
Industrial and Information Systems (ICIIS). 
http://arxiv.org/abs/1403.1314
[8]
 
Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on Twitter Using
 
Stream Algorithms with N-Gram Character Features,” International Journal of
 
Intelligence Science. 
http://dx.doi.org/10.4236/ijis.2012.224019
Slide Note
Embed
Share

Cyber forensics methods play a crucial role in detecting SMS authors for potential use in criminal persecution cases as visual anonymity in text messages can be exploited by criminals. This study delves into the authorship characterization of SMS texts by categorizing authors based on sociolinguistic attributes such as gender, age, educational background, income, nationality, and race. Unlike traditional stylometric techniques applied to larger written documents, SMS messages present unique challenges due to their distinct characteristics like abbreviations, emoticons, limited character count, phonetic spellings, and compression techniques. Researchers have examined datasets for psycholinguistic and gender-preferential cues in English language stories to better understand the performance of various classification algorithms like Bayesian-based logistic regression, Ada-Boost decision tree, and Support Vector Machine (SVM).

  • SMS authorship
  • Cyber forensics
  • Sociolinguistic attributes
  • Text message analysis

Uploaded on Sep 19, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Gender Identification of SMS Texts By: Shannon Silessi

  2. Introduction Introduction Cyber forensics methods are needed for detecting SMS authors for use in criminal persecution cases U.S. wireless users send & receive an average of 6 billion text messages a day Visual anonymity can be misused and exploited by criminals

  3. Introduction (Contd) Authorship Characterization Categorizing an author s text according to sociolinguistic attributes such as: Gender Age Educational Background Income Nationality Race

  4. Introduction (Contd) Most research in the area of authorship characterization has been conducted on larger, more formal written documents Unusual characteristics of SMS make it difficult to apply traditional stylometric techniques

  5. Introduction (Contd) Characteristics of SMS text messages Often contain abbreviations or written representations of sounds (e.g. kt for Katie) [3] Emoticons, such as (representing a frown) [3] Limited to 140 characters [3] Various phonetic spellings for verbal effects ( hehe for laughter and muaha for evil laughter) [3] Combined letters & numbers for compression ( CUL8R for See You Later ) [3]

  6. Background 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words

  7. Background (Cheng Contd) Examined performance of Bayesian-based logistic regression, Ada-Boost decision tree, & Support Vector Machine (SVM) [4] Enron email dataset Messages containing 50 < 1000 words [4] Examination of parameter performance Accuracy increases with increasing number of words [4] Best classification result SVM with 76.75% (Reuter s) & 82.23% (Enron) accuracies [4]

  8. Background (Cheng Contd) Issues Examination of feature sets Word-based features & function words were more important [4] Categorization of documents by gender was based on perceived gender of a person s name Unequal amount of male vs. female authored documents

  9. Background (Contd) Argamon et al proposed using content-based features & style-based features [5] Bayesian Multinomial Regression [5] Issues Varying length of texts ranging from several hundred to tens of thousand of words Dataset: blog posts by 19,320 authors [5] Content features more effective classifiers [5]

  10. Background (Contd) Algorithms: C4.5, k-nearest neighbor, Na ve Bayes, & SVM [2]

  11. Background (Orebaugh Contd) Datasets: IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2] IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2] Optimal algorithm was SVM, using 356 features [2]

  12. Background (Contd) 83 features [6] Algorithms: WEKA s Bagging variant for classification with REPTree as base classifier [6] Dataset: 1,672 NY Times opinion blogs written by 100 male & 100 female authors [6] Use of only syntactic feature group: 77.03% [6]

  13. Background Dataset: NUS SMS corpus [7] Cosine similarity performed best to calculate distance between two vectors [7] As the # of stacked messages increases, accuracy increases, but saturation is reached around 20 messages [7] 545 psycholinguistic & gender-preferential cues [4] Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] Used messages that contained 200 < 1000 words

  14. Methodology Dataset NSU SMS corpus Accuracy will be measured by the percentage of correct author gender classifications Hybrid approach using a classification technique & N-gram modeling N-gram model longest common string Classification using Na ve Bayes

  15. Methodology (Contd)

  16. References [1] US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The Wireless Association. http://www.ctia.org/resource-library/facts-and- infographics/archive/us-text-messages-sms . A. Orebaugh and J. Allnutt. 2010. Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. http://dx.doi.org/10.1007/978-3-642-11534-9_10 M. Rafi. 2008. SMS Text Analysis: Language, Gender and Current Practices, Online Journal of TESOL France. http://www.tesolfrance.org/Documents/ Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20 and%20Current%20Practice%20_1_.pdf. N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. Author gender identification from text, Digital Investigation: The International Journal of Digital Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158. [2] [3] [4]

  17. References (Contd) [5] S. Argamon et al. 2009. Automatically profiling the author of an anonymous text, Communications of the ACM - Inspiring Women in Computing. http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL &dl=ACM&CFID=615755468&CFTOKEN=20724033 J. Soler and L. Wanner. 2014. How to Use Less Features and Reach Better Performance in Author Gender Identification, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). R. Ragel, P. Herath, and U. Senanayake. 2013. Authorship Detection of SMS Messages Using Unigrams, Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314 Z. Miller, B. Dickinson, and W. Hu. 2012. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features, International Journal of Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019 [6] [7] [8]

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#