The Threat of Unintended URLs in Social Media

To Err.Is Human:
Characterising the 
Threat
 of
Unintended URLs 
in 
Social Media
Authored by Beliz Kaleli et al.
Presented today by Denise Jarry
Motivation
 
Social media platforms are huge and are growing
 
 
Ideal target for attackers
 
This article has found an attack
vector with the way clickable
links are automatically rendered
 
Top Level Domain (TLD)
The last label in a domain name
Some TLDs are valid dictionary words (e.g. 
it, online
)
Clickable link rendering
Automatically identify text that corresponds to
a clickable link
On Twitter, last substring needs to be a valid TLD
www.twitter.
com
Background
 
Unintended URLs
According to authors, they are caused by
forgetting a space after a full stop: “typo”
Problem
1.
Bob follows Alice
2.
Alice makes a tweet with a typo – creating an
unintended URL
3.
Mallory sees the tweet, registers the domain,
populates with malicious content
4.
Bob sees the tweet and clicks the link
5.
Bob is exposed to the malicious content
Problem
 
Content could be:
Politically damning (like Rudy Giuliani)
Defamatory or offensive
A phishing site
Bob might trust the content because he thinks it was posted by Alice!
Solution
 
Machine Learning Classifier
that can distinguish between intended and unintended URLs
Authors used it to conduct an analysis of the problem
Implemented as a browser extension to warn users when they write a Tweet
Solution
Machine Learning Classifier
Classification
Decision
Machine Learning Classifier
Prefiltering Step
Heuristics to bypass classifier
Tweet filtered out if URL:
starts with www
ends with .com or .org
has TLD that is a non-dictionary word
These heuristics immediately identify
an 
intended URL
Classification
Decision
Machine Learning Classifier
Classifier
Training  Data
Classifier compares features of unlabelled
tweet with features in the training data to
make a classification decision
Classification
Decision
1,068
tweets
 
How training data was collected:
Sample of tweets from Twitter
API
Passed through pre-filtering
Only want to include data relevant
to the classification decision
1,068 tweets collected
Authors discussed and come to
agreement on labels
644 intended, 424 unintended
Machine Learning Classifier
94% Accuracy
Classification
Decision
Problem Analysis
Ran for 7 months
Authors used the classifier to analyse
tweets as they were posted in real time
 
Key Findings
26,596 unintended URLs posted on Twitter
Most common domain names: 
d.va
, 
b.tech
Reasons other than typos
Video game character (d.va)
Instagram handles have full stops
This can create a URL when posted on Twitter
Common acronyms (b.tech)
 
False positive
 
Not a typo
Problem Analysis: URL Content
Crawler looked at the content of
the unintended URL websites
Concerning: 39.5% of the URLs
they crawled led to domain
parking webpages
Previous work has shown that
domain parking webpages
commonly serve malicious content
Content of unintended URL websites
Problem Analysis: URL Registration Date
Most registered up to 5 years
ago
There was no attack here – the
poster just accidentally typed a
registered domain
Spike in registrations at day 0
Suggests an attack!
Registration date of unintended URL domains
Browser Extension
TypoNoMo Chrome extension
 
JavaScript Click handler on ‘Tweet’ button
Passes the text through the pre-filtering
Runs the pre-trained classifier
Makes a classification decision
Intended or unintended?
 
Shows warning message if
unintended URL is detected
Browser Extension Performance Impact
Multiple scenarios tested
No URLs
1 URL
2 URLs
3 URLs
Worst case scenario
3 URLs = 3 classification tasks
Results in just under 1 second delay
Most users would only rarely see the popup
Authors argue it does not impact UX
They think Twitter should implement it natively
Criticism
 
Issues
Prefilters URLs with non-dictionary TLDs as 
intended
What about user misspelling? Text speech?
What about domains like
lol, ooo, lgbt, google, fyi, aws
All considered non-dictionary words by the authors
Unintended URL with non-dictionary TLD will immediately get
misclassified during the prefiltering step
 
Only 5 / 20 of the top unintended domains were due
to typos
Authors barely acknowledge this
They push forward with the typo detecting classifier solution
d.va 
– no typo for the user to fix!
Improvements: Simpler Solution
 
Simple!
More usable feature that Twitter
might actually implement
Improvements:
Improve non-dictionary TLD
prefiltering step
Consider that humans misspell words and use text speech
Authors need to increase the size of their dictionary
Add TLDs like 
lol, ooo, lgbt, google, fyi, aws
.txt file used as a
dictionary by the
browser extension
 
Dynamic dictionary instead of static .txt file
Detects new text speech or words becoming popular online
Adds them to the dictionary
Thanks for listening
 
Slide Note
Embed
Share

Characterising the potential threat posed by unintended URLs in social media, this study identifies the risk stemming from automatic rendering of clickable links. The research delves into the background of Top Level Domains (TLDs), the issue of unintended URLs caused by forgotten spaces, and proposes a solution involving a Machine Learning Classifier to differentiate between intended and unintended URLs. The proposed solution aims to mitigate the risk of users being exposed to malicious content through inadvertently clicking on misleading links.

  • Social Media
  • URL Threat
  • Machine Learning
  • Cybersecurity
  • Clickable Links

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. To Err.Is Human: Characterising the Threat of Unintended URLs in Social Media Authored by Beliz Kaleli et al. Presented today by Denise Jarry

  2. Motivation Social media platforms are huge and are growing Ideal target for attackers This article has found an attack vector with the way clickable links are automatically rendered

  3. Background Top Level Domain (TLD) The last label in a domain name Some TLDs are valid dictionary words (e.g. it, online) www.twitter.com Clickable link rendering Automatically identify text that corresponds to a clickable link On Twitter, last substring needs to be a valid TLD Unintended URLs According to authors, they are caused by forgetting a space after a full stop: typo

  4. Problem

  5. Problem 1. Bob follows Alice 2. Alice makes a tweet with a typo creating an unintended URL 3. Mallory sees the tweet, registers the domain, populates with malicious content 4. Bob sees the tweet and clicks the link 5. Bob is exposed to the malicious content Content could be: Politically damning (like Rudy Giuliani) Defamatory or offensive A phishing site Bob might trust the content because he thinks it was posted by Alice!

  6. Solution

  7. Solution Machine Learning Classifier that can distinguish between intended and unintended URLs Authors used it to conduct an analysis of the problem Implemented as a browser extension to warn users when they write a Tweet

  8. Machine Learning Classifier Labelled Intended URL Unlabelled Classifier Unknown tweets with a URL Classification Decision Unintended URL Labelled training data ground truth dataset

  9. Machine Learning Classifier Labelled Intended URL Prefiltering Step Unlabelled Classifier Heuristics to bypass classifier Unknown tweets with a URL Tweet filtered out if URL: starts with www ends with .com or .org has TLD that is a non-dictionary word Classification Decision Unintended URL These heuristics immediately identify an intended URL Labelled training data aka ground truth dataset

  10. Machine Learning Classifier Labelled Classifier Intended URL Training Data Unlabelled Classifier compares features of unlabelled tweet with features in the training data to make a classification decision Classifier Unknown tweets with a URL Classification Decision How training data was collected: Sample of tweets from Twitter API Passed through pre-filtering Only want to include data relevant to the classification decision 1,068 tweets collected Authors discussed and come to agreement on labels 644 intended, 424 unintended Unintended URL 1,068 tweets Labelled training data aka ground truth dataset

  11. Machine Learning Classifier Labelled Intended URL Unlabelled Classifier Unknown tweets with a URL Classification Decision Unintended URL Labelled training data aka ground truth dataset 94% Accuracy

  12. Problem Analysis Ran for 7 months Authors used the classifier to analyse tweets as they were posted in real time Key Findings 26,596 unintended URLs posted on Twitter Most common domain names: d.va, b.tech Reasons other than typos Video game character (d.va) Instagram handles have full stops This can create a URL when posted on Twitter Common acronyms (b.tech) False positive Not a typo

  13. Problem Analysis: URL Content Crawler looked at the content of the unintended URL websites Content of unintended URL websites Concerning: 39.5% of the URLs they crawled led to domain parking webpages Previous work has shown that domain parking webpages commonly serve malicious content

  14. Problem Analysis: URL Registration Date Registration date of unintended URL domains Most registered up to 5 years ago There was no attack here the poster just accidentally typed a registered domain Spike in registrations at day 0 Suggests an attack!

  15. Browser Extension TypoNoMo Chrome extension JavaScript Click handler on Tweet button Passes the text through the pre-filtering Runs the pre-trained classifier Makes a classification decision Intended or unintended? Shows warning message if unintended URL is detected

  16. Browser Extension Performance Impact Multiple scenarios tested No URLs 1 URL 2 URLs 3 URLs Worst case scenario 3 URLs = 3 classification tasks Results in just under 1 second delay Most users would only rarely see the popup Authors argue it does not impact UX They think Twitter should implement it natively

  17. Criticism

  18. Issues Prefilters URLs with non-dictionary TLDs as intended What about user misspelling? Text speech? What about domains like lol, ooo, lgbt, google, fyi, aws All considered non-dictionary words by the authors Unintended URL with non-dictionary TLD will immediately get misclassified during the prefiltering step Only 5 / 20 of the top unintended domains were due to typos Authors barely acknowledge this They push forward with the typo detecting classifier solution d.va no typo for the user to fix!

  19. Improvements: Simpler Solution Simple! More usable feature that Twitter might actually implement

  20. Improvements: Improve non-dictionary TLD prefiltering step Consider that humans misspell words and use text speech Authors need to increase the size of their dictionary Add TLDs like lol, ooo, lgbt, google, fyi, aws Dynamic dictionary instead of static .txt file Detects new text speech or words becoming popular online Adds them to the dictionary .txt file used as a dictionary by the browser extension

  21. Thanks for listening

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#