Clickbait Detection: Using NLP and Machine Learning for Identifying Deceptive Content

Clickbait Detection using Natural

Language Processing and

Machine Learning

By Varun Shah

Advisors: Kristina Striegnitz and

Nick Webb

What is Clickbait?

Source: Google Images

The Clickbait Challenge!

•

Organizers: M. Potthast, T. Gollub, B. Stein and

M. Hagen

•

Competition to build a clickbait detector using

provided Twitter data

•

Each post is judged by 5 annotators.

Source:

http://www.clickbait-challenge.org/

Data

•

Small data set:

•

Big data set:

SlideHunter

SlideHunter

Some people are

Some people are

such food snobs.

such food snobs.

http://link.com/

http://link.com/

2 hours ago

2 hours ago

Tweet

Tweet

Attributes: Post Text

Post Text

Attributes: Title of Linked Article

•

Actual title of the article linked in post text

Source:

http://proyectoportal.com/

Attributes: Truth Class

Preliminary Results

•

Baseline classifier: ZeroR

•

Attributes in model: Post Text and Article Title

*statistically significant at 95%

Added Features: Use of Superlatives

Is LeBron James’ NBA

Finals performance

the

best

 ever?

http://link.com/

•

Does given post have a superlative?

Added Features: Use of Numbers

•

Does given post have a number?

 incredible Italian

dishes you haven’t

tried before.

http://link.com/

Added Features: Number of Words

Man dies when car

plunges from

parking garage.

http://link.com/

•

How many words are in the post text?

8 Words

Added Features: Similarity between

Post Text and Title of linked article

Post Text

Article Title

•

In example, # Overlaps = 4

Added Features: POS Ratio

These global

warming skeptics

have

…

Determiner +

Adjective + Singular

Noun + Plural Noun +

rd

 Person Verb

POS Sequence

POS Ratio = #Sequence in Clickbait / #Sequence in All

•

In example, POS Ratio = 0.8698

Results

•

Model tested on (unbalanced) big data set

•

Note: ZeroR (baseline) = 75.8553%

Conclusion and Future Work

•

Improved model to achieve a classification

accuracy of 88.2051%

•

Identified features that help detect clickbait

•

For the future:

- Image analysis

- # Ads on article webpage

Thank you!

Questions?

Slide Note

Embed Share

Download

Explore the realm of clickbait through a detailed investigation into identifying and combating misleading content online. With initiatives like the Clickbait Challenge and innovative feature analysis, researchers aim to enhance algorithms and classifiers for accurate detection. Preliminary results show promising outcomes with a significant improvement over baseline classifiers, paving the way for more advanced methods in combating clickbait. Utilizing natural language processing and machine learning techniques, this research delves into the intricate world of deceptive content to safeguard online users.

jul_lo Follow

Uploaded on Sep 30, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Clickbait Detection using Natural Language Processing and Machine Learning By Varun Shah Advisors: Kristina Striegnitz and Nick Webb

What is Clickbait? Source: Google Images

The Clickbait Challenge! Organizers: M. Potthast, T. Gollub, B. Stein and M. Hagen Competition to build a clickbait detector using provided Twitter data Each post is judged by 5 annotators. Source: http://www.clickbait-challenge.org/

Data Small data set: # Posts # Clickbait # No-clickbait 2459 762 1697 Big data set: # Posts # Clickbait # No-clickbait 19538 4761 14777

Attributes: Post Text SlideHunter Some people are such food snobs. http://link.com/ 2 hours ago Post Text Tweet

Attributes: Title of Linked Article Actual title of the article linked in post text Source: http://proyectoportal.com/

Attributes: Truth Class No-clickbait Clickbait 0 200 400 600 800 1000 1200 1400 1600 1800

Preliminary Results Baseline classifier: ZeroR Attributes in model: Post Text and Article Title Classifier Classification Accuracy ZeroR 50.0% RandomForest 74.9085%* *statistically significant at 95%

Added Features: Use of Superlatives Is LeBron James NBA Finals performance the best ever? http://link.com/ Does given post have a superlative? Superlative? Yes No # Clickbait 60 702 # No-clickbait 87 1610

Added Features: Use of Numbers 5 incredible Italian dishes you haven t tried before. http://link.com/ Does given post have a number? Number? Yes No # Clickbait 191 571 # No-clickbait 416 1281

Added Features: Number of Words Man dies when car plunges from parking garage. http://link.com/ 8 Words How many words are in the post text? Overall Mean Clickbait No-clickbait 12.662 11.901 13.002

Added Features: Similarity between Post Text and Title of linked article Post Text Article Title In example, # Overlaps = 4 Overall Mean 3.997 Clickbait 3.150 No-clickbait 4.378

Added Features: POS Ratio Determiner + Adjective + Singular Noun + Plural Noun + 3rd Person Verb These global warming skeptics have POS Sequence POS Ratio = #Sequence in Clickbait / #Sequence in All In example, POS Ratio = 0.8698 Minimum 0 Maximum 1 Mean 0.412 Std. Dev. 0.326

Results Model tested on (unbalanced) big data set Note: ZeroR (baseline) = 75.8553% Attributes in Model Accuracy Post Text + Article Title + #Words + Overlap Post Text + Article Title + POS Ratio 82.2864% 86.6860% Post Text + Article Title + #Words + Overlap + POS Ratio 88.2051%

Conclusion and Future Work Improved model to achieve a classification accuracy of 88.2051% Identified features that help detect clickbait For the future: - Image analysis - # Ads on article webpage

Thank you! Questions?

Clickbait Detection: Using NLP and Machine Learning for Identifying Deceptive Content

Download Presentation

Presentation Transcript

Related

More Related Content