
Clickstream Models and Sybil Detection Overview
Discover the intricate world of clickstream models and Sybil detection, exploring methodologies, clustering techniques, distance functions, and detection strategies. Gain insights into user-generated events, system overviews, and more for effective analysis and identification of suspicious activities.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Clickstream Models & Sybil Detection Gang Wang ( ) UC Santa Barbara gangw@cs.ucsb.edu
Modeling User Clickstream Events User-generated events E.g. profile load, link follow, photo browse, friend invite Assume we have event type, userID, timestamp UserID Event Generated Timestamp Intuition: Sybil users act differently from normal users Sybil users act differently from normal users Goal-oriented: focus on specific actions, less extraneous events Time-limited: focused on efficient use of time, smaller gaps? Forcing Sybil users to mimic users win?
System Overview 3 Sequence Clustering Cluster Coloring Known Good Users Clickstream Log Incoming Clickstream Legit ? Sybils
Clickstream Models 4 Clickstream log user clicks (click type) with timestamp Modeling Clickstream Event-only Sequence Model: order of events e.g. ABCDA Time-based Model: sequence of inter-arrival time e.g. {t1, t2, t3, } Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A
Clickstream Clustering 5 Similarity Graph Vertices: users (or sessions) Edges: weighted by the similarity score of two user s clickstream Clustering Similar Clickstreams together Graph partitioning using METIS Q: How to compare two clickstreams?
Distance Functions Of Each Model 6 Click Sequence (CS) Model Ngram overlap S1= AAB S2= AAC ngram1= {A, B, AA, AB, AAB} ngram2= {A, C, AA, AC, AAC} Ngram+count Euclidean Distance V1=(2,1,0,1,0,1,1,0)/6 V2=(2,0,1,1,1,0,0,1)/6 S1= AAB S2= AAC ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)} ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)} Time-based Model Compare the distribution of inter-arrival time K-S test Hybrid Model Bucketize inter-arrival time Compute 5grams (similar with CS Model)
Detection In A Nutshell 7 Inputs: Trained clusters Input sequences for testing ? Methodology: given a test sequence A K nearest neighbor: find the top-k nearest sequences in the trained cluster Nearest Cluster: find the nearest cluster based on average distance to sequences in the cluster Nearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center
Clustering Sequences 8 How well can each method separate Sybils from legitimate users? Model (Sequence Type) Distance Function (False positives, False negatives) of users 20 clusters 50 clusters 100 clusters Click Sequence Model (Categories) unigram (3% , 6%) (1%, 7%) (2%, 4%) unigram+count (1% , 4%) (1%, 3%) (1%, 3%) 10gram (1%, 3%) (1%, 3%) (2%, 2%) 10gram+count (1%, 4%) (2%, 4%) (1%, 2%) Time-based Model K-S Test (9%, 8%) (2%, 10%) (5%, 10%) Hybrid Model (Categories) 5gram (3%, 2%) (2%, 2%) (2%, 2%) 5gram+count (3%, 4%) (4%, 5%) (1%, 2%)
Detection Accuracy 9 Basics Training on one group of users, and test on the other group of users. Clusters trained using Hybrid Model Key takeaways High accuracy with 50 clicks in the test sequence Nearest Cluster (Center) method achieves high accuracy with minor computation overhead Number of Clicks in the Sequence (length) (False positives, False negatives) of users K-nearest Neighbors (k=3) Nearest Cluster (Avg. Distance) Nearest Cluster (Center) Length <=50 (1.5% , 2.1%) (0.6%, 2.6%) (0.4%, 2.3%) Length <=100 (0.9% , 1.8%) (0.2%, 2.5%) (0.3%, 2.3%) All (0.6% , 3%) (0.4%, 2.8%) (0.4%, 2.3%)
Can Model Be Effective Over Time? 10 Experiment method Using first two-week data to train the model Testing on the following two-week data (False positives, False negatives) of users Model K-nearest Neighbors (k=3) Nearest Cluster (Avg. Distance) Nearest Cluster (Center) Click Sequence Model (1.8% , 1%) (3%, 2%) (3%, 0.8%) Hybrid Model (3% , 2%) (3%, 1%) (1.2%, 1.4%)
Still Ongoing Work With broad interest and applications As Sybil detection tool Code being tested internally at Renren Trained with 10K users (2-week log) Testing on 1 Million users (1-week log) 5 Sybil clusters 22K suspicious profiles Further improvement Training with longer clickstream (half users have <5 clicks in 2-week) More conservative in labeling Sybil clusters. As user modeling tool Code being tested by LinkedIn as user profiler
Some Useful Tools Graph Partitioning Metis http://glaros.dtc.umn.edu/gkhome/metis/metis/overview Community Detection Louvain code https://sites.google.com/site/findcommunities/
Other Ongoing Works/Ideas Fighting against crowdturfing Crowdturfing: real users are paid to spam How to detect these malicious real users User behavior model Network-wised temporal anomaly detection Information Dissemination Content sharing visa social edges How often will user click on the content How often will user comment on the content Sybil detection, target ad placement
Thank You! Questions? http://current.cs.ucsb.edu