Analyzing Information Spread on Twitter for Improved Content Distribution
This study by Amit Ruhela explores the spread of information on Twitter and its implications for internet content distribution. It discusses the rapid growth in online social networking usage, the utilization of OSN websites for content distribution, and the goal of enhancing web CDN caching algorithms using insights from OSN platforms. The research involves analyzing how topics propagate on OSNs, applying these insights to guide CDN caching, efficiently tracking OSN indicators, and conducting simulations to assess the effectiveness of OSN-aided caching policies. Various datasets related to Twitter usage are examined, and geographical and events-based analyses offer conclusions on topic popularity and virality detection.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
ANALYSIS OF THE SPREAD OF INFORMATION ON TWITTER AND ITS APPLICATION TO INTERNET CONTENT DISTRIBUTION By Amit Ruhela 2007CSZ8359 Department of Computer Science and Engineering, IIT Delhi
MOTIVATION Rapid growth in usage of OSN websites by individuals, celebrities, businesses and advertisers 31% of traffic directed on more than 300,000+ popular websites is produced by 400M users of top 8 OSN websites ( Shareholic Statistics ) ISP/CDN providers who serve a substantial portion of Internet traffic, are looking for better solutions to manage and deliver web content GOAL Use signals from OSN websites to improve Caching Algorithms of Web CDNs 2
OUTLINE How topics spread on OSNs? Can OSN topic spreading insights be applied to guide CDN caching? How to track relevant OSN indicators in an efficient manner? Simulation study using real data to study the efficacy of OSN-aided caching policies 3
DATASETS Tweets SNAP Twitter7 : 196 million tweets 9.8 million users June 11, 2009 to October 1, 2009 Author Time Kaist Details of Twitter7 Dataset : J. Yang, J. Leskovec. Temporal Variation in Online Media. In WSDM '11, 2011., : 1.4 billion social relations 41.7 million users July 6, 2009 to July 31, 2009 Social Relations Twitter Details of Kaist Dataset : H. Kwak, C. Lee, H. Park, S. Moon, What is Twitter, a social network or a news media? , in WWW 10, New York, USA, 2010 : 7.4 million user locations ( 75.4% of the SNAP dataset users ) Location Yahoo! PlaceFinder : 4 million user location mapped to a (Latitude, Longitude) format ( 54% of all locations ) OpenCalais : Entities and Tags extraction for 114 million English tweets (80% of the SNAP dataset tweets) 7.5 million topics and 39 million URLs 4 Topics
GEOGRAPHICAL ANALYSIS Highly Popular Barack Medium Popular Non Popular Hamburg Cambridge Conclusions : 1. Popular topics cross regional boundaries while unpopular topics stay within them. 5 2. Geographic crossovers can indicate popularity growth to guide CDN caching
EVENTS BASED ANALYSIS Ratio of size of largest to 2nd largest component Conclusion : 1. Most users tweeting on popular topics form one large connected component while unpopular topics are discussed in disconnected clusters. 2. The giant component forms when many tightly clustered sets of users discussing a topic coalesce together. 6 3. Tracking change in component size of topics can detect virality in advance and can possibly guide CDN caches to pre-fetch popular content .
CONCLUSIONS Popular topics cross regional boundaries while unpopular topics stay within them. Most of the people talking about a popular topic on a given day tend to form a large connected subgraph (giant component) The giant component forms when many tightly clustered sets of users discussing the topic merge together. 7
OUTLINE How topics spread on OSNs? Can OSN topic spreading insights be applied to guide CDN caching? How to track relevant OSN indicators in an efficient manner? Simulation study using real data to study the efficacy of OSN-aided caching policies 8
SPATIAL PATTERN Event Time US 2:26 PM. India 2:56 AM Related Work : N. Sastry, E. Yoneki, and J. Crowcroft, Buzztraq: predicting geographical access patterns of social cascades using social networks in Proceedings of the Second ACM EuroSys Workshop on Social Network Systems. Germany: ACM, 2009, Knowledge about the number and location of friends of previous users can be used to generate hints that enable placing replicas of the content closer to future accesses Insight : Time-lag across different time-zones can possibly suggest the time at which content should be pre-fetched in CDN servers 9
TEMPORAL PATTERN Slow Ephemeral Slow Weekly Sharp Stable Sharp Yearly Movie Release Death of Michael Jackson MovieRelease Bill Clinton Visit to N.Korea FollowFriday #flywithme Independence Day Iran President Election 1. Growth in Popularity 2. Decay in Popularity 3. Lifespan of Topics 4. Periodicity of Popularity Insight : Temporal event detection algorithms can model growth, decay, stability and periodicity of topic popularity to guide CDN caching 10
SOCIAL PATTERN Related Work : S. Scellato, C. Mascolo, M. Musolesi, and J. Crowcroft Track globally, deliver locally: improving content delivery networks by tracking geographic social cascades" In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, USA, 2011 Used geographic information extracted from social cascades to improve the caching of multimedia files in a Content Delivery Network. Number of UGC participants in topic Social Cohesion = Actual count of social relations between UGC producers ( per topic) ------------------------------------------------------------------------------------------------ Maximum possible social relations between UGC producers 11 Insight: To guide caching of content related to long tailed topics, the immediate social network neighbourhood can indicate pockets of popularity
CONCLUSIONS To design efficient content placement strategies: 1. Time-zone information can be used to determine when content would be requested in other geographies. 2. Temporal event detection algorithms can model growth and decay phases to guide CDN caching 3. To guide caching of content related to long tailed topics, the immediate social network neighbourhood can indicate pockets of popularity. 12
OUTLINE How topics spread on OSNs? Can OSN topic spreading insights be applied to guide CDN caching? How to track relevant OSN indicators in an efficient manner? Simulation study using real data to study the efficacy of OSN-aided caching policies 13
MOTIVATION Practically infeasible to track all OSN users and all the trending topics Can the tracking of trends on OSNs be optimized by following just few users? 14
DATASETS Dataset Seed Users Total Users Tweets #tags > 10K Tweets Total Tweets Bollywood 150 23 M 406 M 119 2.74 M Politics 55 7 M 115 M 182 7.73 M Sports 40 9M 129 M 59 2.36 M Total 245 26 M * 468 M 360 12.83 M 15 * Captures 60% of the entire Twitter user base from India
USERS CLASSIFICATION Users Proportion Followers % of Tweets Popular Medium Ordinary Top 0.1% Top 0.1 to 5% Top 5 to 30% >= 6828 95 to 6828 8 to 94 0.1 58 37 Inactive Bottom 0 to 70% < 8 4 Related Work : M. Cha, F. Benevenuto, H. Haddadi and K. Gummadi, "The World of connections and Information Flow in Twitter," in IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 2012. 16 Classified users among Popular, Evangelist and Grassroots on basis of CDF of incoming links.
TWEETING VOLUMEAND ADOPTION Volume of Tweeting Early Adopters Observation : The volume of tweets by popular vs ordinary users is not distinguishable from one another. 17 Popular users start tweeting sooner than ordinary users by approximately 10% of the growth phase duration.
PARTICIPATIONOFBOTHPOPULARAND ORDINARYUSERSDECAYSOVERTIME Observation : Popular users participate longer in an event. 18
CONTENTCOPYINGCHARACTERISTICS Observation : 1. Popular users write more original tweets than retweets by a factor of 60:40. 2. The tweets by popular users are retweeted 6 times more than tweets by other users. TIME-DELAYCHARACTERISTICSOF CONTENTCOPYING Observation : 1. Popular users are the quickest to retweet tweets by a factor of 8 than other users. 2. Popular users show a preference to retweet tweets by less popular users sooner, probably those who are their friends. 19
INFLUENCEONGROWTHRATES Related Work : Growth Rate Harrigan et al. in Influentials, novelty, and social contagion: The viral power of average friends, close communities, and old news : in Elsevier Social Networks, 34(4), 2012 Community structure rather than hubs that substantially increase social contagion on Twitter Low Medium High Participation of High 28.69% 62.30% 9.02% Popular Users Weng et al. in "Virality prediction and community structure in social networks" in Scientific Reports, 3(2522). 2013 Role of network structure is more likely to be a powerful driver for the emergence of trends Medium 17.87% 59.90% 22.22% Low 16.25% 60.00% 23.75% Sharad Goyal et al. in The Structural Virality of Online Diffusion in Management Science 62 (1), 2015 It s not only the viral spreading by which a piece of content spread but also mass media or marketing efforts that rely on broadcast mechanism Observation : 1. Popular users do not seem to have influence on growth rate of events. readiness of the social network to accept a novel item. Domingos and Richardson in "Mining the network value of customers." in KDD 01 Key factors in determining influence are the relationship among ordinary users and the 20 2. Level of popular users participation doesn t say much on viraity of topics.
CONCLUSIONS Related Work : The indicators based on popular users are slightly leading over the bulk of the population but they don t give a lot more or earlier information than ordinary users. as sensors to detect global-scale contagious outbreaks , PLoS ONE abs/1211.6512 (4) (2014) Information collected from a few randomly selected individuals and their friends, can detect contagious disease outbreaks in advance. 1. G. arc a Herranz, E. M. Egido, M. Cebri an, N. A. Christakis, J. H. Fowler, Using friends Popular users do not predict which events will become popular, hence at best tracking popular users can give a reflection of how the overall population will behave, and hence can be used as markers instead of tracking all the users Krishna P. Gummadi. 2016. On the Wisdom of Experts vs. Crowds: Discovering Trustworthy Topical News in Microblogs. In Proceedings of the 19th ACM Conference on Computer- Supported Cooperative Work & Social Computing (CSCW '16). ACM, New York, NY, USA For discovering news-stories related to a topic, it is sufficient to analyze only the tweets posted by a small number of experts on the topic than the global Twitter population(Crowd). 2. Muhammad Bilal Zafar, Parantapa Bhattacharya, Niloy Ganguly, Saptarshi Ghosh, and 21
OUTLINE How topics spread on OSNs? Can OSN topic spreading insights be applied to guide CDN caching? How to track relevant OSN indicators in an efficient manner? Simulation study using real data to study the efficacy of OSN-aided caching policies 22
MOTIVATION Limitation of CDNs : Limited content can be cached due to space constraints P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube Traffic Characterization: A View from the Edge . In IMC), San Deigo, CA, 2007. Video requests follow a Zipf-like distribution Related Work : Problem : M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. Analyzing the video popularity characteristics of Large-Scale user generated content systems , IEEE/ACM Transactions on Networking 17(5), 2009. Video popularity follows a power-law distribution with an exponential cutoff. How to improve cache replacement strategies in CDNs? Prior Work Use OSN signals to guide content placement strategies : 1. OSN websites know which topics are trending 2. Long tail content can impede LRU performance S. Scellato et.al. "Track globally, deliver locally: improving content delivery networks by tracking geographic social cascades" In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, USA, 2011 Used geographic information extracted from social cascades to improve the caching of multimedia files in a Content Delivery Network. Long-tail of content 23
PERFORMANCEOF LRU VS OSN AIDED EVENT-BASEDCACHING Observation : OSN aided event-based algorithm perform better only when the cache size is small. 24
RESIDENCE TIMEOF CONTENT OBJECTS INTHE CACHE Topic Time Spent in seconds when cache size = 8 GB Time Spent in seconds when cache size = 225 GB IRANELECTION (Popular Topic) NECC09 (Medium Popular Topic) ICANN (Non-Popular Topic) 8 days 31 days 2 minutes 2 hours 6 seconds 7 minutes 25 25 Observation : Content objects of popular topics may never get evicted from the cache.
REMOVALOFK MOST POPULAR TOPICS Crossover = 224 GB Crossover = 31 GB Crossover = 1323 GB Moves further Cache Size (Exponential consumption in 2 days) Cache Size (Uniform consumption in 7 days) K 0 1 2 5 31 GB 224 GB 1323 GB OSN aided 78 GB 516 GB 26 OSN aided OSN aided
WORKLOAD VARIANTS Changing number of surrogate servers Changing count of objects per topic Changing content selection behavior Observation: Consistent trend of na ve LRU outperforming OSN aided caching algorithms at larger cache sizes. 27
SUPPLEMENTARY ANALYSIS: CACHE HITS VS POPULARITY DISTRIBUTION Region of Study Related Work : P. R. Jelenkovi c, A. Radovanovi c, and M. S. Squillante. Critical sizing of LRU caches with dependent requests Journal of Applied Probability, 43:1013 1027, 2006. LRU performs as well on a dependent sequence of random requests as it does on an independent sequence of requests with the same request frequencies, as long as the cache size is large enough. Observation : LRU performs very well as the popularity distribution becomes more skewed. 28
CONCLUSIONS 1. Assuming OSN-based workloads to be representative of the web workload, the wide range of parameters over which LRU outperforms OSN-aided event-based caching algorithms indicates that the trend should hold in general. 2. People are interested in general global content and not just local community centric content. This results in a Zipfian like popularity distribution in web workload, and therefore even long tailed content are not able to disrupt LRU performance. 29
CONCLUSIONSFROM RESEARCH WORK Tracking of OSN trends can be optimized by marking popular users only. 1. Popular users don t seem to have influence on the growth rate of events. 2. OSN-aided cache replacement algorithms have limited scope for CDNs handling Web workloads. 3. 30
Thanks for listening @ruhela_amit 31