Building All k Nearest Neighbor Social Communities

Paris Descartes University, Seminar Series on Data Analytics,

Paris, France, April 12, 2017

Building All k Nearest Neighbor

Social Communities

Max Planck Institute for Informatics, Germany & Department of

Computer Science, University of Cyprus, Cyprus

http://www.cs.ucy.ac.cy/~dzeina/

dzeina@cs.ucy.ac.cy

undefined

Acknowledgements

Joint work with:

•

–

Sources (GPL):

https://github.com/dmsl/rayzit/tree/master/spitfire

•

–

Sources (MIT):

 https://github.com/dmsl/rayzit/

undefined

Online Social Communities

•

–

–

‘

‘

‘

‘

‘

‘

‘88

‘99

‘03

‘04

‘06

‘08

‘11

‘13

undefined

Online Social Communities

•

–

“huge value for individuals

…

  especially when they are working

in far more difficult parts of the world.” J. Bartlett, dw.com,

25.7.2016.

–

Apps: Games (MMOG), Education(MOOC), Crowdsourcing.

•

–

–

–

Topic to be revisited at the end of this talk.

undefined

•

 Smart-homes, smart cars, smart anything

…

•

 Personal assistants (e.g., siri, cortana) and

Chatbots (skype) for Commerce and Q/A

Socializing with People (and Bots)

undefined

Problem 1: Lack of Proximity

•

Current Microblogging apps are unable to

realize proximity-based applications, e.g., for:

–

V2V communication scenarios

–

Adhoc events

–

Emergency response scenarios

•

Crowd around is not in your Social graph



–

#Hashtags not useful either (difficult to be discovered)

undefined

Problem 2: Location Privacy

•

“

”

–

•

“

”

–

undefined

Problem 3: Boostrapping

•

–

“Bootstrapping refers to the starting of a self-

sustaining community that is supposed to proceed

without external input”

wikipedia

•

’

•

•

’

undefined

Rayzit Concept

•

–

Not location-based app but kNN-based! (explained next)

•

Rayzit was funded by the Appcampus Program

(Microsoft, Nokia & Aalto, Finland) in 2013.

–

Ranked among 5 best apps of the program from 3500 apps.

–

•

http://rayzit.com/

undefined

AkNN Networks

 "Continuous all k-nearest neighbor querying in smartphone networks", Georgios Chatzimilioudis,

Demetrios Zeinalipour-Yazti, Wang-Chien Lee, Marios D. Dikaiakos, In IEEE MDM'12.

undefined

•

Rayzit User Map

Large-scale

AkNN Networks

…

undefined

Presentation Outline

•

Introduction

•

–

’

•

Rayzit Crowd Messenger

–

Mobisocial @ IEEE MDM’15

•

Future Challenges

undefined

System Model

•

•

undefined

kNN Query Definition

•

–

undefined

AkNN Query Definition

•

•

≠

≠

∀

∈

∀

∈

≤

⋈

∈

∈

undefined

Centralized AkNN

•

Classical problem with many centralized

algorithms

•

•

–

Memory-resident structures &  algorithms (theory)

–

Disk-resident structures & algorithms (databases)

–

Parallel algorithms (Graphics)

•

–

Not designed for distributed Big Data Architectures

(shared-nothing cloud infrastructure)

undefined

Distributed AkNN: Issues

…

A) Communication Overhead

•

B) Load Balancing

•

•

undefined

Other Hadoop AkNN

•

–

n: number of nodes

–

m: number of servers

–

f: replication factor of an algorithm

undefined

H-NJ Algorithm

•

Hadoop Naïve kNN Join Algorithm [1]:

–

–

⋈

•

–



–



[1] Lu et. al VLDB’12

⋈

undefined

H-BNLJ Algorithm

•

H. Block Nested Loop kNN Join Algorithm [2]:

–

√

≤

√

–

⋈

–

√

–

[2] Zhang et. al EDBT’12

…

…

…

…

…

…

undefined

H-BRJ Optimization

•

–

–

√

√

–

Optimal Load Balancing = 1

√

[2] Zhang et. al EDBT’12

…

…

…

…

…

…

all

combos

√

undefined

PGBJ Algorithm

•

–

√

–

“

”

–

√

–

–

⋈

[1] Lu et al. VLDB’12

’

’

’

’

pivots &

statistics

only

geo-clustering

(voronoi partiotioning)

√

undefined

The Spitfire Algorithm

•

–

1) Quick Partitioning (each cell at least k)

–

2) Replication border cases minimally

–

3) Perform local AkNN computation efficiently

undefined

Spitfire - Partitioning

1.

–

–

–

Using equi-width and equi-depth histograms

Steps:

1.

2.

√

3.

Result:

•

•

•



√

undefined

Spitfire - Replication

2.

–

–

’

–

•

•

–

undefined

Spitfire - Replication

2.

–

“

”

–

“

”

“

”

–

“

”

undefined

Spitfire - Refinement

3.

–

⋈

∪

Same intuition:

1.

Partition using a grid

2.

Find candidates per cell

3.

Refinement per cell

⋈

kNN

undefined

Evaluation Testbed

•

9 computing nodes

•

8 GB RAM

•

CPUs@2.40GHz

•

MapReduce over

Tachyon

(memory-centric file system)

•

Datasets:

–

–

–

–

–

–

–

–

undefined

Response Time

•

Spitfire

and

PGBJ

scale better than block algorithms H-BNLJ and H-BRJ

that run out of memory!

•

Spitfire

outperforms all other algorithms in all cases.

•

Spitfire better than PGBJ on all datasets:

(67%, 75%, 14% on

Random, Oldenburg, Geolife).

Re-computation

possible every

100 sec (~every

few min)

undefined

Partitioning and Replication

•

Spitfire

partitioning is much

faster

 than

PGBJ

•

PGBJ

is growing exponentially

•

H-BNLJ

and

H-BRJ

do not partition geographically



 suffer in refinement

•

Same response times for all datasets



 partitioning independent of dataset

undefined

Replication Factor (f)

•

Plot for most skewed dataset. Others have similar results

•

Spitfire

outperforms

 (optimal value for f is 1)

•

•

H-BNLJ

and

H-BRJ

2m

0.5

, independent of

•

Replication factor offers clear comparison independent of implementation

undefined

Presentation Outline

•

Introduction

•

AkNN Query Processing

–

IEEE TKDE’16

•

–

’

•

Future Challenges

undefined

Rayzit Application

•

–

•

•

•

•

•

A message can be bounded to a

certain distance (kNN is default)

•

0.5, 5, 50, 500, 5000 km

undefined

Example Conversations

undefined

Example Conversations

undefined

Example Conversations

undefined

Example Conversations

undefined

More Topics on Rayzit

•

Topics:

Location, Surveys, Thoughs, Feelings

•

Most answered rayz (responses)

–

“Rain! Lovely.”

(228)

–

“Any good PGs near Avanshi RoadCBE?”

(142)

–

“Which windows phone do you have?”

(119)

–

“anyone around..???”

(94)

–

“Share a secret ..”

 (90)

–

“Rayzit: What is the next feature you would like to see?”

(81)

–

“I love.... (continue)” (58)

–

“Where are you from? I want to test Rayzit location.”

 (52)

–

“Where to dine in Coimbatore if you want to have a dinner

peacefully and by yourself?”

(40)

undefined

Analyzing the Rayzit Network

•

We used both Flurry App Analytics and Analytic

queries on our NoSQL database

•

–

–

32,762  Sessions

–

5,062  New Users

–

1.7 mins  Median

Session Length

•

–

App more popular in Asia, India particularly because

of large Microsoft community and Microsoft

Promotion)

undefined

Average Reply Time





undefined

Average and Maximum

Distance (Rayz, Reply)

•

People in classroom, event?

undefined

Presentation Outline

•

Introduction

•

AkNN Query Processing

–

IEEE TKDE’16

•

Rayzit Crowd Messenger

–

Mobisocial @ IEEE MDM’15

•

undefined

Future Challenges

•

•

–

–

–

…

undefined

Future Challenges

•

Shallow syntax in messages:

2m, 2morrow,

tmw, tomoz, tomz,

all refer to tomorrow.

•

’

•

•

undefined

Future Challenges

Paris Descartes University Seminar Series on Data Analytics

in collaboration with the diNo group, April 12, 2017

Building All k Nearest Neigbhor

Social Communities

Max Planck Institute for Informatics, Germany & Department of

Computer Science, University of Cyprus, Cyprus

http://www.cs.ucy.ac.cy/~dzeina/

dzeina@cs.ucy.ac.cy

Slide Note

Embed Share

Download

This content discusses the development of online social communities facilitated by microblogging tools on the Internet. It explores the structure and content of these communities and highlights both their benefits and drawbacks, emphasizing privacy, censorship, and the risks of negative discourse. The content also touches on the implications of the spread of illegal content within these communities.

olszewski_g Follow

Uploaded on Feb 28, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Building All k Nearest Neighbor Social Communities Demetris Zeinalipour Assistant Professor Max Planck Institute for Informatics, Germany & Department of Computer Science, University of Cyprus, Cyprus http://www.cs.ucy.ac.cy/~dzeina/ dzeina@cs.ucy.ac.cy Paris Descartes University, Seminar Series on Data Analytics, Paris, France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Acknowledgements Joint work with: "Distributed In-Memory Processing of All k Nearest Neighbor Queries", Georgios Chatzimilioudis, Constantinos Costa, Demetrios Zeinalipour-Yazti, Wang-Chien Lee and Evaggelia Pitoura, IEEE Transactions on Knowledge and Data Engineering (TKDE'16), IEEE Computer Society, Vol. 28, Iss. 4, pp. 925-938, 2016. Sources (GPL): https://github.com/dmsl/rayzit/tree/master/spitfire "Rayzit: An Anonymous and Dynamic Crowd Messaging Architecture", Constantinos Costa, Chrysovalantis Anastasiou, Georgios Chatzimilioudis and Demetrios Zeinalipour-Yazti, in IEEE Mobisocial '15, Vol. 2, pp. 98-103, Pittsburgh, PA, USA, 2015. Sources (MIT): https://github.com/dmsl/rayzit/ 2 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Online Social Communities With the advent the Internet, microblogging chat tools have established new means for Online Social Networks: Content: Microblogging data (small elements of content such as short text, images, video.) Structure: Implicit Graph (address book) or Explicit Graph (followers, friends) UNIX talk 88 99 03 IRC MSN Skype Facebook 04 Twitter 06 Foursquare 08 Snapchat 11 Rayzit 13 SMS 70 80 90 00 10 15 Chat Messaging Social Location Forget Anon. LBS 3 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Online Social Communities Benefits: avoid censorship and retain privacy huge value for individuals especially when they are working in far more difficult parts of the world. J. Bartlett, dw.com, 25.7.2016. Apps: Games (MMOG), Education(MOOC), Crowdsourcing. Drawbacks: accountability and traceability threats, rumor spreading and negative discourse (e.g., intimidate, control, manipulate, falsely discredit, or humiliate) leading to the shutdown of services (e.g., Secret) illegal content dissemination leading an ASN to a dark network (e.g., trade drugs, sex, cyber-arms, weapons, counterfeit currency). Topic to be revisited at the end of this talk. 4 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Socializing with People (and Bots) The trends show that socialization will NOT be confined within specific social circles: Smart-homes, smart cars, smart anything Personal assistants (e.g., siri, cortana) and Chatbots (skype) for Commerce and Q/A 5 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Problem 1: Lack of Proximity Current Microblogging apps are unable to realize proximity-based applications, e.g., for: V2V communication scenarios Adhoc events Emergency response scenarios Crowd around is not in your Social graph #Hashtags not useful either (difficult to be discovered) 6 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Problem 2: Location Privacy LB apps can continuously know (surveil, track or monitor) the location of a user while serving them. Which Location settings to keep on? Please rob me awareness project shows the impact of location release. Shows when users check-in somewhere that is not their home (Foursquare location updates). 7 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Problem 3: Boostrapping LB apps suffer from bootstrapping issues Bootstrapping refers to the starting of a self- sustaining community that is supposed to proceed without external input wikipedia Most LB apps don t have enough users online upfront. This gives the impression to new users that the app is deserted. Consequently, many people don t use it and the app gradually dies. 8 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Rayzit Concept Rayzit: a crowd messaging system that instantly forms a dynamic social network based on user proximity Not location-based app but kNN-based! (explained next) Rayzit was funded by the Appcampus Program (Microsoft, Nokia & Aalto, Finland) in 2013. Ranked among 5 best apps of the program from 3500 apps. Over 45,000 user interactions from over 10,000 users! Research Problem: Big Data AkNN Query Processing http://rayzit.com/ 9 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 AkNN Networks Example Problem: Find 2 Closest Neighbors for ALL User "Continuous all k-nearest neighbor querying in smartphone networks", Georgios Chatzimilioudis, Demetrios Zeinalipour-Yazti, Wang-Chien Lee, Marios D. Dikaiakos, In IEEE MDM'12. 10 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Large-scale AkNN Networks Rayzit User Map How to compute the kNN for millions of mobile users every few minutes? from scratch, without tracking (recording) the location of users 11 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Presentation Outline Introduction AkNN Query Processing IEEE TKDE 16 Rayzit Crowd Messenger Mobisocial @ IEEE MDM 15 Future Challenges 12 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 System Model |O| = n objects report location to QP QP computes AkNN query every few minutes. o2 QP O o1 o8 o3 o7 o5 o9 o4 o6 o11 o10 13 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 kNN Query Definition kNN(q, O): Given a multi-dimensional query point q, find the k objects in O that have the smallest distance* to q. *e.g. Manhattan (L1), Euclidean (L2), Geographical distance (earth surface distance) q O o1 o2 2NN(q,O) = {o1,o2} o3 o4 o5 14 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 AkNN Query Definition kNNs of oa not kNNs of oa Find kNN for all objects in O! AkNN(O): Given objects oa ob oc, ob kNN(oa,O) and oc O - kNN(oa,O), it holds that dist(oa,ob) dist(oa,oc) o2 o1 O AkNN = kNN Self-Join o8 o3 kNN Self-Join Definition O kNN O = {(oa, ob) | oa, ob O and ob kNN(oa, O)} o7 o5 o9 o4 o6 o11 o10 O(n2)-time search cost 15 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Centralized AkNN Classical problem with many centralized algorithms Applications: computational geometry, image processing, data compression, clustering and spatial databases. Basic Types of Algorithms: Memory-resident structures & algorithms (theory) Disk-resident structures & algorithms (databases) Parallel algorithms (Graphics) Common Problem: Not designed for distributed Big Data Architectures (shared-nothing cloud infrastructure) 16 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Distributed AkNN: Issues s1 2NN Let us partition the 11 objects over 4 servers {s1, ,s4} Problems: A) Communication Overhead O1: 2NN(o1) are located on adjacent nodes s1 and s2 => We need to minimize the replication (f) among nodes! B) Load Balancing s4: 15 distances among 6 objects (i.e., n(n-1)/2) s3:only 3 distances! => We need to yield a fair space partitioning! s2 o2 o1 o3 o8 o5 o7 o4 o9 o6 o10 o11 s3 s4 17 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Other Hadoop AkNN Bibliography included a few new distributed algorithms founded on the Map-Reduce model. n: number of nodes m: number of servers f: replication factor of an algorithm 18 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 PGBJ Algorithm Partition Group Block kNN Join [1]: Preprocessing: Scatter n pivot points in space (part.) Map1: Map each object to its nearest pivot and store min and max distances of mappings to pivots (for pruning) Midstep: Balance ngroups into m clusters Oi(balancing) Map2: Compute candidates Fifor Oi(correctness) Reduce2: Oi kNNFilocally [1] Lu et al. VLDB 12 pivots & statistics only n O1 O2 O m O 1 O 2 O 3 O m O3 Om geo-clustering (voronoi partiotioning) Workers Workers 22 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 The Spitfire Algorithm Spitfire Execution Phases 1) Quick Partitioning (each cell at least k) 2) Replication border cases minimally 3) Perform local AkNN computation efficiently Distributed Algorithm implemented in MPI 23 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Spitfire - Partitioning 1. Partitioning (Geographical) O(n) m disjoint, appr. equally populated subareas (balancing) Using equi-width and equi-depth histograms Steps: 1.Hash into equi-width buckets on x-axis 2.Group x-buckets into m approx. equi-depth groups 3.For each x-group do step 1 and 2 on y-axis Result: m appr. equi-depth groups. Each cell has at least k items O(n) assuming that number of buckets is less than n 24 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Spitfire - Replication 2. Replication Step In the partitioning step, we made sure that each neighboring cell to Oicontains k entries. Problem: Oican t complete the AkNN computation for its internal objects as there are objects (e.g., the red bullet) lying close to neighbors. Solution: Receive External Candidates (EC) from neighbors. If all nodes do this, each Oi can next perform a local AkNN computation! Next Problem: How to find EC? Oi 25 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Spitfire - Replication 7 2. Replication: How to derive EC? A) Segmenting Border into smaller b segments B) Apply Top-k hiding rule for each b to find which local objects are necessary for neighbor Oj. C) Ship those candidates over to Ojin batch mode e O4 O6 O8 and the condition mindist(o,b) < b is checked for each segment b. If this condition holds then object o is part of ECb. In our example, ECbe= {o1,o2}, ECe= {o1,o2} and ECed= {o2,o4}. Now s1sends ECbeto s2, ECeto s3, and ECedto s4, based on the adjacency described earlier. Similarly, the above steps take place in parallel on each server. Therefore, s1 receives from s2 the ECbeof set O2, from s3 the ECe of set O3, and from s4 the ECedof set O4. Hence, server s1 will be able to construct its EC1 = b BiECb= {o5,o6,o7,o8}. The External Candidate compu- tation completes and the local kNN refinement phase initiates computing kNN(o,Oi ECi), o Oi on each server si. s1 s2 a b c O1 O2 O3 O3 O1 O13 O5 O2 O12 f d (A): Segmenting Border (B): Top-k hiding rule O1 O3 O7 b O9 O10 O11 g h i b s4 s3 If k objects are closer to b than o1and o3, then o1is guaranteed NOT to be Fig. 4. (Top) o2 hides o1 from o3, (Bottom) Segment b hides o1 a kNN of o3and vice versa. Fig. 3. Server s1sends {o1,o2} to s2, {o1,o2} to s3, and {o2,o4} to s4. Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017 from o3. 4 In this section we first show that our algorithm leads to a cor- rect AkNN result, i.e., based on the External Candidates determined by computeECB. Then, we analyze its computational and communication cost. CORRECTNESS AND ANALYSIS 26 m ikNN(o,ECi Oi) = kNN(o,O), In Spitfire, we partition sub-area Ai of server si into a grid of equi-width cells Ci. Each cell c Ci contains a disjoint subset Oc Oi of objects. Next, we compute locally a correct external candidate set ECcfor each cell c Ci (similarly to the replication step). Finally, we find the kNNs for each object o Oi by computing kNN(o,Oc ECc). In Algorithm 4, objects o Oi are scanned once to build a k-min heap Hc for each cell c Ci based on the minimum distance mindist(o,c) between o and the cell-border (Line 3). The first k objects are then popped from Hc to determine threshold c, based on Equation (2) (Lines 4-5). Objects o Oi are scanned once again to determine the External Candidates ECcthat satisfy the threshold as in Equation (3) (Lines 6-10). Finally, si computes kNN(o,Oc ECc), o Oc(Line 11). Given optimal load balancing, the building phase (heap construction and External Candidates) completes in O(n time, whereas finding the kNN within Oc ECc completes in O(fSpitfiren 4.1 To prove correctness, we show that it suffices to compute the External Candidates ECBito border Bi in order to find the External Candidates ECi of the whole area Ai, given area Ai, its border Bi, and the necessary objects around Bi. In the following, we first define the notion of point hiding. Correctness of the computeECB function Definition 1 (Point Hiding). Given three points o1,o2,o3on a line, which holds the following relationship dist(o1,o3) = dist(o1,o2)+ dist(o2,o3), we say that o2hides o1and o3from each other. In Figure 4 (top) point o2hides o1and o3from each other. m) Lemma 1. Given three points o1, o2, o3 where o2 hides o1 from o3and the fact that o1is not a kNN of o2, it holds that o1is not a kNN of o3, and vice versa. n m= |Oi|. m) time, where 3.5 Given an object set O, assume that a set of servers {s1,s2,s3,s4} have been assigned to sub-areas {abde, bcfe, efih, dehg}, respectively (see Figure 3). In the following, we discuss the processing steps of server s1. The objects of s1are O1={o1,o2,o3,o4}, its adjacent servers are Adji={s2,s3,s4}, and its border segments are B1 = {a,ab,b,be,e,ed,d,da}. For simplicity we have defined the border segments to be a one-to-one mapping to the corresponding adjacent servers. As shown, border segment be is adjacent to server s2, corner e is adjacent to server s3, and segment ed is adjacent to server s4. Server s1locally computes ECbfor each b Bi. It does so by scanning all objects o O1and building a heap Hbfor each border segment bbased on mindist(o,b). The k closest objects to each b are popped from Hb as a result. In our example, {o1,o2} are the k closest objects to segment be, {o1,o2} are the k closest objects to e, and {o2,o4} are the k closest objects to segment ed. For each segment b, its pruning threshold b is deter- mined by the largest maxdist of its closest objects computed in the previous step. For instance, for segment be this is b= maxdist(o1,be), since maxdist(o1,be) > maxdist(o2,be). Given the thresholds b, all objects o O1are scanned again Running Example Proof: To prove that o1 is not a kNN of o3 it suffices to prove that there are k points closer than o1 is to o3. The fact that o1 is not a kNN of o2means that there are k other points, {p1,p2,...,pk}, in space that are closer to o2 than is o1, dist(pi,o2) dist(o1,o2). It holds that dist(pi,o3) dist(o1,o2)+ dist(o1,o3) dist(o2,o1) based on trigonometry, which gives dist(pi,o3) dist(o1,o3). Therefore there are k points, namely {p1,p2,...,pk}, that are closer to o3than is o1 Similarly, we can extend the notion of hiding from a point to a line segment, i.e., border. In Figure 4 (bottom) segment b hides o1and o3from each other. Definition 2 (Segment Hiding). Given two points o1, o3, and a segment b, we say that b hides o1 and o3 from each other, when there is always a point o b that hides o1and o3from each other. Lemma 2. Given two points o1and o3, a segment bthat hides o1from o3, and the fact that o1is not a kNN of any point on b, it holds that o1is not a kNN of o3, and vice versa. Proof: It suffices if it holds that dist(ki,o3) dist(o1,o3) for 1 i k. Given that o1is not a kNN of any point p b,

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Spitfire - Refinement 3. Local Refinement -- O( fSpitfiren2/m2 ) execute locally Oi kNN(Oi ECi) using any AkNN algorithm (we use a variant of Proximity IEEE MDM12) si Same intuition: 1.Partition using a grid kNN 2.Find candidates per cell 3.Refinement per cell ECi 27 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Evaluation Testbed 9 computing nodes 8 GB RAM 2 CPUs@2.40GHz MapReduce over Tachyon (memory-centric file system) Datasets: Random (synthetic) uniform distribution Oldenburg (realistic) skewed distribution Geolife (realistic) very skewed distribution Rayzit (real) skewed distribution up to 2x10000 users 28 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Response Time Re-computation possible every 100 sec (~every few min) Spitfire and PGBJ scale better than block algorithms H-BNLJ and H-BRJ that run out of memory! Spitfire outperforms all other algorithms in all cases. Spitfire better than PGBJ on all datasets: (67%, 75%, 14% on Random, Oldenburg, Geolife). 29 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

11 RANDOM: Total Computation for varying number of users (k=64, m=9) OLDENBURG: Total Computation for varying number of users (k=64, m=9) GEOLIFE: Total Computation for varying number of users (k=64, m=9) RAYZIT: Total Computation for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 100000 100000 100000 100000 Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire 10000 10000 10000 10000 1000 1000 1000 1000 Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory 100 100 100 100 10 10 10 10 1 1 1 1 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 2*104 Number of online users (n) Fig. 8. AkNN query response time with increasing number of users. We compare the proposed Spitfire algorithm against the three state-of-the-art AkNN algorithms and a centralized algorithm on four datasets. Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Partitioning and Replication GEOLIFE: Partitioning and Replication for varying number of users (k=64, m=9) RANDOM: Partitioning and Replication for varying number of users (k=64, m=9) OLDENBURG: Partitioning and Replication for varying number of users (k=64, m=9) RAYZIT: Partitioning and Replication for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 1000 1000 1000 1000 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire 100 100 100 100 10 10 10 10 1 1 1 1 0.1 0.1 0.1 0.1 0.01 0.01 0.01 0.01 104 105 106 104 105 106 104 105 106 2*104 Number of online users (n) Number of online users (n) Number of online users (n) Number of online users (n) Spitfire partitioning is much faster than PGBJ PGBJ is growing exponentially H-BNLJ and H-BRJ do not partition geographically suffer in refinement Same response times for all datasets partitioning independent of dataset Fig. 9. Partitioning and Replication step response time with increasing number of users. 30 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017 RANDOM: Refinement for varying number of users (k=64, m=9) OLDENBURG: Refinement for varying number of users (k=64, m=9) GEOLIFE: Refinement for varying number of users (k=64, m=9) RAYZIT: Refinement for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 10000 10000 10000 10000 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire 1000 1000 1000 1000 100 100 100 100 Out Of Memory Out Of Memory Out Of Memory 10 10 10 10 1 1 1 1 0.1 0.1 0.1 0.1 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 2*104 Number of online users (n) Fig. 10. Refinement step response time with increasing number of users and for each available dataset. RANDOM: Replication factor for varying number of users (k=64, m=9) OLDENBURG: Replication factor for varying number of users (k=64, m=9) GEOLIFE: Replication factor for varying number of users (k=64, m=9) RAYZIT: Replication factor for varying number of users (k=64, m=9) 11 11 11 11 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire Replication factor (f) Replication factor (f) Replication factor (f) Replication factor (f) 9 9 9 9 7 7 7 7 5 5 5 5 3 3 3 3 1 1 1 1 104 105 106 104 105 106 104 105 106 2*104 Number of Users (n) Number of Users (n) Number of Users (n) Number of Users (n) Fig. 11. Replication factor f with increasing number of users. The optimal value for f is 1, signifying no replication. TABLE 3 up the values shown in Figures 9 and 10 and comparing to the total response time in Figure 8, it becomes obvious that most of H-BRJ s response time is spent in communication, which is indicated theoretically by its communication complexity of O( mn) shown in Table 2. We focus on comparing only Spitfire and PGBJ for the rest of our evaluation. For 104online users, Spitfire outperforms all algorithms by at least 85% for all dataset, whereas for 105Spitfire outperforms PGBJ, by 75%, 75% and 53% for the Random, Oldenburg and Geolife datasets, respectively. Values used in our experiments Section 5.5 5.6 5.7 5.8 5.9 Dataset ALL Random ALL Random, Rayzit Random n k 64 64 64 m 9 9 9 9 [104, 105, 106] 106 106(104Rayzit) 106(104Rayzit) 106 4i, 1 i 5 64 [3, 6, 9] 5.5 Varying Number of Users (n) In this experimental series, we increase the workload of the system by growing the number of online users (n) exponen- tially and measure the response time and replication factor of the algorithms under evaluation. Total Computation: In Figure 8, we measure the total re- sponse time for all algorithms, datasets and workloads. We can clearly see that Spitfire outperforms all other algorithms in every case. It is also evident that H-BNLJ and H-BRJ do not scale. H-BRJ achieves the worst time for 106users. Adding Spitfire and PGBJ are the only algorithms that scale. For a million online users (n=106), Spitfire and PGBJ are the fastest algorithms, but Spitfire still outperforms PGBJ by 67%, 75%, 14% for the Random, Oldenburg and Geolife datasets, respec- tively. The small percentage noted for the Geolife dataset is attributed to the fact that this dataset is highly skewed (as observed in Figure 7), and that PGBJ achieves better load balancing (as shown later in Section 5.7), which in turn leads to a faster refinement step.

11 RANDOM: Total Computation for varying number of users (k=64, m=9) OLDENBURG: Total Computation for varying number of users (k=64, m=9) GEOLIFE: Total Computation for varying number of users (k=64, m=9) RAYZIT: Total Computation for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 100000 100000 100000 100000 Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire Proximity H-BNLJ H-BRJ PGBJ Spitfire 10000 10000 10000 10000 1000 1000 1000 1000 Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory Out Of Memory 100 100 100 100 10 10 10 10 1 1 1 1 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 2*104 Number of online users (n) Fig. 8. AkNN query response time with increasing number of users. We compare the proposed Spitfire algorithm against the three state-of-the-art AkNN algorithms and a centralized algorithm on four datasets. RANDOM: Partitioning and Replication for varying number of users (k=64, m=9) OLDENBURG: Partitioning and Replication for varying number of users (k=64, m=9) GEOLIFE: Partitioning and Replication for varying number of users (k=64, m=9) RAYZIT: Partitioning and Replication for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 1000 1000 1000 1000 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire 100 100 100 100 10 10 10 10 1 1 1 1 0.1 0.1 0.1 0.1 0.01 0.01 0.01 0.01 104 105 106 104 105 106 104 105 106 2*104 Number of online users (n) Number of online users (n) Number of online users (n) Number of online users (n) Fig. 9. Partitioning and Replication step response time with increasing number of users. RANDOM: Refinement for varying number of users (k=64, m=9) OLDENBURG: Refinement for varying number of users (k=64, m=9) GEOLIFE: Refinement for varying number of users (k=64, m=9) RAYZIT: Refinement for varying number of users (k=64, m=9) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) Response time in log-scale (sec) 10000 10000 10000 10000 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire 1000 1000 1000 1000 100 100 100 100 Out Of Memory Out Of Memory Out Of Memory 10 10 10 10 1 1 1 1 0.1 0.1 0.1 0.1 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 104 Number of online users (n) 105 106 2*104 Number of online users (n) Fig. 10. Refinement step response time with increasing number of users and for each available dataset. Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Replication Factor (f) GEOLIFE: Replication factor for varying number of users (k=64, m=9) RANDOM: Replication factor for varying number of users (k=64, m=9) OLDENBURG: Replication factor for varying number of users (k=64, m=9) RAYZIT: Replication factor for varying number of users (k=64, m=9) 11 11 11 11 H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire H-BNLJ H-BRJ PGBJ Spitfire Replication factor (f) Replication factor (f) Replication factor (f) Replication factor (f) 9 9 9 9 7 7 7 7 5 5 5 5 3 3 3 3 1 1 1 1 104 105 106 104 105 106 104 105 106 2*104 Number of Users (n) Number of Users (n) Number of Users (n) Number of Users (n) Plot for most skewed dataset. Others have similar results fSpitfire outperforms (optimal value for f is 1) Fig. 11. Replication factor f with increasing number of users. The optimal value for f is 1, signifying no replication. fSpitfiredrops as n increases (scalability) fH-BNLJand fH-BRJ= 2m0.5, independent of n Replication factor offers clear comparison independent of implementation 31 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017 TABLE 3 up the values shown in Figures 9 and 10 and comparing to the total response time in Figure 8, it becomes obvious that most of H-BRJ s response time is spent in communication, which is indicated theoretically by its communication complexity of O( mn) shown in Table 2. We focus on comparing only Spitfire and PGBJ for the rest of our evaluation. For 104online users, Spitfire outperforms all algorithms by at least 85% for all dataset, whereas for 105Spitfire outperforms PGBJ, by 75%, 75% and 53% for the Random, Oldenburg and Geolife datasets, respectively. Values used in our experiments Section 5.5 5.6 5.7 5.8 5.9 Dataset ALL Random ALL Random, Rayzit Random n k 64 64 64 m 9 9 9 9 [104, 105, 106] 106 106(104Rayzit) 106(104Rayzit) 106 4i, 1 i 5 64 [3, 6, 9] 5.5 Varying Number of Users (n) In this experimental series, we increase the workload of the system by growing the number of online users (n) exponen- tially and measure the response time and replication factor of the algorithms under evaluation. Total Computation: In Figure 8, we measure the total re- sponse time for all algorithms, datasets and workloads. We can clearly see that Spitfire outperforms all other algorithms in every case. It is also evident that H-BNLJ and H-BRJ do not scale. H-BRJ achieves the worst time for 106users. Adding Spitfire and PGBJ are the only algorithms that scale. For a million online users (n=106), Spitfire and PGBJ are the fastest algorithms, but Spitfire still outperforms PGBJ by 67%, 75%, 14% for the Random, Oldenburg and Geolife datasets, respec- tively. The small percentage noted for the Geolife dataset is attributed to the fact that this dataset is highly skewed (as observed in Figure 7), and that PGBJ achieves better load balancing (as shown later in Section 5.7), which in turn leads to a faster refinement step.

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Presentation Outline Introduction AkNN Query Processing IEEE TKDE 16 Rayzit Crowd Messenger Mobisocial @ IEEE MDM 15 Future Challenges 32 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Rayzit Application A user can post a new Rayz (text, audio, video shown as attachment). To minimize spamming, each user has a powerbar that decreases on every action: Rayz, reply, re-rayz, power+/- When empty no more actions are possible until it gradually reloads. To reload faster, we provide the incentive to provide useful answers that will be starred by others. A message can be bounded to a certain distance (kNN is default) 0.5, 5, 50, 500, 5000 km 33 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Example Conversations General Advice & Crowd Casting 34 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Example Conversations Philosophical & Location-based Questions 35 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Example Conversations Feelings Expression & Taboo Questions 36 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Example Conversations Inappropriate Messages (before installation of report filter) 37 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 More Topics on Rayzit Topics: Location, Surveys, Thoughs, Feelings Most answered rayz (responses) Rain! Lovely. (228) Any good PGs near Avanshi RoadCBE? (142) Which windows phone do you have? (119) anyone around..??? (94) Share a secret .. (90) Rayzit: What is the next feature you would like to see? (81) I love.... (continue) (58) Where are you from? I want to test Rayzit location. (52) Where to dine in Coimbatore if you want to have a dinner peacefully and by yourself? (40) 38 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Analyzing the Rayzit Network We used both Flurry App Analytics and Analytic queries on our NoSQL database Interval of Analysis: Oct. 13 Mar. 15 32,762 Sessions 5,062 New Users 1.7 mins Median Session Length Conclusion: App more popular in Asia, India particularly because of large Microsoft community and Microsoft Promotion) 39 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Average Reply Time Online users 28% Offline users Sporadic users Rest 28% of replies were recorded within 1 minute. Crowd Volume creates quick response possibilities Nearby users respond quickly (more interested in the discussion). 40 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Average and Maximum Distance (Rayz, Reply) Hypothesis: Closer Geographic Users means More Replies distance of rayz to replies ([1-50] replies) = 15km (max) >50 replies At 10km (max) >100 replies at 8 meters! Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017 People in classroom, event? 41

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Presentation Outline Introduction AkNN Query Processing IEEE TKDE 16 Rayzit Crowd Messenger Mobisocial @ IEEE MDM 15 Future Challenges 42 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Future Challenges Tools to handle lack of accountability have shown to encourage negative discourse, including personal attacks, threats or rumor spreading. Introduce more automated and crowd management techniques (collaborative filtering): More intelligent Report Management Classify rayz threads to categories More detailed data analysis (interaction graphs, rerayz path length, community detection, chat groups vs. location, new vs. old users, . 43 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Future Challenges Shallow syntax in messages: 2m, 2morrow, tmw, tomoz, tomz,all refer to tomorrow. Off-the-shelf Part-of-Speech (POS) taggers (e.g., Stanford NLP or MPI AIDA) don t work adequately on these messages. Data contains many infrequent entity types (e.g., physicist) and capitalization (used for recognizing named entities, e.g., AlbertEinstein) is not there. Might need to exploit spatial and temporal dimensions in data to understand what something refers to. 44 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Future Challenges Microblog NER (T-POS, T-NER) Temporal Tagging (Heideltime) Quantity Tagging (QKB) Knowledge Base (YAGO, Freebase) { RQ1, RQ2, RQ3} 45 Demetris Zeinalipour, Paris Descartes Univ., France, April 12, 2017

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 Building All k Nearest Neigbhor Social Communities Demetris Zeinalipour Thank You! Questions? Max Planck Institute for Informatics, Germany & Department of Computer Science, University of Cyprus, Cyprus http://www.cs.ucy.ac.cy/~dzeina/ dzeina@cs.ucy.ac.cy Paris Descartes University Seminar Series on Data Analytics in collaboration with the diNo group, April 12, 2017

Building All k Nearest Neighbor Social Communities

Download Presentation

Presentation Transcript

Related

More Related Content