Strategies and Cautionary Tales in Big Networks

Big Networks:

Strategies and Cautionary Tales

Outline

1.

Introduction: Growth in network data

2.

Three Problems, Three Strategies

1.

Big Data

2.

Deep Data

3.

Many Data

3.

Examples & War Stories

4.

Some Coding & data wrangling hints

5.

Conclusion

Introduction

Social Network analysis has grown out of the case-study method with

heavy theoretical roots in structural anthropology & community studies.

Moreno’s

Who Shall Survive

(24 nodes; 82 arcs, two relations)

Introduction

Social Network analysis has grown out of the case-study method with

heavy theoretical roots in structural anthropology & community studies.

Loomis, 1947.

(39 nodes, ~90 relations)

Introduction

Social Network analysis has grown out of the case-study method with

heavy theoretical roots in structural anthropology & community studies.

Everyone’s favorite

Karate Club

(34 nodes, 156 edges)

Introduction

1)

But…contemporary network data are often not the single case-

studies that characterized the birth of the field.   Rather we now see

three common extensions:

a) Massive single networks

Online data, customer records, electronic medical records

data.

Many bi-partite network opportunities through social

activity tracing.

b) Deep data – the problem of exquisite detail

 - sensor data, text data and other sorts of intensive data

collection routines produce “thick” data descriptions on even a

small number of nodes.

(will only hint at this today due to

time)

c) “Small” multiples

Same data structure repeated in many settings.

Add Health, Prosper, etc.

How well do our standard strategies scale for these sorts of problems?

Introduction

Introduction

Introduction

Introduction

”Big”, in a social network, is primarily a function of its number of ties.

Consequently, we need to think about how we are defining a tie, particularly in

contexts where the edges represent affiliations, proximities, or similarities.

For similarities and proximities, all nodes may be ”connected” to all other nodes

because all pairs of nods have a value.

To isolate the underlying structure from the noise, it is often useful to

thin

 the network

by applying a treatment. These treatments generally come in three flavors:

Threshold Methods

(e.g., line islands)

Ranking Methods

(e.g., top k alters)

Likelihood Methods

(e.g., isolating ties with frequencies greater than we expect by chance)

Introduction

If networks now spanning something like 6 orders of magnitude, do we need new

theory? Largely, no.

The foundational division between

connectionist

and

positional

approaches to

networks stands; what’s changed is that we are forced to specify

internal boundary

conditions

rather than taking our boundary conditions as given by the case.

What first appears as a methods problem is, usually, a theory problem.

Connections & Positions: Network Problems

Three Problems, Three Strategies

If what first appears as a methods problem is, usually, a theory problem, then what are

these

problems

?  We see three:

1.

Big Networks

Connectionist

: What is the relevant flow? How far? What governs spread?

Positional

: what’s the social horizon for action within this structure? Roles relative

to who?

Traditional answers to these questions assume a well-bounded relevance

horizon and take the full network as relevant.  So things like geodesic

distances are meaningful.  Is that true for networks with billions of edges?

Problem:  exponential runtime & interpretability.

2.

Too many trees in the forest?

3.   Too many forests?

Three Problems, Three Strategies

If what first appears as a methods problem is, usually, a theory problem, then what are

these

problems

?  We see three:

1.

Big Networks

2.

Too many trees in the forest?

Connectionist

: With continuous time data, “when” is a relation? With 1000s of

interactants, who’s relevant for information flow?

Positional

: What gives text meaning? How is a note, phrase or gesture situated in

a wider (perhaps unseen) context?

Traditional answers rely on ethnographic sensitivity and deep implicit

understanding.  Can we leverage computational tools to augment and

regularize this?

Problem:

Too much data, too little contextual understanding, methods tuned

to thin networks.

3.

Too many forests?

Three Problems, Three Strategies

If what first appears as a methods problem is, usually, a theory problem, then what are

these

problems

?  We see three:

1.

Big Networks

2.

Too many trees in the forest?

3.   Too many forests?

Connectionist

: why do rumors spread faster in one setting than another?

Positional

: Is the relational hierarchy similar across multiple schools?

Traditional answers rely on user judgement for many network analytic

decisions; it’s a model rooted in deep data involvement.

Problem:

Case paradox: the same skills that make for good judgement in a

single case make for either inconsistency across cases or insensitively to

contextual nuance.

Three Problems, Three Strategies

Each problem carries with it an implicit solution strategy:

1.

Big Networks



Divide & Conquer

Most real-world networks are not actually a single unified network; but rather a

network-of-networks, highly clustered.  Its generally more sensitive to variation

and faster to divide the problem along natural fault lines & work within.

2.

Too many trees in the forest?



Build maps, find patterns.

High fidelity data pushes us to abstract in new ways that (might?) provide insights

into meaning.  Think like a naturalist.

3.    Too many forests?



Regularize case study insights

The task is to weave between the ideals of case-specificity and methodological

consistency.  Is it OK to use a different resolution parameter in each setting?  How

do you develop decision rules?  We think it’s a two-step problem: (a) identify

what underlies choices in a single, well known case; then (b) regularize on that

meta level.

Big Networks



Divide & Conquer

•

Giant networks pose two sorts of problems: Practical and Theoretical

•

Practical

•

Can you manipulate, clean, load, create the necessary data

structures given the scale of the data and your computational

resources?

•

R/iGraph is often very good at simple large-scale calculations.

Assuming you have a graph object, then getting many node-

level metrics is often very fast.    You can often use these

“simple” stats in creative ways, particularly by crossing with

attributes/communities.

•

If not,

•

 sometimes the solution is “merely” adding more

computational power.  This is sometimes possible; though

often limited by data use restrictions (privacy, DUA etc.).

•

Most of the time the solution is to rethink the problem and

make it smaller somehow.  The two most common

solutions are to

divide

or

localize.

•

Theoretical

Big Networks



Divide & Conquer

•

Giant networks pose two sorts of problems: Practical and Theoretical

•

Practical

•

Theoretical

•

What scale is socially relevant?

•

Disease spread is naturally on O(Billions), but

interventions/resources/etc. are not.

•

Most network processes are enacted locally despite global

emergent implications.

•

Do you expect the “same” process at each place in the network?

•

i.e. is the susceptabiltiy to social influence the same for

each place?  Does the ERGM homogeneity of parameters

assumption hold?  Likely no, which means we don’t really

want to analyze it as a single network, but as a collection of

liked networks.

•

Do small-scale ease-of-use practices scale?

•

Geodesics are the common path on small nets because they are

easy to compute there; but are they relevant for massive

networks?

Big Networks



Divide & Conquer

Whole NET

nd

level

cluster

Most social networks admit to

natural fissures and it often makes

more sense to differentiate analysis

within vs. between these fissures.

•

Social processes rarely even

across communities

•

Between community structure

can itself be interesting

•

Creates a need for multi-level

modeling & Analysis

First level

cluster

Big Networks



localize

Pay attention to the base features of large-scale global networks, then use that to tailor

what you work with substantively.

•

Are you interested in the longest-long tail of your degree distribution?

•

Is the network a set of hubs-and-spokes? Clusters?  Homogenous weave?  Think about

the way the global structure shapes visibility for the local structure and ask if you can

just use the local (say two-step) structure?

•

Can the (local) process you are interested in be sampled?

•

Generally it's possible to “localize” networks around nodes – construct ego networks

of k-steps for example or all nodes within k-steps with particular attributes – then

analyze each of those to construct your boundary.

EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation.

The 2010 Ohio

Physician network

has 38K nodes and

2.2M edges.

We repeated this

for 5 states over 3

panels. This still left

us with 474K nodes

and 11.3M edges.

Big Networks



Divide & Conquer

examples

EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation.

The logic behind

physician shared

patient networks is

of coordinated care:

who you talk to

about patients

affects treatment.

This pushes us away

from the total

network toward

identify a set of

localized

communities.

The graph breaks into

116 localized

communities with

sizes ranging from 30

to nearly 1000.

Big Networks



Divide & Conquer

examples

Total network has 40M

nodes; we used a two-level

clustering procedure to

identify reasonably

compact communities.

 -- these are also large.

Here one 2

nd

-level

community is over 6000

nodes.  But the structure

starts to become apparent at

this level

Big Networks



Divide & Conquer

examples

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Network of 300K twitter users.  Modularity on first-level cut was over 0.9 (Louvain,

weighted graph). This is a heat map mixing matrix, where each row/col is a community.

Big Networks



Divide & Conquer

examples

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Big Networks

Network of ~300K twitter users.  Details for 2 of the clusters

Communities 1 & 2: N=21487

Community  5: n=7773

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Big Networks

Of course, the D&C model requires being able to divide the network.

  - See Mucha’s presentation on strategies for this

  - R will do very large networks with things like fastgreedy, spectral or stochastic block

model

I’ve usually used PAJEK for this.  It’s optimized for very large networks (up to billions of

nodes) and provides good control over resolution parameters and such.  You have to go

through the hassle of moving your data in/out; but I’ve found it worth the effort over

purely R solution.

iGraph in Python is more flexible for clustering, so that’s another option.

Exemplar Strategies

Big Networks

A localization

example.

Twitter network

with 1.1M nodes.

Note the tails –

these cannot be

substantively the

“same” sorts of

actors.

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Big Networks

A localization

example.

Twitter network

with 1.1M nodes.

Note the tails –

these cannot be

substantively the

“same” sorts of

actors.

We can then ask

things like

distribution of

hubs/spokes in

large networks

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Big Networks

A localization

example.

It's also possible to use simple-

to-calculate scores in unique

ways.

So while its time and

space/bandwidth consuming

to run a full triad-based

structural equivalence model

over a giant network; you can

calculate a host of local and

bridging sorts of scores, then

cluster those to get positions

quickly.

Role profile plots

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Big Networks

A localization

example.

It's also possible to use simple-

to-calculate scores in unique

ways.

So while its time and

space/bandwidth consuming

to run a full triad-based

structural equivalence model

over a giant network; you can

calculate a host of local and

bridging sorts of scores, then

cluster those to get positions

quickly.

Roles overlaid on a single community

Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies

Deep Networks

Characterized by lots of data on small N of cases.

“Big” does not always involve billions of nodes.  Contemporary data-collection

routines can yield massive datasets on even a single person.

Example: Real-time video data on Data+ students.  Only a 10 nodes, but 30

terabytes of video data.

•

Easy to detect features

Volfovsky, Alex. Katherine Heller, James Moody. “Building Better Teams: A network analysis approach” Army Research Office.

Exemplar Strategies

Deep Networks

Strategies here are often focused on three basic sorts:

a)

Wrangling.

 just putting the data in analyzable format.  This is non-trivial;

some off-the-shelf AI tools are making this easier, but most of it requires

bespoke programming.

Teams data has taken months to move from video to tabulations.

a)

Sifting.

 Most of the data you collected is, sadly, irrelevant.

a)

Tracer data: do you want data on each movement?  What about when

people are sleeping?  Or alone in their ca?

b)

Aggregating.

  Or, at least irrelevant at the scale collected.

Do you want the data

item,

or

similarity

between actors across multiple

items?  Often the fine-grained data gets turned into a vector.

Interaction rituals and bonding

•

Speed-dating research

•

4 minute “dates”

•

Men rotate while women stay in seat

•

“Interaction rituals”

•

Who

clicks

 with whom?

•

How do social bonds form?

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Our

speed

date

setup

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Our

speed

date

setup

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

What do you do for fun?  Dance?

Uh, dance, uh, I like to go, like camping.  Uh, snowboarding, but I'm not good, but I like to go anyway.

You like boarding.

Yeah.  I like to do anything.  Like I, I'm up for anything.

Really?

Yeah.

Are you open-minded about most everything?

Not everything, but a lot of stuff-

What is not everything [laugh]

I don't know.  Think of something, and I'll say if I do it or not. [laugh]

Okay.  [unintelligible].

Skydiving.  I wouldn't do skydiving I don't think.

Yeah I'm afraid of heights.

F:  Yeah, yeah, me too.

M:  [laugh] Are you afraid of heights?

F:  [laugh] Yeah [laugh]

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

The Speed Date Study

•

991 4-minute dates

•

3 events each with ~20x20=400 dates some data loss

•

Participants: 110 graduate student volunteers in 2005

•

participated in return for the chance to date

•

Speech

•

~70 hours from shoulder sash recorders; high noise

•

Transcripts

•

~800K words hand-transcribed w/turn boundary times

•

Surveys

•

(Pre-test surveys event scorecards post-test surveys)

•

Date perceptions and follow-up interest

•

General attitudes, preferences, demographics

•

Largest natural experiment with audio text + survey info

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

How we predict it

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Actor-Partner Interaction Model (Kenny, Kashy, & Cook 2006)

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Strategic Bonding – How to get a desired

mutual bond?

•

Men

•

Actor speech

•

vary pitch

•

avoid talking about work

•

talk about yourself more than usual

•

Women

•

Actor speech

•

laugh

•

make appreciations

•

take short turns

•

talk about yourself and drinking

Exemplar Strategies

Deep Networks

Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Culture in Objects

•

Culture is “stored” more/less ambiguously

•

 “We should try to approach material culture without

reducing objects to instantiations of discourse or

realizations of cognitive representations.” (Mukerji 1997:

36)

Exemplar Strategies

Deep Networks

Shared schema on Scents

Data & Methods

•

Fragrances

•

While humans are actually quite good at smelling, they

are not good at describing smells in intersubjectively

agreed-upon ways (Barwich 2020)

•

Smells are deeply symbolic and cultural (Sperber 1978),

and perhaps the least intellectual of the senses

(Gonzalez-Crussi 1989).

Exemplar Strategies

Deep Networks

Shared schema on Scents

Slides courtesy of Craig Rawlings

Schema 1 (n=36)

Exemplar Strategies

Deep Networks

Shared schema on Scents

Slides courtesy of Craig Rawlings

Schema 2 (n=46)

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Shared schema on Scents

Schema 3 (n=33)

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Shared schema on Scents

H1: Controlling for interactions and objects,

shared schemas predict initial interpersonal

consensus in meanings.

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Shared schema on Scents

Research

Question

Can we use

social network analysis

 to enhance our understanding of

how

sociality and social cohesion

develop

in preschool aged children?

Specific Aims:

1.

Feasibility of video data

collection

2.

Coding scheme development

3.

Social network analysis

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

Data Collection

Naturalistic video

observations

 of classroom

activities

4 microcameras

15 minute segments, 4x per

week

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

Coding

Scheme

Behaviors

Conflict and Cooperation

Modifiers

Self-initiated, Other-initiated, or

Mutually-initiated

Physical or Non-physical

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

 Social Network Analysis

Dynamic Network

-Conflict vs.

Cooperation

-Sex and Age

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

 Social Network Analysis

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

What’s Next?

Continue to create social networks

to analyze in R

Patterns of relationships facilitate

prosocial behavior

What does this mean for intervention to

increase compassionate behavior early in

life

Slides courtesy of Craig Rawlings

Exemplar Strategies

Deep Networks

Video data on children's interactions

Large-scale public good investments and at-scale interventions have

increased access to multiple networks

from the same network generator.

Add Health: Students, 129 Networks, 73K nodes

Prosper: Students, 368 Networks, 48K nodes, (5 waves)

Indian Micro Finance (Banerjee 2013): 43 Villages

Indian Health Villages (Mohanan, ongoing): 80 Villages

Honduran Health Villages: (Christakis 2015): 32 Villages

…

Or any time you “divide” a big network into lots of little parts!

All characterized by having the same

data collected on multiple

sites.

Exemplar Strategies

Multiple Networks

Exemplar Strategies

Multiple Networks

Overarching strategy is to routinize method so it can be repeated.

This forces one to hard-code choices that might otherwise vary if you were doing it as a

disconnected set of one-off analyses.  Key is to try and optimize relevance to each

setting without letting the variances in input drive the results.

Practically this often means spending a lot of time exploring diverse settings to find

solutions that work for each.

Exemplar Strategies

Multiple Networks

Prosper example:

“Easy” bits:

HLM on ERGM/SIENA (See

Dan Ragan

Presentation in this session!)

Calculating basic descriptive statistics on each network – just setting a do loop.

“Hard” bits: Community detection or Role analysis on each setting.  Consider:

(0)

(1)

021D

021U

021C

(2)

111D

111U

030T

030C

(3)

120D

120U

120C

(4)

(5)

(6)

A periodic table of social elements:

16 directed triads

Exemplar Strategies

Multiple Networks

012_S

012_E

012_I

021D_S

021D_E

021U_S

021U_E

021C_S

021C_B

021C_E

111D_S

111D_B

111D_E

030T_S

030T_B

030T_E

030C

201_S

201_B

120D_S

120D_E

120U_S

120U_E

120C_S

120C_B

120C_E

210_S

210_B

210_E

Triadic Position Census: 36 Positions within 16 Directed Triads

Exemplar Strategies

Multiple Networks

Triad position vectors for a simple

example network with 3 positions:

All well and good….how do we do it at scale?

Exemplar Strategies

Multiple Networks

One Prosper School

(6

th

 grade)….of

Exemplar Strategies

Multiple Networks

Stage 1: Within settings:

•

Build triadic involvement distance matrix

•

Ward’s min Variance Clustering

•

Calculate modularity score for the partition applied to the similarity matrix at each

cut level

•

Accept the cut with the highest modularity score

•

Units are students

Exemplar Strategies

Multiple Networks

One Prosper

School

(6

th

 grade)…

.each

color is a position

Exemplar Strategies

Multiple Networks

Example positions identified in a single school network

(role 7 is a “leading crowd” in the simplest sum-of-in-degree sense)

Stage 1: Within settings:

•

Build triadic involvement distance matrix

•

Ward’s min Variance Clustering to build dendrogram

•

Calculate modularity score for the partition applied to the similarity matrix

•

Accept the cut with the highest modularity score

•



 2912 clusters

Thus far…standard single-

network model.

But how do you compare

blocks across networks

when label values are

meaningless?

Exemplar Strategies

Multiple Networks

Stage 2: 2

nd

-order clustering

across

 settings

•

Calculate the triad position profile for each within-setting cluster

•

Identify similarity across the cluster profiles by clustering a 2

nd

 time

•

Units are clusters (of students)

Exemplar Strategies

Multiple Networks

Popular

Loners

Uninvolved

Outsiders

Hangers-on

Aloofs

Leading

Crowd

Segmented

Peers

Lieutenants

Federated

Friends

nd

 Order Clustering Dendrogram

2912 within-school clusters

Exemplar Strategies

Multiple Networks

Exemplar Strategies

Multiple Networks

Role set characteristics: Core



 Secondary Core Branch

Exemplar Strategies

Multiple Networks

Role set characteristics: Core



 Leading Crowd

Power  Centrality

Closeness Centrality

Total Degree

Ego Density

Ego Transitivity

In-Degree

Information Centrality

Two-step Reach

Reciprocity

Betweenness

Out-Degree

Exemplar Strategies

Multiple Networks

(8)

(4)

(6)

(7)

Uninvolved

Outsiders

Hangers-on

Aloofs

Leading Crowd

Segmented

Peers

Lieutenants

Federated Friends

Popular Loner

Exemplar Strategies

Multiple Networks

Practical bits: How to wrangle these data?

•

Move only what you need

•

Keep clear identifying ID

•

Think like an algorithm – what takes time?

•

Betweenness centrality: Slow

•

Pagerank: fast?

•

For your question, does it matter?  iF so, could you save

time by doing “local bridging” etc.

Practical bits: How to wrangle these data?

For large network analysis in R, we recommend iGraph for most research applications.

iGraph provides a wide array of network aggregation techniques, includes a variety of

efficient metrics that scale well, and is implemented in C++, making it fast.

Consequently, in most cases, the major data challenge isn’t working with the graph, but

representing your data as a graph.

Practical bits: How to wrangle these data?

Big Data Challenges

Format Tradeoffs: Speed, Intelligibility, & Flexibility

Super fast formats such as binary are unintelligible without an interpreter.

Adaptable formats such as JSONs are also difficult to interpret, and vary in terms of

speed.

Classic adjacency matrices are mathematically highly interpretable, but inefficient.

We generally recommend node and edgelists both because they are interpretable,

and because they are easy to generate in a database if necessary.

Sharing & Accessing Your Data

Very large networks cannot be directly shared.

Instead of sharing datasets, you share scripts for accessing and manipulating data

that ensures reproducibility between analysis sessions.

Consequently, version control is more important, and some fluency with version

control platforms such as GitHub or Apache subversion can be helpful.

Conclusion

Web archives, social media, online collaboration platforms, and institutional databases

provide new opportunities to understand social processes at the system level.

Social network theory provides many resources for understanding social systems; but, it

requires thinking creatively about old problems.

For example,

network boundary questions

often move from being a missing data

question to a social process question. Does the social process in question operate

across the entire graph or in pockets of it?

What constitutes a tie

(always an important question) has practical implications but

also theoretical ones. Incomplete information often drives connectivity in

organizations which is why spanning a structural hole can be advantageous; but, this

is not the case for collaborations on software development platforms such as GitHub

where connectivity primarily arises from reporting relations and status seeking.

Finally, large network analysis requires thinking both quantitatively and qualitatively

in many instances because multiple causal processes can generate similar patterns.

Slide Note

Embed Share

Download

Social network analysis has evolved from single case studies to encompass massive single networks, deep data analysis, and small multiplicative structures. This evolution challenges traditional strategies in scaling for complex network data. Understanding the nuances of tie definitions is crucial in interpreting network data effectively.

jov_ti Follow

Uploaded on Sep 08, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Big Networks: Strategies and Cautionary Tales

Outline 1. Introduction: Growth in network data 2. Three Problems, Three Strategies 1. Big Data 2. Deep Data 3. Many Data 3. Examples & War Stories 4. Some Coding & data wrangling hints 5. Conclusion

Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Moreno s Who Shall Survive 1936 (24 nodes; 82 arcs, two relations)

Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Loomis, 1947. (39 nodes, ~90 relations)

Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Everyone s favorite Karate Club (34 nodes, 156 edges)

Introduction 1) But contemporary network data are often not the single case- studies that characterized the birth of the field. Rather we now see three common extensions: a) Massive single networks - Online data, customer records, electronic medical records data. - Many bi-partite network opportunities through social activity tracing. b) Deep data the problem of exquisite detail - sensor data, text data and other sorts of intensive data collection routines produce thick data descriptions on even a small number of nodes. (will only hint at this today due to time) c) Small multiples - Same data structure repeated in many settings. - Add Health, Prosper, etc. How well do our standard strategies scale for these sorts of problems?

Introduction

Introduction

Introduction

Introduction Big , in a social network, is primarily a function of its number of ties. Consequently, we need to think about how we are defining a tie, particularly in contexts where the edges represent affiliations, proximities, or similarities. For similarities and proximities, all nodes may be connected to all other nodes because all pairs of nods have a value. To isolate the underlying structure from the noise, it is often useful to thin the network by applying a treatment. These treatments generally come in three flavors: Threshold Methods (e.g., line islands) Ranking Methods (e.g., top k alters) Likelihood Methods (e.g., isolating ties with frequencies greater than we expect by chance) 10

Introduction If networks now spanning something like 6 orders of magnitude, do we need new theory? Largely, no. The foundational division between connectionist and positional approaches to networks stands; what s changed is that we are forced to specify internal boundary conditions rather than taking our boundary conditions as given by the case. What first appears as a methods problem is, usually, a theory problem.

Connections & Positions: Network Problems Ego Complete Multiple - Community Detection - Reachability - Homophily - Degree Distribution - Social Balance - ERGm - Multi-layer networks - Structural Holes - Density - Mixing Models - Size Connectionist: Networks as pipes - Multi-level models of multiple networks Positional: Networks as roles - Local Roles (Mandel 1983, Mandel & Winship 1984) 2 ideas: - Patterns in networks - Relational Block Models - Motifs - Patterns of networks

Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks Connectionist: What is the relevant flow? How far? What governs spread? Positional: what s the social horizon for action within this structure? Roles relative to who? Traditional answers to these questions assume a well-bounded relevance horizon and take the full network as relevant. So things like geodesic distances are meaningful. Is that true for networks with billions of edges? Problem: exponential runtime & interpretability. 2. Too many trees in the forest? 3. Too many forests?

Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? Connectionist: With continuous time data, when is a relation? With 1000s of interactants, who s relevant for information flow? Positional: What gives text meaning? How is a note, phrase or gesture situated in a wider (perhaps unseen) context? Traditional answers rely on ethnographic sensitivity and deep implicit understanding. Can we leverage computational tools to augment and regularize this? Problem: Too much data, too little contextual understanding, methods tuned to thin networks. 3. Too many forests?

Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? 3. Too many forests? Connectionist: why do rumors spread faster in one setting than another? Positional: Is the relational hierarchy similar across multiple schools? Traditional answers rely on user judgement for many network analytic decisions; it s a model rooted in deep data involvement. Problem: Case paradox: the same skills that make for good judgement in a single case make for either inconsistency across cases or insensitively to contextual nuance.

Three Problems, Three Strategies Each problem carries with it an implicit solution strategy: 1. Big Networks Divide & Conquer Most real-world networks are not actually a single unified network; but rather a network-of-networks, highly clustered. Its generally more sensitive to variation and faster to divide the problem along natural fault lines & work within. 2. Too many trees in the forest? Build maps, find patterns. High fidelity data pushes us to abstract in new ways that (might?) provide insights into meaning. Think like a naturalist. 3. Too many forests? Regularize case study insights The task is to weave between the ideals of case-specificity and methodological consistency. Is it OK to use a different resolution parameter in each setting? How do you develop decision rules? We think it s a two-step problem: (a) identify what underlies choices in a single, well known case; then (b) regularize on that meta level.

Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Can you manipulate, clean, load, create the necessary data structures given the scale of the data and your computational resources? R/iGraph is often very good at simple large-scale calculations. Assuming you have a graph object, then getting many node- level metrics is often very fast. You can often use these simple stats in creative ways, particularly by crossing with attributes/communities. If not, sometimes the solution is merely adding more computational power. This is sometimes possible; though often limited by data use restrictions (privacy, DUA etc.). Most of the time the solution is to rethink the problem and make it smaller somehow. The two most common solutions are to divide or localize. Theoretical

Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Theoretical What scale is socially relevant? Disease spread is naturally on O(Billions), but interventions/resources/etc. are not. Most network processes are enacted locally despite global emergent implications. Do you expect the same process at each place in the network? i.e. is the susceptabiltiy to social influence the same for each place? Does the ERGM homogeneity of parameters assumption hold? Likely no, which means we don t really want to analyze it as a single network, but as a collection of liked networks. Do small-scale ease-of-use practices scale? Geodesics are the common path on small nets because they are easy to compute there; but are they relevant for massive networks?

Big NetworksDivide & Conquer Most social networks admit to natural fissures and it often makes more sense to differentiate analysis within vs. between these fissures. Whole NET Social processes rarely even across communities Between community structure can itself be interesting Creates a need for multi-level modeling & Analysis First level cluster 2nd level cluster

Big Networkslocalize Pay attention to the base features of large-scale global networks, then use that to tailor what you work with substantively. Are you interested in the longest-long tail of your degree distribution? Is the network a set of hubs-and-spokes? Clusters? Homogenous weave? Think about the way the global structure shapes visibility for the local structure and ask if you can just use the local (say two-step) structure? Can the (local) process you are interested in be sampled? Generally it's possible to localize networks around nodes construct ego networks of k-steps for example or all nodes within k-steps with particular attributes then analyze each of those to construct your boundary.

Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The 2010 Ohio Physician network has 38K nodes and 2.2M edges. We repeated this for 5 states over 3 panels. This still left us with 474K nodes and 11.3M edges.

Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The logic behind physician shared patient networks is of coordinated care: who you talk to about patients affects treatment. This pushes us away from the total network toward identify a set of localized communities. The graph breaks into 116 localized communities with sizes ranging from 30 to nearly 1000.

Big NetworksDivide & Conquer examples Total network has 40M nodes; we used a two-level clustering procedure to identify reasonably compact communities. -- these are also large. Here one 2nd-level community is over 6000 nodes. But the structure starts to become apparent at this level Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Big NetworksDivide & Conquer examples Network of 300K twitter users. Modularity on first-level cut was over 0.9 (Louvain, weighted graph). This is a heat map mixing matrix, where each row/col is a community. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies Big Networks Communities 1 & 2: N=21487 Community 5: n=7773 Network of ~300K twitter users. Details for 2 of the clusters Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies Big Networks Of course, the D&C model requires being able to divide the network. - See Mucha s presentation on strategies for this - R will do very large networks with things like fastgreedy, spectral or stochastic block model I ve usually used PAJEK for this. It s optimized for very large networks (up to billions of nodes) and provides good control over resolution parameters and such. You have to go through the hassle of moving your data in/out; but I ve found it worth the effort over purely R solution. iGraph in Python is more flexible for clustering, so that s another option.

Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. We can then ask things like distribution of hubs/spokes in large networks Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

A localization example. Exemplar Strategies Big Networks Role profile plots It's also possible to use simple- to-calculate scores in unique ways. 12% 13.6% 55.6% Low activity, Pendants Low activity retweet bridging So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Low activity, Mixed bridge 12.5% 3.4% Quote Bridges 1% Reply Bridges (wgt) high activity retweet bridging 1.4% Active group members 0.5% Local Authorities, fighters 0.06% Superstar hubs (each x value is a within & between community involvement score) Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

A localization example. Exemplar Strategies Big Networks Roles overlaid on a single community It's also possible to use simple- to-calculate scores in unique ways. Role 9 So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Role 7 Role 8 Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

Exemplar Strategies Deep Networks Characterized by lots of data on small N of cases. Big does not always involve billions of nodes. Contemporary data-collection routines can yield massive datasets on even a single person. Easy to detect features Example: Real-time video data on Data+ students. Only a 10 nodes, but 30 terabytes of video data. Volfovsky, Alex. Katherine Heller, James Moody. Building Better Teams: A network analysis approach Army Research Office.

Exemplar Strategies Deep Networks Strategies here are often focused on three basic sorts: a) Wrangling. just putting the data in analyzable format. This is non-trivial; some off-the-shelf AI tools are making this easier, but most of it requires bespoke programming. Teams data has taken months to move from video to tabulations. a) Sifting. Most of the data you collected is, sadly, irrelevant. a) Tracer data: do you want data on each movement? What about when people are sleeping? Or alone in their ca? b) Aggregating. Or, at least irrelevant at the scale collected. Do you want the data item, or similarity between actors across multiple items? Often the fine-grained data gets turned into a vector.

Exemplar Strategies Deep Networks Interaction rituals and bonding Speed-dating research 4 minute dates Men rotate while women stay in seat Interaction rituals Who clicks with whom? How do social bonds form? Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks What do you do for fun? Dance? Uh, dance, uh, I like to go, like camping. Uh, snowboarding, but I'm not good, but I like to go anyway. You like boarding. Yeah. I like to do anything. Like I, I'm up for anything. Really? Yeah. Are you open-minded about most everything? Not everything, but a lot of stuff- What is not everything [laugh] I don't know. Think of something, and I'll say if I do it or not. [laugh] Okay. [unintelligible]. Skydiving. I wouldn't do skydiving I don't think. Yeah I'm afraid of heights. F: Yeah, yeah, me too. M: [laugh] Are you afraid of heights? F: [laugh] Yeah [laugh] Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks The Speed Date Study 991 4-minute dates 3 events each with ~20x20=400 dates some data loss Participants: 110 graduate student volunteers in 2005 participated in return for the chance to date Speech ~70 hours from shoulder sash recorders; high noise Transcripts ~800K words hand-transcribed w/turn boundary times Surveys (Pre-test surveys event scorecards post-test surveys) Date perceptions and follow-up interest General attitudes, preferences, demographics Largest natural experiment with audio text + survey info Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks Strategic Bonding How to get a desired mutual bond? Men Actor speech: vary pitch avoid talking about work talk about yourself more than usual Women Actor speech: laugh make appreciations take short turns talk about yourself and drinking Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

Exemplar Strategies Deep Networks Shared schema on Scents Culture in Objects Culture is stored more/less ambiguously We should try to approach material culture without reducing objects to instantiations of discourse or realizations of cognitive representations. (Mukerji 1997: 36)

Exemplar Strategies Deep Networks Shared schema on Scents Data & Methods Fragrances While humans are actually quite good at smelling, they are not good at describing smells in intersubjectively agreed-upon ways (Barwich 2020) Smells are deeply symbolic and cultural (Sperber 1978), and perhaps the least intellectual of the senses (Gonzalez-Crussi 1989). Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Shared schema on Scents Schema 1 (n=36) Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Shared schema on Scents Schema 2 (n=46) Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Shared schema on Scents Schema 3 (n=33) Slides courtesy of Craig Rawlings

H1: Controlling for interactions and objects, shared schemas predict initial interpersonal consensus in meanings. Exemplar Strategies Deep Networks Shared schema on Scents Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions Research Question Can we use social network analysis to enhance our understanding of how sociality and social cohesiondevelop in preschool aged children? Specific Aims: 1. Feasibility of video data collection 2. Coding scheme development 3. Social network analysis Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions Data Collection - Naturalistic video observations of classroom activities 4 microcameras 15 minute segments, 4x per week - - Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions Coding Scheme Behaviors: - Conflict and Cooperation Modifiers: - Self-initiated, Other-initiated, or Mutually-initiated - Physical or Non-physical Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis yellow2_video1.mp4 Dynamic Network -Conflict vs. Cooperation -Sex and Age Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis Slides courtesy of Craig Rawlings

Exemplar Strategies Deep Networks Video data on children's interactions What s Next? Continue to create social networks to analyze in R Patterns of relationships facilitate prosocial behavior What does this mean for intervention to increase compassionate behavior early in life Slides courtesy of Craig Rawlings

Strategies and Cautionary Tales in Big Networks

Download Presentation

Presentation Transcript

Related

More Related Content