Strategies and Cautionary Tales in Big Networks

Big Networks: 
Strategies and Cautionary Tales 
Outline
1.
Introduction: Growth in network data
2.
Three Problems, Three Strategies
1.
Big Data
2.
Deep Data
3.
Many Data
3.
Examples & War Stories
4.
Some Coding & data wrangling hints
5.
Conclusion
Introduction
Social Network analysis has grown out of the case-study method with
heavy theoretical roots in structural anthropology & community studies.
Moreno’s 
Who Shall Survive
 1936
(24 nodes; 82 arcs, two relations)
Introduction
Social Network analysis has grown out of the case-study method with
heavy theoretical roots in structural anthropology & community studies.
Loomis, 1947.
(39 nodes, ~90 relations)
Introduction
Social Network analysis has grown out of the case-study method with
heavy theoretical roots in structural anthropology & community studies.
Everyone’s favorite
Karate Club
(34 nodes, 156 edges)
Introduction
1)
But…contemporary network data are often not the single case-
studies that characterized the birth of the field.   Rather we now see
three common extensions:
a) Massive single networks
-
Online data, customer records, electronic medical records
data.
-
Many bi-partite network opportunities through social
activity tracing.
 
b) Deep data – the problem of exquisite detail
 - sensor data, text data and other sorts of intensive data
collection routines produce “thick” data descriptions on even a
small number of nodes. 
(will only hint at this today due to
time)
 
c) “Small” multiples
-
Same data structure repeated in many settings.
-
Add Health, Prosper, etc.
How well do our standard strategies scale for these sorts of problems?
Introduction
Introduction
Introduction
10
Introduction
”Big”, in a social network, is primarily a function of its number of ties.
Consequently, we need to think about how we are defining a tie, particularly in
contexts where the edges represent affiliations, proximities, or similarities.
For similarities and proximities, all nodes may be ”connected” to all other nodes
because all pairs of nods have a value.
To isolate the underlying structure from the noise, it is often useful to 
thin
 the network
by applying a treatment. These treatments generally come in three flavors:
Threshold Methods 
(e.g., line islands)
Ranking Methods 
(e.g., top k alters)
Likelihood Methods 
(e.g., isolating ties with frequencies greater than we expect by chance)
Introduction
If networks now spanning something like 6 orders of magnitude, do we need new
theory? Largely, no.
The foundational division between 
connectionist 
and 
positional 
approaches to
networks stands; what’s changed is that we are forced to specify 
internal boundary
conditions 
rather than taking our boundary conditions as given by the case.
What first appears as a methods problem is, usually, a theory problem.
Connections & Positions: Network Problems
Three Problems, Three Strategies
If what first appears as a methods problem is, usually, a theory problem, then what are
these 
problems
?  We see three:
1.
Big Networks
 
Connectionist
: What is the relevant flow? How far? What governs spread?
Positional
: what’s the social horizon for action within this structure? Roles relative
to who?
Traditional answers to these questions assume a well-bounded relevance
horizon and take the full network as relevant.  So things like geodesic
distances are meaningful.  Is that true for networks with billions of edges?
Problem:  exponential runtime & interpretability.
2.
Too many trees in the forest?
3.   Too many forests? 
Three Problems, Three Strategies
If what first appears as a methods problem is, usually, a theory problem, then what are
these 
problems
?  We see three:
1.
Big Networks
 
2.
Too many trees in the forest? 
Connectionist
: With continuous time data, “when” is a relation? With 1000s of
interactants, who’s relevant for information flow?
Positional
: What gives text meaning? How is a note, phrase or gesture situated in
a wider (perhaps unseen) context?
Traditional answers rely on ethnographic sensitivity and deep implicit
understanding.  Can we leverage computational tools to augment and
regularize this?
Problem: 
Too much data, too little contextual understanding, methods tuned
to thin networks.
3.   
Too many forests? 
Three Problems, Three Strategies
If what first appears as a methods problem is, usually, a theory problem, then what are
these 
problems
?  We see three:
1.
Big Networks
 
2.
Too many trees in the forest? 
3.   Too many forests? 
Connectionist
: why do rumors spread faster in one setting than another?
Positional
: Is the relational hierarchy similar across multiple schools?
Traditional answers rely on user judgement for many network analytic
decisions; it’s a model rooted in deep data involvement.
Problem: 
Case paradox: the same skills that make for good judgement in a
single case make for either inconsistency across cases or insensitively to
contextual nuance.
Three Problems, Three Strategies
Each problem carries with it an implicit solution strategy:
1.
Big Networks
 
Divide & Conquer
Most real-world networks are not actually a single unified network; but rather a
network-of-networks, highly clustered.  Its generally more sensitive to variation
and faster to divide the problem along natural fault lines & work within.
2.
Too many trees in the forest? 
 
Build maps, find patterns.
High fidelity data pushes us to abstract in new ways that (might?) provide insights
into meaning.  Think like a naturalist.
3.    Too many forests? 
Regularize case study insights
The task is to weave between the ideals of case-specificity and methodological
consistency.  Is it OK to use a different resolution parameter in each setting?  How
do you develop decision rules?  We think it’s a two-step problem: (a) identify
what underlies choices in a single, well known case; then (b) regularize on that
meta level.
Big Networks
 
Divide & Conquer
 
Giant networks pose two sorts of problems: Practical and Theoretical
Practical
Can you manipulate, clean, load, create the necessary data
structures given the scale of the data and your computational
resources?
R/iGraph is often very good at simple large-scale calculations.
Assuming you have a graph object, then getting many node-
level metrics is often very fast.    You can often use these
“simple” stats in creative ways, particularly by crossing with
attributes/communities.
If not,
 sometimes the solution is “merely” adding more
computational power.  This is sometimes possible; though
often limited by data use restrictions (privacy, DUA etc.).
Most of the time the solution is to rethink the problem and
make it smaller somehow.  The two most common
solutions are to 
divide
 or 
localize.
Theoretical
Big Networks
 
Divide & Conquer
 
Giant networks pose two sorts of problems: Practical and Theoretical
Practical
Theoretical
What scale is socially relevant?
Disease spread is naturally on O(Billions), but
interventions/resources/etc. are not.
Most network processes are enacted locally despite global
emergent implications.
Do you expect the “same” process at each place in the network?
i.e. is the susceptabiltiy to social influence the same for
each place?  Does the ERGM homogeneity of parameters
assumption hold?  Likely no, which means we don’t really
want to analyze it as a single network, but as a collection of
liked networks.
Do small-scale ease-of-use practices scale?
Geodesics are the common path on small nets because they are
easy to compute there; but are they relevant for massive
networks?
Big Networks
 
Divide & Conquer
 
Whole NET
2
nd
level
cluster
Most social networks admit to
natural fissures and it often makes
more sense to differentiate analysis
within vs. between these fissures.
Social processes rarely even
across communities
Between community structure
can itself be interesting
Creates a need for multi-level
modeling & Analysis
First level
cluster
Big Networks
 
localize
Pay attention to the base features of large-scale global networks, then use that to tailor
what you work with substantively.
Are you interested in the longest-long tail of your degree distribution?
Is the network a set of hubs-and-spokes? Clusters?  Homogenous weave?  Think about
the way the global structure shapes visibility for the local structure and ask if you can
just use the local (say two-step) structure?
Can the (local) process you are interested in be sampled?
Generally it's possible to “localize” networks around nodes – construct ego networks
of k-steps for example or all nodes within k-steps with particular attributes – then
analyze each of those to construct your boundary.
EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation.
The 2010 Ohio
Physician network
has 38K nodes and
2.2M edges.
We repeated this
for 5 states over 3
panels. This still left
us with 474K nodes
and 11.3M edges.
Big Networks
 
Divide & Conquer
 
examples
EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation.
The logic behind
physician shared
patient networks is
of coordinated care:
who you talk to
about patients
affects treatment.
This pushes us away
from the total
network toward
identify a set of
localized
communities.
The graph breaks into
116 localized
communities with
sizes ranging from 30
to nearly 1000.
Big Networks
 
Divide & Conquer
 
examples
Total network has 40M
nodes; we used a two-level
clustering procedure to
identify reasonably
compact communities.
 -- these are also large.
Here one 2
nd
-level
community is over 6000
nodes.  But the structure
starts to become apparent at
this level
Big Networks
 
Divide & Conquer
 
examples
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Network of 300K twitter users.  Modularity on first-level cut was over 0.9 (Louvain,
weighted graph). This is a heat map mixing matrix, where each row/col is a community.
Big Networks
 
Divide & Conquer
 
examples
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies 
Big Networks
Network of ~300K twitter users.  Details for 2 of the clusters
Communities 1 & 2: N=21487
Community  5: n=7773
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies 
Big Networks
Of course, the D&C model requires being able to divide the network.
  - See Mucha’s presentation on strategies for this
  - R will do very large networks with things like fastgreedy, spectral or stochastic block
model
I’ve usually used PAJEK for this.  It’s optimized for very large networks (up to billions of
nodes) and provides good control over resolution parameters and such.  You have to go
through the hassle of moving your data in/out; but I’ve found it worth the effort over
purely R solution.
iGraph in Python is more flexible for clustering, so that’s another option.
Exemplar Strategies 
Big Networks
A localization
example.
Twitter network
with 1.1M nodes.
Note the tails –
these cannot be
substantively the
“same” sorts of
actors.
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies 
Big Networks
A localization
example.
Twitter network
with 1.1M nodes.
Note the tails –
these cannot be
substantively the
“same” sorts of
actors.
We can then ask
things like
distribution of
hubs/spokes in
large networks
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies 
Big Networks
A localization
example.
It's also possible to use simple-
to-calculate scores in unique
ways.
So while its time and
space/bandwidth consuming
to run a full triad-based
structural equivalence model
over a giant network; you can
calculate a host of local and
bridging sorts of scores, then
cluster those to get positions
quickly.
Role profile plots
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies 
Big Networks
A localization
example.
It's also possible to use simple-
to-calculate scores in unique
ways.
So while its time and
space/bandwidth consuming
to run a full triad-based
structural equivalence model
over a giant network; you can
calculate a host of local and
bridging sorts of scores, then
cluster those to get positions
quickly.
Roles overlaid on a single community
Homo SocioNeticus: Scaling the cognitive foundations of online social behavior” Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies  
Deep Networks
Characterized by lots of data on small N of cases.  
“Big” does not always involve billions of nodes.  Contemporary data-collection
routines can yield massive datasets on even a single person.
Example: Real-time video data on Data+ students.  Only a 10 nodes, but 30
terabytes of video data.
Easy to detect features
Volfovsky, Alex. Katherine Heller, James Moody. “Building Better Teams: A network analysis approach” Army Research Office.
Exemplar Strategies  
Deep Networks
Strategies here are often focused on three basic sorts:
a)
Wrangling.
 just putting the data in analyzable format.  This is non-trivial;
some off-the-shelf AI tools are making this easier, but most of it requires
bespoke programming.
 
Teams data has taken months to move from video to tabulations.
a)
Sifting. 
 Most of the data you collected is, sadly, irrelevant.
a)
Tracer data: do you want data on each movement?  What about when
people are sleeping?  Or alone in their ca?
b)
Aggregating.
  Or, at least irrelevant at the scale collected.
Do you want the data 
item, 
or 
similarity 
between actors across multiple
items?  Often the fine-grained data gets turned into a vector.
Interaction rituals and bonding
Speed-dating research
4 minute “dates”
Men rotate while women stay in seat
“Interaction rituals”
Who 
clicks
 with whom?
How do social bonds form?
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Our
speed
date
setup
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Our
speed
date
setup
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
What do you do for fun?  Dance?
Uh, dance, uh, I like to go, like camping.  Uh, snowboarding, but I'm not good, but I like to go anyway.
You like boarding.
Yeah.  I like to do anything.  Like I, I'm up for anything.
Really?
Yeah.
Are you open-minded about most everything?
Not everything, but a lot of stuff-
What is not everything [laugh]
I don't know.  Think of something, and I'll say if I do it or not. [laugh]
Okay.  [unintelligible].
Skydiving.  I wouldn't do skydiving I don't think.
Yeah I'm afraid of heights.
F:  Yeah, yeah, me too.
M:  [laugh] Are you afraid of heights?
F:  [laugh] Yeah [laugh]
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
The Speed Date Study
991 4-minute dates
3 events each with ~20x20=400 dates some data loss
Participants: 110 graduate student volunteers in 2005
participated in return for the chance to date
Speech
~70 hours from shoulder sash recorders; high noise
Transcripts
~800K words hand-transcribed w/turn boundary times
Surveys
(Pre-test surveys event scorecards post-test surveys)
Date perceptions and follow-up interest
General attitudes, preferences, demographics
Largest natural experiment with audio text + survey info
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
How we predict it
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Actor-Partner Interaction Model (Kenny, Kashy, & Cook 2006)
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Strategic Bonding – How to get a desired
mutual bond?
Men
Actor speech
:
vary pitch
avoid talking about work
talk about yourself more than usual
Women
Actor speech
:
laugh
make appreciations
take short turns
talk about yourself and drinking
 
 
Exemplar Strategies  
Deep Networks
Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Culture in Objects
 
Culture is “stored” more/less ambiguously
 “We should try to approach material culture without
reducing objects to instantiations of discourse or
realizations of cognitive representations.” (Mukerji 1997:
36)
Exemplar Strategies  
Deep Networks
Shared schema on Scents
Data & Methods
Fragrances
While humans are actually quite good at smelling, they
are not good at describing smells in intersubjectively
agreed-upon ways (Barwich 2020)
Smells are deeply symbolic and cultural (Sperber 1978),
and perhaps the least intellectual of the senses
(Gonzalez-Crussi 1989).
Exemplar Strategies  
Deep Networks
Shared schema on Scents
Slides courtesy of Craig Rawlings
Schema 1 (n=36)
Exemplar Strategies  
Deep Networks
Shared schema on Scents
Slides courtesy of Craig Rawlings
Schema 2 (n=46)
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Shared schema on Scents
Schema 3 (n=33)
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Shared schema on Scents
H1: Controlling for interactions and objects,
shared schemas predict initial interpersonal
consensus in meanings.
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Shared schema on Scents
Research
Question
Can we use 
social network analysis
 to enhance our understanding of 
how
sociality and social cohesion
 
develop 
in preschool aged children?
Specific Aims:
1.
Feasibility of video data
collection 
2.
Coding scheme development
3.
Social network analysis
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
Data Collection
-
Naturalistic video
observations
 of classroom
activities
-
4 microcameras 
-
15 minute segments, 4x per
week
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
Coding
Scheme
Behaviors
: 
-
Conflict and Cooperation
Modifiers
: 
-
Self-initiated, Other-initiated, or
Mutually-initiated 
-
Physical or Non-physical
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
 Social Network Analysis
Dynamic Network 
-Conflict vs.
Cooperation
-Sex and Age
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
 Social Network Analysis
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
What’s Next?
Continue to create social networks
to analyze in R
Patterns of relationships facilitate
prosocial behavior
What does this mean for intervention to
increase compassionate behavior early in
life
Slides courtesy of Craig Rawlings
Exemplar Strategies  
Deep Networks
Video data on children's interactions
Large-scale public good investments and at-scale interventions have
increased access to multiple networks 
from the same network generator.
Add Health: Students, 129 Networks, 73K nodes
Prosper: Students, 368 Networks, 48K nodes, (5 waves)
Indian Micro Finance (Banerjee 2013): 43 Villages
Indian Health Villages (Mohanan, ongoing): 80 Villages
Honduran Health Villages: (Christakis 2015): 32 Villages
Or any time you “divide” a big network into lots of little parts!
All characterized by having the same
 
data collected on multiple
sites.
Exemplar Strategies  
Multiple Networks
Exemplar Strategies  
Multiple Networks
Overarching strategy is to routinize method so it can be repeated.
This forces one to hard-code choices that might otherwise vary if you were doing it as a
disconnected set of one-off analyses.  Key is to try and optimize relevance to each
setting without letting the variances in input drive the results.
Practically this often means spending a lot of time exploring diverse settings to find
solutions that work for each.
Exemplar Strategies  
Multiple Networks
Prosper example:
“Easy” bits: 
 
HLM on ERGM/SIENA (See 
Dan Ragan 
Presentation in this session!)
 
Calculating basic descriptive statistics on each network – just setting a do loop.
“Hard” bits: Community detection or Role analysis on each setting.  Consider:
(0)
012
(1)
021D
021U
021C
(2)
111D
111U
030T
030C
(3)
201
120D
120U
120C
(4)
210
(5)
(6)
A periodic table of social elements:
16 directed triads
Exemplar Strategies  
Multiple Networks
012_S
012_E
012_I
021D_S
021D_E
021U_S
021U_E
021C_S
021C_B
021C_E
111D_S
111D_B
111D_E
030T_S
030T_B
030T_E
030C
201_S
201_B
120D_S
120D_E
120U_S
120U_E
120C_S
120C_B
120C_E
210_S
210_B
210_E
300
Triadic Position Census: 36 Positions within 16 Directed Triads
Exemplar Strategies  
Multiple Networks
Triad position vectors for a simple
example network with 3 positions:
All well and good….how do we do it at scale?
Exemplar Strategies  
Multiple Networks
One Prosper School
(6
th
 grade)….of 
368
.
Exemplar Strategies  
Multiple Networks
Stage 1: Within settings:
Build triadic involvement distance matrix
Ward’s min Variance Clustering
Calculate modularity score for the partition applied to the similarity matrix at each
cut level
Accept the cut with the highest modularity score
Units are students
Exemplar Strategies  
Multiple Networks
One Prosper
School
(6
th
 grade)…
.each
color is a position
Exemplar Strategies  
Multiple Networks
Example positions identified in a single school network
(role 7 is a “leading crowd” in the simplest sum-of-in-degree sense)
Stage 1: Within settings:
Build triadic involvement distance matrix
Ward’s min Variance Clustering to build dendrogram
Calculate modularity score for the partition applied to the similarity matrix
Accept the cut with the highest modularity score
 2912 clusters
Thus far…standard single-
network model.
But how do you compare
blocks across networks
when label values are
meaningless?
Exemplar Strategies  
Multiple Networks
Stage 2: 2
nd
-order clustering 
across
 settings
Calculate the triad position profile for each within-setting cluster
Identify similarity across the cluster profiles by clustering a 2
nd
 time 
Units are clusters (of students)
Exemplar Strategies  
Multiple Networks
Popular
Loners 
Uninvolved 
Outsiders 
Hangers-on 
Aloofs 
Leading 
Crowd 
Segmented
Peers 
Lieutenants 
Federated 
Friends 
2
nd
 Order Clustering Dendrogram
2912 within-school clusters
Exemplar Strategies  
Multiple Networks
Exemplar Strategies  
Multiple Networks
Role set characteristics: Core 
    
 Secondary Core Branch
Exemplar Strategies  
Multiple Networks
Role set characteristics: Core 
 Leading Crowd
Power  Centrality
Closeness Centrality
Total Degree
Ego Density
Ego Transitivity
In-Degree
Information Centrality
Two-step Reach
Reciprocity
Betweenness
Out-Degree
Exemplar Strategies  
Multiple Networks
(8)
(4)
(6)
(7)
Uninvolved 
Outsiders
Hangers-on
Aloofs 
Leading Crowd
Segmented
Peers
Lieutenants
Federated Friends
Popular Loner
Exemplar Strategies  
Multiple Networks
Practical bits: How to wrangle these data?
Move only what you need
Keep clear identifying ID
Think like an algorithm – what takes time?
Betweenness centrality: Slow
Pagerank: fast?
For your question, does it matter?  iF so, could you save
time by doing “local bridging” etc.
70
71
Practical bits: How to wrangle these data?
For large network analysis in R, we recommend iGraph for most research applications. 
iGraph provides a wide array of network aggregation techniques, includes a variety of
efficient metrics that scale well, and is implemented in C++, making it fast.
Consequently, in most cases, the major data challenge isn’t working with the graph, but
representing your data as a graph.
72
Practical bits: How to wrangle these data?
Big Data Challenges
Format Tradeoffs: Speed, Intelligibility, & Flexibility
Super fast formats such as binary are unintelligible without an interpreter.
Adaptable formats such as JSONs are also difficult to interpret, and vary in terms of
speed.
Classic adjacency matrices are mathematically highly interpretable, but inefficient.
We generally recommend node and edgelists both because they are interpretable,
and because they are easy to generate in a database if necessary.
Sharing & Accessing Your Data
Very large networks cannot be directly shared.
Instead of sharing datasets, you share scripts for accessing and manipulating data
that ensures reproducibility between analysis sessions.
Consequently, version control is more important, and some fluency with version
control platforms such as GitHub or Apache subversion can be helpful.
Conclusion
Web archives, social media, online collaboration platforms, and institutional databases
provide new opportunities to understand social processes at the system level.
Social network theory provides many resources for understanding social systems; but, it
requires thinking creatively about old problems.
For example, 
network boundary questions 
often move from being a missing data
question to a social process question. Does the social process in question operate
across the entire graph or in pockets of it?
What constitutes a tie 
(always an important question) has practical implications but
also theoretical ones. Incomplete information often drives connectivity in
organizations which is why spanning a structural hole can be advantageous; but, this
is not the case for collaborations on software development platforms such as GitHub
where connectivity primarily arises from reporting relations and status seeking.
Finally, large network analysis requires thinking both quantitatively and qualitatively
in many instances because multiple causal processes can generate similar patterns.
Slide Note
Embed
Share

Social network analysis has evolved from single case studies to encompass massive single networks, deep data analysis, and small multiplicative structures. This evolution challenges traditional strategies in scaling for complex network data. Understanding the nuances of tie definitions is crucial in interpreting network data effectively.

  • Networks
  • Strategies
  • Cautionary Tales
  • Data Analysis

Uploaded on Sep 08, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Big Networks: Strategies and Cautionary Tales

  2. Outline 1. Introduction: Growth in network data 2. Three Problems, Three Strategies 1. Big Data 2. Deep Data 3. Many Data 3. Examples & War Stories 4. Some Coding & data wrangling hints 5. Conclusion

  3. Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Moreno s Who Shall Survive 1936 (24 nodes; 82 arcs, two relations)

  4. Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Loomis, 1947. (39 nodes, ~90 relations)

  5. Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Everyone s favorite Karate Club (34 nodes, 156 edges)

  6. Introduction 1) But contemporary network data are often not the single case- studies that characterized the birth of the field. Rather we now see three common extensions: a) Massive single networks - Online data, customer records, electronic medical records data. - Many bi-partite network opportunities through social activity tracing. b) Deep data the problem of exquisite detail - sensor data, text data and other sorts of intensive data collection routines produce thick data descriptions on even a small number of nodes. (will only hint at this today due to time) c) Small multiples - Same data structure repeated in many settings. - Add Health, Prosper, etc. How well do our standard strategies scale for these sorts of problems?

  7. Introduction

  8. Introduction

  9. Introduction

  10. Introduction Big , in a social network, is primarily a function of its number of ties. Consequently, we need to think about how we are defining a tie, particularly in contexts where the edges represent affiliations, proximities, or similarities. For similarities and proximities, all nodes may be connected to all other nodes because all pairs of nods have a value. To isolate the underlying structure from the noise, it is often useful to thin the network by applying a treatment. These treatments generally come in three flavors: Threshold Methods (e.g., line islands) Ranking Methods (e.g., top k alters) Likelihood Methods (e.g., isolating ties with frequencies greater than we expect by chance) 10

  11. Introduction If networks now spanning something like 6 orders of magnitude, do we need new theory? Largely, no. The foundational division between connectionist and positional approaches to networks stands; what s changed is that we are forced to specify internal boundary conditions rather than taking our boundary conditions as given by the case. What first appears as a methods problem is, usually, a theory problem.

  12. Connections & Positions: Network Problems Ego Complete Multiple - Community Detection - Reachability - Homophily - Degree Distribution - Social Balance - ERGm - Multi-layer networks - Structural Holes - Density - Mixing Models - Size Connectionist: Networks as pipes - Multi-level models of multiple networks Positional: Networks as roles - Local Roles (Mandel 1983, Mandel & Winship 1984) 2 ideas: - Patterns in networks - Relational Block Models - Motifs - Patterns of networks

  13. Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks Connectionist: What is the relevant flow? How far? What governs spread? Positional: what s the social horizon for action within this structure? Roles relative to who? Traditional answers to these questions assume a well-bounded relevance horizon and take the full network as relevant. So things like geodesic distances are meaningful. Is that true for networks with billions of edges? Problem: exponential runtime & interpretability. 2. Too many trees in the forest? 3. Too many forests?

  14. Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? Connectionist: With continuous time data, when is a relation? With 1000s of interactants, who s relevant for information flow? Positional: What gives text meaning? How is a note, phrase or gesture situated in a wider (perhaps unseen) context? Traditional answers rely on ethnographic sensitivity and deep implicit understanding. Can we leverage computational tools to augment and regularize this? Problem: Too much data, too little contextual understanding, methods tuned to thin networks. 3. Too many forests?

  15. Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? 3. Too many forests? Connectionist: why do rumors spread faster in one setting than another? Positional: Is the relational hierarchy similar across multiple schools? Traditional answers rely on user judgement for many network analytic decisions; it s a model rooted in deep data involvement. Problem: Case paradox: the same skills that make for good judgement in a single case make for either inconsistency across cases or insensitively to contextual nuance.

  16. Three Problems, Three Strategies Each problem carries with it an implicit solution strategy: 1. Big Networks Divide & Conquer Most real-world networks are not actually a single unified network; but rather a network-of-networks, highly clustered. Its generally more sensitive to variation and faster to divide the problem along natural fault lines & work within. 2. Too many trees in the forest? Build maps, find patterns. High fidelity data pushes us to abstract in new ways that (might?) provide insights into meaning. Think like a naturalist. 3. Too many forests? Regularize case study insights The task is to weave between the ideals of case-specificity and methodological consistency. Is it OK to use a different resolution parameter in each setting? How do you develop decision rules? We think it s a two-step problem: (a) identify what underlies choices in a single, well known case; then (b) regularize on that meta level.

  17. Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Can you manipulate, clean, load, create the necessary data structures given the scale of the data and your computational resources? R/iGraph is often very good at simple large-scale calculations. Assuming you have a graph object, then getting many node- level metrics is often very fast. You can often use these simple stats in creative ways, particularly by crossing with attributes/communities. If not, sometimes the solution is merely adding more computational power. This is sometimes possible; though often limited by data use restrictions (privacy, DUA etc.). Most of the time the solution is to rethink the problem and make it smaller somehow. The two most common solutions are to divide or localize. Theoretical

  18. Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Theoretical What scale is socially relevant? Disease spread is naturally on O(Billions), but interventions/resources/etc. are not. Most network processes are enacted locally despite global emergent implications. Do you expect the same process at each place in the network? i.e. is the susceptabiltiy to social influence the same for each place? Does the ERGM homogeneity of parameters assumption hold? Likely no, which means we don t really want to analyze it as a single network, but as a collection of liked networks. Do small-scale ease-of-use practices scale? Geodesics are the common path on small nets because they are easy to compute there; but are they relevant for massive networks?

  19. Big NetworksDivide & Conquer Most social networks admit to natural fissures and it often makes more sense to differentiate analysis within vs. between these fissures. Whole NET Social processes rarely even across communities Between community structure can itself be interesting Creates a need for multi-level modeling & Analysis First level cluster 2nd level cluster

  20. Big Networkslocalize Pay attention to the base features of large-scale global networks, then use that to tailor what you work with substantively. Are you interested in the longest-long tail of your degree distribution? Is the network a set of hubs-and-spokes? Clusters? Homogenous weave? Think about the way the global structure shapes visibility for the local structure and ask if you can just use the local (say two-step) structure? Can the (local) process you are interested in be sampled? Generally it's possible to localize networks around nodes construct ego networks of k-steps for example or all nodes within k-steps with particular attributes then analyze each of those to construct your boundary.

  21. Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The 2010 Ohio Physician network has 38K nodes and 2.2M edges. We repeated this for 5 states over 3 panels. This still left us with 474K nodes and 11.3M edges.

  22. Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The logic behind physician shared patient networks is of coordinated care: who you talk to about patients affects treatment. This pushes us away from the total network toward identify a set of localized communities. The graph breaks into 116 localized communities with sizes ranging from 30 to nearly 1000.

  23. Big NetworksDivide & Conquer examples Total network has 40M nodes; we used a two-level clustering procedure to identify reasonably compact communities. -- these are also large. Here one 2nd-level community is over 6000 nodes. But the structure starts to become apparent at this level Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  24. Big NetworksDivide & Conquer examples Network of 300K twitter users. Modularity on first-level cut was over 0.9 (Louvain, weighted graph). This is a heat map mixing matrix, where each row/col is a community. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  25. Exemplar Strategies Big Networks Communities 1 & 2: N=21487 Community 5: n=7773 Network of ~300K twitter users. Details for 2 of the clusters Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  26. Exemplar Strategies Big Networks Of course, the D&C model requires being able to divide the network. - See Mucha s presentation on strategies for this - R will do very large networks with things like fastgreedy, spectral or stochastic block model I ve usually used PAJEK for this. It s optimized for very large networks (up to billions of nodes) and provides good control over resolution parameters and such. You have to go through the hassle of moving your data in/out; but I ve found it worth the effort over purely R solution. iGraph in Python is more flexible for clustering, so that s another option.

  27. Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  28. Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. We can then ask things like distribution of hubs/spokes in large networks Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  29. A localization example. Exemplar Strategies Big Networks Role profile plots It's also possible to use simple- to-calculate scores in unique ways. 12% 13.6% 55.6% Low activity, Pendants Low activity retweet bridging So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Low activity, Mixed bridge 12.5% 3.4% Quote Bridges 1% Reply Bridges (wgt) high activity retweet bridging 1.4% Active group members 0.5% Local Authorities, fighters 0.06% Superstar hubs (each x value is a within & between community involvement score) Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  30. A localization example. Exemplar Strategies Big Networks Roles overlaid on a single community It's also possible to use simple- to-calculate scores in unique ways. Role 9 So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Role 7 Role 8 Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI

  31. Exemplar Strategies Deep Networks Characterized by lots of data on small N of cases. Big does not always involve billions of nodes. Contemporary data-collection routines can yield massive datasets on even a single person. Easy to detect features Example: Real-time video data on Data+ students. Only a 10 nodes, but 30 terabytes of video data. Volfovsky, Alex. Katherine Heller, James Moody. Building Better Teams: A network analysis approach Army Research Office.

  32. Exemplar Strategies Deep Networks Strategies here are often focused on three basic sorts: a) Wrangling. just putting the data in analyzable format. This is non-trivial; some off-the-shelf AI tools are making this easier, but most of it requires bespoke programming. Teams data has taken months to move from video to tabulations. a) Sifting. Most of the data you collected is, sadly, irrelevant. a) Tracer data: do you want data on each movement? What about when people are sleeping? Or alone in their ca? b) Aggregating. Or, at least irrelevant at the scale collected. Do you want the data item, or similarity between actors across multiple items? Often the fine-grained data gets turned into a vector.

  33. Exemplar Strategies Deep Networks Interaction rituals and bonding Speed-dating research 4 minute dates Men rotate while women stay in seat Interaction rituals Who clicks with whom? How do social bonds form? Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  34. Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  35. Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  36. Exemplar Strategies Deep Networks What do you do for fun? Dance? Uh, dance, uh, I like to go, like camping. Uh, snowboarding, but I'm not good, but I like to go anyway. You like boarding. Yeah. I like to do anything. Like I, I'm up for anything. Really? Yeah. Are you open-minded about most everything? Not everything, but a lot of stuff- What is not everything [laugh] I don't know. Think of something, and I'll say if I do it or not. [laugh] Okay. [unintelligible]. Skydiving. I wouldn't do skydiving I don't think. Yeah I'm afraid of heights. F: Yeah, yeah, me too. M: [laugh] Are you afraid of heights? F: [laugh] Yeah [laugh] Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  37. Exemplar Strategies Deep Networks The Speed Date Study 991 4-minute dates 3 events each with ~20x20=400 dates some data loss Participants: 110 graduate student volunteers in 2005 participated in return for the chance to date Speech ~70 hours from shoulder sash recorders; high noise Transcripts ~800K words hand-transcribed w/turn boundary times Surveys (Pre-test surveys event scorecards post-test surveys) Date perceptions and follow-up interest General attitudes, preferences, demographics Largest natural experiment with audio text + survey info Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  38. Exemplar Strategies Deep Networks Strategic Bonding How to get a desired mutual bond? Men Actor speech: vary pitch avoid talking about work talk about yourself more than usual Women Actor speech: laugh make appreciations take short turns talk about yourself and drinking Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland

  39. Exemplar Strategies Deep Networks Shared schema on Scents Culture in Objects Culture is stored more/less ambiguously We should try to approach material culture without reducing objects to instantiations of discourse or realizations of cognitive representations. (Mukerji 1997: 36)

  40. Exemplar Strategies Deep Networks Shared schema on Scents Data & Methods Fragrances While humans are actually quite good at smelling, they are not good at describing smells in intersubjectively agreed-upon ways (Barwich 2020) Smells are deeply symbolic and cultural (Sperber 1978), and perhaps the least intellectual of the senses (Gonzalez-Crussi 1989). Slides courtesy of Craig Rawlings

  41. Exemplar Strategies Deep Networks Shared schema on Scents Schema 1 (n=36) Slides courtesy of Craig Rawlings

  42. Exemplar Strategies Deep Networks Shared schema on Scents Schema 2 (n=46) Slides courtesy of Craig Rawlings

  43. Exemplar Strategies Deep Networks Shared schema on Scents Schema 3 (n=33) Slides courtesy of Craig Rawlings

  44. H1: Controlling for interactions and objects, shared schemas predict initial interpersonal consensus in meanings. Exemplar Strategies Deep Networks Shared schema on Scents Slides courtesy of Craig Rawlings

  45. Exemplar Strategies Deep Networks Video data on children's interactions Research Question Can we use social network analysis to enhance our understanding of how sociality and social cohesiondevelop in preschool aged children? Specific Aims: 1. Feasibility of video data collection 2. Coding scheme development 3. Social network analysis Slides courtesy of Craig Rawlings

  46. Exemplar Strategies Deep Networks Video data on children's interactions Data Collection - Naturalistic video observations of classroom activities 4 microcameras 15 minute segments, 4x per week - - Slides courtesy of Craig Rawlings

  47. Exemplar Strategies Deep Networks Video data on children's interactions Coding Scheme Behaviors: - Conflict and Cooperation Modifiers: - Self-initiated, Other-initiated, or Mutually-initiated - Physical or Non-physical Slides courtesy of Craig Rawlings

  48. Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis yellow2_video1.mp4 Dynamic Network -Conflict vs. Cooperation -Sex and Age Slides courtesy of Craig Rawlings

  49. Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis Slides courtesy of Craig Rawlings

  50. Exemplar Strategies Deep Networks Video data on children's interactions What s Next? Continue to create social networks to analyze in R Patterns of relationships facilitate prosocial behavior What does this mean for intervention to increase compassionate behavior early in life Slides courtesy of Craig Rawlings

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#