Strategies and Cautionary Tales in Big Networks
Social network analysis has evolved from single case studies to encompass massive single networks, deep data analysis, and small multiplicative structures. This evolution challenges traditional strategies in scaling for complex network data. Understanding the nuances of tie definitions is crucial in interpreting network data effectively.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Big Networks: Strategies and Cautionary Tales
Outline 1. Introduction: Growth in network data 2. Three Problems, Three Strategies 1. Big Data 2. Deep Data 3. Many Data 3. Examples & War Stories 4. Some Coding & data wrangling hints 5. Conclusion
Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Moreno s Who Shall Survive 1936 (24 nodes; 82 arcs, two relations)
Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Loomis, 1947. (39 nodes, ~90 relations)
Introduction Social Network analysis has grown out of the case-study method with heavy theoretical roots in structural anthropology & community studies. Everyone s favorite Karate Club (34 nodes, 156 edges)
Introduction 1) But contemporary network data are often not the single case- studies that characterized the birth of the field. Rather we now see three common extensions: a) Massive single networks - Online data, customer records, electronic medical records data. - Many bi-partite network opportunities through social activity tracing. b) Deep data the problem of exquisite detail - sensor data, text data and other sorts of intensive data collection routines produce thick data descriptions on even a small number of nodes. (will only hint at this today due to time) c) Small multiples - Same data structure repeated in many settings. - Add Health, Prosper, etc. How well do our standard strategies scale for these sorts of problems?
Introduction Big , in a social network, is primarily a function of its number of ties. Consequently, we need to think about how we are defining a tie, particularly in contexts where the edges represent affiliations, proximities, or similarities. For similarities and proximities, all nodes may be connected to all other nodes because all pairs of nods have a value. To isolate the underlying structure from the noise, it is often useful to thin the network by applying a treatment. These treatments generally come in three flavors: Threshold Methods (e.g., line islands) Ranking Methods (e.g., top k alters) Likelihood Methods (e.g., isolating ties with frequencies greater than we expect by chance) 10
Introduction If networks now spanning something like 6 orders of magnitude, do we need new theory? Largely, no. The foundational division between connectionist and positional approaches to networks stands; what s changed is that we are forced to specify internal boundary conditions rather than taking our boundary conditions as given by the case. What first appears as a methods problem is, usually, a theory problem.
Connections & Positions: Network Problems Ego Complete Multiple - Community Detection - Reachability - Homophily - Degree Distribution - Social Balance - ERGm - Multi-layer networks - Structural Holes - Density - Mixing Models - Size Connectionist: Networks as pipes - Multi-level models of multiple networks Positional: Networks as roles - Local Roles (Mandel 1983, Mandel & Winship 1984) 2 ideas: - Patterns in networks - Relational Block Models - Motifs - Patterns of networks
Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks Connectionist: What is the relevant flow? How far? What governs spread? Positional: what s the social horizon for action within this structure? Roles relative to who? Traditional answers to these questions assume a well-bounded relevance horizon and take the full network as relevant. So things like geodesic distances are meaningful. Is that true for networks with billions of edges? Problem: exponential runtime & interpretability. 2. Too many trees in the forest? 3. Too many forests?
Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? Connectionist: With continuous time data, when is a relation? With 1000s of interactants, who s relevant for information flow? Positional: What gives text meaning? How is a note, phrase or gesture situated in a wider (perhaps unseen) context? Traditional answers rely on ethnographic sensitivity and deep implicit understanding. Can we leverage computational tools to augment and regularize this? Problem: Too much data, too little contextual understanding, methods tuned to thin networks. 3. Too many forests?
Three Problems, Three Strategies If what first appears as a methods problem is, usually, a theory problem, then what are these problems? We see three: 1. Big Networks 2. Too many trees in the forest? 3. Too many forests? Connectionist: why do rumors spread faster in one setting than another? Positional: Is the relational hierarchy similar across multiple schools? Traditional answers rely on user judgement for many network analytic decisions; it s a model rooted in deep data involvement. Problem: Case paradox: the same skills that make for good judgement in a single case make for either inconsistency across cases or insensitively to contextual nuance.
Three Problems, Three Strategies Each problem carries with it an implicit solution strategy: 1. Big Networks Divide & Conquer Most real-world networks are not actually a single unified network; but rather a network-of-networks, highly clustered. Its generally more sensitive to variation and faster to divide the problem along natural fault lines & work within. 2. Too many trees in the forest? Build maps, find patterns. High fidelity data pushes us to abstract in new ways that (might?) provide insights into meaning. Think like a naturalist. 3. Too many forests? Regularize case study insights The task is to weave between the ideals of case-specificity and methodological consistency. Is it OK to use a different resolution parameter in each setting? How do you develop decision rules? We think it s a two-step problem: (a) identify what underlies choices in a single, well known case; then (b) regularize on that meta level.
Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Can you manipulate, clean, load, create the necessary data structures given the scale of the data and your computational resources? R/iGraph is often very good at simple large-scale calculations. Assuming you have a graph object, then getting many node- level metrics is often very fast. You can often use these simple stats in creative ways, particularly by crossing with attributes/communities. If not, sometimes the solution is merely adding more computational power. This is sometimes possible; though often limited by data use restrictions (privacy, DUA etc.). Most of the time the solution is to rethink the problem and make it smaller somehow. The two most common solutions are to divide or localize. Theoretical
Big NetworksDivide & Conquer Giant networks pose two sorts of problems: Practical and Theoretical Practical Theoretical What scale is socially relevant? Disease spread is naturally on O(Billions), but interventions/resources/etc. are not. Most network processes are enacted locally despite global emergent implications. Do you expect the same process at each place in the network? i.e. is the susceptabiltiy to social influence the same for each place? Does the ERGM homogeneity of parameters assumption hold? Likely no, which means we don t really want to analyze it as a single network, but as a collection of liked networks. Do small-scale ease-of-use practices scale? Geodesics are the common path on small nets because they are easy to compute there; but are they relevant for massive networks?
Big NetworksDivide & Conquer Most social networks admit to natural fissures and it often makes more sense to differentiate analysis within vs. between these fissures. Whole NET Social processes rarely even across communities Between community structure can itself be interesting Creates a need for multi-level modeling & Analysis First level cluster 2nd level cluster
Big Networkslocalize Pay attention to the base features of large-scale global networks, then use that to tailor what you work with substantively. Are you interested in the longest-long tail of your degree distribution? Is the network a set of hubs-and-spokes? Clusters? Homogenous weave? Think about the way the global structure shapes visibility for the local structure and ask if you can just use the local (say two-step) structure? Can the (local) process you are interested in be sampled? Generally it's possible to localize networks around nodes construct ego networks of k-steps for example or all nodes within k-steps with particular attributes then analyze each of those to construct your boundary.
Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The 2010 Ohio Physician network has 38K nodes and 2.2M edges. We repeated this for 5 states over 3 panels. This still left us with 474K nodes and 11.3M edges.
Big NetworksDivide & Conquer examples EMR Data on physicians: 1/10 sample of patients Ohio connect to physicians around the nation. The logic behind physician shared patient networks is of coordinated care: who you talk to about patients affects treatment. This pushes us away from the total network toward identify a set of localized communities. The graph breaks into 116 localized communities with sizes ranging from 30 to nearly 1000.
Big NetworksDivide & Conquer examples Total network has 40M nodes; we used a two-level clustering procedure to identify reasonably compact communities. -- these are also large. Here one 2nd-level community is over 6000 nodes. But the structure starts to become apparent at this level Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Big NetworksDivide & Conquer examples Network of 300K twitter users. Modularity on first-level cut was over 0.9 (Louvain, weighted graph). This is a heat map mixing matrix, where each row/col is a community. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies Big Networks Communities 1 & 2: N=21487 Community 5: n=7773 Network of ~300K twitter users. Details for 2 of the clusters Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies Big Networks Of course, the D&C model requires being able to divide the network. - See Mucha s presentation on strategies for this - R will do very large networks with things like fastgreedy, spectral or stochastic block model I ve usually used PAJEK for this. It s optimized for very large networks (up to billions of nodes) and provides good control over resolution parameters and such. You have to go through the hassle of moving your data in/out; but I ve found it worth the effort over purely R solution. iGraph in Python is more flexible for clustering, so that s another option.
Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies Big Networks A localization example. Twitter network with 1.1M nodes. Note the tails these cannot be substantively the same sorts of actors. We can then ask things like distribution of hubs/spokes in large networks Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
A localization example. Exemplar Strategies Big Networks Role profile plots It's also possible to use simple- to-calculate scores in unique ways. 12% 13.6% 55.6% Low activity, Pendants Low activity retweet bridging So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Low activity, Mixed bridge 12.5% 3.4% Quote Bridges 1% Reply Bridges (wgt) high activity retweet bridging 1.4% Active group members 0.5% Local Authorities, fighters 0.06% Superstar hubs (each x value is a within & between community involvement score) Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
A localization example. Exemplar Strategies Big Networks Roles overlaid on a single community It's also possible to use simple- to-calculate scores in unique ways. Role 9 So while its time and space/bandwidth consuming to run a full triad-based structural equivalence model over a giant network; you can calculate a host of local and bridging sorts of scores, then cluster those to get positions quickly. Role 7 Role 8 Homo SocioNeticus: Scaling the cognitive foundations of online social behavior Defense Agency Research Projects Agency (DARPA), Mark Orr, PI
Exemplar Strategies Deep Networks Characterized by lots of data on small N of cases. Big does not always involve billions of nodes. Contemporary data-collection routines can yield massive datasets on even a single person. Easy to detect features Example: Real-time video data on Data+ students. Only a 10 nodes, but 30 terabytes of video data. Volfovsky, Alex. Katherine Heller, James Moody. Building Better Teams: A network analysis approach Army Research Office.
Exemplar Strategies Deep Networks Strategies here are often focused on three basic sorts: a) Wrangling. just putting the data in analyzable format. This is non-trivial; some off-the-shelf AI tools are making this easier, but most of it requires bespoke programming. Teams data has taken months to move from video to tabulations. a) Sifting. Most of the data you collected is, sadly, irrelevant. a) Tracer data: do you want data on each movement? What about when people are sleeping? Or alone in their ca? b) Aggregating. Or, at least irrelevant at the scale collected. Do you want the data item, or similarity between actors across multiple items? Often the fine-grained data gets turned into a vector.
Exemplar Strategies Deep Networks Interaction rituals and bonding Speed-dating research 4 minute dates Men rotate while women stay in seat Interaction rituals Who clicks with whom? How do social bonds form? Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks Our speed date setup Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks What do you do for fun? Dance? Uh, dance, uh, I like to go, like camping. Uh, snowboarding, but I'm not good, but I like to go anyway. You like boarding. Yeah. I like to do anything. Like I, I'm up for anything. Really? Yeah. Are you open-minded about most everything? Not everything, but a lot of stuff- What is not everything [laugh] I don't know. Think of something, and I'll say if I do it or not. [laugh] Okay. [unintelligible]. Skydiving. I wouldn't do skydiving I don't think. Yeah I'm afraid of heights. F: Yeah, yeah, me too. M: [laugh] Are you afraid of heights? F: [laugh] Yeah [laugh] Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks The Speed Date Study 991 4-minute dates 3 events each with ~20x20=400 dates some data loss Participants: 110 graduate student volunteers in 2005 participated in return for the chance to date Speech ~70 hours from shoulder sash recorders; high noise Transcripts ~800K words hand-transcribed w/turn boundary times Surveys (Pre-test surveys event scorecards post-test surveys) Date perceptions and follow-up interest General attitudes, preferences, demographics Largest natural experiment with audio text + survey info Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks Strategic Bonding How to get a desired mutual bond? Men Actor speech: vary pitch avoid talking about work talk about yourself more than usual Women Actor speech: laugh make appreciations take short turns talk about yourself and drinking Slides courtesy of Craig Rawlings, collaborative work w. Daniel McFarland
Exemplar Strategies Deep Networks Shared schema on Scents Culture in Objects Culture is stored more/less ambiguously We should try to approach material culture without reducing objects to instantiations of discourse or realizations of cognitive representations. (Mukerji 1997: 36)
Exemplar Strategies Deep Networks Shared schema on Scents Data & Methods Fragrances While humans are actually quite good at smelling, they are not good at describing smells in intersubjectively agreed-upon ways (Barwich 2020) Smells are deeply symbolic and cultural (Sperber 1978), and perhaps the least intellectual of the senses (Gonzalez-Crussi 1989). Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Shared schema on Scents Schema 1 (n=36) Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Shared schema on Scents Schema 2 (n=46) Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Shared schema on Scents Schema 3 (n=33) Slides courtesy of Craig Rawlings
H1: Controlling for interactions and objects, shared schemas predict initial interpersonal consensus in meanings. Exemplar Strategies Deep Networks Shared schema on Scents Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions Research Question Can we use social network analysis to enhance our understanding of how sociality and social cohesiondevelop in preschool aged children? Specific Aims: 1. Feasibility of video data collection 2. Coding scheme development 3. Social network analysis Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions Data Collection - Naturalistic video observations of classroom activities 4 microcameras 15 minute segments, 4x per week - - Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions Coding Scheme Behaviors: - Conflict and Cooperation Modifiers: - Self-initiated, Other-initiated, or Mutually-initiated - Physical or Non-physical Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis yellow2_video1.mp4 Dynamic Network -Conflict vs. Cooperation -Sex and Age Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions Social Network Analysis Slides courtesy of Craig Rawlings
Exemplar Strategies Deep Networks Video data on children's interactions What s Next? Continue to create social networks to analyze in R Patterns of relationships facilitate prosocial behavior What does this mean for intervention to increase compassionate behavior early in life Slides courtesy of Craig Rawlings