Overview of Large Scale Log Studies in HCI
This presentation from the CHI 2011 course delves into the design and analysis of large-scale log studies, highlighting the benefits and drawbacks of utilizing logs to understand user behaviors. It explores the insights logs provide, the challenges faced, and strategies to generate relevant logs for research purposes.
Uploaded on Sep 29, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Design and Analysis of Large Scale Log Studies A CHI 2011 course v11 Susan Dumais, Robin Jeffries, Daniel M. Russell, Diane Tang, Jaime Teevan CHI Tutorial, May, 2011 1
Introduction Daniel M. Russell Google 2
What Can We (HCI) Learn from Log Analysis? Logs are the traces of human behavior seen through the lenses of whatever sensors we have Actual behaviors As opposed to recalled behavior As opposed to subjective impressions of behavior 3
Benefits Portrait of real behavior warts & all and therefore, a more complete, accurate picture of ALL behaviors, including the ones people don t want to talk about Large sample size / liberation from the tyranny of small N Coverage (long tail) & Diversity Simple framework for comparative experiments Can see behaviors at a resolution / precision that was previously impossible Can inform more focused experiment design 4
Drawbacks Not annotated 00:32 now I know 00:35 you get a lot of weird things..hold on 00:38 Are Filipinos ready for gay flicks? 00:40 How does that have to do with what I just .did...? 00:43 Ummm 00:44 So that s where you can get surprised you re like, where is this how does this relate umm Not controlled No demographics Doesn t tell us the why Privacy concerns AOL / Netflix / Enron / Facebook public Medical data / other kinds of personally identifiable data 5
What Are Logs for This Discussion? User behavior events over time User activity primarily on web Edit history Clickstream Queries Annotation / Tagging PageViews all other instrumentable events (mousetracks, menu events .) Web crawls (e.g., content changes) E.g., programmatic changes of content 6
How to Generate Logs Use existing logged data Explore sources in your community (e.g., proxy logs) Work with a company (e.g., intern, visiting researcher) Construct targeted questions Generate your own logs Focuses on questions of unique interest to you Construct community resources Shared software and tools Client side logger (e.g., VIBE logger) Shared data sets Shared experimental platform to deploy experiments (and to attract visitors) Other ideas? 7
Interesting Sources of Log Data Anyone who runs a Web services Proxy (or library) logs at your institution Publically available social resources Wikipedia (content, edit history) Twitter Delicious, Flickr Facebook public data? Others? GPS Virtual worlds Cell call logs 8
Other Kinds of Large Data Sets Mechanical Turk (may / may not be truly log-like) Other rater panels, particularly ones that generate behavioral logs Medical data sets Temporal records of many kinds Example: logs from web servers for your web site Example: an app that generates logs a la the instrumented Sketchup application Akers, et al., 2009 9
Audience Discussion What kind of logs do you need to analyze? What kinds of logs does your work generate? Open Discussion 10
Overview Perspectives on log analysis Understanding User Behavior (Teevan) Design and Analysis of Experiments (Tang & Jeffries) Discussion on appropriate log study design (all) Practical Considerations for log analysis Collection & storage (Dumais) Data Cleaning (Russell) Discussion of log analysis & HCI community (all) 11
Section 1: Understanding User Behavior Jaime Teevan & Susan Dumais Microsoft Research 12
Kinds of User Data User Studies Controlled interpretation of behavior with detailed instrumentation User Panels In the wild, real-world tasks, probe for detail Log Analysis No explicit feedback but lots of implicit feedback 13
Kinds of User Data Observational User Studies Controlled interpretation of behavior with detailed instrumentation In-lab behavior observations User Panels In the wild, real-world tasks, probe for detail Ethnography, field studies, case reports Log Analysis No explicit feedback but lots of implicit feedback Behavioral log analysis Goal: Build an abstract picture of behavior 14
Kinds of User Data Observational Experimental User Studies Controlled interpretation of behavior with detailed instrumentation In-lab behavior observations Controlled tasks, controlled systems, laboratory studies User Panels In the wild, real-world tasks, probe for detail Ethnography, field studies, case reports Diary studies, critical incident surveys Log Analysis No explicit feedback but lots of implicit feedback A/B testing, interleaved results Behavioral log analysis Goal: Build an abstract picture of behavior Goal: Decide if one approach is better than another 15
Web Service Logs Example sources Search engine Commercial site Types of information Recruiting Queries, clicks, edits Academic field Results, ads, products Example analysis Click entropy Teevan, Dumais and Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008 Government contractor 16
Web Browser Logs Example sources Proxy Logging tool Types of information URL visits, paths followed Content shown, settings Example analysis Revisitation Adar, Teevan and Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008 17
Web Browser Logs Example sources Proxy Logging tool Types of information URL visits, paths followed Content shown, settings Example analysis DiffIE Teevan, Dumais and Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects .. Interactions. CHI 2010 18
Rich Client-Side Logs Example sources Client application Operating system Types of information Web client interactions Other client interactions Example analysis Stuff I ve Seen Dumais et al. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003 19
Logs Can Be Rich and Varied Sources of log data Types of information logged Web service Interactions Search engine Queries, clicks Commerce site URL visits System interactions Web Browser Context Proxy Toolbar Results Browser plug-in Ads Web pages shown Client application 20
Using Log Data What can we learn from log analysis? What can t we learn from log analysis? How can we supplement the logs? 21
Using Log Data What can we learn from log analysis? Now: About people s behavior Later: Experiments What can t we learn from log analysis? How can we supplement the logs? 22
Generalizing About Behavior Feature use Buttons clicks Structured answers Information use Information needs chi 2011 What people think Human behavior 23
Generalizing Across Systems Build new features Bing version 2.0 Logs from a particular run Bing use Logs from a Web search engine Build better Web search engine use From many Web search engines systems Search engine use From many search verticals Build new tools Information seeking From browsers, search, email 24
What We Can Learn from Query Logs Summary measures Queries appear 3.97 times [Silverstein et al. 1999] Query frequency 2.35 terms [Jansen et al. 1998] Query length Navigational, Informational, Transactional [Broder 2002] Analysis of query intent Query types and topics Temporal features Sessions 2.20 queries long [Silverstein et al. 1999] Session length Common re-formulations [Lau and Horvitz, 1999] Click behavior Relevant results for query [Joachims 2002] Queries that lead to clicks 25
Query Time User chi 2011 10:41am 2/18/10 142039 pan pacific hotel 10:44am 2/18/10 142039 fairmont waterfront hotel 10:56am 2/18/10 142039 chi 2011 11:21am 2/18/10 659327 restaurants vancouver 11:59am 2/18/10 318222 vancouver bc restaurants 12:01pm 2/18/10 318222 uist conference 12:17pm 2/18/10 318222 chi 2011 12:18pm 2/18/10 142039 daytrips in bc, canada 1:30pm 2/18/10 554320 uist 2011 1:30pm 2/18/10 659327 chi program 1:48pm 2/18/10 142039 chi2011.org 2:32pm 2/18/10 435451 mark ackerman 2:42pm 2/18/10 435451 fairmont waterfront hotel 4:56pm 2/18/10 142039 chi 2011 5:02pm 2/18/10 142039 26
Query Time User chi 2011 10:41am 2/18/10 142039 pan pacific hotel 10:44am 2/18/10 142039 fairmont waterfront hotel 10:56am 2/18/10 Query typology 142039 chi 2011 11:21am 2/18/10 659327 restaurants vancouver 11:59am 2/18/10 318222 vancouver bc restaurants 12:01pm 2/18/10 318222 uist conference 12:17pm 2/18/10 318222 chi 2011 12:18pm 2/18/10 142039 daytrips in bc, canada 1:30pm 2/18/10 554320 uist 2011 1:30pm 2/18/10 659327 chi program 1:48pm 2/18/10 142039 chi2011.org 2:32pm 2/18/10 435451 mark ackerman 2:42pm 2/18/10 435451 fairmont waterfront hotel 4:56pm 2/18/10 142039 chi 2011 5:02pm 2/18/10 142039 27
Query Time User chi 2011 10:41am 2/18/10 142039 pan pacific hotel 10:44am 2/18/10 142039 fairmont waterfront hotel 10:56am 2/18/10 Query typology 142039 chi 2011 11:21am 2/18/10 659327 restaurants vancouver 11:59am 2/18/10 318222 vancouver bc restaurants 12:01pm 2/18/10 318222 uist conference 12:17pm 2/18/10 Query behavior 318222 chi 2011 12:18pm 2/18/10 142039 daytrips in bc, canada 1:30pm 2/18/10 554320 uist 2011 1:30pm 2/18/10 659327 chi program 1:48pm 2/18/10 142039 chi2011.org 2:32pm 2/18/10 435451 mark ackerman 2:42pm 2/18/10 435451 fairmont waterfront hotel 4:56pm 2/18/10 142039 chi 2011 5:02pm 2/18/10 142039 28
Query Time User Uses of Analysis chi 2011 10:41am 2/18/10 142039 Ranking pan pacific hotel 10:44am 2/18/10 142039 fairmont waterfront hotel 10:56am 2/18/10 Query typology 142039 E.g., precision chi 2011 11:21am 2/18/10 659327 System design restaurants vancouver 11:59am 2/18/10 318222 E.g., caching vancouver bc restaurants 12:01pm 2/18/10 318222 User interface uist conference 12:17pm 2/18/10 Query behavior 318222 E.g., history chi 2011 12:18pm 2/18/10 142039 daytrips in bc, canada 1:30pm 2/18/10 554320 Test set development uist 2011 1:30pm 2/18/10 659327 chi program 1:48pm 2/18/10 142039 Long term trends Complementary research chi2011.org 2:32pm 2/18/10 435451 mark ackerman 2:42pm 2/18/10 435451 fairmont waterfront hotel 4:56pm 2/18/10 142039 chi 2011 5:02pm 2/18/10 142039 29
Partitioning the Data Language Location Time User activity Individual Entry point Device System variant [Baeza Yates et al. 2007] 30
Partition by Time Periodicities Spikes Real-time data New behavior Immediate feedback Individual Within session Across sessions [Beitzel et al. 2004] 31
Partition by User [Teevan et al. 2007] Temporary ID (e.g., cookie, IP address) High coverage but high churn Does not necessarily map directly to users User account Only a subset of users 32
What Logs Cannot Tell Us People s intent People s success People s experience People s attention People s beliefs of what s happening Limited to existing interactions Behavior can mean many things 33
Example: Click Entropy Question: How ambiguous is a query? Approach: Look at variation in clicks. Recruiting [Teevan et al. 2008] Click entropy Academic field Low if no variation human computer interaction High if lots of variation hci Government contractor 34
Which Has Lower Variation in Clicks? www.usajobs.gov v. federal government jobs find phone number v. msn live search singapore pools v. singaporepools.com Results change Click entropy = 1.5 Click entropy = 2.0 Result entropy = 5.7 Result entropy = 10.7 35
Which Has Lower Variation in Clicks? www.usajobs.gov v. federal government jobs find phone number v. msn live search singapore pools v. singaporepools.com Results change tiffany v. tiffany s nytimes v. connecticut newspapers Result quality varies Click entropy = 2.5 Click entropy = 1.0 Click position = 2.6 Click position = 1.6 36
Which Has Lower Variation in Clicks? www.usajobs.gov v. federal government jobs find phone number v. msn live search singapore pools v. singaporepools.com Results change tiffany v. tiffany s nytimes v. connecticut newspapers Result quality varies campbells soup recipes v. vegetable soup recipe Task affects # of clicks soccer rules v. hockey equipment Click entropy = 1.7 Click entropy = 2.2 Clicks/user = 1.1 Clicks/user = 2.1 37
Dealing with Log Limitations Look at data Query Query Time Time User User chi 2011 chi 2011 10:41am 2/18/10 10:41am 2/18/10 142039 142039 pan pacific hotel pan pacific hotel 10:44am 2/18/10 10:44am 2/18/10 142039 142039 fairmont waterfront hotel fair 10:56am 2/18/10 10:55am 2/18/10 142039 142039 Clean data chi 2011 fairmont 11:21am 2/18/10 10:55am 2/18/10 659327 142039 restaurants vancouver fairmont water 11:59am 2/18/10 10:56am 2/18/10 318222 142039 vancouver bc restaurants fairmont waterfront 12:01pm 2/18/10 10:56am 2/18/10 318222 142039 uist conference fairmont waterfront hotel 12:17pm 2/18/10 10:56am 2/18/10 318222 142039 Supplement the data Enhance log data Collect associated information (e.g., what s shown) Instrumented panels (critical incident, by individual) Converging methods Usability studies, eye tracking, surveys, field studies, diary studies 38
Example: Re-Finding Intent Large-scale log analysis of re-finding [Tyler and Teevan 2010] Do people know they are re-finding? Do they mean to re-find the result they do? Why are they returning to the result? Small-scale critical incident user study Browser plug-in that logs queries and clicks Pop up survey on repeat clicks and 1/8 new clicks Insight into intent + Rich, real-world picture Re-finding often targeted towards a particular URL Not targeted when query changes or in same session 39
Summary: Understanding User Behavior Log data gives a rich picture of real world behavior There are many potential sources of log data Partition the data to view interesting slices Recognize what the data can and cannot tell you Supplement logs with complementary data 40
Section 2: Design and Analysis of Experiments Robin Jeffries & Diane Tang 41
What Do We Mean by an Experiment? A change to the user experience, directly or indirectly Have a hypothesis Collect metrics to verify / nullify hypothesis Measurability is key! Running on a live (web) app; data coming in from real users, doing their own tasks Multiple arms, each providing different experiences At minimum, the new experience and the original control Can be an entire space of parameters with multiple values for each parameter R 42
Example Changes and Hypotheses Visible changes: Underlines: if I remove underlines, the page will be cleaner and easier to parse and users will find what they need faster Left Nav: by adding links to subpages, users will be able to better navigate the site Adding a new feature: the usage of this feature is better than what was previously shown in its place Less visible changes: Ranking: if I change the order of the (search) results, users will find what they are looking for faster (higher up on the page) 43 R
Why Do Experiments? To test your hypothesis In reality (or ultimately): gather data to make an informed, data-driven decision Little changes can have big impacts. You won't know until you measure it. With big changes, who knows what will happen. Your intuition is not always correct Law of unintended side effects: what you wanted to impact gets better, but something else gets worse. You want to know that. R 44
What Can We Learn from Experiments? How (standard) metrics change Whether/How often users interact with a new feature How users interact with a new feature Whether behavior changes over time. (learning/ habituation) But, remember, you are following a cookie, not a person R 45
What Cant We Learn from Experiments? WHY: figuring out why people do things Need more direct user input Tracking a user over time Without special tracking software: only have a cookie Cookie != user Measuring satisfaction / feelings directly Only indirect measures (e.g., how often users return) Did users even notice the change? Did users tell their friends about feature x? Did users get a bad impression of the product? Did the users find the product enjoyable to use? Is the product lacking an important feature? Would something we didn't test have done better than what we did test? Is the user confused and why? 46 R
Section Outline Background Experiment design: What am I testing & what am I measuring? Experiment sizing: How many observations do I need? Running experiments: What do I need to do? Analyzing experiments: I ve got numbers, what do they mean? D 47
Basic Experiment Definitions Incoming request R has: Cookie C Attributes A: Language, country, browser, etc. Experiment: Diversion: is a request in the experiment? Unit of diversion: cookie vs. request May also depend on attributes Triggering: which subset of diverted requests does an experiment actually change (impact)? E.g., weather onebox vs. page chrome Page chrome: triggering == diversion Weather onebox: triggering << diversion On triggered requests, experiment changes what is served to the user D 48
Experiment Design What decision do you want to make? 3 interlinked questions: What do you want to test? What is the space you will explore/what factors will you vary? What hypotheses do you have about those changes? What metrics will you use to test these hypotheses? How will you make your decision? Every outcome should lead to a decision R 50