Project Topics and Data Analysis Overview
Introduction to project topics selection, data analysis objectives, and assignment requirements for QM222 Fall 2017 Section A1. Topics include descriptive statistics, project data sets, defining research questions, and choosing variables to analyze relationships. Students are tasked with finding topics, defining questions, identifying relevant data sets, and considering potential stakeholders. Excel skills training and TA office hours schedule provided.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
QM222 Class 2 Section A1 QM222 Class 2 Section A1 Projects, Data and Datasets, some Descriptive Projects, Data and Datasets, some Descriptive Statistics Statistics Choose your seat. Sit at approximately this location in KCB as well. Name cards? QM222 Fall 2017 Section A1 1
Some to Some to- -dos Sign up for an appointment (see signup, Sept 20 is the last day) dos https://docs.google.com/a/bu.edu/spreadsheets/d/188IrHsjGhE758eIQ1Jcru1WGKFmJJYrmD1ppcdcMhY/edit?usp=sharing Excel Checklist: Do by the end of next week (Sept. 15) Go to our TA s office hours or any other ones and do/learn an Excel checklist about using Excel for formulas Checklist at: http://sites.bu.edu/qm222projectcourse. QM222 Project Course General QM222 TA Office Hours in Room 206A (off Undergrad Lounge) Wednes 9-10am 9-10am Project Section A1 Tuesdays Wednes Fridays Sundays Thursdays Fridays 9-10am 10- 11am 11- 12pm 10-11am 10:45- 11:45am 11:45- 12:45pm 12:45-1:30pm 1:30-2:30pm Roland 2:30-4pm 4-5pm 5-6pm 6-7pm 10-11amLexi Klein Shogo 11-12pmLexi Klein James James Shogo Y 12-1pmCristiane 1-2pmCristiane 2-3pmSanya Seth 3-4pmSanya Seth 4-5pmNick Lord 5-6pmRachel Mann 12-1pm 1-2pm 2-3pm Rachel Nick Lord Ata Ata Maciej Maciej Roland QM222 Fall 2017 Section A1 2
Todays Objectives Today s Objectives Introduction to Project Data sets and data characteristics Describing a single variable (QM221 review) part 1: Measuring a variable s middle Measuring a variable s spread How to calculate them in Excel QM222 Fall 2017 Section A1 3
Choosing your project topic Choosing your project topic Projects need to be a question about how variables relate to each other. Projects should have a topic about something you are interested in. You need to be able to find a data set to answer your question. QM222 Fall 2017 Section A1 4
Assignment 1 (due Wednesday Sept.20) Assignment 1 (due Wednesday Sept.20) Find Two Possible Topics What specific question or questions will your project address? What is the data set you plan to use? You are going to measure a relationship between two variables. What are these 2 variables? (Note: a variable can be categorical, such as gender.) What company, governmental body or other organization would be interested in knowing the answer to this question? QM222 Fall 2017 Section A1 5
Some of previous years questions/topics: Some of previous years questions/topics: How do working long days affect people s happiness with their marriage? Predicting Flow of the Elbow River at Bragg Creek during the spring/summer What drives dividends? Financial Ratios and Profitability in the Apparel Industry. Impact of Advertising on Sales of (student s own product he sold to students) How MPG affects sales prices of cars. What factors affect whether pregnant 18-26 yr. old women abort a pregnancy? Lots of sports topics: How injuries affect a basketball player s later performance, the impact of offensive coordinators on team success in the NFL, what s more important to winning golf tournaments putting or long drives. Etc. The effect of earnings on depression; of past drug use on later earnings etc. What aspects of countries predict greater future internet growth? What past payment patterns predict credit card defaults? Are employees with different college majors more likely than others to leave firms? What recent trends are there in the color palette of movies and do these color palettes lead to more successful movies? Estimating the impact of an advertising catalog on demand for a fashion store chain (family business) QM222 Fall 2017 Section A1 6
For the question you choose, there must be a For the question you choose, there must be a data set that allows you to answer it. data set that allows you to answer it. In fact, sometimes, people choose topics by looking at data sets that are available. What do data sets look like? Datasets are rectangular tables/spreadsheets of data. QM222 Fall 2017 Section A1 7
Todays Objectives Today s Objectives Introduction to Project Data sets and data characteristics Describing a single variable (QM221 review) part 1: Measuring a variable s middle Measuring a variable s spread How to calculate them in Excel QM222 Fall 2017 Section A1 8
What data sets look like (Movie data from IMDB, metacritic) Title Ratatouille The Social Network Spirited Away WALL-E Pulp Fiction Sideways The Hurt Locker The Lord of the Rings: The Return of the King 2003 Crouching Tiger, Hidden Dragon Schindler's List Toy Story 3 Toy Story The Lord of the Rings: The Fellowship of the Ring Waltz with Bashir Secrets & Lies Do the Right Thing Star Wars Ep. IV: A New Hope Raiders of the Lost Ark The Incredibles Finding Nemo Topsy-Turvy (Gilbert & Sullivan) Being John Malkovich American Splendor United 93 L.A. Confidential Before Sunset Saving Private Ryan The Truman Show Year Metascore Budget 150000000 BV 40000000 Sony 19000000 BV 180000000 BV 8000000 Mira. 17000000 FoxS 15000000 Sum. 94000000 NL 15000000 SPC 25000000 Uni. 200000000 BV 30000000 BV 109000000 NL 2000000 SPC 4500000 Oct. 6000000 Uni. 11000000 Fox 20000000 Par. 92000000 BV 94000000 BV 20000000 USA 13000000 USA 2000000 FL 18000000 Uni. 35000000 WB 2000000 WIP 65000000 DW 60000000 Par. QM222 Fall 2017 Section A1 Studio LifetimeGross 206445654 96962694 10055859 223808164 107928762 71503593 17017811 377845905 128078872 96065768 415004880 191796233 315544750 2283849 13417292 27545445 460998007 248159971 261441092 380843261 6208548 22863596 6010990 31483450 64616940 5820649 216540909 125618201 LifetimeTheater Opening 3940 2921 OpeningTheaters 2007 2010 2002 2008 1994 2004 2009 96 95 94 94 94 94 94 94 93 93 92 92 92 91 91 91 91 90 90 90 90 90 90 90 90 90 90 90 47027395 22445653 449839 63087526 9311882 207042 145352 72629713 663205 656636 110307189 29140617 47211490 50021 60813 3563535 1554475 8305823 70467623 70251710 31387 637721 159705 11478360 5211198 219425 30576104 31542121 3940 2771 26 3992 1338 714 3992 1494 1786 535 3703 2027 1389 4028 2574 3381 208 296 534 1750 1078 3933 3425 224 630 272 1871 1625 204 2807 2911 4 4 3703 16 25 4028 2457 3359 2000 1993 2010 1995 2001 2008 1996 1989 1977 1981 2004 2003 1999 1999 2003 2006 1997 2004 1998 1998 5 4 353 43 1078 3933 3374 2 25 6 1795 769 20 2463 2315 9
How Data How Data- -sets are organized sets are organized Each row is an observation. One occurrence of the thing you are examining. In the data set on the previous the observation is one movie. n is the number of observations. How many observations can you use? Billions .. Limited by your computer s memory! How many observations do you need? Thousands if each observation is an individual person Far fewer if each observation is a country, team, company etc. I d advise 100 if possible. The more observations, the more likely you ll find definitive results. QM222 Fall 2017 Section A1 10
How Data How Data- -sets are organized sets are organized Each column is a variable. Something you know about the observation. In the movie data, some variables are year, metascore, budget Notation: X as variable, Xi is the value of X for observation How many variables will you need? Of course, you will need the 2 (or more) variables whose relationship you are studying But as we ll learn, it will be good to collect a lot more variables that might affect any of these 2+ key variables. QM222 Fall 2017 Section A1 11
Types of Variables Types of Variables Numerical (also known as Quantitative): These variables take on a number, and represent some kind of measurement. Categorical: puts an observation into a category, but is not easily represented by a number. In the movie data, which variables are numerical? In the movie data, which variables are categorical? We will learn how apply statistical tools to categorical data QM222 Fall 2017 Section A1 12
Domestic/ Internation al D D D D D D D D D D D D D D D D D These data are on starting salaries of Questrom recent UG graduates Main Concentration Year Salary $80,000 ACC $75,000 ACC $70,000 ACC $70,000 ACC $61,000 ACC $60,000 ACC $60,000 ACC $60,000 ACC $60,000 ACC $75,000 ACC $70,000 ACC $70,000 ACC $61,000 ACC $60,000 ACC $59,000 ACC $58,000 ACC $57,000 ACC Q1: What does an observation represent in this data set. Q2: Which variables are numerical, which are categorical? 2017 2017 2017 2017 2017 2017 2017 2017 2017 2016 2016 2016 2016 2016 2016 2016 2016 A1: a person, a graduate A2: numerical: Year, Salary categorical: Domestic/Intl, Main Concentration QM222 Fall 2017 Section A1 13
Data sets can be: Cross sectional v. Time Data sets can be: Cross sectional v. Time Series v. Panel Data Sets Series v. Panel Data Sets Cross Section at one point of time, each observation is a different person, company etc. v. Time series each observation is a different point of time v. Cross Section Time Series: each observation is a different company etc. at a specific point of time. Time here is a variable like all others. v. Panel data, longitudinal data: the same people/ companies etc. are observed at different points of time; each observation is a specific company etc. at a specific point of time 14 SM222 Class 2 QM222 Fall 2017 Section A1
Domestic/ Internation al D D D D D D D D D D D D D D D D D Main Concentration Year Salary $80,000 ACC $75,000 ACC $70,000 ACC $70,000 ACC $61,000 ACC $60,000 ACC $60,000 ACC $60,000 ACC $60,000 ACC $75,000 ACC $70,000 ACC $70,000 ACC $61,000 ACC $60,000 ACC $59,000 ACC $58,000 ACC $57,000 ACC 2017 2017 2017 2017 2017 2017 2017 2017 2017 2016 2016 2016 2016 2016 2016 2016 2016 Q: What kind of dataset is this:? A. Cross-section B. Time-series C. Cross section-Time series D. Panel/Longitudinal A: C. Cross section-Time series QM222 Fall 2017 Section A1 15
Google Trends 0-100 Day These are data on a score 0-100 for whether lots of people googled The Big Bang Theory that day The Big Bang Theory: (US) 55 66 57 50 55 Friday Saturday Sunday Monday Tuesday 1/1/2016 1/2/2016 1/3/2016 1/4/2016 1/5/2016 Wednesday Thursday Friday Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday Monday Tuesday 1/6/2016 1/7/2016 1/8/2016 1/9/2016 1/10/2016 1/11/2016 1/12/2016 1/13/2016 1/14/2016 1/15/2016 1/16/2016 1/17/2016 1/18/2016 1/19/2016 68 92 64 56 57 54 51 50 73 57 54 49 42 48 Q1: Which variables are numerical, which are categorical? Q2: What kind of dataset is this:? A. Cross-section B. Time-series C. Cross section-Time series D. Panel/Longitudinal A1: Numerical: Score Categorical: Day Day: Hard to say We would make it into a numerical one Wednesday Thursday Friday Saturday Sunday Monday Tuesday 1/20/2016 1/21/2016 1/22/2016 1/23/2016 1/24/2016 1/25/2016 1/26/2016 50 76 57 65 39 47 53 A2: B. Time-series QM222 Fall 2017 Section A1 16
Name ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABARCA ARENAS, LUIS GERARDO ABEBE, TILAHUN ABEBE, TILAHUN ABEBE, TILAHUN ABEBE, TILAHUN ABEBE, TILAHUN ABEBE, TILAHUN ABEBE, TILAHUN ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADIDHARMA, HERTANTO ADOMAVICIUS, GEDIMINAS ADOMAVICIUS, GEDIMINAS ADOMAVICIUS, GEDIMINAS ADOMAVICIUS, GEDIMINAS ADOMAVICIUS, GEDIMINAS ADOMAVICIUS, GEDIMINAS ADRIANSYAH, A ADRIANSYAH, A ADRIANSYAH, A ADRIANSYAH, A ADRIANSYAH, A Year Fulbright? PhD Year 2000 1 2001 1 2002 1 2003 1 2004 1 2005 1 2006 1 2007 1 2001 1 2002 1 2003 1 2004 1 2005 1 2006 1 2007 1 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2002 1 2003 1 2004 1 2005 1 2006 1 2007 1 2000 0 2001 0 2002 0 2003 0 2004 0 Region LatinAmer LatinAmer LatinAmer LatinAmer LatinAmer LatinAmer LatinAmer LatinAmer ME/Africa ME/Africa ME/Africa ME/Africa ME/Africa ME/Africa ME/Africa Asia Asia Asia Asia Asia Asia Asia Asia Asia Europe Europe Europe Europe Europe Europe Asia Asia Asia Asia Asia location log(GDP) Female 9.1302 9.1429 9.1589 9.1787 9.2325 9.3897 9.4667 9.5198 6.0818 6.0607 6.0121 6.0589 6.2343 6.3581 6.4982 7.8368 7.9047 7.9383 7.9472 8.0212 8.0659 8.1452 8.2113 8.2554 9.0913 9.2234 9.3619 9.4910 9.5882 9.7197 7.9047 7.9383 7.9472 8.0212 8.0659 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 2001 2001 2001 2001 1999 1999 1999 1999 1999 1999 1999 1999 1999 2002 2002 2002 2002 2002 2002 2000 2000 2000 2000 2000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 Data on individual STEM PhDs Q: What kind of dataset is this:? A. Cross-section B. Time-series C. Cross section-Time series D. Panel/Longitudinal A: D Panel/Longitudinal QM222 Fall 2017 Section A1 17
Finding Datasets Finding Datasets Of course, you can just google the two names of things you will relate and the word data , and see what you come with. But I have given you lots of suggestions of where you might look for data sets. Go to: http://sites.bu.edu/qm222projectcourse. Data Sets. Let s look at these suggestions. For any data set you decide to use, the website will give you careful instructions on how to choose the variables that you want and download the data. You will also need the codebook for that dataset, which tells us exactly how each numerical variable is measured and its units; for categorical values, often categories are given numbers. You need to know what 1 represents, what 2 represents etc. Let s go to one of these sites. QM222 Fall 2017 Section A1 18
Todays Objectives Today s Objectives Introduction to Project Data sets and data characteristics Describing a single variable (QM221 review) part 1: Measuring a variable s middle Measuring a variable s spread How to calculate them in Excel QM222 Fall 2017 Section A1 19
Measuring the middle Measuring the middle Xi N N X = Mean i Add up all the values and dividing by the number of observations Median Half the observations are greater than the median, half the observations are smaller than the median. QM222 Fall 2017 Section A1 20
Measures of the Spread Measures of the Spread Range: Max Min Pro s: Simple Con s: Can be highly affected by ONE unusual observations. Standard deviation Measures how far is the data spread out around the mean 2 1 n = . . ( ) Std Dev X X i 1 n i 2) . . ( Std Dev Average of deviations from average Pros: A single measure based on all observations Cons: Not as intuitive to some people. Pairs of percentiles, for instance the 25th and 75th percentiles QM222 Fall 2017 Section A1 21
Checking your understanding Checking your understanding Q: What share of observations are between the 25th and 75th percentiles? Answer: 50% QM222 Fall 2017 Section A1 22
Getting these statistics in Excel Getting these statistics in Excel (for a variable with data in cells a2:a64) (for a variable with data in cells a2:a64) Mean: =average(a2:a64) Median: =median(a2:a64) Range: =max(a2:a64 ) min (a2:a64) Standard deviation: =stdev(a2:a64) or =stdev.s(a2:a64) both are for a sample 25th and 75th percentiles: =percentile(a2:a64,0.25) gives the value at the 25th percentile of the data set =percentile(a2:a64,0.75) gives the value at the 75th Do In-Class exercise (also posted on sites.bu.edu/qm222projectcourse Other Materials) percentile of the data set. QM222 Fall 2017 Section A1 23
Today we: Today we: Got suggestions on choosing a topic and finding a dataset Learned about the characteristics of a dataset Reviewed some descriptive statistics and how to use Excel to derive them. QM222 Fall 2017 Section A1 24
Some to Some to- -dos Sign up for an appointment (see signup, Sept 20 last day) dos https://docs.google.com/a/bu.edu/spreadsheets/d/188IrHsjGhE758eIQ1Jcru1WGKFmJJYrmD1ppcdcMhY/edit?usp=sharing Excel Checklist: Do by the end of next week (Sept. 15) Go to our TA s office hours or any other ones and do/learn an Excel checklist about using Excel for formulas http://sites.bu.edu/qm222projectcourse. QM222 Project Course General Maybe sign-up for the URO? QM222 Fall 2017 Section A1 25