Introduction to Statistics for Future Data Scientists
This content delves into the foundations of statistics for aspiring data scientists. It covers various topics such as statistical thinking, multivariate data analysis, real-world data scenarios, and the application of technology in statistical exploration. Examples include investigating the value of fireplaces in homes, examining childhood growth patterns, and discussing issues like bias and confounding variables. The goal is to equip future data scientists with a strong statistical foundation to navigate complex data environments effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Intro Stats for Future Data Scientists Brianna Heggeseth and Dick De Veaux Williams College
Motivation USCOTS 2015 What is wrong with Stat 101?....What, How, and When Teaching Future Data Scientists Already learning statistical tools in other courses What do we offer? Revised GAISE Report (Everson, Mocko, et al) Statistical thinking ( think with data ) in multivariate situations Real data with context and purpose Technology to explore concepts and analyze data (says Dick)
What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification.
How much is a fireplace worth? Red: Fireplace, Blue: No Fireplace Home price ($) Fireplace in home? Data and Analysis at ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home
Helping the SFPD Data: http://data.sfgov.org Code for Shiny Apps available at: https://github.com/bchegge seth/ShinyApps/
What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use.
Do boys and girls grow the same? 200 Data: Kids198 from Stat2 textbook 150 Weight 100 50 100 120 140 160 180 200 220 Age (in months) Two lines with indicator variables and an interaction term, Male line is Female line is
What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use. Inference (Random Variability) Data examples: Trump and Babies Introduce via simulation to gain intuition.
Trump and Babies Data: U.S. Babies born in 1998 (census) https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm Data: Election Polling (sample) We used recent CNN/ORC polling (during primaries). Q: What is the median gestational age? Calculate it with the population! Q: What percent of Republicans favor Trump? But we only have a sample. What if you only had a sample of 500 babies? Simulate a sample! Try Again! And Again! Treat our sample as our population . Simulate a sample from it! Try Again! And Again! Get a sampling distribution Get a bootstrap distribution
How we taught Data Drives Everything In class: Address data questions with models and discuss issues as needed Out of class: Open-ended data analysis assignments and group project presentations Use Technology Statisticians Use R and RMarkdown in lecture and guided homework problems (with lots of code examples) Some computation, more interpretation of output General Approach to Inference Course is not a cookbook of tests
Lecture Slides RMarkdown Homework RMarkdown Lectures available: https://github.com/ bcheggeseth/Stat2 01Lectures
When we taught Early and often Multivariate questions and data collection complexities Sampling variability and inference General Order of Topics 4th Ed. of Stats: Data and Models with some adjustments 1. Review EDA (Chp 1 - 5) 2. Sampling (Chp 11) 3. Sampling variability and inference via computing (notes) 4. Simple and multiple linear regression (Chp 6 - 9 + notes) 5. Experiments (Chp 12) 6. Formalize sampling variability via probability (Chp 13 - 17) 7. Formal inference (Chp 18 - 25)
Thank you! Contact Info Brianna Heggeseth Email: bch2@williams.edu Shiny App Code: https://github.com/bcheggeseth/ShinyApps/ Lecture Notes: https://github.com/bcheggeseth/Stat201Lectures ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home