Introduction to Statistics for Future Data Scientists

Brianna Heggeseth and Dick De Veaux

Williams College

Motivation

•

–

What is wrong with Stat 101?....What, How, and When

•

–

Already learning statistical tools in other courses

–

What do we offer?

•

–

Statistical thinking (“think with data”) in multivariate situations

–

Real data with context and purpose

–

Technology to explore concepts and analyze data

(says Dick)

What we taught

•

Multivariate datasets

: Fireplace worth, Crime in S

–

Discuss multivariate and sampling issues such as

bias,

confounding

, lurking variables, and effect modification.

How much is a fireplace worth?

Home price ($)

Red: Fireplace, Blue: No Fireplace

Fireplace in home?

Data and Analysis at ASA Stat 101 Toolkit:

http://community.amstat.org/stats101/home

Helping

the SFPD

Code for Shiny Apps

available at:

https://github.com/bchegge

seth/ShinyApps/

Data:

http://data.sfgov.org

What we taught

•

Multivariate datasets

: Fireplace worth, Crime in S

–

Discuss multivariate and sampling issues such as

bias,

confounding

, lurking variables, and effect modification.

•

Data E

xample: Childhood Growt

–

Introduce

multiple regression

 with

indicators and

interactions

focusing on

interpretatio

n and practical use.

Do boys and girls grow the same?

Two lines with indicator variables and an interaction term,

Male line is

Female line is

Data: Kids198

from Stat2 textbook

What we taught

•

Multivariate datasets

: Fireplace worth, Crime in S

–

Discuss multivariate and sampling issues such as

bias,

confounding

, lurking variables, and effect modification.

•

Data E

xample: Childhood Growt

–

Introduce

multiple regression

 with

indicators and

interactions

focusing on

interpretatio

n and practical use.

•

Data

examples:

Trump and Babies

–

Introduce

via simulation

to

gain intuition

Trump and Babies

Trump and Babies

What if you only had a

sample of 500 babies?

Simulate a sample! Try Again!

And Again!

Treat our sample as our

“population”. Simulate a

sample from it! Try Again!

And Again!

https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm

But we only have a sample.

Calculate it with the population!

We used recent CNN/ORC polling (during primaries).

How we taught

•

In class: Address data questions with models and

discuss issues

as needed

•

Out of class: Open-ended data analysis assignments

and group project presentations

•

R and RMarkdown in lecture and

guided

homework problems  (with lots of code examples)

•

Some computation, more interpretation of output

•

Course is not a cookbook of tests

Homework

RMarkdown

Lecture Slides

RMarkdown

Lectures available:

https://github.com/

bcheggeseth/Stat2

01Lectures

When we taught

•

–

Multivariate questions and data collection complexities

–

•

–

th

 Ed. of

Stats: Data and Models

 with some adjustments

1.

Review EDA (Chp 1 - 5)

2.

Sampling (Chp 11)

3.

Sampling variability and inference via computing

 (notes)

4.

Simple and

multiple

 linear regression (Chp 6 - 9 +

notes

5.

Experiments (Chp 12)

6.

Formalize sampling variability via probability (Chp 13 - 17)

7.

Formal inference (Chp 18 - 25)

Thank you!

Brianna Heggeseth

Email:

bch2@williams.edu

Shiny App Code:

https://github.com/bcheggeseth/ShinyApps/

Lecture Notes:

https://github.com/bcheggeseth/Stat201Lectures

ASA Stat 101 Toolkit:

http://community.amstat.org/stats101/home

Slide Note

Thank you Gayla. Thank you for being here this morning, on the last day of the conference.

Last fall, my colleague, Dick De Veaux, and I were scheduled to each teach a section of our advanced introductory statistics course and I am going to share with you our collaborative work of redesigning this intro course so that it was more relevant and geared towards our students who will be the future data scientists --- students who will need to extract information from data no matter what their exact job title might be.

Embed Share

Download

This content delves into the foundations of statistics for aspiring data scientists. It covers various topics such as statistical thinking, multivariate data analysis, real-world data scenarios, and the application of technology in statistical exploration. Examples include investigating the value of fireplaces in homes, examining childhood growth patterns, and discussing issues like bias and confounding variables. The goal is to equip future data scientists with a strong statistical foundation to navigate complex data environments effectively.

pace_t Follow

Uploaded on Feb 28, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Intro Stats for Future Data Scientists Brianna Heggeseth and Dick De Veaux Williams College

Motivation USCOTS 2015 What is wrong with Stat 101?....What, How, and When Teaching Future Data Scientists Already learning statistical tools in other courses What do we offer? Revised GAISE Report (Everson, Mocko, et al) Statistical thinking ( think with data ) in multivariate situations Real data with context and purpose Technology to explore concepts and analyze data (says Dick)

What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification.

How much is a fireplace worth? Red: Fireplace, Blue: No Fireplace Home price ($) Fireplace in home? Data and Analysis at ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home

Helping the SFPD Data: http://data.sfgov.org Code for Shiny Apps available at: https://github.com/bchegge seth/ShinyApps/

What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use.

Do boys and girls grow the same? 200 Data: Kids198 from Stat2 textbook 150 Weight 100 50 100 120 140 160 180 200 220 Age (in months) Two lines with indicator variables and an interaction term, Male line is Female line is

What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use. Inference (Random Variability) Data examples: Trump and Babies Introduce via simulation to gain intuition.

Trump and Babies

Trump and Babies Data: U.S. Babies born in 1998 (census) https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm Data: Election Polling (sample) We used recent CNN/ORC polling (during primaries). Q: What is the median gestational age? Calculate it with the population! Q: What percent of Republicans favor Trump? But we only have a sample. What if you only had a sample of 500 babies? Simulate a sample! Try Again! And Again! Treat our sample as our population . Simulate a sample from it! Try Again! And Again! Get a sampling distribution Get a bootstrap distribution

How we taught Data Drives Everything In class: Address data questions with models and discuss issues as needed Out of class: Open-ended data analysis assignments and group project presentations Use Technology Statisticians Use R and RMarkdown in lecture and guided homework problems (with lots of code examples) Some computation, more interpretation of output General Approach to Inference Course is not a cookbook of tests

Lecture Slides RMarkdown Homework RMarkdown Lectures available: https://github.com/ bcheggeseth/Stat2 01Lectures

When we taught Early and often Multivariate questions and data collection complexities Sampling variability and inference General Order of Topics 4th Ed. of Stats: Data and Models with some adjustments 1. Review EDA (Chp 1 - 5) 2. Sampling (Chp 11) 3. Sampling variability and inference via computing (notes) 4. Simple and multiple linear regression (Chp 6 - 9 + notes) 5. Experiments (Chp 12) 6. Formalize sampling variability via probability (Chp 13 - 17) 7. Formal inference (Chp 18 - 25)

Thank you! Contact Info Brianna Heggeseth Email: bch2@williams.edu Shiny App Code: https://github.com/bcheggeseth/ShinyApps/ Lecture Notes: https://github.com/bcheggeseth/Stat201Lectures ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home

Introduction to Statistics for Future Data Scientists

Download Presentation

Presentation Transcript

Related

More Related Content