Introduction to Statistics for Future Data Scientists

 
I
n
t
r
o
 
S
t
a
t
s
 
f
o
r
 
F
u
t
u
r
e
 
D
a
t
a
 
S
c
i
e
n
t
i
s
t
s
 
Brianna Heggeseth and Dick De Veaux
Williams College
 
Motivation
 
U
S
C
O
T
S
 
2
0
1
5
What is wrong with Stat 101?....What, How, and When
T
e
a
c
h
i
n
g
 
F
u
t
u
r
e
 
D
a
t
a
 
S
c
i
e
n
t
i
s
t
s
Already learning statistical tools in other courses
What do we offer?
R
e
v
i
s
e
d
 
G
A
I
S
E
 
R
e
p
o
r
t
 
(
E
v
e
r
s
o
n
,
 
M
o
c
k
o
,
 
e
t
 
a
l
)
Statistical thinking (“think with data”) in multivariate situations
Real data with context and purpose
Technology to explore concepts and analyze data
 
 
 
(says Dick)
 
What we taught
 
R
e
v
i
e
w
 
E
D
A
 
a
n
d
 
D
a
t
a
 
C
o
l
l
e
c
t
i
o
n
Multivariate datasets
: Fireplace worth, Crime in S
F
Discuss multivariate and sampling issues such as 
bias,
confounding
, lurking variables, and effect modification.
 
How much is a fireplace worth?
 
Home price ($)
 
Red: Fireplace, Blue: No Fireplace
 
Fireplace in home?
 
Data and Analysis at ASA Stat 101 Toolkit:
 
http://community.amstat.org/stats101/home
 
Helping
the SFPD
 
Code for Shiny Apps
available at:
https://github.com/bchegge
seth/ShinyApps/
 
Data: 
http://data.sfgov.org
 
What we taught
 
R
e
v
i
e
w
 
E
D
A
 
a
n
d
 
D
a
t
a
 
C
o
l
l
e
c
t
i
o
n
Multivariate datasets
: Fireplace worth, Crime in S
F
Discuss multivariate and sampling issues such as 
bias,
confounding
, lurking variables, and effect modification.
M
o
d
e
l
i
n
g
 
(
E
x
p
l
a
i
n
i
n
g
 
V
a
r
i
a
b
i
l
i
t
y
)
Data E
xample: Childhood Growt
h
Introduce 
multiple regression
 with 
indicators and
interactions
, 
focusing on
 
interpretatio
n and practical use.
 
Do boys and girls grow the same?
 
Two lines with indicator variables and an interaction term,
 
Male line is
 
Female line is
 
Data: Kids198
from Stat2 textbook
 
What we taught
 
R
e
v
i
e
w
 
E
D
A
 
a
n
d
 
D
a
t
a
 
C
o
l
l
e
c
t
i
o
n
Multivariate datasets
: Fireplace worth, Crime in S
F
Discuss multivariate and sampling issues such as 
bias,
confounding
, lurking variables, and effect modification.
M
o
d
e
l
i
n
g
 
(
E
x
p
l
a
i
n
i
n
g
 
V
a
r
i
a
b
i
l
i
t
y
)
Data E
xample: Childhood Growt
h
Introduce 
multiple regression
 with 
indicators and
interactions
, 
focusing on
 
interpretatio
n and practical use.
I
n
f
e
r
e
n
c
e
 
(
R
a
n
d
o
m
 
V
a
r
i
a
b
i
l
i
t
y
)
Data 
examples: 
Trump and Babies
Introduce 
via simulation
 to 
gain intuition
.
 
Trump and Babies
 
Trump and Babies
 
D
a
t
a
:
 
U
.
S
.
 
B
a
b
i
e
s
 
b
o
r
n
 
i
n
1
9
9
8
 
(
c
e
n
s
u
s
)
 
Q
:
 
W
h
a
t
 
i
s
 
t
h
e
 
m
e
d
i
a
n
g
e
s
t
a
t
i
o
n
a
l
 
a
g
e
?
 
What if you only had a
sample of 500 babies?
Simulate a sample! Try Again!
And Again!
 
G
e
t
 
a
 
s
a
m
p
l
i
n
g
 
d
i
s
t
r
i
b
u
t
i
o
n
 
D
a
t
a
:
 
E
l
e
c
t
i
o
n
 
P
o
l
l
i
n
g
(
s
a
m
p
l
e
)
 
Q
:
 
W
h
a
t
 
p
e
r
c
e
n
t
 
o
f
R
e
p
u
b
l
i
c
a
n
s
 
f
a
v
o
r
 
T
r
u
m
p
?
 
Treat our sample as our
“population”. Simulate a
sample from it! Try Again!
And Again!
 
G
e
t
 
a
 
b
o
o
t
s
t
r
a
p
 
d
i
s
t
r
i
b
u
t
i
o
n
 
https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm
 
But we only have a sample.
 
Calculate it with the population!
 
We used recent CNN/ORC polling (during primaries).
 
How we taught
 
D
a
t
a
 
D
r
i
v
e
s
 
E
v
e
r
y
t
h
i
n
g
In class: Address data questions with models and
discuss issues 
as needed
Out of class: Open-ended data analysis assignments
and group project presentations
U
s
e
 
T
e
c
h
n
o
l
o
g
y
 
S
t
a
t
i
s
t
i
c
i
a
n
s
 
U
s
e
R and RMarkdown in lecture and 
guided
homework problems  (with lots of code examples)
Some computation, more interpretation of output
G
e
n
e
r
a
l
 
A
p
p
r
o
a
c
h
 
t
o
 
I
n
f
e
r
e
n
c
e
Course is not a cookbook of tests
 
Homework
RMarkdown
 
Lecture Slides
RMarkdown
 
Lectures available:
https://github.com/
bcheggeseth/Stat2
01Lectures
 
When we taught
 
E
a
r
l
y
 
a
n
d
 
o
f
t
e
n
Multivariate questions and data collection complexities
S
a
m
p
l
i
n
g
 
v
a
r
i
a
b
i
l
i
t
y
 
a
n
d
 
i
n
f
e
r
e
n
c
e
G
e
n
e
r
a
l
 
O
r
d
e
r
 
o
f
 
T
o
p
i
c
s
4
th
 Ed. of 
Stats: Data and Models
 with some adjustments
1.
Review EDA (Chp 1 - 5)
2.
Sampling (Chp 11)
3.
 
Sampling variability and inference via computing
 (notes)
4.
Simple and 
multiple
 linear regression (Chp 6 - 9 + 
notes
)
5.
Experiments (Chp 12)
6.
Formalize sampling variability via probability (Chp 13 - 17)
7.
Formal inference (Chp 18 - 25)
 
 
Thank you!
 
C
o
n
t
a
c
t
 
I
n
f
o
Brianna Heggeseth
Email: 
bch2@williams.edu
 
Shiny App Code:
https://github.com/bcheggeseth/ShinyApps/
 
Lecture Notes:
https://github.com/bcheggeseth/Stat201Lectures
 
ASA Stat 101 Toolkit:
http://community.amstat.org/stats101/home
 
 
 
Slide Note

Thank you Gayla. Thank you for being here this morning, on the last day of the conference.

Last fall, my colleague, Dick De Veaux, and I were scheduled to each teach a section of our advanced introductory statistics course and I am going to share with you our collaborative work of redesigning this intro course so that it was more relevant and geared towards our students who will be the future data scientists --- students who will need to extract information from data no matter what their exact job title might be.

Embed
Share

This content delves into the foundations of statistics for aspiring data scientists. It covers various topics such as statistical thinking, multivariate data analysis, real-world data scenarios, and the application of technology in statistical exploration. Examples include investigating the value of fireplaces in homes, examining childhood growth patterns, and discussing issues like bias and confounding variables. The goal is to equip future data scientists with a strong statistical foundation to navigate complex data environments effectively.

  • Statistics
  • Data Science
  • Multivariate Analysis
  • Real-World Data
  • Statistical Thinking

Uploaded on Feb 28, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Intro Stats for Future Data Scientists Brianna Heggeseth and Dick De Veaux Williams College

  2. Motivation USCOTS 2015 What is wrong with Stat 101?....What, How, and When Teaching Future Data Scientists Already learning statistical tools in other courses What do we offer? Revised GAISE Report (Everson, Mocko, et al) Statistical thinking ( think with data ) in multivariate situations Real data with context and purpose Technology to explore concepts and analyze data (says Dick)

  3. What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification.

  4. How much is a fireplace worth? Red: Fireplace, Blue: No Fireplace Home price ($) Fireplace in home? Data and Analysis at ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home

  5. Helping the SFPD Data: http://data.sfgov.org Code for Shiny Apps available at: https://github.com/bchegge seth/ShinyApps/

  6. What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use.

  7. Do boys and girls grow the same? 200 Data: Kids198 from Stat2 textbook 150 Weight 100 50 100 120 140 160 180 200 220 Age (in months) Two lines with indicator variables and an interaction term, Male line is Female line is

  8. What we taught Review EDA and Data Collection Multivariate datasets: Fireplace worth, Crime in SF Discuss multivariate and sampling issues such as bias, confounding, lurking variables, and effect modification. Modeling (Explaining Variability) Data Example: Childhood Growth Introduce multiple regression with indicators and interactions, focusing on interpretation and practical use. Inference (Random Variability) Data examples: Trump and Babies Introduce via simulation to gain intuition.

  9. Trump and Babies

  10. Trump and Babies Data: U.S. Babies born in 1998 (census) https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm Data: Election Polling (sample) We used recent CNN/ORC polling (during primaries). Q: What is the median gestational age? Calculate it with the population! Q: What percent of Republicans favor Trump? But we only have a sample. What if you only had a sample of 500 babies? Simulate a sample! Try Again! And Again! Treat our sample as our population . Simulate a sample from it! Try Again! And Again! Get a sampling distribution Get a bootstrap distribution

  11. How we taught Data Drives Everything In class: Address data questions with models and discuss issues as needed Out of class: Open-ended data analysis assignments and group project presentations Use Technology Statisticians Use R and RMarkdown in lecture and guided homework problems (with lots of code examples) Some computation, more interpretation of output General Approach to Inference Course is not a cookbook of tests

  12. Lecture Slides RMarkdown Homework RMarkdown Lectures available: https://github.com/ bcheggeseth/Stat2 01Lectures

  13. When we taught Early and often Multivariate questions and data collection complexities Sampling variability and inference General Order of Topics 4th Ed. of Stats: Data and Models with some adjustments 1. Review EDA (Chp 1 - 5) 2. Sampling (Chp 11) 3. Sampling variability and inference via computing (notes) 4. Simple and multiple linear regression (Chp 6 - 9 + notes) 5. Experiments (Chp 12) 6. Formalize sampling variability via probability (Chp 13 - 17) 7. Formal inference (Chp 18 - 25)

  14. Thank you! Contact Info Brianna Heggeseth Email: bch2@williams.edu Shiny App Code: https://github.com/bcheggeseth/ShinyApps/ Lecture Notes: https://github.com/bcheggeseth/Stat201Lectures ASA Stat 101 Toolkit: http://community.amstat.org/stats101/home

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#