The Key Distinctions in Statistics

 
Summer 2017
 
Summer Institutes
 
134
 
Sampling Distributions
 
Summer 2017
 
Summer Institutes
 
135
 
The most important distinction in
statistics
 
When analysing data (or reading the literature), think
about whether you want to discuss the sample that you
observed or want to make statements that are more
generally true
 
Statistics is the only field that gives us the correct
framework to generalise from our sample to the
population
 
Summer 2017
 
Summer Institutes
 
136
 
The most important distinction in
statistics
 
Example: T cell counts from 40 women with triple
negative breast cancer were observed. What can we do
with this information?
 
 
 
Option 1: Discuss the 40 women. What was the mean T
cell count? What was its variation?
 
 
 
Option 2: Generalise the information about the 40
women to make statements about all women with triple
negative breast cancer
 
 
These are 2 different approaches to using the same
information
 
Summer 2017
 
Summer Institutes
 
137
 
Language for making these distinctions
 
Population
Size N (usually 
)
Mean = 
 
 
Variance =
 
 
Sample
Size n
Sample Mean =
 
 
Sample variance =
 
 
 
Summer 2017
 
Summer Institutes
 
138
 
Generalising the sample to the
population
 
Issue: We can calculate the sample mean and sample
variance from our data, but the true mean and true
variance are generally unknown
 
 
 
Fortunately, statisticians have learnt some things about
how to recover* the true mean and true variance based
only on sample means and sample variances
 
 
 
 
 
 
 
 
 
* with high probability
 
 
 
Summer 2017
 
Summer Institutes
 
139
 
How do sample means behave?
 
Suppose we observe data X
1
, X
2
, ..., X
n
.
We can calculate X-bar exactly, but what
can we say about μ?
 
 
Idea: μ is probably close to Xbar
 
Goal: Make this more rigourous
 
 
 
Summer 2017
 
Summer Institutes
 
140
 
In general, neither Σ
i
X
i
 nor Xbar will have
the same distribution as the X’s
 
Example:
 X
1
, X
2
, X
3
 follow F-distributions.
X
1
+ X
2
+ X
3
 does not follow an F-
distribution
(X
1
+ X
2
+ X
3
)/3 does not follow an F-
distribution
 
 
Exception to the rule:
If X
1
, X
2
, ..., X
n
 are independent and
normally distributed with means μ
i
 and  σ
i
2
,
X
1
+…+X
n
follows a normal distribution with mean
 μ
1
+…+ μ
n
and variance
σ
1
2
 +…+ σ
n
2
 
Sums of Normal Random Variables
 
Summer 2017
 
Summer Institutes
 
141
 
What can we do instead? Use the central
limit theorem!
 
 
Central limit theorem:
If X
1
, X
2
, ..., X
n
 are independent and have
the same distribution, and the mean of that
distribution is σ
2
, then if n is large,
 
 
 
 
approximately and under relatively weak
conditions.
 
In general, this applies for n 
 30.
As n increases, the normal approximation
improves.
 
Central limit theorem
 
Summer 2017
 
Summer Institutes
 
142
 
Distribution of the Sample Mean
 
Population of X’s
(mean = 
)
 
sample of size n
 
sample of size n
 
sample of size n
 
sample of size n
 
sample of size n
 
 
Summer 2017
 
Summer Institutes
 
143
 
Central Limit Theorem - Illustration
 
Population
 
Summer 2017
 
Summer Institutes
 
144
 
Summer 2017
 
Summer Institutes
 
145
 
The central limit theorem allows us to use
the sample (X
1
…X
n
) to discuss the
population (μ)
 
 
We do not need to know the distribution of
the data to make statements about the true
mean of the population!
 
Central limit theorem
 
Summer 2017
 
Summer Institutes
 
146
 
Distribution of the Sample Mean
 
EXAMPLE:
Suppose that for Seattle sixth grade students the mean
number of missed school days is 5.4 days with a
standard deviation of 2.8 days.  What is the
probability that a random sample of size 49 (say
Ridgecrest’s 6th graders) will have a mean number of
missed days greater than 6 days?
 
Summer 2017
 
Summer Institutes
 
147
 
 
Find the probability that a random sample of
size 49 from this population will have a mean
greater than 6 days.
  
 = 5.4 days
  
 = 2.8 days
  
n = 49
 
 
 
 
 
Summer 2017
 
Summer Institutes
 
148
 
Population
distribution
 
Sampling distribution
of      (for samples of
size 49)
 
 
Summer 2017
 
Summer Institutes
 
149
 
What is the probability that a random sample (size
49) from this population has a mean between 4 and
6 days?
 
Exercise
 
Summer 2017
 
Summer Institutes
 
150
 
Solution
 
Summer 2017
 
Summer Institutes
 
151
 
Confidence Intervals
 
Summer 2017
 
Summer Institutes
 
152
 
Confidence Intervals
 
Confidence intervals are not just intervals!
 
“(L, U) is a 100p% confidence interval for a
parameter θ”
 
means that
 
“For any possible correct value parameter θ,
the interval (L, U) contains θ with
probability at least p.”
 
 
 
Confidence intervals only concern
parameters.
Prediction intervals (different!) are intervals
about random variables.
 
Summer 2017
 
Summer Institutes
 
153
 
Confidence Intervals for the mean
 
Because
 
 
we know that
 
 
 
Rearranging gives us that
 
 
is a 95% confidence interval for the true
mean μ
 
Summer 2017
 
Summer Institutes
 
154
 
Confidence Intervals
 known
 
 
 
 
 
 
I
f we desire a (1 - 
) confidence interval we can
derive it based on the statement
 
 
That is, we find constants
 
and
 
  that have
exactly (1 - 
) probability between them.
A (1 - 
)
 
Confidence Interval for the Population Mean
 
Summer 2017
 
Summer Institutes
 
155
 
Confidence Intervals
 known
 - EXAMPLE
 
 
Suppose gestational times are normally
distributed with a standard deviation of 6 days.  A
sample of 30 second time mothers yield a mean
pregnancy length of 279.5 days.  Construct a 90%
confidence interval for the mean length of second
pregnancies based on this sample.
 
Summer 2017
 
Summer Institutes
 
156
 
Confidence Intervals
 unknown
 
To get a CI for 
 using the methods outlined
above, we need X and 
2
. But usually, 

is
unknown
 - we only have X and s
2
. It turns out
that even though
 
is normally distributed,
 
is not (quite)!
 
W.S. Gosset worked for Guinness Brewing in
Dublin, IR.  He was forced to publish under the
pseudonym “Student”.  In 1908 he derived the
distribution of
 
 
 
which is now known as Student’s 
t-distribution
.
 
Summer 2017
 
Summer Institutes
 
157
 
Normal and t distributions
 
Summer 2017
 
Summer Institutes
 
158
 
Confidence Intervals
2 
unknown
t Distribution
 
When 
 is unknown we replace it with the
estimate, s, and use the t-distribution.  The statistic
 
 
has a t-distribution with n-1 
degrees of freedom
.
We can use this distribution to obtain a confidence
interval for 
 even when 
 is not known.
 
 
A
 (1-
) 
Confidence Interval for the Population
Mean when 
 is unknown
 
 
 
Summer 2017
 
Summer Institutes
 
159
 
Confidence Intervals  - 
2 
unknown
t Distribution - 
EXAMPLE
 
 
Given our 30 moms with a mean
gestation of 279.5 days and a variance of
28.3 days
2
, we can now compute a 95%
confidence interval for the mean length of
pregnancies for second time mothers:
 
Summer 2017
 
Summer Institutes
 
160
 
Definition
:  The sum of squared independent
standard normal random is a random variable
with a 
Chi-square
 distribution with n degrees
of freedom.
Let Z
i
 be standard normals, N(0,1).  Let
 
 
X has a 
2
(n)
 distribution
 
Confidence Intervals -
sample variance
 
Q:
 Can we derive a confidence interval for
the sample variance?
A:
 Yes. We’ll need the 
Chi-square
distribution
 
Summer 2017
 
Summer Institutes
 
161
 
Chi-square Distribution
 
 
Properties of 
2
 (n):  Let X ~ 
2
(n).
1.
 
X 
 0
2.
 
E[X] = n
3.
 
V[X] = 2n
4.
 
n
, the parameter of the distribution is called 
the
degrees of freedom
.
 
Summer 2017
 
Summer Institutes
 
162
 
Chi-square Distribution
Sample Variance
 
 
The Chi-square distribution describes the
distribution of the 
sample variance
.  Recall
 
 
and
 
Now the right side almost looks like
 
 
which would be 
2
(n).
Since 
 is estimated by   
 
 one degree of freedom is
lost leading to …
  
with n-1 degrees of freedom
 
Summer 2017
 
Summer Institutes
 
163
 
Chi-square Distribution
Confidence Interval for 
2
 
 
We can use the Chi-square distribution to obtain
a (1 - 
) confidence interval for the 
population
variance
.
 
 
 
Now, inverting this statement yields:
 
 
Therefore,
A
 (
1 - 
) 
Confidence Interval for the Population
Variance
 
Summer 2017
 
Summer Institutes
 
164
 
Chi-square Distribution
Confidence Interval for 
2
 - 
Exercise
 
 
Suppose for the second time mothers were
not happy using the standard deviation of 6 days
since it was based on the population of all mothers
regardless of parity.  The sample variance was
28.3 days
2
.  What is a 95% confidence interval for
the variance of the length of second pregnancies?
 
Summer 2017
 
Summer Institutes
 
165
 
Summary
 
 General (1 - 
) Confidence Intervals.
Confidence intervals are only for
parameters!
 CI for 
, 
 assumed known 
 Z.
 CI for 
, 
 unknown 
 T.
 CI for 
2 
 
2
 
confidence 
 wider interval
sample size 
 narrower interval
Slide Note
Embed
Share

In statistics, the crucial difference between sample and population data shapes how we interpret information and draw conclusions. By generalizing sample data to the population, statisticians can estimate true means and variances with confidence. Sample means help us infer about the population, although they may not have the same distribution. Recognizing these distinctions is essential for accurate statistical analysis.

  • Statistics
  • Sampling Distributions
  • Population vs Sample
  • Data Analysis
  • Statistical Inference

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Sampling Distributions Summer 2017 Summer Institutes 134

  2. The most important distinction in statistics sample vs population When analysing data (or reading the literature), think about whether you want to discuss the sample that you observed or want to make statements that are more generally true Statistics is the only field that gives us the correct framework to generalise from our sample to the population Summer 2017 Summer Institutes 135

  3. The most important distinction in statistics Example: T cell counts from 40 women with triple negative breast cancer were observed. What can we do with this information? Option 1: Discuss the 40 women. What was the mean T cell count? What was its variation? Option 2: Generalise the information about the 40 women to make statements about all women with triple negative breast cancer These are 2 different approaches to using the same information Summer 2017 Summer Institutes 136

  4. Language for making these distinctions Population Size N (usually ) Mean = = p X or j j Variance = 2 ( ) 2 = 2 p X or j j Sample Size n Sample Mean = X 1 n = = j X X j n 1 2s ( 1 = Sample variance = 2 1 ) n = 2 j s X X j 1 n Summer 2017 Summer Institutes 137

  5. Generalising the sample to the population Issue: We can calculate the sample mean and sample variance from our data, but the true mean and true variance are generally unknown Fortunately, statisticians have learnt some things about how to recover* the true mean and true variance based only on sample means and sample variances * with high probability Summer 2017 Summer Institutes 138

  6. How do sample means behave? Suppose we observe data X1, X2, ..., Xn. We can calculate X-bar exactly, but what can we say about ? Idea: is probably close to Xbar Goal: Make this more rigourous Summer 2017 Summer Institutes 139

  7. Sums of Normal Random Variables In general, neither iXi nor Xbar will have the same distribution as the X s Example: X1, X2, X3 follow F-distributions. X1+ X2+ X3 does not follow an F- distribution (X1+ X2+ X3)/3 does not follow an F- distribution Exception to the rule: If X1, X2, ..., Xn are independent and normally distributed with means iand i2, X1+ +Xn follows a normal distribution with mean 1+ + n and variance 12+ + n2 Summer 2017 Summer Institutes 140

  8. Central limit theorem What can we do instead? Use the central limit theorem! Central limit theorem: If X1, X2, ..., Xn are independent and have the same distribution, and the mean of that distribution is 2, then if n is large, X ~ N m,s2 n approximately and under relatively weak conditions. In general, this applies for n 30. As n increases, the normal approximation improves. Summer 2017 Summer Institutes 141

  9. Distribution of the Sample Mean Population of X s (mean = ) X X X X X .4 .3 .2 .1 0 -4 -2 0 2 4 Summer 2017 Summer Institutes 142

  10. Central Limit Theorem - Illustration Population .262 Fraction 0 0 10 20 30 x .276 Fraction 0 0 5 10 15 Means of size 5 Summer 2017 Summer Institutes 143

  11. .165 Fraction 0 2 4 6 8 Means of size 10 .19 Fraction 0 3 4 5 6 7 Means of size 30 Summer 2017 Summer Institutes 144

  12. Central limit theorem The central limit theorem allows us to use the sample (X1 Xn) to discuss the population ( ) We do not need to know the distribution of the data to make statements about the true mean of the population! Summer 2017 Summer Institutes 145

  13. Distribution of the Sample Mean EXAMPLE: Suppose that for Seattle sixth grade students the mean number of missed school days is 5.4 days with a standard deviation of 2.8 days. What is the probability that a random sample of size 49 (say Ridgecrest s 6th graders) will have a mean number of missed days greater than 6 days? Summer 2017 Summer Institutes 146

  14. Find the probability that a random sample of size 49 from this population will have a mean greater than 6 days. = 5.4 days = 2.8 days n = 49 = = = / / 8 . 2 49 4 . 0 n X 4 . 5 = X 4 . 5 X 6 ( ) = 6 P X P X 4 . 0 X ( ) = 5 . 1 = 0668 . 0 P Z Summer 2017 Summer Institutes 147

  15. Sampling distribution of (for samples of size 49) X Population distribution 1 0 -3 13.8 = 4 . 5 = X Summer 2017 Summer Institutes 148

  16. Exercise What is the probability that a random sample (size 49) from this population has a mean between 4 and 6 days? Summer 2017 Summer Institutes 149

  17. Solution = 5 . 3 ) 5 . 1 4 ( P ) 6 ( X P Z = ) 5 . 1 ) 5 . 3 ( ( P Z P Z = . 933 Summer 2017 Summer Institutes 150

  18. Confidence Intervals Summer 2017 Summer Institutes 151

  19. Confidence Intervals Confidence intervals are not just intervals! (L, U) is a 100p% confidence interval for a parameter means that For any possible correct value parameter , the interval (L, U) contains with probability at least p. Confidence parameters. intervals only concern Prediction intervals (different!) are intervals about random variables. Summer 2017 Summer Institutes 152

  20. Confidence Intervals for the mean Because , X ~ N m,s2 n we know that =0.95. P -1.96 X -m +1.96 s / n Rearranging gives us that is a 95% confidence interval for the true mean Summer 2017 Summer Institutes 153

  21. Confidence Intervals known If we desire a (1 - ) confidence interval we can derive it based on the statement 2 2 1 / X Z Z = 1 P Q Q n 1 2 Z Z 2 Q That is, we find constants exactly (1 - ) probability between them. and that have Q A (1 - )Confidence Interval for the Population Mean + X n 2 2 1 Z Z + , X Q Q n Summer 2017 Summer Institutes 154

  22. Confidence Intervals known - EXAMPLE Suppose distributed with a standard deviation of 6 days. A sample of 30 second time mothers yield a mean pregnancy length of 279.5 days. Construct a 90% confidence interval for the mean length of second pregnancies based on this sample. gestational times are normally Summer 2017 Summer Institutes 155

  23. Confidence Intervals unknown To get a CI for using the methods outlined above, we need X and 2. But usually, is unknown - we only have X and s2. It turns out that even though X ) ( n is normally distributed, ( ) X s n is not (quite)! W.S. Gosset worked for Guinness Brewing in Dublin, IR. He was forced to publish under the pseudonym Student . In 1908 he derived the distribution of X ( ) s n which is now known as Student s t-distribution. Summer 2017 Summer Institutes 156

  24. Normal and t distributions Summer 2017 Summer Institutes 157

  25. Confidence Intervals 2 unknown t Distribution When is unknown we replace it with the estimate, s, and use the t-distribution. The statistic X / s n has a t-distribution with n-1 degrees of freedom. We can use this distribution to obtain a confidence interval for even when is not known. A (1- ) Confidence Interval for the Population Mean when is unknown s/ n a 2 1-a X+tn-1 s/ n, X+tn-1 2 ( ) ( ) Summer 2017 Summer Institutes 158

  26. Confidence Intervals - 2 unknown t Distribution - EXAMPLE Given our 30 moms with a mean gestation of 279.5 days and a variance of 28.3 days2, we can now compute a 95% confidence interval for the mean length of pregnancies for second time mothers: Summer 2017 Summer Institutes 159

  27. Confidence Intervals - sample variance Q: Can we derive a confidence interval for the sample variance? A:Yes. We ll need the Chi-square distribution Definition: The sum of squared independent standard normal random is a random variable with a Chi-square distribution with n degrees of freedom. Let Zi be standard normals, N(0,1). Let n = + + + = 2 1 2 2 2 2 = i X Z Z Z Z n i 1 X has a 2(n) distribution Summer 2017 Summer Institutes 160

  28. Chi-square Distribution Properties of 2 (n): Let X ~ 2(n). 1. X 0 2. E[X] = n 3. V[X] = 2n 4. n, the parameter of the distribution is called the degrees of freedom. Summer 2017 Summer Institutes 161

  29. Chi-square Distribution Sample Variance The Chi-square distribution describes the distribution of the sample variance. Recall 2 1 ( ) n = 2 = i s X X i 1 n 1 and 2 2 X X s n ( ) = = i 1 i n 2 1 Now the right side almost looks like 2 X n = i i 1 which would be 2(n). Since is estimated by one degree of freedom is lost leading to ( ) 2 ~ 1 X 2 s n 2 with n-1 degrees of freedom Summer 2017 Summer Institutes 162

  30. Chi-square Distribution Confidence Interval for 2 We can use the Chi-square distribution to obtain a (1 - ) confidence interval for the population variance. 2 2 2 n 1 2 s n 1 ( ) = 1 1 P Q n Q ( ) ( ) 1 2 2 Now, inverting this statement yields: / 1 n Q n s P 2 2 2 1 1 n ( ) ( ) = 2 2 2 1 / 1 s n Q ( ) ( ) 1 2 Therefore, A (1 - ) Confidence Interval for the Population Variance 1 1 ( ) ( ) ) 2 2 2 n 1 / , 1 / s n Q s n Q 2 ( ) ( 1 n 2 2 Summer 2017 Summer Institutes 163

  31. Chi-square Distribution Confidence Interval for 2 - Exercise Suppose for the second time mothers were not happy using the standard deviation of 6 days since it was based on the population of all mothers regardless of parity. The sample variance was 28.3 days2. What is a 95% confidence interval for the variance of the length of second pregnancies? Summer 2017 Summer Institutes 164

  32. Summary General (1 - ) Confidence Intervals. Confidence intervals are only for parameters! CI for , assumed known Z. CI for , unknown T. CI for 2 2 confidence wider interval sample size narrower interval Summer 2017 Summer Institutes 165

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#