Understanding Variation in Statistical Studies
Variability is key in statistical studies, shaping the essence of statistical analysis. Students often struggle to grasp the concept of variability, despite being taught statistical methods. The term "variation" takes on different meanings in various statistical contexts, presenting challenges in comprehension. It is crucial to understand and appreciate the diverse manifestations of variation in data analysis to enhance statistical thinking skills.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
VARIATION THE HEART OF STATISTICAL STUDY
Variability is the heart and soul of the statistical endeavor. If it werent for the complexity of variability in data, statistics would be the past-time of middle-schoolers rather than one of the worlds leading scientific fields. STUDENTS NEED TO BE ABOUT VARIATION THINKING MORE Research shows that students find it hard to appreciate the role of variability, even when they are being taught to account and allow for it. Students have, from an early conception of statistical thinking, become very accustomed to being asked questions in data and chance that prompt them to respond with single point value answers. What is the Probability that .? How many would you expect? What is the average of ? These types of tasks in both research and assessment have tended to mask the issue of variability.
Questions like: STUDENTS NEED TO BE what is the probability of a year 12 student being taller than 170 cm? This question evokes a specific answer. There is a 34% Probability Students understand the meaning of their answer (or relatively so) ABOUT VARIATION THINKING MORE But when a student is told that Y12 boys have more variation in their data, they haven t a clue what that means They can calculate both IQRs They can tell you which IQR is bigger They ll even tell you that means boys heights have more variation. But they have no idea what that means
VARIATION, VARIATION, AND VARIATION
It seems like every time we teach stats, the word variation comes up in a different way. VARIATION, VARIATION, For example, in inference we say that sampling variability means that when I take another sample, I will get somewhat different results, etc. AND VARIATION But then in the analysis section, I say more variation has something to do with a bigger IQR And then in bivariate and experiments and others I m dealing with sources of variation , which certainly seems different to the other two? Many students think statisticians just like the word, and that s why it comes up under so many different contexts. But actually, we use the same word in so many different ways, because variation is a concept that has many manifestations.
It all comes from the same underlying idea Lets say we wanted to study boys heights. We know when we take 30 people s heights, that we re not going to keep getting the same number. There will be variation, There will be a variety of heights. VARIATION WITHIN SAMPLES One thing that contributes to that variety is VARIATIONWITHIN SAMPLE / SPREAD . This contributor is almost entirely a function of how the population behaves. i.e. if there is a lot of variation within heights of boys in the population, there will more than likely be lots of variation within heights of boys for the sample IT IS NOT DECREASED BY SAMPLE SIZE.
Another thing that contributes to that variety is VARIATION BETWEEN SAMPLES . VARIATION BETWEEN This is the idea that all samples are different, and it is by chance we picked this one. SAMPLES This contributes to the variability (variety) in the data. One cannot claim to know exactly whether in this particular sample it has contributed such and such amount. However, the amount of variation contributed by samples of the same size is largely predictable, and when considering descriptors such as mean/median, the extent of possible variation can be estimated by confidence interval construction.
Used in everyday teaching VARIATION BETWEEN VARIATIONWITHIN SAMPLES . Therefore the sample of boys heights has significantly more variation than the girls SAMPLES what we mean is: the boys heights have more variety than the girls. Could this be a mirror of the population? What would you expect? VARIATION BETWEEN SAMPLES . A sample of size 10 will have more variation than a sample of size 100 what we mean is: chance plays a more pivotal role in the look of you data with a sample of size 10 than if you had a sample of size 100
Another thing that contributes to that variety is VARIATION FROM NON SAMPLING ERRORS / SOURCES OFVARIATIONTHAT NEEDTO BE MANAGED VARIATION FROM NON SAMPLING ERRORS This is variation caused by factors of data collection that unduly affect the variety in that basket. Whether they increase the variety, or decrease it, both are equally problematic, because you as a scientist are making discoveries that are compromised.
We should try to quantify it and describe it meaningfully DEAL WITH VARIATION? IDEAL STATS STUDENT HOW SHOULD THE We should try to seek explanations for it When collecting data, we should control our processes so that we don t change it unnecessarily (if this one isn t attended to, it can render all the above effort inaccurate) We should not decide on the use of a model without considering it
VARIATION IN THE PLAN SECTION
We should try to quantify it and describe it meaningfully DEAL WITH VARIATION? IDEAL STATS STUDENT HOW SHOULD THE We should try to seek explanations for it When collecting data, we should control our processes so that we don t add to it unnecessarily (if this one isn t attended to, it can render all the above effort inaccurate) We should not decide on the use of a model without considering it
OF VARIATION MEAN ? MANAGING SOURCES WHAT DOES managing sources of variation : managing anything that might unduly increase or decrease or change in any other way the variation in our data.
In the plan section, the statistician should make every effort to manage sources of variation, possibly through: The application of correct statistical design in order to limit undue effects on data collection If sampling is involved, random sampling methods should be followed to limit undue effects caused by a biased/unrepresentative selection. If running a two independent group experiment, groups should be randomised to limit undue variation caused by unmatched groups. Etc. THE PLAN SECTION CONSIDERING VARIATION IN Expert statistician: Attempt to anticipate variation through researching results of similar studies completed previously. Decisions around design are often made based on anticipated variation. http://powerandsamplesize.com/
In 11 Bivariate, Some possible ideas include: STANDARDS THAT ALLOW MORE OPPORTUNITY TO MANAGE SOURCES OF how the measures are taken: where to start? Where to end? We should remove anything that would effect consistency of start point and end point. VARIATION How to make sure things are zeroed? (includes a tape measure, they could have a student other than the one measuring double check the tape measure s zero starts where it is meant to start) Inadequacy of equipment and how you will deal with it (weighing scale looks different from different angles, person taking measurement could stand in fixed position relative to scale) Remove any other barrier from consistency.
In Standards such as experiments, students should be encouraged to go beyond surface thinking to deeper thinking. STANDARDS THAT ALLOW MORE OPPORTUNITY TO MANAGE SOURCES OF I explain it like this. VARIATION Surface thinking: a person running your experiment would have considered that difficulty almost immediately, and come up with your method of dealing it with little thought Deeper thinking: only a person who has pondered deeply into the many aspects that could effect the results (contribute unduly to variation) would have thought of that difficulty and/or the way you dealt with it.
For example: Question: Does colour effect the number of words you can remember? STANDARDS THAT ALLOW MORE OPPORTUNITY TO MANAGE SOURCES OF Surface thinking: Every student will have the same amount of time to look at and memorise the words VARIATION Deeper thinking: I made sure the words chosen were from different topics and of different lengths. I was worried that if they were all from the same topic someone may find them easier to remember. For example, if most of them are sport related the sporty kids will find them a lot easier to memorise than the non-sporty kids
VARIATION IN THE ANALYSIS SECTION
We should try to quantify it and describe it meaningfully DEAL WITH VARIATION? IDEAL STATS STUDENT HOW SHOULD THE We should try to seek explanations for it When collecting data, we should control our processes so that we don t add to it unnecessarily (if this one isn t attended to, it can render all the above effort inaccurate) We should not decide on the use of a model without considering it
I think it is superior for a student to think of data as a story of variation from which we extract measures of centre, than for a student to think of data through it s centre and consider variability around that centre. So, I think it would be superior for a student to acknowledge the statistical endeavour as an attempt to consider variation of the data, and that: One facet of this, is to extract the central tendency from the variation in the data. (CENTRE) Another facet, is to describe how much variation there is in the data (SPREAD) Another facet, is to describe the pattern of that variation (SHAPE) Another facet, is to describe unusual aspects of the variation. (UNUSUAL)
DESCRIBING CENTRE Description of INFERENCE Mean Median Difference in mean/median Shift of middle 50% Etc.
DESCRIBING SPREAD RANGE o A measure of the extent of variation. What is the extent of varitey in the sample of heights? IQR o A measure of typical variation, or, the variation of the generality of sample points. INFERENCE eg. Everyone greater than the upper quartile can be thought of as taller people . Everyone lower than the lower quartile can be thought of as shorter people . All the people in the IQR could be thought of as the general case of heights. Therefore, spread of IQR is a good measure of the amount of variety of heights amongst the more typical boys. STANDARD DEVIATION o Again, a measure of typical variation. Generally, the standard deviation is favoured as the measure of typical variation in more symmetrical data, and IQR is favoured as measure of typical variation in skewed data. To make a comparative statement about which group (eg. Boys or Girls) have more spread: It is superior to use typical measures of spread when making this decision
DESCRIBING SHAPE SKEW/SYMETRY INFERENCE Indicates where the majority of the data lies, students should ponder why that is. Indicates if there is a tail, students should ponder why that is. UNIMODAL/BIMODAL If a dataset truly is bimodal, it could indicate groups within the data. However, students seem to struggle to understand the difference between a unimodal dataset with two towers in and around the peak, compared with truly bimodal data.
DESCRIBING UNUSUAL FEATURES OUTLIERS? (Why?) INFERENCE GAPS? (Why?)
Direction (Positive vs Negative) Centre Centre Form(Linear vs Non-Linear) Spread Strength BIVARIATE Spread Scatter (increasing/decreasing/consistent?) Groupings Unusual Unusual Outliers Unusual Gaps
DESCRIBING CENTRE Direction (Positive vs Negative) Form (Linear vs Non-Linear) BIVARIATE
DESCRIBING SPREAD STRENGTH This is a global statement of how much variation you see in the data as you move through the ages of the cars. Strong: little variation in price as you move through the ages of cars BIVARIATE Moderate: moderate variation in price as you move through the ages of cars Weak: A lot of variation in price as you move through the ages of cars Contextually, the better an indicator of the y-value the x-variable is, the stronger the relationship. For example: Very strong (Often trivial): Customers Served in an Hour vs Average Time Spent per Customer Moderate: Customers served in a hour vs Experience Weak: Customers Served in an Hour vs hours working per week No Relationship: Customers Served in an Hour vs Height
DESCRIBING SPREAD SCATTER Is the variation consistent as we increase the age of a car (constant scatter) or is there a reasonable difference? (non-constant scatter) BIVARIATE SO TO SUM UP Strength: amount of variation as we move along x Scatter: consistency in variation as we move along x
DESCRIBING SHAPE SKEW/SYMETRY (not assessed in NCEA curriculum) BIVARIATE
DESCRIBING UNUSUAL FEATURES OUTLIERS? BIVARIATE GROUPS? GAPS?
Centre Trend Seasonal Spread TIME SERIES Spread Relative Contributions (From Decomposition Graph) Unusual Residuals
DESCRIBING CENTRE TREND Changes in direction TIME SERIES
DESCRIBING SPREAD PEAK AND TROUGH OF RAW DATA The extent of variability in y in (eg.) the last 10 years TIME SERIES SEASONAL VARIATION Provides an idea of variation in y in a season. Gives us a pattern to explore Again, crucial for us to find explanations RELATIVE CONTRIBUTION Provides an idea of which component (Trend, Seasonal, Residual) contributes more to variation in y.
DESCRIBING SHAPE Not Applicable TIME SERIES
DESCRIBING UNUSUAL FEATURES RESIDUALS TIME SERIES
VARIATION WHEN CHOOSING A MODEL
We should try to quantify it and describe it meaningfully DEAL WITH VARIATION? IDEAL STATS STUDENT HOW SHOULD THE We should try to seek explanations for it When collecting data, we should control our processes so that we don t add to it unnecessarily (if this one isn t attended to, it can render all the above effort inaccurate) We should not decide on the use of a model without considering it
Very strong evidence that this data is not well modelled by a Normal Distribution. Another distribution should be considered Reasonable to assume the use of a Normal Distribution for modelling WHICH ONE WOULD YOU MODEL WITH A NORMAL DISTRIBUTION?
Very strong evidence that this data is not well modelled by a Normal Distribution. Another distribution should be considered Reasonable to assume the use of a Normal Distribution for modelling WHICH ONE WOULD YOU MODEL WITH A NORMAL DISTRIBUTION? At such a small sample size, it s only natural that the shape of our variable is yet to settle. It is only logical at this sample size to have some discrepancy from the model At such a large sample size, even small discrepencies from the model are concerning
Research about variation suggests that individuals reasoning about variation can be captured from three perspectives THREE INTERTWINED A DESIGN PERSPECTIVEthat integrates acknowledgement and anticipation of variability in the design of studies (usually surfaces in the Plan section of numerous standards) PERSPECTIVES A DATA-CENTRIC PERSPECTIVEthat integrates the processes of representing, measuring, and describing variation in exploratory data analysis (usually in the Analysis section of all standards) MODELING PERSPECTIVEthat integrates reasoning for fitting models to patterns of variability in data and statistics, judging the fit of models (usually in the Bivariate and Distributions standards) (reference)
Recognizing that variability is everywhere. DEVELOPING INTUITIVE IDEAS OF VARIABILITY Some things vary just a little, some vary a lot. We can try to understand why things vary Considering data as an entity, rather than as individual points.
Graphs of data show how things vary and may reveal patterns to help us focus on features of distributions and identify the signal in the noise. DESCRIBING AND REPRESENTING VARIABILITY Different graphs may reveal different aspects of the variability in a data set so it is important to study more than a single graph of a data set. Different numerical summaries tell us different things about the spread of a data set.
When making comparisons of two or more data sets, examining their graphs on the same scale allows us to compare the variability and speculate about why there are differences in the data sets. USING VARIABILITY TO MAKE COMPARISONS Advanced students should be introduced to the concept of a distribution of means and understand it as the bases of confidence interval construction
In a normal distribution, the mean and SD provide useful and specific information about variability. TYPES OF DISTRIBUTIONS VARIABILITY IN SPECIAL RECOGNIZING There is variability in a bivariate data distribution, and we need to consider the variability of both variables as well as the variability for y values given individual values of x.
There is variability involved in fitting models and judging the fit of models (e.g., fitting the normal curve to a distribution of data, or fitting a straight line to a scatterplot of bivariate data). IDENTIFYING PATTERNS OF VARIABILITY IN FITTING MODELS The variability of the deviations from the model (residuals) can tell us about the how well the model fits the data.