
Understanding Associations in Data Analysis
Explore the concept of association in data analysis, covering bivariate relationships, quantitative variables, semiquantitative variables, and qualitative variables. Learn about covariance, Pearson correlation coefficient, Spearman coefficient, and indices like Chi-Square and Cramer's V. Understand how to analyze associations between variables of different scales and recognize nonlinear relationships.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Theme 5. Association 1. Introduction. 2. Bivariate tables and graphs. 3. Quantitative variables: covariance, Pearson correlation coefficient, variance-covariance matrix and correlation matrix. 4. Semiquantitative variables: Spearman coefficient. 5. Qualitative variables: Indices Chi Square and Cramer's V. 6. Association between variables of different scales. 7. Concept of nonlinear relationships.
Introduction So far we have focused on measures of central tendency, variability, skewness and kurtosis of a single variable. However, in practice it is common to examine two or more variables together (e.g., relationship between performance and intelligence, etc.) Here we will focus on the relationship between two variables (from n paired observations) and calculate (in particular) an index that will give us the degree of relationship between the two variables: the coefficient of linear correlation (Pearson)
Graphical representation performance performance performance IQ IQ IQ Negative linear relation No relation Positive Linear relation Note: The Pearson correlation coefficient measures linear correlation.
Graphical representation performance performance IQ IQ Non linear relation Linear relation Note: The Pearson correlation coefficient measures linear correlation..
Graphical representation performance performance performance IQ IQ IQ Perfect linear relation Strong linear relation Weak linear relation Now we need an index that we report the extent to which both X and Y are related, and if the relationship is positive or negative
Covariance and Pearsons index when the linear relationship is positive: When X is above its mean, Y is typically above its mean rendimiento Scenario 1 inteligencia when the linear relationship is negative: When X is above its mean, Y is typically below its mean Scenario 2 rendimiento inteligencia
Covariance Here's the formula: n ( )( ) = X X Y Y i i = 1 i s xy n In case 1, the covariance will be positive, and in case 2, the covariance will be negative. Therefore the covariance gives us an idea of whether the relationship between X and Y is positive or negative. Problem: the covariance is not a bounded index (e.g., how to interpret a covariance of 6 in terms of the degree of association?), and does not account for the variability of the variables. So we use another index
Pearson coefficient The Pearson correlation coefficient: : s n ( )( ) X X Y Y xy = r i i xy = = 1 i r s s xy n s s x y x y
Properties of Pearsons r Property 1. The Pearson correlation index is between -1 and +1. A Pearson correlation index of -1 indicates a perfect negative linear relationship An index of Pearson correlation of +1 indicates a perfect positive linear relationship. A Pearson correlation index of 0 indicates no linear relationship. (Notice that a value close to 0 the index does not imply that there is some kind of non-linear relationship: the Pearson index only measures linear relationship.)
Properties of Pearsons r Property 2. The Pearson correlation index (in absolute value) does not change when we make a linear transformation on the variables. For example, the Pearson correlation between the temperature (in degrees Celsius) and the level of depression is the same as the correlation between the temperature (measured in degrees Fahrenheit) and the level of depression.
More on Pearsons r Interpretation We have to consider what we are measuring to interpret how the strength of the relationship between the variables under study. In any case, it is very important to draw an scatterplot. For example, in the case of the left, it is clear that there is no relationship between intelligence and performance. However, if we calculate the Pearson correlation index will give a very high value, caused by the atypical score in the top right corner. performance IQ
More on Pearsons r Interpretation (2) It is important to note that "correlation does not imply causation". The fact that two variables are highly correlation does not imply that X causes Y or that Y causes X.
More on Pearsons r Interpretation (3) It is important to note that the Pearson correlation coefficient may be affected by third variables. For example, if we were to a school and measured height and had a test of verbal ability, the higher will also have more verbal ability ... of course, that may be simply because in the older children age will be taller than the younger children. If this "third variable is controlled (by "partial correlation ), there will hardly be a relationship between height and important numerical ability. Habilidad num rica 14 a 12 a There are many cases where the third variable is the cause of a high relationship between X and Y (and it is often difficult to identify) 10 a 8 a 6 a os Estatura
More on Pearsons r Interpretation (3) The Pearson coefficient value depends in part on the variability of the group. If we make the Pearson coefficient between intelligence and performance with all subjects, the Pearson coefficient value is quite high. However, if we use only the individuals with IC low (or high CI) and calculate the correlation with framerate, the Pearson coefficient value will be significantly lower. Performance A heterogeneous group would give a greater degree of relationship between variables than a homogeneous group. Low IQ High IQ IQ
5.4 Other coefficients Of course, it is possible to obtain measurements of the degree of relatedness of variables when they are not quantitative. The case in which the variables X and Y are ordinals Remember, when we have variables with ordinal scale, we can establish order between the values, but do not know the distances between values. (If we knew the distance between the values we would be at least an interval scale) We can calculate the correlation coefficient Spearman correlation coefficient or Kendall. (We will see the first one.)
Spearman's rank correlation coefficient What we have is 2 sequences of ordinal values. Spearman coefficient is a special case of the Pearson correlation coefficient. n 2 i 6 d = = 1 1 i r ( ) s 2 1 n n id is the difference between the ordinal value X and the ordinal value of the subject Y i
Spearman's rank correlation coefficient (properties) First. It is bounded, as the Pearson coefficient, between -1 and +1. A Ppearman coefficient of +1 means that which is first to X is first to Y, which is the second in X is the second in Y, etc. Spearman coefficient of -1 means that which is first in X is the last in Y, etc Second. Its calculation is simple (more than the Pearson correlation coefficient). However, with computers this is irrelevant these days ...
5.5 Qualitative Variables 2 test as a measure of association The chi-square test is a nonparametric test that is used to measure the association between two variables when we have contingency tables. It is also used, generally, to assess the divergence between observed scores (empirical) and a predicted scores (theoretical). Generally, the chi-square statistic is obtained as follows: )2 ( fe are the empirical frequencies and ft represents the theoretical frequencies fe ft ft 2=
2 test as a measure of association: The case of 2 qualitative variables The empirical frequencies are those that have in the contingency table. Now, how do you compute the theoretical frequencies? This process is simple: If both variables are independent, the theoretical frequency of each cell will be the result of multiplying the sum frequency of the row by the sum of the fequencies of the column, and the result is divided by N )2 ( fe ft ft 2= To calculate "chi-square" with crosstabs on the Internet http://vassarstats.net/newcs.html
2 as a test as a measure of association. derived coefficients and interpretation From the chi-square test, there are a number of measures of association between variables. They quantify the strength of the relationship between two variables. Case of 2x2 tables: phi coefficient This index is interpreted analogously to the Pearson coefficient 2 = n
2test as a measure of association: Other coefficients If we have more than 2 rows or columns: Cramer s index m is the smallest number among the number of rows-1 and columns-1 2 = V n m This index is interpreted similarly to Pearson s r (except for the issue of the sign;; V is always positive). Note that if the table is 2x2 this index matches the phi index (see the previous slide)