Understanding Linear Regression Analysis: Testing for Association Between X and Y Variables
The provided images and text explain the process of testing for association between two quantitative variables using Linear Regression Analysis. It covers topics such as estimating slopes for Least Squares Regression lines, understanding residuals, conducting T-Tests for population regression lines, and checking conditions for statistical validity. The conclusion involves determining whether there is sufficient evidence to confirm a positive or negative linear association between the X and Y variables based on the analysis results.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Is there an association between X and Y in either of these plots? What would the slopes be for the LSR lines? (A) (B)
LinReg T-Test * We want to test if there is an association between two quantitative variables (x and y) * We look at a SAMPLE of data on a scatterplot and estimate the true association (+ or -) of the POPULATION plot * LSR line (sample): * Population model (that we don't know):
Residuals (errors): Independent, Normally distributed N(0, Se) Se = Standard deviation of the residuals = The average y-distance of the points from the LSR line (from the predictions) On average, the actual y-data is _______ y-units away from the predicted values.
Small Se large Se
T-Test: Testing the slope of the population regression line Hypotheses: Ho: 1 = 0 (there is no linear association btw X & Y) Ha: 1 >, <, 0 (there is a linear association btw X & Y)
Conditions: 1) Random 1) 2) Linear data 2) scatterplot is linear w/ no outliers 3) Independence 3) each thing is indep. of the others 4) Normal residuals 4) Normal prob. plot of residuals is linear (or histogram symm) 5) Equal variance 5) Residual plot shows no change in spread of residuals
Conditions met LinReg t-test Mechanics: Test Statistic: t = statistic - parameter std. dev. of stat (SE) P-Value: P(t test statistic) = tcdf(LB, UB, df) df = n - 2
Conclusion: Reject / Fail to reject Ho .... We have sufficient/insufficient evidence there is a positive/negative linear association between x-variable and y- variable.
Confidence Interval: conditions met LinReg T Interval Formula: statistic + (critical value)(SE) = (a, b) Get the t* from the INVT program.
Sentence: We are ____% confident that as X increases by 1 X unit, the Y increases/decreases between a and b Y units. Example: AIRFARES data (distance traveled vs. airfare paid) Assume we are told that a 95% confidence interval for the slope was (0.05435, 0.1804) Interpret: We are 95% confident that as the distance of the flight increases by 1 mile, the airfare increases between $0.05435 and $0.1804.
2 ways to do the mechanics of the test: 1)With actual data 2)With computer output NOTE: to check conditions, they will give you all the plots Multiple Regression Model of HusbandsAndWives Response attribute (numeric): Age_Wife Std Error t P R2 Predictor Constant Coefficient 1.5740 1.1501 Statistic Value 1.369 0.1730 0.9112 0.0259 35.249 0.0000 0.8809 Age_Husband Regression Equation: Age_Wife = + Age_Husband R-Squared: 0.880894 Adjusted R-Squared: 0.880186 Standard Deviation of the Error: 3.95101
Example: Does a relationship exist between High School GPA and freshman year performance in college? A random sample of 40 freshmen at a local college was taken and their HS GPA and the GPA from their first full year were recorded. Hypotheses:
Multiple Regression Model of Sample of SATGPA Response attribute (numeric): FYGPA Std Error t P R2 Predictor Coefficient Constant Statistic Value -1.510 0.1392 -1.0149 0.6720 1.0903 0.2045 5.333 0.0000 0.4280 HSGPA Regression Equation: FYGPA = + HSGPA R-Squared: 0.428036 Adjusted R-Squared: 0.412985 Standard Deviation of the Error: 0.626502
Multiple Regression Model of Sample of SATGPA Response attribute (numeric): FYGPA MECHANICS: Std Error t P R2 Predictor Coefficient Constant Statistic Value -1.510 0.1392 -1.0149 0.6720 1.0903 0.2045 5.333 0.0000 0.4280 HSGPA Regression Equation: FYGPA = + HSGPA R-Squared: 0.428036 Adjusted R-Squared: 0.412985 Standard Deviation of the Error: 0.626502 CONCLUSION:
p. 675 #17 If you reject Ho, complete an appropriate confidence interval
p. 674 #5 You will need to look at #3 for context/data
Conditions: 1) Random - assume 2005 movies are representative of all movies 2) Linear Data - The scatterplot is a weak linear with 1 possible outlier 3) Independence - Each movie is independent of others 4) Normal residuals - The normal probability plot of the residuals looks approx. linear 5) Equal Variance - The residual plot shows no change in the spread of the residuals but does show the one possible outlier.
conditions met Lin Reg T-Interval 0.7144 + (1.98)(0.1541) = (0.4094, 1.0194) df = 118 We are 95% confident that for every increase of 1 minute of run time of a movie, the budget increases by btw 0.4094 and 1.0194 million dollars.
Complete #13: For letter (b), don't check all conditions, just say what condition isn't met. Then go on to letter (c), which tells you to do the test.
Ho: 1 = 0 Ha: 1 < 0 13) (a) (b) No, the "normal residuals" condition is not satisfied because the histogram of the residuals is not normally distributed. It is right skewed. (c) Conditions not met, proceeding anyway with a LinReg t Test t = -0.02996 = -7.04 0.0043 P(t < -7.04) = 8.907 x 10-8 df = 26 We reject Ho b/c p-value of 8.907 x 10-8 < = 0.05. We have sufficient evidence that there is a negative linear association between the year and the difference between marital ages of men & women. The data suggests there is a decrease in the difference in ages of first marriages since 1975.