
Linear Regression Analysis and Correlation Techniques
Learn about correlation and regression analysis in statistics for business, where you explore the relationship between variables through scatter plots, examine patterns, identify outliers, and calculate correlation coefficients to measure the strength of associations. Understand the properties of correlation coefficients and interpret the results using empirical rules.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Statistics for Business STAT130 Unit 8: Correlation and Regression Analysis
Chapter 13 Simple Linear Regression Analysis
Introduction In addition to hypothesis testing and confidence intervals, inferential determining whether a relationship between two or more quantitative variables exists. Most commonly used technique for investigating the relationship between two or more variables; Correlation is a statistical method used to determine whether a variables exists. Regression is a statistical method used to describe the nature of the relationship between variables that is, positive or negative, linear or nonlinear. statistics involves relationship between 3
Relationships Variables Dependent Variable: measures an outcome of a study and is sometimes called the response variable. Independent Variable: explains or causes changes in the response variable and is sometimes called the explanatory or predictor variable. A scatter plot shows the relationship between two variables. Always plot the independent variable on the horizontal axis, and the dependent variable as the vertical axis. 4
Examining Relationships In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern by the: Form: Describe the type of trend between X and Y (linear, quadratic, exponential). Direction: describes the direction of the trend upward (positive) or downward (negative). Strength: Measures the amount of scatter around the general trend. An important kind of deviation is an outlier, an individual that falls outside the overall pattern of the relationship. 5
Correlation Coefficient The coefficient of correlation, denoted in the population and r in the sample, is used to measure the strength of the linear association between two quantitative variables. The sample coefficient of correlation is ( )( 1 i i r x x = ) n x x y y SS i xy = = = SS SS ( ) ( ) n n 2 2 y y xx yy i i = 1 1 i i Correlation in MegaStat: MegaStat Correlation / Regression Correlation Matrix 7
Properties of r -1 r 1 (Check this applet) r < 0 negative linear association r > 0 positive linear association r = 1 perfect linear relationship r is independent of units. Empirical rule to interpret r: |r| close to 1 strong linear association |r| close to 0.5 moderate linear association |r| close to 0 weak or no linear association 8
Testing the significance of We can test to see if the correlation is significant using the hypotheses H0: = 0 Ha: 0 The statistic is 1 2 r n = t 2 r which follows a t-distribution with n-2 degrees of freedom. MegaStat produces 95% and 99% critical limits for . If r falls outside these limits then H0 is rejected at the corresponding significance level. 10
Example A car dealer wants to find whether there is a linear relationship between the odometer reading and the selling price of used cars. A random sample of 100 cars is selected, and the data recorded. There is a strong negative linear relationship. 6500 6000 5500 Price 5000 4500 4000 10000 20000 30000 40000 50000 60000 Odometer 11
Example Compute correlation between odometer and price. r = -0.806 negative strong correlation (linear relationship). r falls outside the 99% critical limits ( 0.256) which means that there is significant linear relationship between the odometer reading the price of the car ( 0). and interpret the coefficient of 12
Simple Linear Regression Regression is used to predict the value of one variable (the dependent variable - y) based on the value of other variables (independent variables x1, x2, xk). The objective of regression analysis is to build a regression model that can be used to describe, predict and control the dependent variable on the basis of the independent variable Examples: Relationship between odometer reading (X) and a used car s selling price (Y). Relationship between years of experience (X) and the salary of an accountant (Y). 13
The Model The simple linear regression model y= 0 + 1x+ y = dependent variable x = independent variable y|x = 0 + 1x = the mean value of y given x 0 = y-intercept 1 = slope of the line = error variable 0 and 1 are called regression parameters b0is the estimate of 0 and b1is the estimate of 1 14
Model Assumptions 1) Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x 2) Normality Assumption At any given value of x, the population of potential error term values has a normal distribution 3) Independence Assumption Any one value of the error term is statistically independent of any other value of 16
Estimating the coefficients The regression equation that estimates the equation of the first order linear model is: y = b0 + b1x The estimates of the coefficients are: ( )( i x x y b x x = SS SS ) y xy i = = 1 2 ( ) xx i b y b x 0 1 Regression in MegaStat MegaStat Correlation / Regression Regression Analysis 17
Interpretation of Regression Coefficients The intercept, b0 is the estimated average value of y when the value of x is zero. The slope, b1 is the estimated change in the average value of y as a result of a one-unit increase in x. 18
Example A car dealer wants to find the relationship between the odometer reading and the selling price of used cars. Find and plot the regression line. y = -0.031 x + 6,533.383 R2 = 0.650 6500 6000 5500 Price 5000 4500 4000 10000 20000 30000 40000 50000 60000 Odometer 19
Example The estimated regression equation is y = 6533 0.0312 x Interpretation of Regression Coefficients: The intercept is b0 = 6533 Do not interpret the intercept as the Price of cars that have not been driven (Why?) The slope is b1 =-$0.0312 For each additional mile on the odometer, the price decreases by an average of $0.0312 20
Testing the Significance of the Slope A regression model is not likely to be useful unless there is a significant relationship between x and y. To test significance, we use the null hypothesis: H0: 1 = 0 vs. Ha: 1 0 The test statistic b t= where s = 1 s b s SS 1 b xx 1 which follows a t-distribution with df=n-2. This test is equivalent to testing whether the correlation coefficient equals zero. 21
Example: Testing the Slope Test whether the odometer reading of a car and its price are linearly related. Hypotheses: H0: 1 = 0 vs. Ha: 1 0 Test statistic: t=-13.495 P-value 0 Conclusion: There is overwhelming evidence to infer that the odometer reading affects the auction selling price, i.e. reject H0: 1 = 0 22
An F Test for Model Test validity of the model For simple regression, this is another way to test the null hypothesis H0: 1 = 0 The F test tests the significance of the overall regression relationship between x and y and is given in the ANOVA table in regression output. The test statistic is the square of the test statistic in the t-test, but the p-value should be the same. Example: 23
The Simple Coefficient of Determination(r2) For simple linear regression model: Total variation= (yi - y )2 Explained variation= (y i - y )2 Unexplained variation= (yi - y i)2 (yi - y )2 = (y i - y )2 + (yi - y i)2 The simple coefficient of determination is Explained variation Total variation r = 2 24
The Simple Coefficient of Determination(r2) The simple coefficient of determination (r2)is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation (r). 0 r2 1. r2= 1: Perfect match between the line and the data. r2= 0: There is no linear relationship between x and y. It does not give any information on the direction of the relationship between the variables. The larger the value of r2, the better the fit is. 25
Example Find the simple coefficient of determination for example; what does this statistic tell you about the model? From Regression output, r2= 0.65 65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model. 26
Using the Regression Equation Before using the regression model, we need to assess how well it fits the data. If we are satisfied with how well the model fits the data, we can use it to make predictions for y. Example Predict the selling price of a three-year-old Taurus with 40,000 miles on the odometer. = 6533 .0312 = 6533 .0312(40,000) = $5,285 y x 27
Confidence and Prediction Intervals There are two different intervals for the response variable: Confidence interval for the mean response: What is the mean response, y|x, for a given value, x0, of the predictor variable? Prediction interval for individual value of the response: What would one predict a new observation, y, to be for a given value, x0, of the predictor variable? The point estimate for both inferences is the value of for the specified value of x0. 28
Confidence and Prediction Intervals A (1- )100% confidence interval for mean value of y when x=x0 is ( ) x 2 x x 1 n 0 x + y t s /2 ( ) 2 i A (1- )100% prediction interval for an individual value of y when x=x0 is ( ) x 2 x x 1 n 0 x + + 1 y t s /2 ( ) 2 i Here s is the standard error and t /2 is based on (n-2) degrees of freedom. 29
Confidence and Prediction Intervals A prediction interval is intended to trap a new observation of the dependent variable given values of the independent variables. While the confidence interval is intended to trap the mean of the dependent variable given values of the independent variables. In MegaStat: In regression window, choose Type in predictor values and enter the predictor values for the independent variable. 30
Example: Prediction The car dealer wants to bid on a lot of 250 Ford Taurus, where each car has been driven for about 40,000 miles. The dealer needs to estimate the mean price per car. The 95% confidence interval is ($5252, $5322) Provide an interval estimate for the bidding price on a Ford Taurus with 40,000 miles on the odometer. The dealer would like to predict the price of a single car. The 95% prediction interval is ($4984, $5590) 31
Exercises 1) The manufactures car seats has been concerned about the number and cost of machine breakdowns. The problem machines are old and unreliable. However, the cost of replacing them is quite high and the president is not certain that the cost can be made up in today s slow economy. To help make a decision about replacement, he gathered data about last month s costs for repairs and the ages (in months) of the plant s 20 welding machines (worksheet: Repair). president of a company that is that the quite becoming 32
Exercises a) Find the sample regression line. b) Interpret the coefficients. c) Determine the coefficient of determination and discuss what this statistic tells you. d) Test at 5% significance level whether the age of a machine and its repair cost are linearly related. e) Find a 95% prediction interval for the monthly repair cost of a welding machine that is 120 months old. f) Find a 95% confidence interval for the average monthly repair cost of welding machines that are 120 months old. 33
Exercises 2) Ten cars between 1 and 6 years old were randomly selected from the classified ads. The data were obtained (Worksheet: Cars), where x denotes age, in years, and y denotes price, in hundreds of dollars. a) Develop a scatter plot and describe the relationship between the price and the age of the cars. b) Compute the correlation coefficient. c) Determine the regression equation for the data. d) Interpret carefully the regression coefficients. 34
Exercises e) Compute determination, r2. f) Does the age of the car seem a good predictor for its price? Test the appropriate hypothesis at =0.05. g) Obtain a point prediction for the mean price of all 4-year-old cars. h) Obtain a 95% confidence interval for the mean price of all 4-year old cars. i) Obtain a point prediction for the mean price of all 12-year-old cars. Comment on the accuracy of this prediction. and interpret the coefficient of 35
Exercises 3) A fire insurance company wants to relate the amount of fire damage in major residential fires to the distance between the burning house and the nearest fire station. The study is to be conducted in a large suburb of a major city; a sample of 15 recent fires in the suburb is selected. The amount of damage (in $1,000) and the distance (in miles) between the fire and the nearest fire station are recorded for each fire. a) Develop a scatter plot and describe the relationship between the distance and the damage. b) Find and interpret the correlation coefficient. 36
Exercises c) Determine the regression equation for the data. d) Interpret carefully the regression coefficients. e) Compute and interpret R2. f) Does the distance seem a good predictor for the damage in the burning appropriate hypothesis at =0.05. g) Obtain a point estimate for the mean damage of all houses on fire that are 4 miles away from the nearest fire station. h) Obtain a 90% prediction interval for the damage of a house on fire that is 4 miles away from the nearest fire station. house? Test the 37
Chapter 14 Multiple Regression
The Multiple Regression Model Simple linear regression used one independent variable to explain the dependent variable Some relationships are too complex to be described using a single independent variable Multiple regression uses two or more independent variables to describe the dependent variable This allows multiple regression models to handle more complex situations There is no limit to the number of independent variables a model can use Multiple regression has only one dependent variable (y) 39
The Multiple Regression Model The multiple linear regression model relating y to x1, x2, , xk is y = 0 + 1x1 + 2x2 + + kxk + 0, 1, 2, k are unknown parameters is an error term The estimated regression equation is given by y = b0 + b1x1 + b2x2 + + bkxk b0, b1, b2, , bk are the least squares point estimates of the parameters 0, 1, 2, , k 40
Multiple Regression Coefficients interpretation: bi represents an estimate of the change in y corresponding to a one-unit increase in xi when all other independent variables are held constant. Coefficient of Multiple Determination R2 The multiple coefficient of determination, R2, is the proportion of the total variation in the n observed values of the dependent variable that is explained by the multiple regression model. Confidence and Prediction Intervals Similar to simple linear regression. 41
The Overall F Test (Validity of the model) This F test is used to find out if all of the regression coefficients, except the intercept, are equal to zero. To test H0: 1= 2 = = k = 0 Ha: At least one of 1, 2, , k 0 The test statistic is (Explained variation)/k (Unexplained variation)/[n-(k 1)] = F + which follows an F distribution with k and n-k-1 degrees of freedom. 42
Testing the Significance of an Independent Variable To test significance of an independent variable xj, we test H0: j = 0 vs. Ha: j 0 Test Statistic t=bj /sbj which follows t distribution with df=n-k-1. Note on Significance testing: Whether the independent variable xj is significantly related to y in a particular regression model is dependent on what other independent variables are included in the model. That is, changing independent variables can cause a significant variable to become insignificant or cause an insignificant variable to become significant 43
Example A researcher wanted to find the effect of driving experience and the number of driving violations on auto insurance premiums. A random sample of 12 drivers insured with the same company and having similar auto insurance policies was selected from a large city. The data includes the monthly auto insurance premiums (in $) paid by these drivers, their driving experiences (in years), and the numbers of driving violations committed by them during the past three years. Find the regression equation of monthly premiums paid by drivers on the driving experiences and the numbers of driving violations. 44
Example From the scatter plots we have: Moderate negative linear relationship between the monthly premium and the driving experience. strong positive relationship between the monthly premium and the number violations. linear of driving 45
Example: Estimated Model The proposed regression model is y = 0+ 1 x1+ 2 x2 + y = the monthly auto insurance premium in dollars x1 = the driving experience in years of a driver x2 = the number of driving violations committed by a driver during the past three years The estimated regression model is =110.28 2.75 x1 + 16.11 x2 46
Example: Coefficients Interpretation b0 = $110.28 a driver with no driving experience and no driving violations committed in the past three years is expected to pay an auto insurance premium of $110.28 per month. That may not be true because none of the drivers in our sample has both zero experience and zero driving violations. b1 = -$2.75 A driver with one extra year of experience but the same number of driving violations is expected to pay $2.75 less per month for the auto insurance premium. b2 = $16.11 A driver with one extra driving violation during the past three years but with the same years of driving experience is expected to pay $16.11 more per month for the auto insurance premium. 47
Example: R2 Find determination. The value of R2= 0.931 tells us that the two independent variables; years of driving experiences and the numbers of driving violations, explain 93.1% of the variation in the auto insurance premiums. and interpret the coefficient of 48
Example: Overall F-test At 5% level, can you conclude that the number of years of driving experience and number of driving violations are useful useful in the regression model? H0: 1= 2 = 0 vs. Ha: At least one of 1, 2 0 F=60.88 and P-value=0.00000589 Reject H0. The number of years of driving experience and number of driving violations should be retained in the model. 49
Example: Testing significance At 5% level, can you conclude that the number of years of driving experience is useful in the regression model? H0: 1= 0 vs. Ha: 1 0 t=-2.812 P-value=0.0203 Reject H0. The number of years of driving experience is a useful predictor and should be retained in the model. 50