Understanding Car Purchase Discounts Through Regression Analysis
A new-car dealership owner conducted a study on 100 purchasers of mid-size cars to analyze the relationship between customer characteristics (age, annual income, sex) and negotiated discounts. The dataset includes information on purchaser demographics and the discount received. By examining the univariate statistics and exploring potential explanatory variables, insights can be gained into how customer attributes may influence discounts in car purchases.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Regression Analysis: How to DO It Example: The car discount dataset The slides marked with this symbol will be skipped during our first discussion of this dataset. After we cover hypothesis testing, we ll return to them.
Discounts on Car Purchases Of course, no one pays list price for a new car. Realizing this, the owner of a new-car dealership has decided to conduct a study, to attempt to understand better the relationship between customer characteristics, and customer success in negotiating a discount from his salespeople. He collects data on a sample of 100 purchasers of mid-size cars (he has already sold several thousand of these cars): Specifically, he notes the age, annual income, and sex (men were represented by 0, and women by 1, in the coding of sex) of each purchaser (obtained from credit records), together with the discount from list price which the purchaser finally received.
Discounts on Car Purchases He collects data on a sample of 100 purchasers of mid-size cars (he has already sold several thousand of these cars): He notes the age, annual income, and sex of each purchaser, together with the discount from list price which the purchaser finally received. Discount Age Income 47658 32126 28374 29321 38016 25343 30310 45709 46242 27933 Sex 1003 1394 2542 1658 1374 1536 1402 692 947 1415 28 41 21 47 29 43 54 35 41 19 1 1 1 0 1 0 0 0 0 1 Discount ($) negotiated on the purchase of a car: age of purchaser (years), annual income ($), and sex (M/F = 0/1).
Discounts on Car Purchases He collects data on a sample of 100 purchasers of mid-size cars (he has already sold several thousand of these cars): He notes the age, annual income, and sex of each purchaser, together with the discount from list price which the purchaser finally received. Why mid-size cars only? To avoid needing to include model/price of car Other possible explanatory variables? About purchaser Negotiation training Preparatory research Significant other About salesperson Identity Biases
Look at the Univariate Statistics This will give you a sense of how each variable varies individually Estimate of population mean (or proportion) Standard deviation and extremes 95%-confidence interval for population mean (or proportion) Estimate (~2) (standard error of the mean) Estimate margin of error (at 95%-confidence level)
Univariate statistics Discount 1268.24 538.665375 9.91122209 10273.7291 0.50090827 53.8665375 0.99112221 1027.37291 0.05009083 Age Income 35705.17 Sex 37.1 0.46 mean standard deviation standard error of the mean 130 19 37 58 39 19119 34401.5 64648 45529 0 0 1 1 minimum median maximum range 1310.5 2542 2412 -0.018 -0.710 0.154 -0.633 0.452 -0.270 0.163 -2.014 skewness kurtosis 100 number of observations t-statistic for computing 95%-confidence intervals 1.9842 For example, $1,268.24 1.9842 $53.87, or 46% 1.9842 5.01% .
The Full Regression The most-complete model provides The best predictive model (pretty much) The most accurate estimate of the pure effect of each explanatory variable on the dependent variable Specifically, the difference in the dependent variable typically associated with one unit of difference in one explanatory variable when the others are held constant.
Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 301.19175 69.68% 68.74% standard error of regression coefficient of determination adjusted coef of determination 100 96 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9850
The Adjusted Coefficient of Determination in the Full Model How much of the story (how much of the overall variation in the dependent variable) is potentially explained by the fact that the explanatory variables themselves vary across the population? r2 = 1 Var( ) / Var(Y) (roughly) = 68.74% How can it be increased? By including new relevant variables Including a new garbage variable will leave it, on average, unchanged
The Coefficients The coefficient of an explanatory variable in the most-complete model Is an estimate of the average difference in the dependent variable for two distinct individuals who differ (by one unit) only in that explanatory variable. Is an estimate of the average difference we d expect to see in a specific individual if one aspect alone were slightly different (and all other aspects were the same.) coefficient (~2) (standard error of coefficient)
Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 301.19175 69.68% 68.74% standard error of regression coefficient of determination adjusted coef of determination 100 96 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9850 $9.49 1.9850 $3.63 per year of Age (with same Income and Sex), or -$0.0353 1.9850 $0.0037 per dollar of Income (with same Age and Sex), or $446.29 1.9850 $64.56 more for a woman (1) than for a man (0) (with same Age and Income)
Make single prediction Predictions, using most-recent regression Predict coefficients values for prediction 1971.7256 9.4899138 -0.035313 35000 446.29435 constant Age Income Sex 30 31 30 30 35000 36000 35000 1 1 1 0 1466.762 1476.252 1431.449 1020.468 305.5644 305.2995 305.8382 305.2382 301.1917 301.1917 301.1917 301.1917 predicted value of Discount standard error of prediction standard error of regression standard error of estimated mean 51.50843 49.91316 53.10864 49.53689 95.00% 1.9850 confidence level t-statistic residual degr. freedom 96 860.2218 870.2374 824.3652 414.5748 2073.303 2082.267 2038.533 1626.361 confidence limits for prediction lower upper 1364.519 1377.175 1326.029 922.1379 1569.006 1575.329 1536.869 1118.798 confidence limits for estimated mean lower upper
Tests involving Coefficients In the full model, how strongly does the evidence support saying, Sex $200 ? H0: Sex $200, significance 0.01204% (overwhelmingly strong evidence against H0, hence supporting original statement) 446.294 64.557 estimate/prediction of unknown quantity measure of uncertainty sample size number of explanatory variables in regression, or 0 if dealing with a population mean 100 3 significance level of data with respect to null hypothesis Null hypothesis: 100.00000% 0.02408% 0.01204% true value = 200 (from t-distribution with 96 degrees of freedom) From Session-2 s Hypothesis_Testing_Tool.xls
Tests involving Coefficients Other statements? Statement Significance level of data (with respect to opposite statement) Strength of evidence supporting statement Sex $200 Sex $300 Sex $350 Sex $400 overwhelming 0.01204% very strong 1.28444% somewhat strong 6.95385% quite weak 23.75235% From Session-1 s Hypothesis_Testing_Tool.xls
Predictions Based on ANY model, what would we predict the dependent variable to be, if all we knew about an individual were the given values for the listed explanatory variables? Prediction (~2) (standard error of the prediction) What would we expect to see, on average, across a large pool of similar individuals? Prediction (~2) (std. error of the estimated mean)
Make multiple predictions Prediction, using most-recent regression constant 1971.726 9.489914 Age Income -0.03531 446.2944 35000 Sex coefficients values for prediction 30 1 Predict 1466.762 305.5644 301.1917 51.50843 predicted value of Discount standard error of prediction standard error of regression standard error of estimated mean 95.00% 1.9850 confidence level t-statistic residual degr. freedom 96 860.2218 2073.303 confidence limits for prediction lower upper 1364.519 1569.006 confidence limits for estimated mean lower upper $1,466.76 1.9850 $305.56, an individual prediction for a 30-year-old woman earning $35,000/year $1,466.76 1.9850 $51.51, an estimate of the large-group mean for 30-year-old women earning $35,000/year
Significance The significance level of the t-ratio (for each variable separately) Sometimes called the p-value for that variable How strong is the evidence that, in a model already containing all of the other explanatory variables, this variable belongs (i.e., has a non-zero coefficient of its own)? Equivalently, is this a variable whose value we d like to know when predicting for a specific individual? Close to zero = strong evidence it DOES belong (our null hypothesis is that it doesn t)
Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 301.19175 69.68% 68.74% standard error of regression coefficient of determination adjusted coef of determination 100 96 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9850
Significance (continued) Null hypothesis: In the current model, the true coefficient of thisvariable is 0. The coefficient of this variable is our estimate (coefficient) / (standard error of the coefficient) tells us how many standard deviations away from the hypothesized truth (0) the estimate is significance = Pr(we d be this far away just by chance) Close to 0% = (recall coin-flipping story) highly contradictory to null hypothesis strongly supportive of alternative (it DOES belong)
Significance (continued) The significance level deals with the marginal contribution of a variable to the current model. Adding an irrelevant explanatory variable to a regression model will increase the adjusted coefficient of determination about half the time. The significance level tells us if the coefficient of determination went up by enough to argue that the new variable is relevant.
The Beta-Weights Why is Discount varying from one sale to the next? What s the relative explanatory power of (variation in) each of the explanatory variables (in explaining the currently-observed variability in the dependent variable across the population)? The comparative magnitudes of the beta-weights (for all of the explanatory variables together in the model) answer this question.
Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 301.19175 69.68% 68.74% standard error of regression coefficient of determination adjusted coef of determination 100 96 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9850 Why does discount vary across the population? Primarily, because Income varies. Secondarily, because some purchasers are men and others are women (i.e., Sex varies).
The Beta-Weights (continued) Each answers the question: If two individuals have the same values for all the explanatory variables in the model except one, and for this one their values differ by one standard- deviation s-worth of variability (in this variable), then their predicted values for the dependent variable would differ by how many standard deviations (of variability in the dependent variable)? Typical variation in each of the explanatory variables alone can explain (relatively) how much of the observed variability in the dependent variable?
We Can Explore Other Models We can drop variables Are older or younger purchasers currently getting larger discounts? We can change the dependent variable Are the female purchasers, on average, older or younger than the male purchasers? What s the impact of aging on purchaser income?
Are the female purchasers, on average, older or younger than the male purchasers? Regression: Age constant 38.9074074 -3.9291465 1.32861376 1.95893412 29.2842 0.0000% Sex coefficient std error of coef t-ratio significance beta-weight -2.0058 4.7639% -0.1986 9.76327735 3.94% 2.96% standard error of regression coefficient of determination adjusted coef of determination 100 98 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9845 Male purchasers are, on average, 38.91 years old. Female purchasers are, on average, 3.93 years younger than the men.
If the pure effect of an additional year of age is to increase a purchasers discount, then what explains the negative coefficient of Age below? Regression: Discount constant 1817.16511 -14.795825 202.794627 5.28272248 8.9606 0.0000% Age coefficient std error of coef t-ratio significance beta-weight -2.8008 0.6142% -0.2722 520.957868 7.41% 6.47% standard error of regression coefficient of determination adjusted coef of determination 100 98 number of observations residual degrees of freedom t-statistic for computing 95%-confidence intervals 1.9845 An older patron is likely to have a higher income (which typically is associated with a smaller discount) An older patron is more likely to be male (which typically is associated with a smaller discount)
A Reconciliation across Models On these next three slides, we ll focus on the older people have higher incomes effect: As a patron ages by a year (and his/her sex stays unchanged!), his/her discount typically drops by $8.47. Regression: Discount constant 1292.48764 -8.4694585 630.367989 178.496906 4.34612096 85.9945283 7.2410 -1.9487 0.0000% 5.4216% -0.1558 Age Sex coefficient std error of coef t-ratio significance beta-weight 7.3303 0.0000% 0.5862
As the patron ages by a year (and his/her sex stays unchanged!), his/her income typically rises by $508.58. Regression: Income constant 19234.7835 3542.55077 5.4296 0.0000% Age Sex 508.57672 -5212.6301 86.25558 1706.69615 5.8962 0.0000% 0.4906 coefficient std error of coef t-ratio significance beta-weight -3.0542 0.2913% -0.2541
The combined age and income effects are precisely what we originally estimated for an additional year of age, when income was not held constant. Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 9.48991379 impact of Age 508.57672 additional Income -17.959372 impact of additional Income net consequence of -8.4694585 aging a year and earning more as a result
Conclusion To the extent that Income covaries with Age, if Income is omitted from our model, Age gets blamed for part of Income s effect on Discount. Regression: Discount constant 1292.48764 -8.4694585 630.367989 178.496906 4.34612096 85.9945283 7.2410 -1.9487 0.0000% 5.4216% -0.1558 Age Sex coefficient std error of coef t-ratio significance beta-weight 7.3303 0.0000% 0.5862 This yields the most accurate possible predictions based on Age and Sex alone, but grossly misestimates the pure effect of Age. And that is why we try to use the most-complete model to estimate the pure effect of any variable on the dependent variable and why our next session will focus on building the model itself.
Summary: Questions a Regression Study can Answer
Make an Individual Prediction Predict a variable (with an unknown value) for an individual, given some specific information about that individual. Regress the variable-to-be-predicted (the dependent variable) onto the known variables (the independent or explanatory variables), and make a prediction. The margin of error in the prediction is (~2) (the standard error of the prediction). Example: Predict the discount from list price that a 30-year-old woman who buys an intermediate-sized vehicle from the dealership would receive. $1668.77 1.9847 $420.06
Estimate a Group Mean Estimate the mean value of a variable, across a (large) group of individuals who share certain specific characteristics. Regress the first variable onto the others. Then make a prediction of the variable for one of the individuals (which will be used as the estimate of the mean across this group of similar individuals). The margin of error in the estimated mean is (~2) (the standard error of the estimated mean). Example: Estimate the mean discount received by 30-year-old women (plural!) who buy intermediate-sized vehicles from the dealership. $1668.77 1.9847 $65.60
Estimate a Pure Difference (1) What is the mean difference in the value of the dependent variable typically associated with a one-unit difference in another variable, when everything else of relevance remains unchanged? Regress the dependent variable onto all of the other variables in the study (the most complete model), and look at the coefficient of the other variable. The margin of error in the estimated mean associated difference is (~2) (the standard error of the coefficient). Example: What is the average difference in negotiated discount associated with an incremental year of age of the purchaser of an intermediate-sized car from the dealership, when all other characteristics of that purchaser remain unchanged? $9.49 1.9850 $3.63
Estimate a Pure Effect (2) Example: What is the average difference in negotiated discount associated with an incremental year of age of the purchaser of an intermediate-sized car from the dealership, when all other characteristics of that purchaser remain unchanged? Example: What is the average effect of an incremental year of age on negotiated discount? If you re willing to assert that the linkage between age and negotiated discount is causal (we ll discuss causality in our next class), then the average pure difference and average pure effect questions can be viewed as the same.
Estimate a Confounded Difference (1) What is the mean difference in the value of the dependent variable typically associated with a one-unit difference in another variable, when all remaining variables consequently may take different values themselves? Regress the dependent variable onto just the one variable, and look at the coefficient of the explanatory variable. The margin of error in the estimated mean difference is (~2) (the standard error of the coefficient). Example: As 30-year-old purchasers age by a year, estimate the average change in their negotiated discounts. $-14.80 1.9845 $5.28
Estimate a Confounded Difference (2) Example (continued): As 30-year-old purchasers age by a year, estimate the average change in their negotiated discounts. The older purchasers would, on average receive smaller discounts. This is because, as Age increases for purchasers, Income tends to increase as well. The additional Age increases Discount, the additional Income tends to decrease Discount, and the net effect just happens to be a decrease.
Measure the Potential Explanatory Power of a Model How much of the variation in the dependent variable is potentially explained by the fact that several explanatory variables vary from one individual to the next? Regress the first variable (the dependent variable) onto the other variables (the independent or explanatory variables), and look at the adjusted coefficient of determination. Example: How much of the variation in negotiated Discounts on intermediate-size cars can be potentially explained by the facts that Age, Income, and Sex all vary from one purchaser to the next? 68.74%
Rank the Explanatory Variables by Relative Explanatory Importance When all the variables are considered together, typical variation in which would lead to the greatest expected variation in the dependent variable. Regress the dependent variable-to-be-predicted (the dependent variable) onto the explanatory variable. Compare the magnitudes (absolute values) of the beta-weights of the explanatory variables. Example: Why does Discount vary from one purchaser to the next? Because Income varies (-0.6735). And secondarily, because Sex varies (some are men, and others women) (0.4150).
Evaluate a Variables Model Inclusion (1) Given a particular regression model, how strong is the (supporting) evidence that a specific one of the explanatory variables has a true non- zero effect on the dependent variable (and therefore "belongs" in the model)? To see if evidence supports a claim, we always take the opposite as the null hypothesis: in this case, to say that a variable does not belong in the model we say H0: coefficient (of the explanatory variable) = 0. The displayed significance level for that variable is with respect to the doesn t belong null hypothesis, so a large numeric significance level indicates little or no evidence that the variable belongs in the model. However, a small significance level provides strong evidence against the null hypothesis, and therefore strong evidence that the explanatory variable plays a non-zero role in the relationship.
Evaluate a Variables Model Inclusion (2) Example: Regression: Discount constant 1971.72565 9.48991379 146.147064 13.4914 0.0000% Age Income -0.035313 446.294355 Sex coefficient std error of coef t-ratio significance beta-weight 3.6320188 0.00366827 64.5567912 2.6128 -9.6266 1.0423% 0.0000% 0.1746 -0.6735 6.9132 0.0000% 0.4150 We see here overwhelmingly-strong evidence that Income and Sex have non-zero effects and belong in our model, and very strong evidence that Age belongs as well.