R Short Course Session 5 Overview: Linear and Logistic Regression

Slide Note
Embed
Share

In this session, Dr. Daniel Zhao and Dr. Sixia Chen from the Department of Biostatistics and Epidemiology at the College of Public Health, OUHSC, cover topics on linear regression including fitting models, checking results, examining normality, outliers, collinearity, model selection, and comparisons. Additionally, logistic regression is discussed with fitting models, odds ratio calculations, and goodness of fit tests. Practical examples using the Prestige dataset in R are provided, along with detailed instructions on model summaries, pair comparisons, and result checks using various statistical techniques.


Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. R Short Course Session 5 Daniel Zhao, PhD Sixia Chen, PhD Department of Biostatistics and Epidemiology College of Public Health, OUHSC 9/23/2020

  2. Outline Linear Regression Fit linear regression and check results Examine normality, independence and heteroscedasticity Examine outliers and influence points Examine collinearity Model selection procedure Model comparisons

  3. Outline (2) Logistic Regression Fit logistic model and check results Odds ratio and confidence intervals Model selection procedure Goodness of fit test

  4. Linear Regression Fitting linear regression of y vs x: lm(y~x,data=,weights=,singular.ok=TRUE) Example: Input data is Prestige dataset in R package car Variables: education, income, women, prestige, census and type

  5. Linear Regression (2) summary(Prestige)

  6. Linear Regression (3) pairs(Prestige)

  7. Fit linear regression reg1<- lm(prestige~education+log2(income)+women, data=Prestige)

  8. Check the results summary(reg1)

  9. Check the results (2) attributes(reg1)

  10. Check the results (3) reg1$coefficients reg1$df.residual

  11. Examine Normality QQ plot for studentized residuals: qqPlot(reg1, main="QQ Plot") Distribution of studentized residuals library(MASS) sresid <- studres(reg1) par(mfrow=c(1,2)) qqPlot(reg1, main="QQ Plot") hist(sresid, freq=FALSE,main="Distribution of Studentized Residuals") xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit)

  12. Examine Normality (2)

  13. Examine Normality (3) shapiro.test(reg1$residuals) ad.test(reg1$residuals)

  14. Examine Independence Plots:

  15. Examine Independence (2) Durbin Watson Test: durbinWatsonTest(reg1)

  16. Examine heteroscedasticity Plots:

  17. Examine heteroscedasticity (2) Goldfeld-Quandt test: gqtest(reg1). Note that we need to install R package lmtest Non-constant error variance test: ncvTest(reg1)

  18. Examine Outliers outlierTest(reg1) # Bonferonni p-value for most extreme obs lm.influence(reg1) # Calculate diagonal hat matrix and the influence of each point on regression coefficient and standard deviation estimation

  19. Examine Outliers (2) lm.influence(reg1)$hat #calculate leverage lm.influence(reg1)$hat[lm.influence(reg1)$hat >3*(3+1)/102] #Identify high leverage cases

  20. Examine influence points influence.measures(reg1) # calculate DFFITS, COOK S D, DFBETAS and Covariance ratios

  21. Examine influence points (2) Cook s D plot (identify D values>4/(n-k-1)): cutoff<-4/((nrow(Prestige)- length(reg1$coefficients)-2)) plot(reg1, which=4, cook.levels=cutoff) Influence Plot: influencePlot(reg1, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

  22. Cooks D plot

  23. Influence Plot

  24. Examine Colinearity Variance inflation factors: vif(reg1) Problem? sqrt(vif(reg1)) > 2

  25. Examine Colinearity (2) Added variable plots: avPlots(reg1)

  26. Model Selection Procedures step(object=,scope=list(lower=,upper=),directi on=c( both , backward , forward ), steps=1000,k=2, ) #object is an object representing a model #scope defines the range of models examined in the stepwise search #direction controls the mode of stepwise search

  27. Model Selection Procedures (2) edu2<-education^2 loginc2<-log2(income)^2 edulogin<-education*log2(income) reg2<- lm(prestige~education+edu2+log2(income)+lo ginc2+edulogin+women,data=Prestige) step(reg2,direction='both')

  28. Model Selection Procedures (3)

  29. Model Comparisons anova(reg1,reg2)

  30. Logistic Regression Input data: plasma in package HSAUR Variables: Fibrinogen: the fibrinogen level in the blood Globulin: the globulin level in the blood ESR: the erythrocyte sedimentation rate, either less or greater 20 mm /hour

  31. Logistic Regression (2) plasma data: head(plasma) Objective: fit logistic regression by using ESR as dependent variable and other two as independent variables

  32. Logistic Regression (3) glm(formula, data =,family=,weights=,intercept=, ) #formula is y~x type #data is the input dataset #family can be gussian, binomial or others #weights specifies weighted or unweighted analysis #intercept is logical (Do we need intercept or not?)

  33. Fit logistic model fit<- glm(ESR~fibrinogen+globulin,data=plasma,fa mily=binomial('logit'))

  34. Fit logistic model (2) summary(fit)

  35. Fit logistic model (3) attributes(fit)

  36. Logistic regression plot attach(plasma) ESR2<-rep(1,dim(plasma)[1]) ESR2[ESR=='ESR > 20']<-0 fit<- glm(ESR2~fibrinogen,data=plasma,family=binomi al('logit')) plot(fibrinogen, ESR2) lines(fibrinogen[order(fibrinogen)],fit$fitted.value s[order(fibrinogen)])

  37. Logistic regression plot (2)

  38. Odds ratio and confidence intervals Calculate Odds ratio: exp(coef(fit)) Calculate variance covariance for coefficients: vcov(fit)

  39. Odds ratio and confidence intervals (2) Confidence interval for coefficient: confint.default(fit) Confidence interval for Odds ratio: exp(confint.default(fit))

  40. Model selection fit2<- glm(ESR2~fibrinogen+globulin,data=plasma,fa mily=binomial('logit')) step(fit2) ###backward selection by default

  41. Model selection (2)

  42. Hosmer-Lemeshow goodness of fit test hoslem.test(x,y,g=10) in R package ResourceSelection #x is a numeric vector of observations, binary (0/1) #y is expected values #g is number of bins to use to calculate quantiles

  43. Example hoslem.test(ESR2,fit2$fitted.values)

  44. Questions Contact email: daniel-zhao@ouhsc.edu and Sixia-Chen@ouhsc.edu

Related


More Related Content