R Short Course Session 5 Overview: Linear and Logistic Regression

 
R Short Course
Session 5
 
Daniel Zhao, PhD
Sixia Chen, PhD
Department of Biostatistics and Epidemiology
College of Public Health, OUHSC
9/23/2020
 
Outline
 
Linear Regression
Fit linear regression and check results
Examine normality, independence and
heteroscedasticity
Examine outliers and influence points
Examine collinearity
Model selection procedure
Model comparisons
 
Outline (2)
 
Logistic Regression
Fit logistic model and check results
Odds ratio and confidence intervals
Model selection procedure
Goodness of fit test
 
Linear Regression
 
Fitting linear regression of y vs x:
lm(y~x,data=,weights=,singular.ok=TRUE)
Example: Input data is ‘Prestige’ dataset in R
package ‘car’
Variables: education, income, women,
prestige, census and type
 
Linear Regression (2)
 
summary(Prestige)
 
Linear Regression (3)
 
pairs(Prestige)
 
 
Fit linear regression
 
reg1<-
lm(prestige~education+log2(income)+women,
data=Prestige)
 
Check the results
 
summary(reg1)
 
Check the results (2)
 
attributes(reg1)
 
Check the results (3)
 
reg1$coefficients
 
 
reg1$df.residual
 
 
Examine Normality
 
QQ plot for studentized residuals: qqPlot(reg1,
main="QQ Plot")
Distribution of studentized residuals
library(MASS)
sresid <- studres(reg1)
par(mfrow=c(1,2))
qqPlot(reg1, main="QQ Plot")
hist(sresid, freq=FALSE,main="Distribution of Studentized
Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
 
Examine Normality (2)
 
Examine Normality (3)
 
shapiro.test(reg1$residuals)
 
 
 
ad.test(reg1$residuals)
 
 
Examine Independence
 
Plots:
 
Examine Independence (2)
 
Durbin Watson Test: durbinWatsonTest(reg1)
 
Examine heteroscedasticity
 
Plots:
 
Examine heteroscedasticity (2)
 
Goldfeld-Quandt test: gqtest(reg1). Note that
we need to install R package ‘lmtest’
 
 
Non-constant error variance test:
ncvTest(reg1)
 
 
Examine Outliers
 
outlierTest(reg1) # Bonferonni p-value for
most extreme obs
 
 
lm.influence(reg1) # Calculate diagonal hat
matrix and the influence of each point on
regression coefficient and standard deviation
estimation
 
Examine Outliers (2)
 
lm.influence(reg1)$hat #calculate leverage
 
 
lm.influence(reg1)$hat[lm.influence(reg1)$hat
>3*(3+1)/102] #Identify high leverage cases
 
 
 
Examine influence points
 
influence.measures(reg1) # calculate DFFITS,
COOK’S D, DFBETAS and Covariance ratios
 
Examine influence points (2)
 
Cook’s D plot (identify D values>4/(n-k-1)):
cutoff<-4/((nrow(Prestige)-
length(reg1$coefficients)-2))
plot(reg1, which=4, cook.levels=cutoff)
Influence Plot: influencePlot(reg1,
id.method="identify", main="Influence Plot",
sub="Circle size is proportial to Cook's
Distance" )
 
 
Cook’s D plot
 
Influence Plot
 
Examine Colinearity
 
Variance inflation factors: vif(reg1)
 
 
Problem? sqrt(vif(reg1)) > 2
 
Examine Colinearity (2)
 
Added variable plots: avPlots(reg1)
 
Model Selection Procedures
 
step(object=,scope=list(lower=,upper=),directi
on=c(“both”,”backward”,”forward”),
steps=1000,k=2,…)
#object is an object representing a model
#scope defines the range of models examined in
the stepwise search
#direction controls the mode of stepwise search
 
Model Selection Procedures (2)
 
edu2<-education^2
loginc2<-log2(income)^2
edulogin<-education*log2(income)
reg2<-
lm(prestige~education+edu2+log2(income)+lo
ginc2+edulogin+women,data=Prestige)
step(reg2,direction='both')
 
 
 
Model Selection Procedures (3)
 
Model Comparisons
 
anova(reg1,reg2)
 
Logistic Regression
 
Input data: ‘plasma’ in package ‘HSAUR’
Variables:
Fibrinogen: the fibrinogen level in the blood
Globulin: the globulin level in the blood
ESR: the erythrocyte sedimentation rate, either
less or greater 20 mm /hour
 
Logistic Regression (2)
 
‘plasma’ data: head(plasma)
 
 
 
 
 
Objective: fit logistic regression by using ESR as
dependent variable and other two as
independent variables
 
 
Logistic Regression (3)
 
glm(formula, data
=,family=,weights=,intercept=,…)
#formula is y~x type
#data is the input dataset
#family can be gussian, binomial or others
#weights specifies weighted or unweighted analysis
#intercept is logical (Do we need intercept or not?)
 
Fit logistic model
 
fit<-
glm(ESR~fibrinogen+globulin,data=plasma,fa
mily=binomial('logit'))
 
Fit logistic model (2)
 
summary(fit)
 
 
Fit logistic model (3)
 
attributes(fit)
 
Logistic regression plot
 
attach(plasma)
ESR2<-rep(1,dim(plasma)[1])
ESR2[ESR=='ESR > 20']<-0
fit<-
glm(ESR2~fibrinogen,data=plasma,family=binomi
al('logit'))
plot(fibrinogen, ESR2)
lines(fibrinogen[order(fibrinogen)],fit$fitted.value
s[order(fibrinogen)])
 
Logistic regression plot (2)
 
Odds ratio and confidence intervals
 
Calculate Odds ratio: exp(coef(fit))
 
Calculate variance covariance for coefficients:
vcov(fit)
 
 
 
 
 
Odds ratio and confidence intervals (2)
 
Confidence interval for coefficient:
confint.default(fit)
 
 
Confidence interval for Odds ratio:
exp(confint.default(fit))
 
Model selection
 
fit2<-
glm(ESR2~fibrinogen+globulin,data=plasma,fa
mily=binomial('logit'))
step(fit2) ###backward selection by default
 
Model selection (2)
 
Hosmer-Lemeshow goodness of fit
test
 
hoslem.test(x,y,g=10) in R package
‘ResourceSelection’
#x is a numeric vector of observations, binary
(0/1)
#y is expected values
#g is number of bins to use to calculate
quantiles
 
Example
 
hoslem.test(ESR2,fit2$fitted.values)
 
 
Questions
 
Contact email: 
daniel-zhao@ouhsc.edu
 and
Sixia-Chen@ouhsc.edu
Slide Note
Embed
Share

In this session, Dr. Daniel Zhao and Dr. Sixia Chen from the Department of Biostatistics and Epidemiology at the College of Public Health, OUHSC, cover topics on linear regression including fitting models, checking results, examining normality, outliers, collinearity, model selection, and comparisons. Additionally, logistic regression is discussed with fitting models, odds ratio calculations, and goodness of fit tests. Practical examples using the Prestige dataset in R are provided, along with detailed instructions on model summaries, pair comparisons, and result checks using various statistical techniques.

  • Regression
  • Linear Regression
  • Logistic Regression
  • Data Analysis
  • Statistics

Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. R Short Course Session 5 Daniel Zhao, PhD Sixia Chen, PhD Department of Biostatistics and Epidemiology College of Public Health, OUHSC 9/23/2020

  2. Outline Linear Regression Fit linear regression and check results Examine normality, independence and heteroscedasticity Examine outliers and influence points Examine collinearity Model selection procedure Model comparisons

  3. Outline (2) Logistic Regression Fit logistic model and check results Odds ratio and confidence intervals Model selection procedure Goodness of fit test

  4. Linear Regression Fitting linear regression of y vs x: lm(y~x,data=,weights=,singular.ok=TRUE) Example: Input data is Prestige dataset in R package car Variables: education, income, women, prestige, census and type

  5. Linear Regression (2) summary(Prestige)

  6. Linear Regression (3) pairs(Prestige)

  7. Fit linear regression reg1<- lm(prestige~education+log2(income)+women, data=Prestige)

  8. Check the results summary(reg1)

  9. Check the results (2) attributes(reg1)

  10. Check the results (3) reg1$coefficients reg1$df.residual

  11. Examine Normality QQ plot for studentized residuals: qqPlot(reg1, main="QQ Plot") Distribution of studentized residuals library(MASS) sresid <- studres(reg1) par(mfrow=c(1,2)) qqPlot(reg1, main="QQ Plot") hist(sresid, freq=FALSE,main="Distribution of Studentized Residuals") xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit)

  12. Examine Normality (2)

  13. Examine Normality (3) shapiro.test(reg1$residuals) ad.test(reg1$residuals)

  14. Examine Independence Plots:

  15. Examine Independence (2) Durbin Watson Test: durbinWatsonTest(reg1)

  16. Examine heteroscedasticity Plots:

  17. Examine heteroscedasticity (2) Goldfeld-Quandt test: gqtest(reg1). Note that we need to install R package lmtest Non-constant error variance test: ncvTest(reg1)

  18. Examine Outliers outlierTest(reg1) # Bonferonni p-value for most extreme obs lm.influence(reg1) # Calculate diagonal hat matrix and the influence of each point on regression coefficient and standard deviation estimation

  19. Examine Outliers (2) lm.influence(reg1)$hat #calculate leverage lm.influence(reg1)$hat[lm.influence(reg1)$hat >3*(3+1)/102] #Identify high leverage cases

  20. Examine influence points influence.measures(reg1) # calculate DFFITS, COOK S D, DFBETAS and Covariance ratios

  21. Examine influence points (2) Cook s D plot (identify D values>4/(n-k-1)): cutoff<-4/((nrow(Prestige)- length(reg1$coefficients)-2)) plot(reg1, which=4, cook.levels=cutoff) Influence Plot: influencePlot(reg1, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

  22. Cooks D plot

  23. Influence Plot

  24. Examine Colinearity Variance inflation factors: vif(reg1) Problem? sqrt(vif(reg1)) > 2

  25. Examine Colinearity (2) Added variable plots: avPlots(reg1)

  26. Model Selection Procedures step(object=,scope=list(lower=,upper=),directi on=c( both , backward , forward ), steps=1000,k=2, ) #object is an object representing a model #scope defines the range of models examined in the stepwise search #direction controls the mode of stepwise search

  27. Model Selection Procedures (2) edu2<-education^2 loginc2<-log2(income)^2 edulogin<-education*log2(income) reg2<- lm(prestige~education+edu2+log2(income)+lo ginc2+edulogin+women,data=Prestige) step(reg2,direction='both')

  28. Model Selection Procedures (3)

  29. Model Comparisons anova(reg1,reg2)

  30. Logistic Regression Input data: plasma in package HSAUR Variables: Fibrinogen: the fibrinogen level in the blood Globulin: the globulin level in the blood ESR: the erythrocyte sedimentation rate, either less or greater 20 mm /hour

  31. Logistic Regression (2) plasma data: head(plasma) Objective: fit logistic regression by using ESR as dependent variable and other two as independent variables

  32. Logistic Regression (3) glm(formula, data =,family=,weights=,intercept=, ) #formula is y~x type #data is the input dataset #family can be gussian, binomial or others #weights specifies weighted or unweighted analysis #intercept is logical (Do we need intercept or not?)

  33. Fit logistic model fit<- glm(ESR~fibrinogen+globulin,data=plasma,fa mily=binomial('logit'))

  34. Fit logistic model (2) summary(fit)

  35. Fit logistic model (3) attributes(fit)

  36. Logistic regression plot attach(plasma) ESR2<-rep(1,dim(plasma)[1]) ESR2[ESR=='ESR > 20']<-0 fit<- glm(ESR2~fibrinogen,data=plasma,family=binomi al('logit')) plot(fibrinogen, ESR2) lines(fibrinogen[order(fibrinogen)],fit$fitted.value s[order(fibrinogen)])

  37. Logistic regression plot (2)

  38. Odds ratio and confidence intervals Calculate Odds ratio: exp(coef(fit)) Calculate variance covariance for coefficients: vcov(fit)

  39. Odds ratio and confidence intervals (2) Confidence interval for coefficient: confint.default(fit) Confidence interval for Odds ratio: exp(confint.default(fit))

  40. Model selection fit2<- glm(ESR2~fibrinogen+globulin,data=plasma,fa mily=binomial('logit')) step(fit2) ###backward selection by default

  41. Model selection (2)

  42. Hosmer-Lemeshow goodness of fit test hoslem.test(x,y,g=10) in R package ResourceSelection #x is a numeric vector of observations, binary (0/1) #y is expected values #g is number of bins to use to calculate quantiles

  43. Example hoslem.test(ESR2,fit2$fitted.values)

  44. Questions Contact email: daniel-zhao@ouhsc.edu and Sixia-Chen@ouhsc.edu

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#