R Short Course Session 5 Overview: Linear and Logistic Regression
In this session, Dr. Daniel Zhao and Dr. Sixia Chen from the Department of Biostatistics and Epidemiology at the College of Public Health, OUHSC, cover topics on linear regression including fitting models, checking results, examining normality, outliers, collinearity, model selection, and comparisons. Additionally, logistic regression is discussed with fitting models, odds ratio calculations, and goodness of fit tests. Practical examples using the Prestige dataset in R are provided, along with detailed instructions on model summaries, pair comparisons, and result checks using various statistical techniques.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
R Short Course Session 5 Daniel Zhao, PhD Sixia Chen, PhD Department of Biostatistics and Epidemiology College of Public Health, OUHSC 9/23/2020
Outline Linear Regression Fit linear regression and check results Examine normality, independence and heteroscedasticity Examine outliers and influence points Examine collinearity Model selection procedure Model comparisons
Outline (2) Logistic Regression Fit logistic model and check results Odds ratio and confidence intervals Model selection procedure Goodness of fit test
Linear Regression Fitting linear regression of y vs x: lm(y~x,data=,weights=,singular.ok=TRUE) Example: Input data is Prestige dataset in R package car Variables: education, income, women, prestige, census and type
Linear Regression (2) summary(Prestige)
Linear Regression (3) pairs(Prestige)
Fit linear regression reg1<- lm(prestige~education+log2(income)+women, data=Prestige)
Check the results summary(reg1)
Check the results (2) attributes(reg1)
Check the results (3) reg1$coefficients reg1$df.residual
Examine Normality QQ plot for studentized residuals: qqPlot(reg1, main="QQ Plot") Distribution of studentized residuals library(MASS) sresid <- studres(reg1) par(mfrow=c(1,2)) qqPlot(reg1, main="QQ Plot") hist(sresid, freq=FALSE,main="Distribution of Studentized Residuals") xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit)
Examine Normality (3) shapiro.test(reg1$residuals) ad.test(reg1$residuals)
Examine Independence Plots:
Examine Independence (2) Durbin Watson Test: durbinWatsonTest(reg1)
Examine heteroscedasticity Plots:
Examine heteroscedasticity (2) Goldfeld-Quandt test: gqtest(reg1). Note that we need to install R package lmtest Non-constant error variance test: ncvTest(reg1)
Examine Outliers outlierTest(reg1) # Bonferonni p-value for most extreme obs lm.influence(reg1) # Calculate diagonal hat matrix and the influence of each point on regression coefficient and standard deviation estimation
Examine Outliers (2) lm.influence(reg1)$hat #calculate leverage lm.influence(reg1)$hat[lm.influence(reg1)$hat >3*(3+1)/102] #Identify high leverage cases
Examine influence points influence.measures(reg1) # calculate DFFITS, COOK S D, DFBETAS and Covariance ratios
Examine influence points (2) Cook s D plot (identify D values>4/(n-k-1)): cutoff<-4/((nrow(Prestige)- length(reg1$coefficients)-2)) plot(reg1, which=4, cook.levels=cutoff) Influence Plot: influencePlot(reg1, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
Examine Colinearity Variance inflation factors: vif(reg1) Problem? sqrt(vif(reg1)) > 2
Examine Colinearity (2) Added variable plots: avPlots(reg1)
Model Selection Procedures step(object=,scope=list(lower=,upper=),directi on=c( both , backward , forward ), steps=1000,k=2, ) #object is an object representing a model #scope defines the range of models examined in the stepwise search #direction controls the mode of stepwise search
Model Selection Procedures (2) edu2<-education^2 loginc2<-log2(income)^2 edulogin<-education*log2(income) reg2<- lm(prestige~education+edu2+log2(income)+lo ginc2+edulogin+women,data=Prestige) step(reg2,direction='both')
Model Comparisons anova(reg1,reg2)
Logistic Regression Input data: plasma in package HSAUR Variables: Fibrinogen: the fibrinogen level in the blood Globulin: the globulin level in the blood ESR: the erythrocyte sedimentation rate, either less or greater 20 mm /hour
Logistic Regression (2) plasma data: head(plasma) Objective: fit logistic regression by using ESR as dependent variable and other two as independent variables
Logistic Regression (3) glm(formula, data =,family=,weights=,intercept=, ) #formula is y~x type #data is the input dataset #family can be gussian, binomial or others #weights specifies weighted or unweighted analysis #intercept is logical (Do we need intercept or not?)
Fit logistic model fit<- glm(ESR~fibrinogen+globulin,data=plasma,fa mily=binomial('logit'))
Fit logistic model (2) summary(fit)
Fit logistic model (3) attributes(fit)
Logistic regression plot attach(plasma) ESR2<-rep(1,dim(plasma)[1]) ESR2[ESR=='ESR > 20']<-0 fit<- glm(ESR2~fibrinogen,data=plasma,family=binomi al('logit')) plot(fibrinogen, ESR2) lines(fibrinogen[order(fibrinogen)],fit$fitted.value s[order(fibrinogen)])
Odds ratio and confidence intervals Calculate Odds ratio: exp(coef(fit)) Calculate variance covariance for coefficients: vcov(fit)
Odds ratio and confidence intervals (2) Confidence interval for coefficient: confint.default(fit) Confidence interval for Odds ratio: exp(confint.default(fit))
Model selection fit2<- glm(ESR2~fibrinogen+globulin,data=plasma,fa mily=binomial('logit')) step(fit2) ###backward selection by default
Hosmer-Lemeshow goodness of fit test hoslem.test(x,y,g=10) in R package ResourceSelection #x is a numeric vector of observations, binary (0/1) #y is expected values #g is number of bins to use to calculate quantiles
Example hoslem.test(ESR2,fit2$fitted.values)
Questions Contact email: daniel-zhao@ouhsc.edu and Sixia-Chen@ouhsc.edu