R Short Course Session 5 Overview: Linear and Logistic Regression

R Short Course

Session 5

Daniel Zhao, PhD

Sixia Chen, PhD

Department of Biostatistics and Epidemiology

College of Public Health, OUHSC

9/23/2020

Outline

•

Linear Regression

–

Fit linear regression and check results

–

Examine normality, independence and

heteroscedasticity

–

Examine outliers and influence points

–

Examine collinearity

–

Model selection procedure

–

Model comparisons

Outline (2)

•

Logistic Regression

–

Fit logistic model and check results

–

Odds ratio and confidence intervals

–

Model selection procedure

–

Goodness of fit test

Linear Regression

•

Fitting linear regression of y vs x:

lm(y~x,data=,weights=,singular.ok=TRUE)

•

Example: Input data is ‘Prestige’ dataset in R

package ‘car’

•

Variables: education, income, women,

prestige, census and type

Linear Regression (2)

•

summary(Prestige)

Linear Regression (3)

•

pairs(Prestige)

Fit linear regression

•

reg1<-

lm(prestige~education+log2(income)+women,

data=Prestige)

Check the results

•

summary(reg1)

Check the results (2)

•

attributes(reg1)

Check the results (3)

•

reg1$coefficients

•

reg1$df.residual

Examine Normality

•

QQ plot for studentized residuals: qqPlot(reg1,

main="QQ Plot")

•

Distribution of studentized residuals

–

library(MASS)

–

sresid <- studres(reg1)

–

par(mfrow=c(1,2))

–

qqPlot(reg1, main="QQ Plot")

–

hist(sresid, freq=FALSE,main="Distribution of Studentized

Residuals")

–

xfit<-seq(min(sresid),max(sresid),length=40)

–

yfit<-dnorm(xfit)

–

lines(xfit, yfit)

Examine Normality (2)

Examine Normality (3)

•

shapiro.test(reg1$residuals)

•

ad.test(reg1$residuals)

Examine Independence

•

Plots:

Examine Independence (2)

•

Durbin Watson Test: durbinWatsonTest(reg1)

Examine heteroscedasticity

•

Plots:

Examine heteroscedasticity (2)

•

Goldfeld-Quandt test: gqtest(reg1). Note that

we need to install R package ‘lmtest’

•

Non-constant error variance test:

ncvTest(reg1)

Examine Outliers

•

outlierTest(reg1) # Bonferonni p-value for

most extreme obs

•

lm.influence(reg1) # Calculate diagonal hat

matrix and the influence of each point on

regression coefficient and standard deviation

estimation

Examine Outliers (2)

•

lm.influence(reg1)$hat #calculate leverage

•

lm.influence(reg1)$hat[lm.influence(reg1)$hat

>3*(3+1)/102] #Identify high leverage cases

Examine influence points

•

influence.measures(reg1) # calculate DFFITS,

COOK’S D, DFBETAS and Covariance ratios

Examine influence points (2)

•

Cook’s D plot (identify D values>4/(n-k-1)):

–

cutoff<-4/((nrow(Prestige)-

length(reg1$coefficients)-2))

–

plot(reg1, which=4, cook.levels=cutoff)

•

Influence Plot: influencePlot(reg1,

id.method="identify", main="Influence Plot",

sub="Circle size is proportial to Cook's

Distance" )

Cook’s D plot

Influence Plot

Examine Colinearity

•

Variance inflation factors: vif(reg1)

•

Problem? sqrt(vif(reg1)) > 2

Examine Colinearity (2)

•

Added variable plots: avPlots(reg1)

Model Selection Procedures

•

step(object=,scope=list(lower=,upper=),directi

on=c(“both”,”backward”,”forward”),

steps=1000,k=2,…)

#object is an object representing a model

#scope defines the range of models examined in

the stepwise search

#direction controls the mode of stepwise search

Model Selection Procedures (2)

•

edu2<-education^2

•

loginc2<-log2(income)^2

•

edulogin<-education*log2(income)

•

reg2<-

lm(prestige~education+edu2+log2(income)+lo

ginc2+edulogin+women,data=Prestige)

•

step(reg2,direction='both')

Model Selection Procedures (3)

Model Comparisons

•

anova(reg1,reg2)

Logistic Regression

•

Input data: ‘plasma’ in package ‘HSAUR’

•

Variables:

–

Fibrinogen: the fibrinogen level in the blood

–

Globulin: the globulin level in the blood

–

ESR: the erythrocyte sedimentation rate, either

less or greater 20 mm /hour

Logistic Regression (2)

•

‘plasma’ data: head(plasma)

•

Objective: fit logistic regression by using ESR as

dependent variable and other two as

independent variables

Logistic Regression (3)

•

glm(formula, data

=,family=,weights=,intercept=,…)

#formula is y~x type

#data is the input dataset

#family can be gussian, binomial or others

#weights specifies weighted or unweighted analysis

#intercept is logical (Do we need intercept or not?)

Fit logistic model

•

fit<-

glm(ESR~fibrinogen+globulin,data=plasma,fa

mily=binomial('logit'))

Fit logistic model (2)

•

summary(fit)

Fit logistic model (3)

•

attributes(fit)

Logistic regression plot

•

attach(plasma)

•

ESR2<-rep(1,dim(plasma)[1])

•

ESR2[ESR=='ESR > 20']<-0

•

fit<-

glm(ESR2~fibrinogen,data=plasma,family=binomi

al('logit'))

•

plot(fibrinogen, ESR2)

•

lines(fibrinogen[order(fibrinogen)],fit$fitted.value

s[order(fibrinogen)])

Logistic regression plot (2)

Odds ratio and confidence intervals

•

Calculate Odds ratio: exp(coef(fit))

•

Calculate variance covariance for coefficients:

vcov(fit)

Odds ratio and confidence intervals (2)

•

Confidence interval for coefficient:

confint.default(fit)

•

Confidence interval for Odds ratio:

exp(confint.default(fit))

Model selection

•

fit2<-

glm(ESR2~fibrinogen+globulin,data=plasma,fa

mily=binomial('logit'))

•

step(fit2) ###backward selection by default

Model selection (2)

Hosmer-Lemeshow goodness of fit

test

•

hoslem.test(x,y,g=10) in R package

‘ResourceSelection’

#x is a numeric vector of observations, binary

(0/1)

#y is expected values

#g is number of bins to use to calculate

quantiles

Example

•

hoslem.test(ESR2,fit2$fitted.values)

Questions

•

Contact email:

daniel-zhao@ouhsc.edu

and

Sixia-Chen@ouhsc.edu

Slide Note

Embed Share

Download

In this session, Dr. Daniel Zhao and Dr. Sixia Chen from the Department of Biostatistics and Epidemiology at the College of Public Health, OUHSC, cover topics on linear regression including fitting models, checking results, examining normality, outliers, collinearity, model selection, and comparisons. Additionally, logistic regression is discussed with fitting models, odds ratio calculations, and goodness of fit tests. Practical examples using the Prestige dataset in R are provided, along with detailed instructions on model summaries, pair comparisons, and result checks using various statistical techniques.

con_ore Follow

Uploaded on Oct 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

R Short Course Session 5 Daniel Zhao, PhD Sixia Chen, PhD Department of Biostatistics and Epidemiology College of Public Health, OUHSC 9/23/2020

Outline Linear Regression Fit linear regression and check results Examine normality, independence and heteroscedasticity Examine outliers and influence points Examine collinearity Model selection procedure Model comparisons

Outline (2) Logistic Regression Fit logistic model and check results Odds ratio and confidence intervals Model selection procedure Goodness of fit test

Linear Regression Fitting linear regression of y vs x: lm(y~x,data=,weights=,singular.ok=TRUE) Example: Input data is Prestige dataset in R package car Variables: education, income, women, prestige, census and type

Linear Regression (2) summary(Prestige)

Linear Regression (3) pairs(Prestige)

Fit linear regression reg1<- lm(prestige~education+log2(income)+women, data=Prestige)

Check the results summary(reg1)

Check the results (2) attributes(reg1)

Check the results (3) reg1$coefficients reg1$df.residual

Examine Normality QQ plot for studentized residuals: qqPlot(reg1, main="QQ Plot") Distribution of studentized residuals library(MASS) sresid <- studres(reg1) par(mfrow=c(1,2)) qqPlot(reg1, main="QQ Plot") hist(sresid, freq=FALSE,main="Distribution of Studentized Residuals") xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit)

Examine Normality (2)

Examine Normality (3) shapiro.test(reg1$residuals) ad.test(reg1$residuals)

Examine Independence Plots:

Examine Independence (2) Durbin Watson Test: durbinWatsonTest(reg1)

Examine heteroscedasticity Plots:

Examine heteroscedasticity (2) Goldfeld-Quandt test: gqtest(reg1). Note that we need to install R package lmtest Non-constant error variance test: ncvTest(reg1)

Examine Outliers outlierTest(reg1) # Bonferonni p-value for most extreme obs lm.influence(reg1) # Calculate diagonal hat matrix and the influence of each point on regression coefficient and standard deviation estimation

Examine Outliers (2) lm.influence(reg1)$hat #calculate leverage lm.influence(reg1)$hat[lm.influence(reg1)$hat >3*(3+1)/102] #Identify high leverage cases

Examine influence points influence.measures(reg1) # calculate DFFITS, COOK S D, DFBETAS and Covariance ratios

Examine influence points (2) Cook s D plot (identify D values>4/(n-k-1)): cutoff<-4/((nrow(Prestige)- length(reg1$coefficients)-2)) plot(reg1, which=4, cook.levels=cutoff) Influence Plot: influencePlot(reg1, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

Cooks D plot

Influence Plot

Examine Colinearity Variance inflation factors: vif(reg1) Problem? sqrt(vif(reg1)) > 2

Examine Colinearity (2) Added variable plots: avPlots(reg1)

Model Selection Procedures step(object=,scope=list(lower=,upper=),directi on=c( both , backward , forward ), steps=1000,k=2, ) #object is an object representing a model #scope defines the range of models examined in the stepwise search #direction controls the mode of stepwise search

Model Selection Procedures (2) edu2<-education^2 loginc2<-log2(income)^2 edulogin<-education*log2(income) reg2<- lm(prestige~education+edu2+log2(income)+lo ginc2+edulogin+women,data=Prestige) step(reg2,direction='both')

Model Selection Procedures (3)

Model Comparisons anova(reg1,reg2)

Logistic Regression Input data: plasma in package HSAUR Variables: Fibrinogen: the fibrinogen level in the blood Globulin: the globulin level in the blood ESR: the erythrocyte sedimentation rate, either less or greater 20 mm /hour

Logistic Regression (2) plasma data: head(plasma) Objective: fit logistic regression by using ESR as dependent variable and other two as independent variables

Logistic Regression (3) glm(formula, data =,family=,weights=,intercept=, ) #formula is y~x type #data is the input dataset #family can be gussian, binomial or others #weights specifies weighted or unweighted analysis #intercept is logical (Do we need intercept or not?)

Fit logistic model fit<- glm(ESR~fibrinogen+globulin,data=plasma,fa mily=binomial('logit'))

Fit logistic model (2) summary(fit)

Fit logistic model (3) attributes(fit)

Logistic regression plot attach(plasma) ESR2<-rep(1,dim(plasma)[1]) ESR2[ESR=='ESR > 20']<-0 fit<- glm(ESR2~fibrinogen,data=plasma,family=binomi al('logit')) plot(fibrinogen, ESR2) lines(fibrinogen[order(fibrinogen)],fit$fitted.value s[order(fibrinogen)])

Logistic regression plot (2)

Odds ratio and confidence intervals Calculate Odds ratio: exp(coef(fit)) Calculate variance covariance for coefficients: vcov(fit)

Odds ratio and confidence intervals (2) Confidence interval for coefficient: confint.default(fit) Confidence interval for Odds ratio: exp(confint.default(fit))

Model selection fit2<- glm(ESR2~fibrinogen+globulin,data=plasma,fa mily=binomial('logit')) step(fit2) ###backward selection by default

Model selection (2)

Hosmer-Lemeshow goodness of fit test hoslem.test(x,y,g=10) in R package ResourceSelection #x is a numeric vector of observations, binary (0/1) #y is expected values #g is number of bins to use to calculate quantiles