Categorical Data Analysis in Population Studies

Chapter 8

Categorical Data Analysis

Inference for a Single Proportion (



•

Goal: Estimate proportion of individuals in a population with a

certain characteristic (



). This is equivalent to estimating a

binomial probability

•

Sample: Take a SRS of

 individuals from the population and

observe

that have the characteristic. The sample proportion is

and has the following sampling properties:

Large-Sample Confidence Interval for



•

Take SRS of size

 from population where



 is true

(unknown) proportion of successes.

–

Observe

 successes

–

Set confidence level (1-



) and obtain



/2

 from

-table

Example - Ginkgo and Azet for AMS

•

Study Goal: Measure effect of Ginkgo and

Acetazolamide on occurrence of Acute Mountain

Sickness (AMS) in Himalayan Trackers

•

Parameter:



= True proportion of all trekkers receiving

Ginkgo&Acetaz who would suffer from AMS.

•

Sample Data:

n=

trekkers received G&A,

=18

suffered from AMS

Sample Size for Margin of Error =

•

Goal: Estimate



 within

 with 100(1-





 Confidence

•

Confidence Interval will have width of 2

Wilson-Agresti-Coull Method

•

For moderate to small sample sizes, large-sample

methods may not work well wrt coverage probabilities

•

Simple approach that works well in practice:

–

Adjust observed number of Successes (

) and sample size (

Example: Lister’s Tests with Antiseptic

•

Experiments with antiseptic in patients with upper

limb amputations (John Lister, circa 1870)

•

=12 patients received antiseptic

=1 died

Significance Test for a Proportion

•

Goal test whether a proportion (



) equals some null

value











Large-sample test works well when



and

(1-





Ginkgo and Acetazolamide for AMS

•

Can we claim that the incidence rate of AMS is less

than 25% for trekkers receiving G&A?

•



0.25



 < 0.25

Strong evidence that incidence rate is below 25% (



< 0.25)

R Code/Output

y <- 18; n <- 126

binom.test(y, n, p=0.25, alternative="less")

> binom.test(y, n, p=0.25, alternative="less")

        Exact binomial test

data:  y and n

number of successes = 18, number of trials = 126, p-

value = 0.002465

alternative hypothesis: true probability of success is

less than 0.25

95 percent confidence interval:

 0.0000000 0.2044495

sample estimates:

probability of success

             0.1428571

The 95% Confidence Interval is 1-sided as the alternative is

“less” than the null value.

Multinomial Experiment / Distribution

•

Extension of Binomial Distribution to

experiments where each trial can end in

exactly one of

categories

•

 independent trials

•

Probability a trial results in category

is



•

 is the number of trials resulting in

category

•



+…+



= 1

•

+…+

Multinomial Distribution / Test for Cell Probabilities

Example – English Premier League -2013

•

Home Team Games can end in Win, Draw, Lose (

 = 3)

•

Season:

 = 380 games (All 20 teams play Home/Away)

•

Test  H





 = 0.40,



 = 0.20

•

Data:

 = 179,

 = 78,

 = 123

English Premier League -2013 – R Code

#### Multinomial Goodness of Fit Test

## Give counts

game.count <- c(179, 78, 123)

## Give null values for probabilities

prob.null <- c(0.40, 0.20, 0.40)

## Use chisq.test function for the test

chisq.test(game.count, p=prob.null)

> chisq.test(game.count, p=prob.null)

        Chi-squared test for given probabilities

data:  game.count

X-squared = 10.382, df = 2, p-value = 0.005568

Goodness of Fit Test for a Probability Distribution

•

Data are collected and wish to be determined whether it

comes from a specific probability distribution (e.g.

Poisson, Normal, Gamma)

•

Estimate any unknown model parameters (

 estimates)

•

Break down the range of data values into

 intervals

(typically where ≥ 80% have expected counts ≥ 5) obtain

observed (

) and expected (

) values for each interval

Example – Goals in 2013 Brazil Soccer

•

League has 20 teams, each team plays other 19 teams twice

•

Games are 90 minutes, with no overtime

•

Mean and variance of the total goals in a game are 2.46 and

2.61 respectively

•

For Poisson distribution, the theoretical mean and variance

are the same. For this empirical data, they are close

Comparing Two Population Proportions

•

Goal: Compare two populations/treatments wrt

a nominal (binary) outcome

•

Sampling Design: Independent vs Dependent

Samples

•

Methods based on large vs small samples

•

Contingency tables used to summarize data

•

Measures of Association: Absolute Risk,

Relative Risk, Odds Ratio

Contingency Tables

•

Tables representing all combinations of

levels of explanatory and response variables

•

Numbers in table represent

Counts

 of the

number of cases in each cell

•

Row and column totals are called

Marginal

counts

2x2 Tables - Notation

Example - Firm Type/Product Quality

•

Groups: Not Integrated (Weave only) vs Vertically integrated

(Spin and Weave) Cotton Textile Producers

•

 Outcomes: High Quality (High Count) vs Low Quality (Count)

Source: P. Temin (1988). “Product Quality and Vertical Integration in the Early Cotton Textile Industry,”

Journal of Economic History

Vol. 48, #4, pp. 891-907.

Notation

•

Proportion in Population 1 with the characteristic

of interest:



•

Sample size from Population 1:

•

Number of individuals in Sample 1 with the

characteristic of interest:

•

Sample proportion from Sample 1 with the

characteristic of interest:

•

Similar notation for Population/Sample 2

Example - Cotton Textile Producers





 - True proportion of all Non-integretated

firms that would produce High quality





 - True proportion of all vertically integretated

firms that would produce High quality

Notation (Continued)

•

Parameter of Primary Interest:





, the difference

in the 2 population proportions with the

characteristic (two other measures given below)

•

Estimator:

•

Standard Error (and its estimator):

•

Pooled Estimated Standard Error when















Cotton Textile Producers (Continued)

•

Parameter of Primary Interest:











, the difference

in the 2 population proportions that produce High

quality output

•

Estimator:

•

Standard Error (and its estimate):

•

Pooled Estimated Standard Error when















Significance Tests for











•

Testing whether











can

be done by interpreting

“plausible values” of











 from the confidence interval:

–

If entire interval is positive, conclude





















 > 0)

–

If entire interval is negative, conclude





















 < 0)

–

If interval contains 0, do not conclude that













•

Alternatively, we can conduct a significance test:

–























  (2-sided)











  (1-sided)

–

Test Statistic:

–

RR:

|z

obs





/2

  (2-sided)

obs





  (1-sided)

–

P-value:  2P(



|z

obs

|)    (2-sided)         P(



obs

)  (1-sided)

Example - Cotton Textile Production

Strong evidence of differences in quality by firm type

R Code and Output

y1 <- 33; n1 <- 88

y2 <- 5;  n2 <- 84

prop.test(c(y1,y2), c(n1,n2), correct=F)

> prop.test(c(y1,y2), c(n1,n2), correct=F)

        2-sample test for equality of proportions without

continuity correction

data:  c(y1, y2) out of c(n1, n2)

X-squared = 24.851, df = 1, p-value = 6.195e-07

alternative hypothesis: two.sided

95 percent confidence interval:

 0.2023778 0.4285746

sample estimates:

    prop 1     prop 2

0.37500000 0.05952381

Measures of Association

•

Absolute Risk (AR):











•

Relative Risk (RR):











•

Odds Ratio (OR):

o =



/(1-



))

•

Note that if











  (No association

between outcome and grouping variables):

–

AR=0

–

RR=1

–

OR=1

Relative Risk

•

Ratio of the probability that the outcome

characteristic is present for one group, relative

to the other

•

Sample proportions with characteristic from

groups 1 and 2:

Relative Risk

•

Estimated Relative Risk:

95% Confidence Interval for Population Relative Risk:

Relative Risk

•

Interpretation

–

Conclude that the probability that the outcome

is present is higher (in the population) for group

1 if the entire interval is above 1

–

Conclude that the probability that the outcome

is present is lower (in the population) for group

1 if the entire interval is below 1

–

Do not conclude that the probability of the

outcome differs for the two groups if the

interval contains 1

Example - Concussions in NCAA Athletes

•

Units: Game exposures among college socer players

1997-1999

•

Outcome: Presence/Absence of a Concussion

•

Group Variable: Gender (Female vs Male)

•

Contingency Table of case outcomes:

Source: Covassin, et al (2003). “

Sex Differences and the Incidence of Concussions Among Collegiate Athletes,”

Journal of Athletic

Training,

Vol. 38, #3, pp. 238-244

Example - Concussions in NCAA Athletes

There is strong evidence that females have a higher risk of concussion

Odds Ratio

•

Odds of an event is the probability it occurs

divided by the probability it does not occur

•

Odds ratio is the odds of the event for group 1

divided by the odds of the event for group 2

•

Sample odds of the outcome for each group:

Odds Ratio

•

 Estimated Odds Ratio:

95% Confidence Interval for Population Odds Ratio

Odds Ratio

•

Interpretation

–

Conclude that the probability that the outcome

is present is higher (in the population) for group

1 if the entire interval is above 1

–

Conclude that the probability that the outcome

is present is lower (in the population) for group

1 if the entire interval is below 1

–

Do not conclude that the probability of the

outcome differs for the two groups if the

interval contains 1

Osteoarthritis in Former Soccer Players

•

Units: 68 Former British professional football players and 136

age/sex matched controls

•

Outcome: Presence/Absence of Osteoathritis (OA)

•

Data:

•

Of

= 68 former professionals,

 =9 had OA,

=59 did not

•

Of

= 136 controls,

 =2 had OA,

=134 did not

Source: Shepard, et al (2003).”

Ex-professional association footballers have an increased prevalence of osteoarthritis of the hip compared

with age matched controls despite not having sustained notable hip injuries,”

Brit. J. Sports Med.

, Vol. 37, #1, pp. 80-81

Interval > 1

Fisher’s Exact Test

•

Method of testing for testing whether





 when

one or both of the group sample sizes is small

•

Measures (conditional on the group sizes and

number of cases with and without the

characteristic) the chances we would see

differences of this magnitude or larger in the

sample proportions, if there were no differences in

the populations

Example – Echinacea Purpurea for Colds

•

Healthy adults randomized to receive EP (

=24)

or placebo (

=22, two were dropped)

•

Among EP subjects, 14 of 24 developed cold after

exposure to RV-39 (58%)

•

Among Placebo subjects, 18 of 22 developed cold

after exposure to RV-39 (82%)

•

Out of a total of 46 subjects, 32 developed cold

•

Out of a total of 46 subjects, 24 received EP

Source: S.J. Sperber, et al (2004), “Echinacea Purpurea for Prevention of Experimental Rhinovirus Colds,”

Clinical Infectious Diseases

, Vol. 38, #10, pp. 1367-1371.

Example – Echinacea Purpurea for Colds

•

Conditional on 32 people

developing colds and 24

receiving EP and 22

receiving placebo, the

following table gives the

outcomes that would have

been as strong or stronger

evidence that EP reduced

risk of developing cold (1-

sided test).

-value from R

is .079 (next slide).

R Code/Output

ep.cold <- matrix(c(10,4, 14,18), ncol=2)

fisher.test(ep.cold, alt="greater")

fisher.test(ep.cold, alt="two.sided")

> fisher.test(ep.cold, alt="greater")

        Fisher's Exact Test for Count Data

data:  ep.cold

p-value = 0.07867

alternative hypothesis: true odds ratio is greater than 1

95 percent confidence interval:

 0.8653928       Inf

sample estimates:

odds ratio

  3.132591

McNemar’s Test for Paired Samples

•

Common subjects (or matched pairs) being observed

under 2 conditions (2 treatments, before/after, 2

diagnostic tests) in a crossover setting

•

Two possible outcomes (Presence/Absence of

Characteristic) on each measurement

•

Four possibilities for each subject/pair wrt outcome:

–

Present in both conditions

–

Absent in both conditions

–

Present in Condition 1, Absent in Condition 2

–

Absent in Condition 1, Present in Condition 2

McNemar’s Test for Paired Samples

McNemar’s Test for Paired Samples

•

Data:

 = # of pairs where the characteristic is present

in condition 1 and not 2 and

 # where present in 2 and

not 1

•

: Probability the outcome is Present is same for the 2

conditions (





•

: Probabilities differ for the 2 conditions (



≠



Example - Reporting of Silicone Breast

Implant Leakage in Revision Surgery

•

Subjects - 165 women having revision surgery involving

silicone gel breast implants

•

Conditions (Each being observed on all women)

–

Self Report of Presence/Absence of Rupture/Leak

–

Surgical Record of Presence/Absence of Rupture/Leak

Source: Brown and Pennello (2002), “Replacement Surgery and Silicone Gel Breast Implant Rupture”,

Journal of Women’s Health & Gender-Based Medicine

, Vol. 11, pp 255-264

Example - Reporting of Silicone Breast

Implant Leakage in Revision Surgery

•

: Tendency to report ruptures/leaks is the same

for self reports and surgical records

•

: Tendencies differ

R Code and Output

rupture <- matrix(c(69,5, 28,63), ncol=2)

mcnemar.test(rupture, correct=F)

> mcnemar.test(rupture, correct=F)

        McNemar's Chi-squared test

data:  rupture

McNemar's chi-squared = 16.03, df = 1, p-value = 6.234e-05

Note that the

mcnemar.test

 function reports z

 which is chi-square

with 1 degree of freedom (thus it is equivalent to the z-test).

Mantel-Haenszel Test / CI for Multiple Tables

•

Data collected from q studies or strata in 2x2

contingency tables with common groupings/outcomes

•

Each table has 4 cells: n

h11

, n

h12

, n

h21

, n

h21

 h=1,…,q

•

They can be combined for an overall Chi-square statistic

or odds ratio and confidence Interval

Mantel-Haenszel Computations

Associations Between Categorical

Variables

•

Case where both explanatory (independent)

variable and response (dependent) variable

are qualitative

•

Association: The distributions of responses

differ among the levels of the explanatory

variable (e.g. Party affiliation by gender)

Contingency Tables

•

Cross-tabulations of frequency counts where the

rows (typically) represent the levels of the

explanatory variable and the columns represent

the levels of the response variable.

•

 Numbers within the table represent the numbers

of individuals falling in the corresponding

combination of levels of the two variables

•

Row and column totals are called the

marginal

distributions

 for the two variables

Example – Acute Mountain Sickness in Hikers

•

Explanatory Variable: Treatment (Placebo,

Acetazolamide, Ginkgo, Acetazolamide/Ginkgo)

•

Response: Presence/Absence of Occurrence of Acute

Mountain Sickness in Himalayan Trekkers

•

Units:

 = 487 Hikers

•

Hikers randomly assigned to treatment condition

Source: J.H. Gertsch, B. Basnyat, E.W. Johnson, J. Onopa, and P.S. Holck (2004). "Randomized, Double-Blind Placebo Controlled

Comparison of Ginkgo Biloba and Acetazolamide for Prevention of Acute Mountain Sickness Among Himalayan Trekkers: the Prevention of

High Altitude Illness Trial", BMJ, 328: pp 797-

Example – Acute Mountain Sickness in Hikers

For each treatment (row) we can compute the percentage of hikers in

the AMS presence/absence conditions, the

conditional distribution

Of the 119 hikers in the Placebo condition, 40 suffered from AMS, a

proportion of 40/119 = 0.3361, or 33.61% as a percentage.

Guidelines for Contingency Tables

•

Compute percentages for the response (column)

variable within the categories of the explanatory

(row) variable. Note that in journal articles, rows

and columns may be interchanged.

•

Divide the cell totals by the row (explanatory

category) total and multiply by 100 to obtain a

percent, the row percents will add to 100

•

Give title and clearly define variables and

categories.

•

Include row (explanatory) total sample sizes

Independence & Dependence

•

Statistically Independent: Population conditional

distributions of one variable are the same across

all levels of the other variable

•

Statistically Dependent: Conditional Distributions

are not all equal

•

When testing, researchers typically wish to

demonstrate dependence (alternative hypothesis),

and wish to refute independence (null hypothesis)

Pearson’s Chi-Square Test

•

Can be used for nominal or ordinal explanatory

and response variables

•

Variables can have any number of distinct levels

•

Tests whether the distribution of the response

variable is the same for each level of the

explanatory variable (

: No association between

the variables)

•

 = # of levels of explanatory variable

•

 = # of levels of response variable

Pearson’s Chi-Square Test

•

Intuition behind test statistic

–

Obtain marginal distribution of outcomes for

the response variable

–

Apply this common distribution to all levels of

the explanatory variable, by multiplying each

proportion by the corresponding sample size

–

Measure the difference between actual cell

counts and the expected cell counts in the

previous step

Pearson’s Chi-Square Test

•

Notation to obtain test statistic

–

Rows represent explanatory variable (

levels)

–

Cols represent response variable (

levels)

Pearson’s Chi-Square Test

•

Observed frequency (

ij

): The number of

individuals falling in a particular cell

•

Expected frequency (

ij

): The number we would

expect in that cell, given the sample sizes

observed in study and the assumption of

independence.

–

Computed by multiplying the row total and the

column total, and dividing by the overall sample

size.

–

Applies the overall marginal probability of the

response category to the sample size of explanatory

category

Pearson’s Chi-Square Test

•

Large-sample test (at least 80% of

ij

 ≥ 5)

•

: Variables are statistically independent

(No association between variables)

•

: Variables are statistically dependent

(Association exists between variables)

•

Test Statistic:

•

-value: Area above       in the chi-squared

distribution with (

-1)(

-1) degrees of

freedom.

Example – Acute Mountain Sickness in Hikers

Note that overall: (115/487)100%=23.61% of all hikers suffered

from AMS. If we apply that percentage to the 119 that received

Placebo, we would expect (0.2361)(119)=28.10 to have occurred in

the first cell of the table. The full table of

ij

Observed Cell Counts (

ij

):

Example – Acute Mountain Sickness in Hikers

Computation of

Example – Acute Mountain Sickness in Hikers

•

: Incidence of AMS is independent of

treatment condition

•

: Incidence of AMS differs by treatment

condition

•

Test Statistic:

•

RR:

•

-value:

Likelihood Ratio Statistic

R Code –

chisq.test

 Function

## Set up a matrix of observed counts with 4 rows

##    (trts), 2 columns (outcomes)

## Default is to enter data by columns (AMS first, then No AMS)

ams.obs <- matrix(c(40, 14, 43, 18,  79, 104, 81, 108), ncol=2)

## Use chisq.test function on matrix of observed counts

ams.X2 <- chisq.test(ams.obs, correct=F)

ams.X2

cbind(ams.obs, ams.X2$expected)     # Print n’s and E’s

> ams.X2

        Pearson's Chi-squared test

data:  ams.obs

X-squared = 30.12, df = 3, p-value = 1.302e-06

> cbind(ams.obs, ams.X2$expected)

     [,1] [,2]     [,3]     [,4]

[1,]   40   79 28.10062 90.89938

[2,]   14  104 27.86448 90.13552

[3,]   43   81 29.28131 94.71869

[4,]   18  108 29.75359 96.24641

Misuses of chi-squared Test

•

Expected frequencies too small (at least

80% of expected counts should be at least 5,

not necessary for the observed counts)

•

Dependent samples (the same individuals

are in each row, see McNemar’s test)

•

Can be used for nominal or ordinal

variables, but more powerful methods exist

when both variables are ordinal and a

directional association is hypothesized

Residual Analysis

•

Once dependence has been determined from a chi-

square test, often interested in determining which

cells contributed

•

Residual:

ij

- E

ij

 measures the difference between

the observed and expected counts

–

Positive implies observed more than expected

–

Residual’s practical importance depends on level of

ij

•

Adjusted Residual (computed for each cell):

Adjusted residuals above about 3 in absolute value give strong evidence against

independence in that cell (These are like “z-statistics”)

Example – Acute Mountain Sickness in Hikers

Adjusted residuals are computed in the following table.

Row proportion for Placebo: 119/487 = 0.2444

Column Proportion for AMS is: 115/487 = 0.2361

All adjusted residual are close to or above 3 in absolute value.

When Acetazolamide is taken, large negative residuals for AMS

and large positive for No AMS. Opposite for when Acet not taken

R Code/Output

ams.obs <- matrix(c(40, 14, 43, 18,  79, 104, 81, 108),

ncol=2)

## Use chisq.test function on matrix of observed counts

ams.X2 <- chisq.test(ams.obs, correct=F)

ams.X2

ams.X2$stdres

> ams.X2$stdres

          [,1]      [,2]

[1,]  2.954610 -2.954610

[2,] -3.452410  3.452410

[3,]  3.359862 -3.359862

[4,] -2.863550  2.863550

Ordinal Explanatory and Response Variables

•

Pearson’s Chi-square test can be used to test

associations among ordinal variables, but more

powerful methods exist

•

When theories exist that the association is

directional (positive or negative), measures exist

to describe and test for these specific alternatives

from independence:

–

Gamma

–

Kendall’s



Concordant and Discordant Pairs

•

Concordant Pairs - Pairs of individuals where one

individual scores “higher” on both ordered

variables than the other individual

•

Discordant Pairs - Pairs of individuals where one

individual scores “higher” on one ordered variable

and the other individual scores “higher” on the

other

•

 = # Concordant Pairs

 = # Discordant Pairs

–

Under Positive association, expect

–

Under Negative association, expect

–

Under No association, expect



Measures of Association

•

 Goodman and Kruskal’s Gamma:

•

 Kendall’s



When there’s no association between the ordinal variables,

the population based values of these measures are 0.

Statistical software packages provide these tests and CI’s.

Example – Language Lateralization and Handedness

•

Language Lateralization (Strong Left, Moderate

Left, Bilateral, Moderate Right, Strong Right)

•

Handedness (Strong Left, Moderate Left, Mixed,

Moderate Right, Strong Right)

•

Concordant Pairs - Pairs of subjects where one

scores higher on both language lateralization and

handedness than the other

•

Discordant Pairs - Pairs of subjects where one

scores higher on language lateralization and the

other scores higher on handedness

Source: M. Somers, et al. (2015). “On the Relationship Between Degree of Hand Preference and Degree of Language

Lateralization,”

Brain & Language

, Vol. 144, pp. 10-15.

•

Concordant Pairs: Beginning in bottom left cell,

each individual in a given cell is concordant with

each individual in cells “Northeast” of theirs

•

Discordant Pairs: Beginning in top left cell, each

individual in a given cell is discordant with each

individual in cells “Southeast” of theirs

Example – Language Lateralization and Handedness

Example – Language Lateralization and Handedness

R Code

## Set up matrix of counts (Rows in reverse order from EXCEL)

langhand.obs <- matrix(c(0,7,10,80,23, 0,3,8,33,6, 0,1,6,13,3,

                         1,4,10,34,9, 2,8,17,23,7), ncol=5)

install.packages("vcdExtra")

library(vcdExtra)

GKgamma(langhand.obs)

## "String-out" matrix into n = 308 value of lang and hand

n.tot <- sum(langhand.obs)

lang <- rep(0, n.tot)

hand <- rep(0, n.tot)

n.count <- 0

for (i1 in 1:nrow(langhand.obs)) {

  for (i2 in 1:ncol(langhand.obs)) {

    lang[(n.count+1):(n.count+langhand.obs[i1,i2])] <- i1

    hand[(n.count+1):(n.count+langhand.obs[i1,i2])] <- i2

    n.count <- n.count+langhand.obs[i1,i2]

cor.test(lang, hand, method = "kendall")

R Output

> GKgamma(langhand.obs)

gamma        : -0.279

std. error   : 0.072

CI           : -0.42 -0.137

> cor.test(lang, hand, method = "kendall")

        Kendall's rank correlation tau

data:  lang and hand

z = -3.8779, p-value = 0.0001054

alternative hypothesis: true tau is not equal to 0

sample estimates:

tau

-0.1889155

Inter-Rater Agreement – Cohen’s Kappa

•

Two Raters rate the same items, typically on an ordinal

scale

•

Goal: Measure Strength of their agreement above

“chance”

Agreement Among Movie Reviewers

Reviews by Gene Siskel and Roger Ebert (160 movies

between April, 1995 through September 1996)

A. Agresti and L. Winner (1997). “Evaluating Agreement and Disagreement Among Movie Reviewers,”

Chance

, Vol. 10, #2, pp. 10—14.

R Code/Output

siskel.ebert <- matrix(c(24,8,10, 8,13,9, 13,11,64),

ncol=3)

install.packages("psych")

library(psych)

cohen.kappa(siskel.ebert)

> cohen.kappa(siskel.ebert)

Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha =

alpha, levels = levels)

Cohen Kappa and Weighted Kappa correlation coefficients

and confidence boundaries

                 lower estimate upper

unweighted kappa  0.27     0.39  0.51

weighted kappa    0.32     0.46  0.60

 Number of subjects = 160

Slide Note

Embed Share

Download

Inference methods for estimating proportions in a population are essential in categorical data analysis. This includes techniques for single proportions, confidence intervals, sample size determination, and Wilson-Agresti-Coull method for small sample sizes. Illustrated with examples and visuals, these methods are crucial for making accurate statistical inferences in studies involving categorical data.

peal_zi Follow

Uploaded on Sep 18, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Chapter 8 Categorical Data Analysis

Inference for a Single Proportion () Goal: Estimate proportion of individuals in a population with a certain characteristic ( ). This is equivalent to estimating a binomial probability Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties: y ^ = Sample proportion : n n 1 ( ) = = Mean and Std. Dev. of sampling distributi on : ^ ^ ^ ^ 1 = Estimated Standard Error : SE ^ n n Shape approximat : ely normal for large samples (Rule of thumb : , 1 ( n ) ) 5

Large-Sample Confidence Interval for Take SRS of size n from population where is true (unknown) proportion of successes. Observe y successes Set confidence level (1- ) and obtain z /2from z-table y n ^ = Point Estimate: ^ ^ 1 = Estimated Standard Error: SE ^ n = Margin of error: SE m z /2 ^ ^ ( ) 1 100% confidence interval for : m

Example - Ginkgo and Azet for AMS Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers Parameter: = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. Sample Data: n=126 trekkers received G&A, y=18 suffered from AMS 18 (. 14 )(. 86 ) ^ = = = = 143 . SE 031 . ^ 126 126 = 100 ) . 1 = = Margin of error (( 1 % 95 %) : 96 (. 031 ) 061 . m 95 % CI for : 143 . 061 . (. 082 204 ,. )

Sample Size for Margin of Error = E Goal: Estimate within E with 100(1- ) Confidence Confidence Interval will have width of 2E ( ) ( ) 2 1 1 z = = = /2 m E z n /2 2 n E = Since is unknown, an educated guess can be used or set This is most conservative as 0.5 ( ) = 1 is largest for 0.5 ( ) ( )( ) = = = 2 2 0.05, 0.5 1 2 0.5 1 0.5 1 z /2 1 n 2 E

Wilson-Agresti-Coull Method For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities Simple approach that works well in practice: Adjust observed number of Successes (y) and sample size (n) ~ ~ 2 2 /2 /2 ~ ~ Point Estimate: n = + = + = = 2 .025 z 2 0.5 Note that for 0.05, 1.96 4 y y z n n z y = ~ ~ ~ 1 = Estimated Standard Error: SE ~ ~ n = Margin of error: SE m z /2 ~ ~ )100% confidence interval for : (1 m

Example: Listers Tests with Antiseptic Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) n=12 patients received antiseptic y=1 died ~ y ( ) ( ) 2 1 0.5 1.96 = + 1 0.5 3.84 = + = 2.92 ~ n ( ) 2 = + = 12 3.84 15.84 + = 12 1.96 2.92 15.84 .1843(.8157) 15.84 95%): 1.96(.0974) ( .0067,.3953) ~ = = = = .1843 SE .0974 ~ = = Margin of error((1- )100% 95% CI for : .1843 .1910 .1910 (0,.40)

Significance Test for a Proportion Goal test whether a proportion ( ) equals some null value 0H0: = ^ = 0 Test Statistic : z obs 1 ( o ) 0 n = : : - value ( P ) H RR z z P P = Z z 0 a obs obs z : : - value ( ) H RR z z P Z 0 a obs z obs = : : - value 2 ( ) H RR z P P Z z 0 / 2 a obs obs Large-sample test works well when n 0 and n(1- 0) 5

Ginkgo and Acetazolamide for AMS Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? H0: =0.25 Ha: < 0.25 ^ 18 126 18 0.143 126 .143 .25 .107 Test Statistic: .039 .25(.75) 126 ( .05): 1.645 obs RR z z P Y Y = = = = = 0.25 n y 0 = = = 2.75 z obs = = = = -value = ( P Z = 2.75) = .0030 P .05 18| ( ) ( ) Exact P-value: ~ Bin 126, 0.25 .0025 n Strong evidence that incidence rate is below 25% ( < 0.25)

R Code/Output y <- 18; n <- 126 binom.test(y, n, p=0.25, alternative="less") > binom.test(y, n, p=0.25, alternative="less") Exact binomial test data: y and n number of successes = 18, number of trials = 126, p- value = 0.002465 alternative hypothesis: true probability of success is less than 0.25 95 percent confidence interval: 0.0000000 0.2044495 sample estimates: probability of success 0.1428571 The 95% Confidence Interval is 1-sided as the alternative is less than the null value.

Multinomial Experiment / Distribution Extension of Binomial Distribution to experiments where each trial can end in exactly one of k categories n independent trials Probability a trial results in category i is i ni is the number of trials resulting in category i 1+ + k = 1 n1+ +nk = n

Multinomial Distribution / Test for Cell Probabilities ! n ( ) = k n k n ,..., ... p n n 1 1 1 k !... ! n n 1 k k k = = , 1, 0, 0 n n n i i i i = = 1 1 i i Testing whether the category probabilities are specific values: k = = = : ,..., 1 H 0 1 10 0 0 k k i = 1 i : At least one cell probability is not as s Expected cell counts under pecified i = H A = : 1,..., H E n k 0 0 i i ( ) 2 n E k = i i 2 obs Test Statistic: X E = 1 i i ( ) 2 obs 2 2 k 2 obs Rejection Region: P-value: X P X 1 , 1 k

Example English Premier League -2013 Home Team Games can end in Win, Draw, Lose (k = 3) Season: n = 380 games (All 20 teams play Home/Away) Test H0: W = L = 0.40, D = 0.20 Data: nW = 179, nD = 78, nL = 123 ( ) E ( ) + = = = = = Expected Counts Under Note: W + : 380 0.40 = 152 + 380 0.20 152 76 152 = + 76 H E E E + 0 W L D E + = 0.40 0.20 0.40 1.00 + + = = 380 E n 0 0 0 D L W D L ( ) ( ) ( ) ( ) 2 2 2 2 179 152 152 78 76 76 123 152 152 n E = = + + = 4.796 0.053 5.533 + + = i i 2 obs Test Statistic: 10.382 X E i i = 1 3 1 Rejection Region: = = 2 obs 2 .05,2 5.991 df k X ( ) = = 2 2 10.382 .0056 P P 179 380 78 380 123 380 ^ ^ ^ = = = = = = Sample Proportions: 0.4711 0.2053 0.3237 W D L

English Premier League -2013 R Code #### Multinomial Goodness of Fit Test ## Give counts game.count <- c(179, 78, 123) ## Give null values for probabilities prob.null <- c(0.40, 0.20, 0.40) ## Use chisq.test function for the test chisq.test(game.count, p=prob.null) > chisq.test(game.count, p=prob.null) Chi-squared test for given probabilities data: game.count X-squared = 10.382, df = 2, p-value = 0.005568

Goodness of Fit Test for a Probability Distribution Data are collected and wish to be determined whether it comes from a specific probability distribution (e.g. Poisson, Normal, Gamma) Estimate any unknown model parameters (p estimates) Break down the range of data values into k > p intervals (typically where 80% have expected counts 5) obtain observed (n) and expected (E) values for each interval P-value > .25 .15-.25 .05-.15 .01-.05 <.01 Quality of Fit Excellent Good Moderately Good Poor Unacceptable ( ) 2 n E k = i i 2 obs Test Statistic: X E = 1 i i ( ) 2 k 2 obs P-value: P X 1 p Assessing quality of fit to hypothesized distribution:

Example Goals in 2013 Brazil Soccer League has 20 teams, each team plays other 19 teams twice Games are 90 minutes, with no overtime Mean and variance of the total goals in a game are 2.46 and 2.61 respectively For Poisson distribution, the theoretical mean and variance are the same. For this empirical data, they are close Mean Max 2.463158 ( ) ( ) 8 = = = = 380 | ~ Poi 2.46 E P Y i Y Tgoals Count Expected X2 37 32.36292 0.664418 75 79.71498 0.278882 93 98.1753 0.272815 92 80.60709 1.610262 44 49.637 0.640162 23 24.45275 0.086309 9 10.0385 0.107434 7 5.011467 0.789043 380 380 4.449324 i 0 1 2 3 4 5 6 2.46 i 2.46 ! i e = = 380 0,...,6 i 6 = 380 E E 7+ sum + 7 i = 0 i X2(.05,6) P-value 12.59159 0.616108

Comparing Two Population Proportions Goal: Compare two populations/treatments wrt a nominal (binary) outcome Sampling Design: Independent vs Dependent Samples Methods based on large vs small samples Contingency tables used to summarize data Measures of Association: Absolute Risk, Relative Risk, Odds Ratio

Contingency Tables Tables representing all combinations of levels of explanatory and response variables Numbers in table represent Counts of the number of cases in each cell Row and column totals are called Marginal counts

2x2 Tables - Notation Outcome Present Outcome Absent Group Total Group 1 n1-y1 y1 n1 Group 2 n2-y2 y2 n2 Outcome Total y1+y2 (n1+n2)- (y1+y2) n1+n2

Example - Firm Type/Product Quality High Quality Low Quality Group Total Not 33 55 88 Integrated Vertically Integrated 5 79 84 Outcome Total 38 134 172 Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers Outcomes: High Quality (High Count) vs Low Quality (Count) Source: P. Temin (1988). Product Quality and Vertical Integration in the Early Cotton Textile Industry, Journal of Economic History, Vol. 48, #4, pp. 891-907.

Notation Proportion in Population 1 with the characteristic of interest: 1 Sample size from Population 1: n1 Number of individuals in Sample 1 with the characteristic of interest: y1 Sample proportion from Sample 1 with the characteristic of interest: y ^ = 1 1 n 1 Similar notation for Population/Sample 2

Example - Cotton Textile Producers 1 - True proportion of all Non-integretated firms that would produce High quality 2 - True proportion of all vertically integretated firms that would produce High quality 33 y ^ = = = = = 88 33 . 0 375 1 n y 1 1 1 88 n 1 5 y ^ = = = = = 84 5 . 0 060 2 n y 2 2 2 84 n 2

Notation (Continued) Parameter of Primary Interest: 1- 2, the difference in the 2 population proportions with the characteristic (two other measures given below) Estimator: 1 = D ^ ^ 2 Standard Error (and its estimator): ^ ^ ^ ^ 1 1 1 1 2 2 (1 ) (1 ) ^ = + = + SE 1 1 2 2 D D n n n n 1 2 1 2 Pooled Estimated Standard Error when = = : + + 1 n 1 n y n y n ^ ^ ^ ^ = + = 1 SE 1 2 D P 1 2 1 2

Cotton Textile Producers (Continued) Parameter of Primary Interest: , the difference in the 2 population proportions that produce High quality output Estimator: Standard Error (and its estimate): ^ ^ = = = . 0 375 . 0 060 . 0 315 D 1 2 ^ ^ ^ ^ 1 1 1 1 2 2 . 0 375 . 0 ( 625 ) . 0 060 . 0 ( 94 ) = + = + = = 003335 . 0577 . SED 88 84 n n 1 2 Pooled Estimated Standard Error when = = : + 1 1 33 5 ^ ( . 0 ) = + = = = . 0 221 779 0633 . . 0 221 SE D + 88 84 88 84 P

Significance Tests for Testing whether = canbe done by interpreting plausible values of from the confidence interval: If entire interval is positive, conclude ( > 0) If entire interval is negative, conclude ( < 0) If interval contains 0, do not conclude that Alternatively, we can conduct a significance test: H0: = Ha: (2-sided) HA: (1-sided) Test Statistic: = ^ 1 ^ ^ 1 2 zobs 1 n 1 n ^ + 1 2 RR: |zobs| z /2 (2-sided) zobs z (1-sided) P-value: 2P(Z |zobs|) (2-sided) P(Z zobs) (1-sided)

Example - Cotton Textile Production ^ ^ ^ ^ 1 1 1 1 2 2 ^ ^ + 95% Confidence Interval for : .025 z 1 2 1 2 n n 1 2 ( ) ( ) ( ) = 0.315 0.113 0.375 0.060 1.96 0.0577 0) 0) 0.202,0.428 = : : ( ( H H 0 1 2 1 2 1 2 1 2 A ^ ^ 0.375 0.060 0.315 0.0633 1 2 = = = = : 4.98 TS z obs 1 88 1 1 n 1 n ^ ^ + 0.221(0.779) + 1 84 1 2 = = : 1.96 4.98) RR z P .025 z P Z obs -value 2 ( 0 Strong evidence of differences in quality by firm type

R Code and Output y1 <- 33; n1 <- 88 y2 <- 5; n2 <- 84 prop.test(c(y1,y2), c(n1,n2), correct=F) > prop.test(c(y1,y2), c(n1,n2), correct=F) 2-sample test for equality of proportions without continuity correction data: c(y1, y2) out of c(n1, n2) X-squared = 24.851, df = 1, p-value = 6.195e-07 alternative hypothesis: two.sided 95 percent confidence interval: 0.2023778 0.4285746 sample estimates: prop 1 prop 2 0.37500000 0.05952381

Measures of Association Absolute Risk (AR): Relative Risk (RR): Odds Ratio (OR): o1 / o2 (o = /(1- )) Note that if = (No association between outcome and grouping variables): AR=0 RR=1 OR=1

Relative Risk Ratio of the probability that the outcome characteristic is present for one group, relative to the other Sample proportions with characteristic from groups 1 and 2: y y ^ ^ = = 1 2 1 2 n n 1 2

Relative Risk Estimated Relative Risk: ^ 1 = RR ^ 2 95% Confidence Interval for Population Relative Risk: . 1 96 . 1 96 v v ( ( , ) ( ) ) RR e RR e ^ ^ 1 ( ) 1 ( ) 1 2 = = + 71828 . 2 e v y y 1 2

Relative Risk Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1

Example - Concussions in NCAA Athletes Units: Game exposures among college socer players 1997-1999 Outcome: Presence/Absence of a Concussion Group Variable: Gender (Female vs Male) Contingency Table of case outcomes: Outcome Gender Female No Concussion Concussion Total 158 74924 75082 Male 101 75633 75734 Total 259 150557 150816 Source: Covassin, et al (2003). Sex Differences and the Incidence of Concussions Among Collegiate Athletes, Journal of Athletic Training, Vol. 38, #3, pp. 238-244

Example - Concussions in NCAA Athletes 158 ^ = = Among Females : . 0 0021 F 75082 Concussion (2.1 per s 1000 female player/gam es) 101 ^ = = Among Males : 0013 . 0 M 75734 Concussion (1.3 per s 1000 male player/gam es) ^ 0021 . F = = . 1 = ( / ) 62 RR F M ^ . 0013 M 1 0021 . 1 . 0013 = + = = 0162 . . 1273 v v 158 101 95%CI for Population Relative Risk : ( 1.62e ) 1.96(.1273 - ) 1.96(.1273 ) 1.62e , . 1 ( 27 . 2 , 13 ) There is strong evidence that females have a higher risk of concussion

Odds Ratio Odds of an event is the probability it occurs divided by the probability it does not occur Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2 Sample odds of the outcome for each group: / y y n y = = 1 1 1 odds 1 ( / ) n n n y 1 1 1 1 1 y = 2 odds 2 n y 2 2

Odds Ratio Estimated Odds Ratio: /( ) ( ) odds y n y y n y = = = 1 1 1 1 1 2 2 OR /( ) ( ) odds y n y y n y 2 2 2 2 2 1 1 95% Confidence Interval for Population Odds Ratio . 1 96 . 1 96 v v ( ( , ) ( ) ) OR e OR e 1 y 1 1 y 1 = = + + + 71828 . 2 e v n y n y 1 1 1 2 2 2

Odds Ratio Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1

Osteoarthritis in Former Soccer Players Units: 68 Former British professional football players and 136 age/sex matched controls Outcome: Presence/Absence of Osteoathritis (OA) Data: Of n1= 68 former professionals, y1 =9 had OA, n1-y1=59 did not Of n2= 136 controls, y2 =2 had OA, n2-y2=134 did not 9 59 2 X = = = = = .1525 .0149 odds odds 1 1 2 134 n X 1 1 .1525 .0149 odds odds = = = 10.23 OR 1 2 1 9 1 59 1 2 1 = + + + = = .6355 .797 v v 134 ( ) 1.96(.797) 1.96(.797) e 95% CI for Population Odds Ratio: 10.23 ,10.23 (2.14,48.80) e Interval > 1 Source: Shepard, et al (2003). Ex-professional association footballers have an increased prevalence of osteoarthritis of the hip compared with age matched controls despite not having sustained notable hip injuries, Brit. J. Sports Med., Vol. 37, #1, pp. 80-81

Fishers Exact Test Method of testing for testing whether 2= 1 when one or both of the group sample sizes is small Measures (conditional on the group sizes and number of cases with and without the characteristic) the chances we would see differences of this magnitude or larger in the sample proportions, if there were no differences in the populations

Example Echinacea Purpurea for Colds Healthy adults randomized to receive EP (n1=24) or placebo (n2=22, two were dropped) Among EP subjects, 14 of 24 developed cold after exposure to RV-39 (58%) Among Placebo subjects, 18 of 22 developed cold after exposure to RV-39 (82%) Out of a total of 46 subjects, 32 developed cold Out of a total of 46 subjects, 24 received EP Source: S.J. Sperber, et al (2004), Echinacea Purpurea for Prevention of Experimental Rhinovirus Colds, Clinical Infectious Diseases, Vol. 38, #10, pp. 1367-1371.

Example Echinacea Purpurea for Colds EP/Cold 14 13 12 11 10 Sum Placebo/Cold Probability 18 19 20 21 22 Conditional on 32 people developing colds and 24 receiving EP and 22 receiving placebo, the following table gives the outcomes that would have been as strong or stronger evidence that EP reduced risk of developing cold (1- sided test). P-value from R is .079 (next slide). 0.059808 0.016025 0.002604 0.000229 0.000008 0.078674 + + n y n y n y n y EP PL ( ) = EP PL Probabilities: , p y y EP PL EP PL EP PL 24 14 22 18 ( )( ) 1961256 7315 23987744005 ( ) = = = 14,18 .059808 p 46 32 ... 24 10 22 22 ( 23987744005 )( ) 1961256 1 ( ) = = = 10,22 .000 0082 p 46 32

R Code/Output ep.cold <- matrix(c(10,4, 14,18), ncol=2) fisher.test(ep.cold, alt="greater") fisher.test(ep.cold, alt="two.sided") > fisher.test(ep.cold, alt="greater") Fisher's Exact Test for Count Data data: ep.cold p-value = 0.07867 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.8653928 Inf sample estimates: odds ratio 3.132591

McNemars Test for Paired Samples Common subjects (or matched pairs) being observed under 2 conditions (2 treatments, before/after, 2 diagnostic tests) in a crossover setting Two possible outcomes (Presence/Absence of Characteristic) on each measurement Four possibilities for each subject/pair wrt outcome: Present in both conditions Absent in both conditions Present in Condition 1, Absent in Condition 2 Absent in Condition 1, Present in Condition 2

McNemars Test for Paired Samples Condition 1\2 Present Absent Present n11 n12 Absent n21 n22

McNemars Test for Paired Samples Data: n12 = # of pairs where the characteristic is present in condition 1 and not 2 and n21 # where present in 2 and not 1 H0: Probability the outcome is Present is same for the 2 conditions ( 1 = 2) HA: Probabilities differ for the 2 conditions ( 1 2) Large-Sample Test (Normal Approximation to Binomial) n n T S z = + = . .: 12 n 21 n z obs 12 21 2 ( P Z | |) P val obs

Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery Subjects - 165 women having revision surgery involving silicone gel breast implants Conditions (Each being observed on all women) Self Report of Presence/Absence of Rupture/Leak Surgical Record of Presence/Absence of Rupture/Leak L C T p R u R G T o S E L F * S U R G N S I C R o U A u u p I t a C o o r t C r A o n l r s t e L s t a b u l a t i o n u t e u l t a Source: Brown and Pennello (2002), Replacement Surgery and Silicone Gel Breast Implant Rupture , Journal of Women s Health & Gender-Based Medicine, Vol. 11, pp 255-264

Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery H0: Tendency to report ruptures/leaks is the same for self reports and surgical records HA: Tendencies differ 28 5 28 5 2 ( P Z n n = = = . .: T S z 4.00 12 n 21 n z obs + + 12 21 = = = 2 ( P Z | |) 4) 2(.0000317) 0 P val obs ( ) ( ) = = Exact P-value: 2 28| ~ 33, 0.5 .0001 P Y Y B n

R Code and Output rupture <- matrix(c(69,5, 28,63), ncol=2) mcnemar.test(rupture, correct=F) > mcnemar.test(rupture, correct=F) McNemar's Chi-squared test data: rupture McNemar's chi-squared = 16.03, df = 1, p-value = 6.234e-05 Note that the mcnemar.test function reports z2 which is chi-square with 1 degree of freedom (thus it is equivalent to the z-test).

Mantel-Haenszel Test / CI for Multiple Tables Data collected from q studies or strata in 2x2 contingency tables with common groupings/outcomes Each table has 4 cells: nh11, nh12, nh21, nh21h=1, ,q They can be combined for an overall Chi-square statistic or odds ratio and confidence Interval Table 1 1 n_111 n_121 n_1 1 Table q 1 n_q11 n_q21 n_q 1 Trt\Response 1 2 Total 2 Total n_11 n_12 n_1 Trt\Response 1 2 Total 2 Total n_q1 n_q2 n_q n_112 n_122 n_1 2 n_q12 n_q22 n_q 2

Mantel-Haenszel Computations 2 q n n n 1 n n n 1 h h 11 h = 1 q h h = 2 MH Test Statisic: n n n 1 2 h 2 n 1 2 h h h h ( ) 1 = 1 h h ( ) 2 MH 2 2 1 2 MH Rejection Region: P-value: P ,1 q q q n n n n R S ^ = = = = 11 h n 22 12 h n 21 h h OR R S S MH h = = = 1 1 1 h h h h h 1 q 1 S 1 1 1 = ^ ^ = + + + 2 h lo g( ) v V OR S MH 2 n n n n = 1 h 11 h 12 h 21 22 h h ^ ^ 1.96 1.96 e v v 95% CI for Overall Odds Ratio: , OR e OR MH MH

Associations Between Categorical Variables Case where both explanatory (independent) variable and response (dependent) variable are qualitative Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)

Categorical Data Analysis in Population Studies

Download Presentation

Presentation Transcript

Related

More Related Content