Categorical Data Analysis in Population Studies
Inference methods for estimating proportions in a population are essential in categorical data analysis. This includes techniques for single proportions, confidence intervals, sample size determination, and Wilson-Agresti-Coull method for small sample sizes. Illustrated with examples and visuals, these methods are crucial for making accurate statistical inferences in studies involving categorical data.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Chapter 8 Categorical Data Analysis
Inference for a Single Proportion () Goal: Estimate proportion of individuals in a population with a certain characteristic ( ). This is equivalent to estimating a binomial probability Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties: y ^ = Sample proportion : n n 1 ( ) = = Mean and Std. Dev. of sampling distributi on : ^ ^ ^ ^ 1 = Estimated Standard Error : SE ^ n n Shape approximat : ely normal for large samples (Rule of thumb : , 1 ( n ) ) 5
Large-Sample Confidence Interval for Take SRS of size n from population where is true (unknown) proportion of successes. Observe y successes Set confidence level (1- ) and obtain z /2from z-table y n ^ = Point Estimate: ^ ^ 1 = Estimated Standard Error: SE ^ n = Margin of error: SE m z /2 ^ ^ ( ) 1 100% confidence interval for : m
Example - Ginkgo and Azet for AMS Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers Parameter: = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. Sample Data: n=126 trekkers received G&A, y=18 suffered from AMS 18 (. 14 )(. 86 ) ^ = = = = 143 . SE 031 . ^ 126 126 = 100 ) . 1 = = Margin of error (( 1 % 95 %) : 96 (. 031 ) 061 . m 95 % CI for : 143 . 061 . (. 082 204 ,. )
Sample Size for Margin of Error = E Goal: Estimate within E with 100(1- ) Confidence Confidence Interval will have width of 2E ( ) ( ) 2 1 1 z = = = /2 m E z n /2 2 n E = Since is unknown, an educated guess can be used or set This is most conservative as 0.5 ( ) = 1 is largest for 0.5 ( ) ( )( ) = = = 2 2 0.05, 0.5 1 2 0.5 1 0.5 1 z /2 1 n 2 E
Wilson-Agresti-Coull Method For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities Simple approach that works well in practice: Adjust observed number of Successes (y) and sample size (n) ~ ~ 2 2 /2 /2 ~ ~ Point Estimate: n = + = + = = 2 .025 z 2 0.5 Note that for 0.05, 1.96 4 y y z n n z y = ~ ~ ~ 1 = Estimated Standard Error: SE ~ ~ n = Margin of error: SE m z /2 ~ ~ )100% confidence interval for : (1 m
Example: Listers Tests with Antiseptic Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) n=12 patients received antiseptic y=1 died ~ y ( ) ( ) 2 1 0.5 1.96 = + 1 0.5 3.84 = + = 2.92 ~ n ( ) 2 = + = 12 3.84 15.84 + = 12 1.96 2.92 15.84 .1843(.8157) 15.84 95%): 1.96(.0974) ( .0067,.3953) ~ = = = = .1843 SE .0974 ~ = = Margin of error((1- )100% 95% CI for : .1843 .1910 .1910 (0,.40)
Significance Test for a Proportion Goal test whether a proportion ( ) equals some null value 0H0: = ^ = 0 Test Statistic : z obs 1 ( o ) 0 n = : : - value ( P ) H RR z z P P = Z z 0 a obs obs z : : - value ( ) H RR z z P Z 0 a obs z obs = : : - value 2 ( ) H RR z P P Z z 0 / 2 a obs obs Large-sample test works well when n 0 and n(1- 0) 5
Ginkgo and Acetazolamide for AMS Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? H0: =0.25 Ha: < 0.25 ^ 18 126 18 0.143 126 .143 .25 .107 Test Statistic: .039 .25(.75) 126 ( .05): 1.645 obs RR z z P Y Y = = = = = 0.25 n y 0 = = = 2.75 z obs = = = = -value = ( P Z = 2.75) = .0030 P .05 18| ( ) ( ) Exact P-value: ~ Bin 126, 0.25 .0025 n Strong evidence that incidence rate is below 25% ( < 0.25)
R Code/Output y <- 18; n <- 126 binom.test(y, n, p=0.25, alternative="less") > binom.test(y, n, p=0.25, alternative="less") Exact binomial test data: y and n number of successes = 18, number of trials = 126, p- value = 0.002465 alternative hypothesis: true probability of success is less than 0.25 95 percent confidence interval: 0.0000000 0.2044495 sample estimates: probability of success 0.1428571 The 95% Confidence Interval is 1-sided as the alternative is less than the null value.
Multinomial Experiment / Distribution Extension of Binomial Distribution to experiments where each trial can end in exactly one of k categories n independent trials Probability a trial results in category i is i ni is the number of trials resulting in category i 1+ + k = 1 n1+ +nk = n
Multinomial Distribution / Test for Cell Probabilities ! n ( ) = k n k n ,..., ... p n n 1 1 1 k !... ! n n 1 k k k = = , 1, 0, 0 n n n i i i i = = 1 1 i i Testing whether the category probabilities are specific values: k = = = : ,..., 1 H 0 1 10 0 0 k k i = 1 i : At least one cell probability is not as s Expected cell counts under pecified i = H A = : 1,..., H E n k 0 0 i i ( ) 2 n E k = i i 2 obs Test Statistic: X E = 1 i i ( ) 2 obs 2 2 k 2 obs Rejection Region: P-value: X P X 1 , 1 k
Example English Premier League -2013 Home Team Games can end in Win, Draw, Lose (k = 3) Season: n = 380 games (All 20 teams play Home/Away) Test H0: W = L = 0.40, D = 0.20 Data: nW = 179, nD = 78, nL = 123 ( ) E ( ) + = = = = = Expected Counts Under Note: W + : 380 0.40 = 152 + 380 0.20 152 76 152 = + 76 H E E E + 0 W L D E + = 0.40 0.20 0.40 1.00 + + = = 380 E n 0 0 0 D L W D L ( ) ( ) ( ) ( ) 2 2 2 2 179 152 152 78 76 76 123 152 152 n E = = + + = 4.796 0.053 5.533 + + = i i 2 obs Test Statistic: 10.382 X E i i = 1 3 1 Rejection Region: = = 2 obs 2 .05,2 5.991 df k X ( ) = = 2 2 10.382 .0056 P P 179 380 78 380 123 380 ^ ^ ^ = = = = = = Sample Proportions: 0.4711 0.2053 0.3237 W D L
English Premier League -2013 R Code #### Multinomial Goodness of Fit Test ## Give counts game.count <- c(179, 78, 123) ## Give null values for probabilities prob.null <- c(0.40, 0.20, 0.40) ## Use chisq.test function for the test chisq.test(game.count, p=prob.null) > chisq.test(game.count, p=prob.null) Chi-squared test for given probabilities data: game.count X-squared = 10.382, df = 2, p-value = 0.005568
Goodness of Fit Test for a Probability Distribution Data are collected and wish to be determined whether it comes from a specific probability distribution (e.g. Poisson, Normal, Gamma) Estimate any unknown model parameters (p estimates) Break down the range of data values into k > p intervals (typically where 80% have expected counts 5) obtain observed (n) and expected (E) values for each interval P-value > .25 .15-.25 .05-.15 .01-.05 <.01 Quality of Fit Excellent Good Moderately Good Poor Unacceptable ( ) 2 n E k = i i 2 obs Test Statistic: X E = 1 i i ( ) 2 k 2 obs P-value: P X 1 p Assessing quality of fit to hypothesized distribution:
Example Goals in 2013 Brazil Soccer League has 20 teams, each team plays other 19 teams twice Games are 90 minutes, with no overtime Mean and variance of the total goals in a game are 2.46 and 2.61 respectively For Poisson distribution, the theoretical mean and variance are the same. For this empirical data, they are close Mean Max 2.463158 ( ) ( ) 8 = = = = 380 | ~ Poi 2.46 E P Y i Y Tgoals Count Expected X2 37 32.36292 0.664418 75 79.71498 0.278882 93 98.1753 0.272815 92 80.60709 1.610262 44 49.637 0.640162 23 24.45275 0.086309 9 10.0385 0.107434 7 5.011467 0.789043 380 380 4.449324 i 0 1 2 3 4 5 6 2.46 i 2.46 ! i e = = 380 0,...,6 i 6 = 380 E E 7+ sum + 7 i = 0 i X2(.05,6) P-value 12.59159 0.616108
Comparing Two Population Proportions Goal: Compare two populations/treatments wrt a nominal (binary) outcome Sampling Design: Independent vs Dependent Samples Methods based on large vs small samples Contingency tables used to summarize data Measures of Association: Absolute Risk, Relative Risk, Odds Ratio
Contingency Tables Tables representing all combinations of levels of explanatory and response variables Numbers in table represent Counts of the number of cases in each cell Row and column totals are called Marginal counts
2x2 Tables - Notation Outcome Present Outcome Absent Group Total Group 1 n1-y1 y1 n1 Group 2 n2-y2 y2 n2 Outcome Total y1+y2 (n1+n2)- (y1+y2) n1+n2
Example - Firm Type/Product Quality High Quality Low Quality Group Total Not 33 55 88 Integrated Vertically Integrated 5 79 84 Outcome Total 38 134 172 Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers Outcomes: High Quality (High Count) vs Low Quality (Count) Source: P. Temin (1988). Product Quality and Vertical Integration in the Early Cotton Textile Industry, Journal of Economic History, Vol. 48, #4, pp. 891-907.
Notation Proportion in Population 1 with the characteristic of interest: 1 Sample size from Population 1: n1 Number of individuals in Sample 1 with the characteristic of interest: y1 Sample proportion from Sample 1 with the characteristic of interest: y ^ = 1 1 n 1 Similar notation for Population/Sample 2
Example - Cotton Textile Producers 1 - True proportion of all Non-integretated firms that would produce High quality 2 - True proportion of all vertically integretated firms that would produce High quality 33 y ^ = = = = = 88 33 . 0 375 1 n y 1 1 1 88 n 1 5 y ^ = = = = = 84 5 . 0 060 2 n y 2 2 2 84 n 2
Notation (Continued) Parameter of Primary Interest: 1- 2, the difference in the 2 population proportions with the characteristic (two other measures given below) Estimator: 1 = D ^ ^ 2 Standard Error (and its estimator): ^ ^ ^ ^ 1 1 1 1 2 2 (1 ) (1 ) ^ = + = + SE 1 1 2 2 D D n n n n 1 2 1 2 Pooled Estimated Standard Error when = = : + + 1 n 1 n y n y n ^ ^ ^ ^ = + = 1 SE 1 2 D P 1 2 1 2
Cotton Textile Producers (Continued) Parameter of Primary Interest: , the difference in the 2 population proportions that produce High quality output Estimator: Standard Error (and its estimate): ^ ^ = = = . 0 375 . 0 060 . 0 315 D 1 2 ^ ^ ^ ^ 1 1 1 1 2 2 . 0 375 . 0 ( 625 ) . 0 060 . 0 ( 94 ) = + = + = = 003335 . 0577 . SED 88 84 n n 1 2 Pooled Estimated Standard Error when = = : + 1 1 33 5 ^ ( . 0 ) = + = = = . 0 221 779 0633 . . 0 221 SE D + 88 84 88 84 P
Significance Tests for Testing whether = canbe done by interpreting plausible values of from the confidence interval: If entire interval is positive, conclude ( > 0) If entire interval is negative, conclude ( < 0) If interval contains 0, do not conclude that Alternatively, we can conduct a significance test: H0: = Ha: (2-sided) HA: (1-sided) Test Statistic: = ^ 1 ^ ^ 1 2 zobs 1 n 1 n ^ + 1 2 RR: |zobs| z /2 (2-sided) zobs z (1-sided) P-value: 2P(Z |zobs|) (2-sided) P(Z zobs) (1-sided)
Example - Cotton Textile Production ^ ^ ^ ^ 1 1 1 1 2 2 ^ ^ + 95% Confidence Interval for : .025 z 1 2 1 2 n n 1 2 ( ) ( ) ( ) = 0.315 0.113 0.375 0.060 1.96 0.0577 0) 0) 0.202,0.428 = : : ( ( H H 0 1 2 1 2 1 2 1 2 A ^ ^ 0.375 0.060 0.315 0.0633 1 2 = = = = : 4.98 TS z obs 1 88 1 1 n 1 n ^ ^ + 0.221(0.779) + 1 84 1 2 = = : 1.96 4.98) RR z P .025 z P Z obs -value 2 ( 0 Strong evidence of differences in quality by firm type
R Code and Output y1 <- 33; n1 <- 88 y2 <- 5; n2 <- 84 prop.test(c(y1,y2), c(n1,n2), correct=F) > prop.test(c(y1,y2), c(n1,n2), correct=F) 2-sample test for equality of proportions without continuity correction data: c(y1, y2) out of c(n1, n2) X-squared = 24.851, df = 1, p-value = 6.195e-07 alternative hypothesis: two.sided 95 percent confidence interval: 0.2023778 0.4285746 sample estimates: prop 1 prop 2 0.37500000 0.05952381
Measures of Association Absolute Risk (AR): Relative Risk (RR): Odds Ratio (OR): o1 / o2 (o = /(1- )) Note that if = (No association between outcome and grouping variables): AR=0 RR=1 OR=1
Relative Risk Ratio of the probability that the outcome characteristic is present for one group, relative to the other Sample proportions with characteristic from groups 1 and 2: y y ^ ^ = = 1 2 1 2 n n 1 2
Relative Risk Estimated Relative Risk: ^ 1 = RR ^ 2 95% Confidence Interval for Population Relative Risk: . 1 96 . 1 96 v v ( ( , ) ( ) ) RR e RR e ^ ^ 1 ( ) 1 ( ) 1 2 = = + 71828 . 2 e v y y 1 2
Relative Risk Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Example - Concussions in NCAA Athletes Units: Game exposures among college socer players 1997-1999 Outcome: Presence/Absence of a Concussion Group Variable: Gender (Female vs Male) Contingency Table of case outcomes: Outcome Gender Female No Concussion Concussion Total 158 74924 75082 Male 101 75633 75734 Total 259 150557 150816 Source: Covassin, et al (2003). Sex Differences and the Incidence of Concussions Among Collegiate Athletes, Journal of Athletic Training, Vol. 38, #3, pp. 238-244
Example - Concussions in NCAA Athletes 158 ^ = = Among Females : . 0 0021 F 75082 Concussion (2.1 per s 1000 female player/gam es) 101 ^ = = Among Males : 0013 . 0 M 75734 Concussion (1.3 per s 1000 male player/gam es) ^ 0021 . F = = . 1 = ( / ) 62 RR F M ^ . 0013 M 1 0021 . 1 . 0013 = + = = 0162 . . 1273 v v 158 101 95%CI for Population Relative Risk : ( 1.62e ) 1.96(.1273 - ) 1.96(.1273 ) 1.62e , . 1 ( 27 . 2 , 13 ) There is strong evidence that females have a higher risk of concussion
Odds Ratio Odds of an event is the probability it occurs divided by the probability it does not occur Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2 Sample odds of the outcome for each group: / y y n y = = 1 1 1 odds 1 ( / ) n n n y 1 1 1 1 1 y = 2 odds 2 n y 2 2
Odds Ratio Estimated Odds Ratio: /( ) ( ) odds y n y y n y = = = 1 1 1 1 1 2 2 OR /( ) ( ) odds y n y y n y 2 2 2 2 2 1 1 95% Confidence Interval for Population Odds Ratio . 1 96 . 1 96 v v ( ( , ) ( ) ) OR e OR e 1 y 1 1 y 1 = = + + + 71828 . 2 e v n y n y 1 1 1 2 2 2
Odds Ratio Interpretation Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Osteoarthritis in Former Soccer Players Units: 68 Former British professional football players and 136 age/sex matched controls Outcome: Presence/Absence of Osteoathritis (OA) Data: Of n1= 68 former professionals, y1 =9 had OA, n1-y1=59 did not Of n2= 136 controls, y2 =2 had OA, n2-y2=134 did not 9 59 2 X = = = = = .1525 .0149 odds odds 1 1 2 134 n X 1 1 .1525 .0149 odds odds = = = 10.23 OR 1 2 1 9 1 59 1 2 1 = + + + = = .6355 .797 v v 134 ( ) 1.96(.797) 1.96(.797) e 95% CI for Population Odds Ratio: 10.23 ,10.23 (2.14,48.80) e Interval > 1 Source: Shepard, et al (2003). Ex-professional association footballers have an increased prevalence of osteoarthritis of the hip compared with age matched controls despite not having sustained notable hip injuries, Brit. J. Sports Med., Vol. 37, #1, pp. 80-81
Fishers Exact Test Method of testing for testing whether 2= 1 when one or both of the group sample sizes is small Measures (conditional on the group sizes and number of cases with and without the characteristic) the chances we would see differences of this magnitude or larger in the sample proportions, if there were no differences in the populations
Example Echinacea Purpurea for Colds Healthy adults randomized to receive EP (n1=24) or placebo (n2=22, two were dropped) Among EP subjects, 14 of 24 developed cold after exposure to RV-39 (58%) Among Placebo subjects, 18 of 22 developed cold after exposure to RV-39 (82%) Out of a total of 46 subjects, 32 developed cold Out of a total of 46 subjects, 24 received EP Source: S.J. Sperber, et al (2004), Echinacea Purpurea for Prevention of Experimental Rhinovirus Colds, Clinical Infectious Diseases, Vol. 38, #10, pp. 1367-1371.
Example Echinacea Purpurea for Colds EP/Cold 14 13 12 11 10 Sum Placebo/Cold Probability 18 19 20 21 22 Conditional on 32 people developing colds and 24 receiving EP and 22 receiving placebo, the following table gives the outcomes that would have been as strong or stronger evidence that EP reduced risk of developing cold (1- sided test). P-value from R is .079 (next slide). 0.059808 0.016025 0.002604 0.000229 0.000008 0.078674 + + n y n y n y n y EP PL ( ) = EP PL Probabilities: , p y y EP PL EP PL EP PL 24 14 22 18 ( )( ) 1961256 7315 23987744005 ( ) = = = 14,18 .059808 p 46 32 ... 24 10 22 22 ( 23987744005 )( ) 1961256 1 ( ) = = = 10,22 .000 0082 p 46 32
R Code/Output ep.cold <- matrix(c(10,4, 14,18), ncol=2) fisher.test(ep.cold, alt="greater") fisher.test(ep.cold, alt="two.sided") > fisher.test(ep.cold, alt="greater") Fisher's Exact Test for Count Data data: ep.cold p-value = 0.07867 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.8653928 Inf sample estimates: odds ratio 3.132591
McNemars Test for Paired Samples Common subjects (or matched pairs) being observed under 2 conditions (2 treatments, before/after, 2 diagnostic tests) in a crossover setting Two possible outcomes (Presence/Absence of Characteristic) on each measurement Four possibilities for each subject/pair wrt outcome: Present in both conditions Absent in both conditions Present in Condition 1, Absent in Condition 2 Absent in Condition 1, Present in Condition 2
McNemars Test for Paired Samples Condition 1\2 Present Absent Present n11 n12 Absent n21 n22
McNemars Test for Paired Samples Data: n12 = # of pairs where the characteristic is present in condition 1 and not 2 and n21 # where present in 2 and not 1 H0: Probability the outcome is Present is same for the 2 conditions ( 1 = 2) HA: Probabilities differ for the 2 conditions ( 1 2) Large-Sample Test (Normal Approximation to Binomial) n n T S z = + = . .: 12 n 21 n z obs 12 21 2 ( P Z | |) P val obs
Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery Subjects - 165 women having revision surgery involving silicone gel breast implants Conditions (Each being observed on all women) Self Report of Presence/Absence of Rupture/Leak Surgical Record of Presence/Absence of Rupture/Leak L C T p R u R G T o S E L F * S U R G N S I C R o U A u u p I t a C o o r t C r A o n l r s t e L s t a b u l a t i o n u t e u l t a Source: Brown and Pennello (2002), Replacement Surgery and Silicone Gel Breast Implant Rupture , Journal of Women s Health & Gender-Based Medicine, Vol. 11, pp 255-264
Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery H0: Tendency to report ruptures/leaks is the same for self reports and surgical records HA: Tendencies differ 28 5 28 5 2 ( P Z n n = = = . .: T S z 4.00 12 n 21 n z obs + + 12 21 = = = 2 ( P Z | |) 4) 2(.0000317) 0 P val obs ( ) ( ) = = Exact P-value: 2 28| ~ 33, 0.5 .0001 P Y Y B n
R Code and Output rupture <- matrix(c(69,5, 28,63), ncol=2) mcnemar.test(rupture, correct=F) > mcnemar.test(rupture, correct=F) McNemar's Chi-squared test data: rupture McNemar's chi-squared = 16.03, df = 1, p-value = 6.234e-05 Note that the mcnemar.test function reports z2 which is chi-square with 1 degree of freedom (thus it is equivalent to the z-test).
Mantel-Haenszel Test / CI for Multiple Tables Data collected from q studies or strata in 2x2 contingency tables with common groupings/outcomes Each table has 4 cells: nh11, nh12, nh21, nh21h=1, ,q They can be combined for an overall Chi-square statistic or odds ratio and confidence Interval Table 1 1 n_111 n_121 n_1 1 Table q 1 n_q11 n_q21 n_q 1 Trt\Response 1 2 Total 2 Total n_11 n_12 n_1 Trt\Response 1 2 Total 2 Total n_q1 n_q2 n_q n_112 n_122 n_1 2 n_q12 n_q22 n_q 2
Mantel-Haenszel Computations 2 q n n n 1 n n n 1 h h 11 h = 1 q h h = 2 MH Test Statisic: n n n 1 2 h 2 n 1 2 h h h h ( ) 1 = 1 h h ( ) 2 MH 2 2 1 2 MH Rejection Region: P-value: P ,1 q q q n n n n R S ^ = = = = 11 h n 22 12 h n 21 h h OR R S S MH h = = = 1 1 1 h h h h h 1 q 1 S 1 1 1 = ^ ^ = + + + 2 h lo g( ) v V OR S MH 2 n n n n = 1 h 11 h 12 h 21 22 h h ^ ^ 1.96 1.96 e v v 95% CI for Overall Odds Ratio: , OR e OR MH MH
Associations Between Categorical Variables Case where both explanatory (independent) variable and response (dependent) variable are qualitative Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)