Understanding Categorical Data Analysis for Proportion Estimation

Slide Note
Embed
Share

In the realm of categorical data analysis, estimating proportions is crucial for understanding population characteristics. This involves sampling, calculating sample proportions, standard errors, and constructing confidence intervals. Through examples like studying the effects of treatments on medical conditions or analyzing historical experiments, various methods like Large-Sample Confidence Intervals and the Wilson-Agresti-Coull Method are employed to make reliable inferences.


Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Chapter 10 Categorical Data Analysis

  2. Inference for a Single Proportion () Goal: Estimate proportion of individuals in a population with a certain characteristic ( ). This is equivalent to estimating a binomial probability Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties: y ^ = Sample proportion : n n 1 ( ) = = Mean and Std. Dev. of sampling distributi on : ^ ^ ^ ^ 1 = Estimated Standard Error : SE ^ n n Shape approximat : ely normal for large samples (Rule of thumb : , 1 ( n ) ) 5

  3. Large-Sample Confidence Interval for Take SRS of size n from population where is true (unknown) proportion of successes. Observe y successes Set confidence level (1- ) and obtain z /2from z-table y n ^ = Point Estimate: ^ ^ 1 = Estimated Standard Error: SE ^ n = Margin of error: SE m z /2 ^ ^ ( ) 1 100% confidence interval for : m

  4. Example - Ginkgo and Azet for AMS Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers Parameter: = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. Sample Data: n=126 trekkers received G&A, y=18 suffered from AMS 18 (. 14 )(. 86 ) ^ = = = = 143 . SE 031 . ^ 126 126 = 100 ) . 1 = = Margin of error (( 1 % 95 %) : 96 (. 031 ) 061 . m 95 % CI for : 143 . 061 . (. 082 204 ,. )

  5. Wilson-Agresti-Coull Method For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities Simple approach that works well in practice: Adjust observed number of Successes (y) and sample size (n) ~ ~ 2 2 /2 /2 ~ ~ Point Estimate: n = + = + = = 2 .025 z 2 0.5 Note that for 0.05, 1.96 4 y y z n n z y = ~ ~ ~ 1 = Estimated Standard Error: SE ~ ~ n = Margin of error: SE m z /2 ~ ~ )100% confidence interval for : (1 m

  6. Example: Listers Tests with Antiseptic Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) n=12 patients received antiseptic y=1 died ~ y ( ) ( ) 2 1 0.5 1.96 = + 1 0.5 3.84 = + = 2.92 ~ n ( ) 2 = + = 12 3.84 15.84 + = 12 1.96 2.92 15.84 .1843(.8157) 15.84 95%): 1.96(.0974) ( .0067,.3953) ~ = = = = .1843 SE .0974 ~ = = Margin of error((1- )100% 95% CI for : .1843 .1910 .1910 (0,.40)

  7. Sample Size for Margin of Error = E Goal: Estimate within E with 100(1- ) Confidence Confidence Interval will have width of 2E ( ) ( ) 2 1 1 z = = = /2 m E z n /2 2 n E = Since is unknown, an educated guess can be used or set This is most conservative as 0.5 ( ) = 1 is largest for 0.5 ( ) ( )( ) = = = 2 2 0.05, 0.5 1 2 0.5 1 0.5 1 z /2 1 n 2 E

  8. Significance Test for a Proportion Goal test whether a proportion ( ) equals some null value 0H0: = ^ = 0 Test Statistic : z obs 1 ( o ) 0 n = : : - value ( P ) H RR z z P P = Z z 0 a obs obs z : : - value ( ) H RR z z P Z 0 a obs z obs = : : - value 2 ( ) H RR z P P Z z 0 / 2 a obs obs Large-sample test works well when n 0 and n(1- 0) 5

  9. Ginkgo and Acetaz for AMS Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? H0: =0.25 Ha: < 0.25 18 18 126 = = = y n ^ = = . 0 143 . 0 25 0 126 143 . . 25 . 107 = = = . 2 Test Statistic : 75 z obs . 039 . 25 (. 75 ) 126 = = . 1 ( . 05 = : ) P 645 RR z z 05 . obs . 2 = - value ( 75 ) 0030 . P Z Strong evidence that incidence rate is below 25% ( < 0.25)

  10. Comparing Two Population Proportions Goal: Compare two populations/treatments wrt a nominal (binary) outcome Sampling Design: Independent vs Dependent Samples Methods based on large vs small samples Contingency tables used to summarize data Measures of Association: Absolute Risk, Relative Risk, Odds Ratio

  11. Contingency Tables Tables representing all combinations of levels of explanatory and response variables Numbers in table represent Counts of the number of cases in each cell Row and column totals are called Marginal counts

  12. 2x2 Tables - Notation Outcome Present Outcome Absent Group Total Group 1 n1-y1 y1 n1 Group 2 n2-y2 y2 n2 Outcome Total y1+y2 (n1+n2)- (y1+y2) n1+n2

  13. Example - Firm Type/Product Quality High Quality Low Quality Group Total Not 33 55 88 Integrated Vertically Integrated 5 79 84 Outcome Total 38 134 172 Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers Outcomes: High Quality (High Count) vs Low Quality (Count) Source: Temin (1988)

  14. Notation Proportion in Population 1 with the characteristic of interest: 1 Sample size from Population 1: n1 Number of individuals in Sample 1 with the characteristic of interest: y1 Sample proportion from Sample 1 with the characteristic of interest: y ^ = 1 1 n 1 Similar notation for Population/Sample 2

  15. Example - Cotton Textile Producers 1 - True proportion of all Non-integretated firms that would produce High quality 2 - True proportion of all vertically integretated firms that would produce High quality 33 y ^ = = = = = 88 33 . 0 375 1 n y 1 1 1 88 n 1 5 y ^ = = = = = 84 5 . 0 060 2 n y 2 2 2 84 n 2

  16. Notation (Continued) Parameter of Primary Interest: 1- 2, the difference in the 2 population proportions with the characteristic (2 other measures given below) Estimator: 1 = D ^ ^ 2 Standard Error (and its estimate): ^ ^ ^ ^ 1 1 1 1 2 2 1 ( ) 1 ( ) = + = + 1 1 2 2 SE D D n n n n 1 2 1 2 Pooled Estimated Standard Error when = = : + 1 n 1 n y y ^ ^ ^ = + = 1 1 2 SE D + n n P 1 2 1 2

  17. Cotton Textile Producers (Continued) Parameter of Primary Interest: , the difference in the 2 population proportions that produce High quality output Estimator: Standard Error (and its estimate): ^ ^ = = = . 0 375 . 0 060 . 0 315 D 1 2 ^ ^ ^ ^ 1 1 1 1 2 2 . 0 375 . 0 ( 625 ) . 0 060 . 0 ( 94 ) = + = + = = 003335 . 0577 . SED 88 84 n n 1 2 Pooled Estimated Standard Error when = = : + 1 1 33 5 ^ ( . 0 ) = + = = = . 0 221 779 0633 . . 0 221 SE D + 88 84 88 84 P

  18. Significance Tests for Deciding whether = canbe done by interpreting plausible values of from the confidence interval: If entire interval is positive, conclude ( > 0) If entire interval is negative, conclude ( < 0) If interval contains 0, do not conclude that Alternatively, we can conduct a significance test: H0: = Ha: (2-sided) Ha: (1-sided) Test Statistic: = ^ 1 ^ ^ 1 2 zobs 1 n 1 n ^ + 1 2 RR: |zobs| z /2 (2-sided) zobs z (1-sided) P-value: 2P(Z |zobs|) (2-sided) P(Z zobs) (1-sided)

  19. Example - Cotton Textile Production = = : ( ) 0 H 0 1 2 1 2 : ( ) 0 H 1 2 1 2 A ^ ^ . 0 375 . 0 060 . 0 315 1 2 = = = = : . 4 98 TS z obs . 0 0633 1 1 1 n 1 n ^ ^ + . 0 221 . 0 ( 779 ) + 1 88 84 1 2 = : . 1 96 RR z z 025 . P obs 2 = - value ( . 4 98 ) 0 P Z Again, there is strong evidence that non-integrated performs are more likely to produce high quality output than integrated firms

  20. Fishers Exact Test Method of testing for testing whether 2= 1 when one or both of the group sample sizes is small Measures (conditional on the group sizes and number of cases with and without the characteristic) the chances we would see differences of this magnitude or larger in the sample proportions, if there were no differences in the populations

  21. Example Echinacea Purpurea for Colds Healthy adults randomized to receive EP (n1=24) or placebo (n2=22, two were dropped) Among EP subjects, 14 of 24 developed cold after exposure to RV-39 (58%) Among Placebo subjects, 18 of 22 developed cold after exposure to RV-39 (82%) Out of a total of 46 subjects, 32 developed cold Out of a total of 46 subjects, 24 received EP Source: Sperber, et al (2004)

  22. Example Echinacea Purpurea for Colds EP/Cold 14 13 12 11 10 Sum Placebo/Cold Probability 18 19 20 21 22 Conditional on 32 people developing colds and 24 receiving EP and 22 receiving placebo, the following table gives the outcomes that would have been as strong or stronger evidence that EP reduced risk of developing cold (1- sided test). P-value from SPSS is .079 (next slide). 0.059808 0.016025 0.002604 0.000229 0.000008 0.078674 + + n y n y n y n y EP PL ( ) = EP PL Probabilities: , p y y EP PL EP PL EP PL 24 14 22 18 ( )( ) 1961256 7315 23987744005 ( ) = = = 14,18 .059808 p 46 32 ... 24 10 22 22 ( 23987744005 )( ) 1961256 1 ( ) = = = 10,22 .000 0082 p 46 32

  23. Example - SPSS Output r C T T o e o T R T * C O C h i-S q u a r e T V alue A s ym p . S ig . (2 -s id e d ) E xac t S ig . (2 -s id e d ) E xac t S ig . (1 -s id e d ) L D C N Y C O T e s ts s s o u R T o t s L D t a l o t a t b u l a t i o n n C o ntinuity C o rre c tio n d f C o m p ute d o nly f o r a 2 x2 tab le a. 0 c e lls (.0 % ) have e xp e c te d c o unt le s s than 5 . T he m inim um e xp e c te d c o unt is 6 .7 0 . b . a l

  24. McNemars Test for Paired Samples Common subjects (or matched pairs) being observed under 2 conditions (2 treatments, before/after, 2 diagnostic tests) in a crossover setting Two possible outcomes (Presence/Absence of Characteristic) on each measurement Four possibilities for each subject/pair wrt outcome: Present in both conditions Absent in both conditions Present in Condition 1, Absent in Condition 2 Absent in Condition 1, Present in Condition 2

  25. McNemars Test for Paired Samples Condition 1\2 Present Absent Present n11 n12 Absent n21 n22

  26. McNemars Test for Paired Samples Data: n12 = # of pairs where the characteristic is present in condition 1 and not 2 and n21 # where present in 2 and not 1 H0: Probability the outcome is Present is same for the 2 conditions ( 1 = 2) HA: Probabilities differ for the 2 conditions ( 1 2) Large-Sample Test (Normal Approximation to Binomial) n n T S z = + = . .: 12 n 21 n z obs 12 21 2 ( P Z | |) P val obs

  27. Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery Subjects - 165 women having revision surgery involving silicone gel breast implants Conditions (Each being observed on all women) Self Report of Presence/Absence of Rupture/Leak Surgical Record of Presence/Absence of Rupture/Leak L C T p R u R G T o S E L F * S U R G N S I C R o U A u u p I t a C o o r t C r A o n l r s t e L s t a b u l a t i o n u t e u l t a Source: Brown and Pennello (2002), Replacement Surgery and Silicone Gel Breast Implant Rupture , Journal of Women s Health & Gender-Based Medicine, Vol. 11, pp 255-264

  28. Example - Reporting of Silicone Breast Implant Leakage in Revision Surgery H0: Tendency to report ruptures/leaks is the same for self reports and surgical records HA: Tendencies differ 28 5 28 5 2 ( P Z n n = = = . .: T S z 4.00 12 n 21 n z obs + + 12 21 = = = 2 ( P Z | |) 4) 2(.0000317) 0 P val obs ( ) ( ) = = Exact P-value: 2 28| ~ 33, 0.5 P Y Y B n

  29. Multinomial Experiment / Distribution Extension of Binomial Distribution to experiments where each trial can end in exactly one of k categories n independent trials Probability a trial results in category i is i ni is the number of trials resulting in category I 1+ + k = 1 n1+ +nk = n

  30. Multinomial Distribution / Test for Cell Probabilities ! n ( ) = k n k n ,..., ... p n n 1 1 1 k !... ! n n 1 k k k = = , 1, 0, 0 n n n i i i i = = 1 1 i i Testing whether the category probabilities are specific values: k = = = : ,..., 1 H 0 1 10 0 0 k k i = 1 i : At least one cell probability is not as s Expected cell counts under pecified i = H A = : 1,..., H E n n 0 0 i i ( ) 2 n E k = i i 2 obs Test Statistic: E = 1 i i ( ) 2 obs 2 2 k 2 obs Rejection Region: P-value: P 1 , 1 k

  31. Goodness of Fit Test for a Probability Distribution Data are collected and wish to be determined whether it comes from a particular probability distribution (e.g. Poisson, Normal, Gamma) Estimate any unknown model parameters (p estimates) Break down the range of data values into k > p intervals (typically where 80% have expected counts 5) obtain observed (n) and expected (E) values for each interval P-value > .25 .15-.25 .05-.15 .01-.05 <.01 Quality of Fit Excellent Good Moderately Good Poor Unacceptable ( ) 2 n E k = i i 2 obs Test Statistic: E = 1 i i ( ) 2 k 2 obs P-value: P 1 p Assessing quality of fit to hypothesized distribution:

  32. Associations Between Categorical Variables Case where both explanatory (independent) variable and response (dependent) variable are qualitative Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)

  33. Contingency Tables Cross-tabulations of frequency counts where the rows (typically) represent the levels of the explanatory variable and the columns represent the levels of the response variable. Numbers within the table represent the numbers of individuals falling in the corresponding combination of levels of the two variables Row and column totals are called the marginal distributions for the two variables

  34. Example - Cyclones Near Antarctica Period of Study: September,1973-May,1975 Explanatory Variable: Region (40-49,50-59,60-79) (Degrees South Latitude) Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8)) (Number of months in parentheses) Units: Cyclones in the study area Treating the observed cyclones as a random sample of all cyclones that could have occurred Source: Howarth(1983), An Analysis of the Variability of Cyclones around Antarctica and Their Relation to Sea-Ice Extent , Annals of the Association of American Geographers, Vol.73,pp519-537

  35. Example - Cyclones Near Antarctica Region\Season 40 -49 S 50 -59 S 60 -79 S Total Autumn 370 526 980 1876 Winter 452 624 1200 2276 Spring 273 513 995 1781 Summer 422 1059 1751 3232 Total 1517 2722 4926 9165 For each region (row) we can compute the percentage of storms occuring during each season, the conditional distribution. Of the 1517 cyclones in the 40-49 band, 370 occurred in Autumn, a proportion of 370/1517=.244, or 24.4% as a percentage. Total% (n) 100.0 (1517) 100.0 (2722) 100.0 (4926) Region\Season 40 -49 S 50 -59 S 60 -79 S Autumn 24.4 19.3 19.9 Winter 29.8 22.9 24.4 Spring 18.0 18.9 20.2 Summer 27.8 38.9 35.5

  36. Example - Cyclones Near Antarctica 40.00 region 40-49S 50-59S 60-79S 30.00 Bars show Means regpct 20.00 10.00 Autumn Winter Spring Summer season Graphical Conditional Distributions for Regions

  37. Guidelines for Contingency Tables Compute percentages for the response (column) variable within the categories of the explanatory (row) variable. Note that in journal articles, rows and columns may be interchanged. Divide the cell totals by the row (explanatory category) total and multiply by 100 to obtain a percent, the row percents will add to 100 Give title and clearly define variables and categories. Include row (explanatory) total sample sizes

  38. Independence & Dependence Statistically Independent: Population conditional distributions of one variable are the same across all levels of the other variable Statistically Dependent: Conditional Distributions are not all equal When testing, researchers typically wish to demonstrate dependence (alternative hypothesis), and wish to refute independence (null hypothesis)

  39. Pearsons Chi-Square Test Can be used for nominal or ordinal explanatory and response variables Variables can have any number of distinct levels Tests whether the distribution of the response variable is the same for each level of the explanatory variable (H0: No association between the variables r = # of levels of explanatory variable c = # of levels of response variable

  40. Pearsons Chi-Square Test Intuition behind test statistic Obtain marginal distribution of outcomes for the response variable Apply this common distribution to all levels of the explanatory variable, by multiplying each proportion by the corresponding sample size Measure the difference between actual cell counts and the expected cell counts in the previous step

  41. Pearsons Chi-Square Test Notation to obtain test statistic Rows represent explanatory variable (r levels) Cols represent response variable (c levels) 1 2 c Total n1. 1 n1c n11 n12 n2. 2 n22 n2c n21 nr. r nr2 nr1 nrc n.1 n.2 n.c n.. Total

  42. Pearsons Chi-Square Test Observed frequency (nij): The number of individuals falling in a particular cell Expected frequency (Eij): The number we would expect in that cell, given the sample sizes observed in study and the assumption of independence. Computed by multiplying the row total and the column total, and dividing by the overall sample size. Applies the overall marginal probability of the response category to the sample size of explanatory category

  43. Pearsons Chi-Square Test Large-sample test (at least 80% of Eij > 5) H0: Variables are statistically independent (No association between variables) Ha: Variables are statistically dependent (Association exists between variables) Test Statistic: obs E 2 ( ) n E ij ij = 2 ij P-value: Area above in the chi-squared distribution with (r-1)(c-1) degrees of freedom. (Critical values in Table 8) 2 obs

  44. Example - Cyclones Near Antarctica Observed Cell Counts (nij): Region\Season 40 -49 S 50 -59 S 60 -79 S Total Autumn 370 526 980 1876 Winter 452 624 1200 2276 Spring 273 513 995 1781 Summer 422 1059 1751 3232 Total 1517 2722 4926 9165 Note that overall: (1876/9165)100%=20.5% of all cyclones occurred in Autumn. If we apply that percentage to the 1517 that occurred in the 40-49S band, we would expect (0.205)(1517)=310.5 to have occurred in the first cell of the table. The full table of Eij: Region\Season 40 -49 S 50 -59 S 60 -79 S Total Autumn 310.5 557.2 1008.3 1876 Winter 376.7 676.0 1223.3 2276 Spring 294.8 529.0 957.3 1781 Summer 535.0 959.9 1737.1 3232 Total 1517 2722 4926 9165

  45. Example - Cyclones Near Antarctica 2 obs Computation of n_ij E_ij (n-E)^2 3540.25 5670.09 475.24 12769 973.44 ((n-E)^2)/E 11.4017713 15.0520042 1.61207598 23.8672897 1.74702082 Region 40-49S 40-49S 40-49S 40-49S 50-59S 50-59S 50-59S 50-59S 60-79S 60-79S 60-79S 60-79S Season Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer 370 452 273 422 526 624 513 1059 980 1200 995 1751 310.5 376.7 294.8 535.0 557.2 676.0 529.0 959.9 1008.3 1223.3 957.3 1737.1 2704 256 4 0.48393195 10.2310762 0.79429733 0.44379138 1.4846861 0.11122561 71.2291706 9820.81 800.89 542.89 1421.29 193.21 2 obs

  46. Example - Cyclones Near Antarctica H0: Seasonal distribution of cyclone occurences is independent of latitude band Ha: Seasonal occurences of cyclone occurences differ among latitude bands Test Statistic: RR: obs2 .05,62 = 12.59 P-value: Area in chi-squared distribution with (3- 1)(4-1)=6 degrees of freedom above 71.2 From Table 8, P( 2 22.46)=.001 P< .001 2= obs 71 2 .

  47. Likelihood Ratio Statistic Alternative statistic provided by many computer packages: ( ) n n n n n E r c r c ij ij = = 2 LR Test Statistic: 2 ln 2 ln n n ij ij = = = = 1 1 1 1 i j i j ij i j 2 LR 2 Rejection Region: ( )( ) ) R , 1 1 r c ( 2 r 2 P-value: P Row(i) 1 1 1 1 2 2 2 2 3 3 3 3 Sum Column(j) n_ij 1 2 3 4 1 2 3 4 1 2 3 4 n_i n_ j X2(LR) ( )( ) L 1 1 c 370 452 273 422 526 624 513 1059 980 1200 995 1751 1517 1517 1517 1517 2722 2722 2722 2722 4926 4926 4926 4926 1876 129.6947 2276 164.6768 1781 -41.9335 3232 -200.192 1876 -60.5646 2276 -99.8393 1781 -31.4258 3232 208.0912 1876 -55.8208 2276 -46.1601 1781 76.9673 3232 27.84263 71.33672 Note: The formula on page 512 of textbook is incorrect

  48. SPSS Output - Cyclone Example REGION * SEASON Crosstabulation SEASON Autumn Winter Spring Summer Total REGION 40-49S Count Expected Count % within REGION Count Expected Count % within REGION Count Expected Count % within REGION Count Expected Count % within REGION 370 310.5 24.4% 526 557.2 19.3% 980 1008.3 19.9% 1876 1876.0 20.5% 452 376.7 29.8% 624 676.0 22.9% 1200 1223.3 24.4% 2276 2276.0 24.8% 273 294.8 18.0% 513 529.0 18.8% 995 957.3 20.2% 1781 1781.0 19.4% 422 535.0 27.8% 1059 959.9 38.9% 1751 1737.1 35.5% 3232 3232.0 35.3% 1517 1517.0 100.0% 2722 2722.0 100.0% 4926 4926.0 100.0% 9165 9165.0 100.0% 50-59S 60-79S Total Chi-Square Tests Asymp. Sig. (2-sided) Value df 71.189a 71.337 Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases a. 6 6 .000 .000 P-value 23.418 1 .000 9165 0 cells (.0%) have expected count less than 5. The minimum expected count is 294.79.

  49. Misuses of chi-squared Test Expected frequencies too small (at least 80% of expected counts should be above 5, not necessary for the observed counts) Dependent samples (the same individuals are in each row, see McNemar s test) Can be used for nominal or ordinal variables, but more powerful methods exist for when both variables are ordinal and a directional association is hypothesized

  50. Residual Analysis Once dependence has been determined from a chi- squared test, often interested in determining which cells contributed Residual: fo-fe measures the difference between the observed and expected counts Positive implies observed more than expected Residual s practical importance depends on level of fe Adjusted Residual (computed for each cell): o f f f e 1 ( e row proportion 1 ( ) column proportion ) Adjusted residuals above 3 in absolute value give strong evidence against independence in that cell

More Related Content