Addressing Missing Race Data in Pre-Invasive Cervical Cancer Study
Study discusses missing race data in pre-invasive cervical cancer cases among three states and the impact on analysis. It highlights the concept of multiple imputation to handle missing data effectively, providing insights into data mechanisms and methods to treat missing values.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Multiple Imputation and Missing Race in the Pre-Invasive Cervical Cancer Study among Three States 2010 NAACCR Conference Quebec City, June 22, 2010 Bin Huang Kentucky Cancer Registry University of Kentucky
The Pre-invasive Cervical Cancer Study HPV vaccine Quadrivalent vaccine licensed for females in June 2006 ACS developed the guideline for HPV vaccine use June 2007 Anticipated reductions in cervical cancers, other anogenital cancers Need for surveillance systems Collection of population data for pre-invasive cervical cancer cases Monitoring effectiveness and efficacy CDC funded study Includes three cancer registries Michigan, Kentucky, Louisiana Pre-pilot period (Sept-Dec 2008) Data collection Jan 2009-Dec 2009
Missing Data In the Study Missing data issue Race : 30% missing. Overall cases with complete data: 68.7% Potential to cause bias or lead to inefficient analyses.
Missing Data Mechanism Missing completely at random (MCAR). The missingness is independent of both the missing response and the observed response. Missing at random (MAR). The missingness is independent of the missing response given the observed values. Not missing at random (NMAR) . The missingness depends on both observed and missing responses.
Methods to Treat Missing Data Available Case Methods Complete case method (listwise deletion). Pairwise deletion Single Imputation methods Mean substitution Hot deck imputation Regression substitution Modern Approaches Maximum Likelihood (ML) method Bayesian method Multiple Imputation (MI)
Multiple Imputation (MI) MI is a three-step approach to estimation for incomplete data, first proposed by Rubin in 1977. MI assumes missing data are MAR. Imputation - the missing data are filled in m times to generate m complete data sets. Imputation model preserves the distributional relationship between the missing values and the observed values. Analysis - the m complete data sets are analyzed separately using standard statistical analyses. Combination - the results from the m complete data sets are combined to produce inferential results.
Software Available SAS PROC MI; PROC MIANALYZE. MCMC option - assumption of multivariate normality. SOLAS (Statistical Solutions Inc) Same assumption as SAS Proc MI. S-Plus: NORM IVEware: SAS callable PROC IMPUTE; PROC DESCRIBE; PROC REGRESS Does not assume multivariate normality.
Aim of the Study To impute the missing race with MI To examine the difference of estimates between complete case method and the MI method Percentage of race The correlation between having AIS and Race.
Data Pre-Cervical Cancer Cases Three states Kentucky, Louisiana and Michigan Total 3843 Kentucky: 953 (24.8%), Louisiana: 653 (17.0%), Michigan: 2237 (58.2%) Variables (17) Demographics: race, address, age, ethnicity Data sources: reporting facility, facility type, time at diagnosis Disease data: site, histology code, histology terminology code, sequence code Added variable (2) 2000 US Census % of Whites at county level % of Blacks at county level
Data Collection Process Kentucky Michigan Louisiana Web-based entry AIM reprogrammed Modifications to already existing methods AIM reprogrammed new web based entry form hard copy Data entry methods 38 hospital-based path labs 10 indep free-standing path labs 74 hospital- based path labs 2 indep labs 1 out-of-State cancer registry 47 out of 104 hospitals 0 out of 3 medical oncology centers 11 out of 13 pathology laboratories 5 out of 16 surgery centers 6 out of 10 physician offices that use E-Path Facilities Reporting
Descriptive Analysis Demographics for Cases in the Cervical Cancer Study N % Variables Variables N % State County Code KY 953 24.8 Known 3338 86.9 LA 653 17.0 Missing 505 13.1 MI 2237 58.2 Age at Diagnosis Race 15-20 297 7.7 White 2154 56.1 21-25 933 24.3 Black 491 12.8 26 - 35 1474 38.4 Other 39 1.0 36 - 50 884 23.0 Missing 1159 30.2 State at Diagnosis 50+ 247 6.4 Missing 8 0.2 KY 953 24.8 Average 31.9 LA 651 16.9 Ethnicity (NHIA) MI 1718 44.7 Non-Hispanics 3696 96.2 Other 4 0.1 Hispanics 147 3.8 Missing 517 13.5
Descriptive Analysis (cont.) Characteristics of Cases in the Cervical Cancer Study Variables N % Variables Site N % Histology C530 344 9.0 Carcinoma 137 3.6 C531 86 2.2 Squamous 3568 92.8 C538 136 3.5 Adenoma In Situ 138 3.6 C539 3277 85.3 Histology Terminology Report Source AIS 136 3.5 Hospital 984 25.6 CIN III 2854 74.3 Laboratory 2731 71.1 CIS 382 9.9 Physician 105 2.7 Severe Dysplasia 471 12.3 Other 23 0.6
Comparison Among The Three States Characteristics Kentucky Louisiana Michigan P-Value N % N % N % Race White Black Other Missing Histology Terminology AIS CIN III CIS Severe Dysplasia Report Source Hospital Laboratory Physician Other 701 40 2 210 73.6 4.2 0.2 22.0 427 189 12 25 65.4 28.9 1.8 3.8 1026 262 25 924 45.9 11.7 1.1 41.3 <0.0001 28 513 128 284 2.9 53.8 13.4 29.8 17 437 113 86 2.6 66.9 17.3 13.2 91 4.1 85.1 6.3 4.5 <0.0001 1904 141 101 0 0 305 325 3 20 46.7 49.8 0.5 3.1 679 1454 102 2 30.4 65.0 4.6 0.1 952 0 1 99.9 0 0.1 <0.0001
Missing Cases Race, State at Diagnosis, County at Diagnosis KY LA MI 45 40 35 Percentage of Missing 30 25 20 15 10 5 0 Race State at Diagnosis County at Diagnosis
Comparison Between Known and Unknown Races Variables Cases with Known Race Cases with Missing Race P-Value N % N % Age at Diagnosis 15-20 21-25 26 - 35 36 - 50 50+ Average Ethnicity (NHIA) Non-Hispanics Hispanics Histology Terminology AIS CIN III CIS Severe Dysplasia Report Source Hospital Laboratory Physician Other 205 658 995 637 186 7.7 24,5 37.1 23.8 6.9 92 275 479 247 61 8 23.8 41.5 21.4 5.3 0.0459 32.2 31.2 2591 93 96.5 3.5 1105 54 95.3 4.7 0.0765 101 1960 296 327 3.8 73 11 12.2 35 894 86 144 3 77.1 7.4 12.4 0.0033 855 1776 30 23 31.9 66.2 1.1 0.9 129 955 75 0 11.1 82.4 6.5 0 <0.0001
MI Methods IVEware and SAS PROC MI Used both methods Only results from IVEware are presented IVEware: http://www.isr.umich.edu/src/smp/ive/
Missing Pattern All States Missing Pattern for Three State Data State at Diagnosis Age at Diagnosis Year at Diagnosis Month at Diagnosis Race County N Percent O O O O O O 2639 68.7 O O O O O X 2 0.1 O O O X X O 1 0.0 O O O X O X 2 0.1 O O X O O O 5 0.1 O X O O O O 677 17.6 O X X O O O 1 0.0 X O O O O O 10 0.3 X O X O O O 26 0.7 X X O O O O 8 0.2 X X X O O O 468 12.2 X X X X O O 5 0.1
Associations Multivariate logistic regression showed: Race is significantly associated with ethnicity, histological terminology type, age, state. Most notably, percent of race at county level is most dominate variable predicting race.
Imputation Model Variables includes race, registry, age, ethnicity, facility type, site, histology terminology code, sequence code, percentages of races at county level 10 imputation sets
Frequency of Race All Kentucky Louisiana Michigan Race N % S.E. N % S.E. N % S.E N % S.E White Complete Case 2154 80.3 0.0086 701 94.4 0.0087 427 68.0 0.0226 1026 78.1 0.0129 MI Method 3141 81.7 0.0065 894 93.8 0.0086 444 68.0 0.0183 1803 80.6 0.0089 Black Complete Case 491 18.3 0.0175 40 5.4 0.0357 189 30.1 0.0334 262 20.0 0.0247 MI Method 650 16.9 0.0065 56 5.8 0.0082 197 30.1 0.0180 36 17.8 0.0087 Other Complete Case 39 1.5 0.0195 2 0.3 0.0387 12 1.9 0.0394 25 1.9 0.0273 MI Method 52 1.4 0.0023 4 0.4 0.0024 12 1.9 0.0054 398 1.6 0.0036
Logistics Regression Analysis with AIS Status as the Dependent Variable Complete Case MI Effect O.R 95% C.I. O.R. 95% C.I. Registry (Baseline=Michigan) Kentucky 0.524 0.313-0.879 0.636 0.398 - 1.015 Louisiana 0.615 0.352-1.075 0.652 0.370 - 1.148 Race (Baseline= Black) White 3.71 1.594 - 8.645 2.16 1.048 - 4.466 Other 3.86 0.744 - 19.993 2.42 0.440 - 13.332 Age 1.03 1.012 - 1.045 1.02 1.007 - 1.038 Sequence (1st vs. 2nd) 0.07 0.033 - 0.148 0.04 0.022 - 0.086
Summary The high percentage of cases with missing race likely introduced bias to the estimate of proportion of race, mainly among data from Michigan. The results shows that whites have much higher risk of getting AIS than blacks. Quantitative differences in estimates between the two methods were found in the logistic model. MI is relatively easy to implement and is appropriate for a wide range of datasets.
Acknowledgements CDC Deblina Datta and staff Kentucky Cancer Registry: Thomas Tucker, Mary Jane Byrne, Brent Shelton Michigan Cancer Registry: Glenn Copland, Won Silva and staff Louisiana Cancer Registry: Vivien Chen and staff Macro International - Benita O Colma
Words to Share John Wooden - Be quick, but don t hurry If you don t have time to do it right, how will you find time to do it again?
Questions? Bin Huang bhuang@kcr.uky.edu 859-219-0773 x 280 Thank You ! Merci !