Predictor Selection in Biostatistics: Goals and Methods

biostatistics 208 n.w
1 / 38
Embed
Share

Explore the process of predictor selection in biostatistics for different inferential goals such as prediction, causal effect estimation, and risk factor identification. Learn about the methods developed for each goal and the importance of model validation based on prediction error assessment.

  • Biostatistics
  • Predictor Selection
  • Regression Analysis
  • Statistical Modeling
  • Prediction Error

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Biostatistics 208 Lecture 7: Predictor Selection Aaron Wolfe Scheffler Division of Biostatistics, UCSF 1

  2. Date/Time Faculty Title/Content Reading & Assignment Scheffler Overview & Intro. to Linear Regression Overview of regression; linear regression intro. Lecture 1 Chap. 2, Section 3.3, VGSM Scheffler Linear Regression for Multiple Predictors Linear regression with categorical and/or continuouspredictors Lecture 2 Sections 4.1-4.2, VGSM Scheffler Categorical Predictors F tests, tests for trend, introduction to confounding adjustment in multi predictor models Lecture 3 Section 4.3-4.4, VGSM Scheffler Confounding and Mediation Confounding, mediation and causal inference Lecture 4 Sections 4.4-4.5, VGSM Scheffler Interaction Conceptual introduction, product terms, testing for interaction, lincom statements Lecture 5 Section 4.6, VGSM Scheffler Model Diagnostics Linearity, normality, constant variance, influential points Lecture 6 Section 4.7, VGSM Scheffler Predictor Selection Causal diagrams, three inferential goals; number of predictors, collinearity Lecture 7 Chapter 10, VGSM Scheffler Binary Outcome Data Contingency tables; binary outcomes; measures of association; logistic model Lecture 8 Section 3.4, VGSM Section 5.1, VGSM Scheffler Multiple Logistic Regression Multiple logistic regression; confounding; interaction, causal inference Lecture 9 Section 5.2.1-5.2.4, VGSM Scheffler Prediction and Model Diagnostics Prediction; model assessment; outliers; goodness Chapter 5.4, VGSM of Lecture 10 Chapter 5.2.5-5.2.6 & fit Lecture 11 Scheffler Case Control Studies, Alternate Models Case-control studies; conditional logistic regression; alternate binary regression models Chapter 5.3 & 5.5 VGSM 3

  3. Predictor Selection (Chapter 10 of text) Given a (potentially large) number of available predictors, which ones should be included in a regression model? Decision depends on inferential goal: 1. predict future outcomes 2. estimate causal effect of a primary predictor 3. identify important risk factors for an outcome Methods for all 3 goals under continuing development, especially with recent advances in machine learning Biostatistics 210 & 215 offer more detailed coverage 5

  4. Goal 1: Prediction 6

  5. Goal 1: Prediction Includes diagnosis and prognostic risk stratification - Risk of fracture - Risk of hospital readmission Often used in making decisions at the level of the individual Causal relationships useful, but not the primary focus Often times multiple predictors needed, a single predictor is not enough - e.g. odds ratio for binary diagnostic variable with sensitivity and specificity of 90% is 81 A large set of candidate predictors is considered and their influence is modeled in a data driven manner Need a method to select which models and which variables produce the best prediction Solution: model validation (evaluating a models predictive performance) based on assessment of prediction error (PE) 7

  6. Prediction Error Prediction error measures how well a model predicts a new outcome i.e., for new observations of outcome & predictors not used in fitting model (ie out of sample) Goodness of fit is typically defined for the training/learning data Types of prediction error 1. overall performance: distance between observed and predicted outcomes (can be transformed) 2. discrimination: how well does model distinguish outcomes between individuals (e.g. cases from controls, high-risk patients from low, early from late events)? 1. Generally not discussed for continuous outcomes 3. calibration: how accurately does model estimate average outcomes, failure rates? Both discrimination and calibration are important Caution, terms and definitions for prediction error, discrimination, and calibration depend on the outcome variable type, eg continuous or binary - discussed further in Lecture 10 8

  7. Prediction - how to understand model performance Bias-variance tradeoff and over-fitting - Predicted values from regression: biased when important predictors omitted from model unstable (higher variance) when unimportant predictors are included Excess predictors yield over-fitted estimates reflecting minor data features - overfit models may yield poor prediction performance less variability/more bias more variability/less bias better compromise? 9

  8. Example prediction tool development process Prediction error confirmed with final model on test set How many predictors to include? What is their form, eg interactions, splines? Prediction error assessed when choosing a model 10

  9. Model selection strategies to avoid over-fitting 1. Pre-specify well-motivated predictors, how to model them 2. Eliminate predictors without using the outcome 3. Do not use training data to perform model selection 4. Select model to minimize optimism -corrected PE measure 1. PE calculated on training data is biased, overly optimistic 2. Should not be used to perform model selection 3. Use methods such as bootstrap/cross validation rather than the learning/training data to inform model choices 5. Shrink coefficient estimates for poor performing predictors (e.g. LASSO regression methods) 17

  10. Optimism-corrected estimates of PE Na ve measures penalized for number of predictors: adjusted R2, AIC, BIC (obtained via estat ic postestimation command in Stata) - retain disadvantage that they are based on the estimation sample Use different data to estimate tuning parameters and make modeling decisions, evaluate PE - measure out of sample performance, include construct additional validation samples (as in SOF example) k-fold cross-validation bootstrap 19

  11. Optimism-corrected PE: k-fold cross-validation Divide data into k = 5 to 10 subsets, then for each subset: - fit model to the other subsets combined (k-1 subsets) - obtain predictions for excluded subset Calculated PE from predictions in excluded subset; repeat Calculated optimism-corrected PE by averaging over subset results Do this for each candidate model, select model with minimum cross-validated PE Can be applied within a learning/training sample to aid in choosing model with best out-of-sample prediction performance Iteration 1 training_1 validation_1 Iteration 2 training_2 validation_2 training_2 Iteration 3 training_3 validation_3 training_3 Iteration 4 training_4 validation_4 training_4 Iteration 5 validation_5 training_5 20

  12. Optimism-corrected PE: bootstrapping In each of, say, 200 bootstrap samples - fit model to bootstrap sample, obtain predictions - evaluate PE in bootstrap, original samples - average difference in bootstrap PE and original PE estimates Fit model to original data, calculate na ve PE, penalize by average bootstrap optimism - optimistic-corrected PE = PE_overall - (PE_overall - PE_bootstrap) Do this for each candidate model, select model with minimum optimism- corrected PE - This will be illustrated for logistic regression in lecture 10 23

  13. Recommendations for Goal 1 For clinical use, select easily available predictors Pay attention to nonlinearity and interactions in fitting candidate models To avoid over-fitting: - eliminate candidate predictors without using outcome - select model using cross-validation or bootstrap Consider appropriate metric for PE (overall fit, discrimination, calibration) Validate model in external test set (Altman & Royston, What do we mean by validating a prognostic model? Stat Med, 2000;19:453-73) Consider applying modern supervised machine learning tools (e.g. lasso, random forests, super learner) in applications where number of predictors and interpretability are not of primary concern - Reference: James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. (2nd edition; https://www.statlearning.com) These topics are treated in more detail in Biostatistics 202, 210, 216, DATASCI 225 25

  14. Goal 2: Assessing a predictor of primary interest 26

  15. Goal 2: Assessing a predictor of primary interest Research question focuses on a single predictor - Example: Does maternal vitamin use reduce risk of birth defects? Ruling out confounding is key - we want unbiased estimates by including the minimally sufficient adjustment set Minimizing PE is not critical, so over-fitting not a central issue assuming we are deliberate in the predictors we include Directed acyclic graphs (DAGs) are central in model selection for this goal 1. DAGs: Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology, 1999;10:37 48. 27

  16. Maternal vitamin use and birth defects Question: what MSAS (set of predictors) will allow us to our causal effect of interest? Assumptions about causal pathways: No other common causes of vitamin use and birth defects In summary, no excluded confounders or causal links 29

  17. Open and blocked backdoor paths If any of the backdoor paths between vitamin use and birth defects remains open, we would expect to find an association between vitamin use and birth defects, even if there was no causal effect - Undermines our goal of assessing effect of vitamin use on birth defects A backdoor path is blocked, provided we control for at least one non- collider on the path. A backdoor path including a collider is blocked provided we do not control for the collider If we do control for the collider, we need to control for a non-collider on the backdoor path to block it 34

  18. What do we need to control for to block all backdoor paths? Controlling for pre-natal care blocks the first three backdoor paths, but it is also a collider Controlling for it opens the fourth backdoor path, but controlling for any non- collider on that path will block it Controlling for SES or difficulty conceiving would be cheaper and easier than maternal genetics (study design consideration) 35

  19. Determination of MSAS using dagitty.net 36

  20. Remaining issues to consider in confounding control 1. Unmeasured confounders 2. Other plausible causal pathways (edges) to consider: DAGS can help determine vulnerability to unmeasured confounders - provided we have sufficient information about their effects - In some cases, assessment of expected direction and magnitude of resulting bias possible via simulation or analytic approaches Of course, we may have also excluded important causal pathways! 37

  21. Remaining issues to consider in confounding control MSAS options resulting from considering other plausible causal connections 39

  22. Insights from DAGs Exclude from model: - mediators (unless estimating direct effect) - common variables effected by outcome and exposure - redundant confounders Adjusting for a confounder/collider (e.g., PNC) may require adjusting for additional factors Often > 1 minimum sufficient adjustment set (MSAS) Be careful of control for near-instrumental variables - Myers JA, Rassen JA, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol. 2011;174:1213-22. - Arah OA. Bias Analysis for Uncontrolled Confounding in the Health Sciences. Annu Rev Public Health. 2017;38:23-38. 43

  23. Weakly supported (B-list) confounders Frequently there are B-list measured variables with weakly-supported confounding roles compared to your A-list confounders Including all might over burden your model, or introduce collision bias If DAGs including, excluding B-list confounders have a common MSAS, go with it Otherwise: - use most feasible MSAS based on DAG including B-list - sequentially drop B-list confounders if adjusted coefficient estimate for primary predictor changes < 5% or 10% If uncertainty about causal direction and adjustment affects estimate for primary predictor by > 5% or 10%, report and discuss implications 44

  24. Limitations of DAGs for predictor selection No good way to represent interactions, no guidance about keeping or excluding them - interactions between primary predictor and important adjustment variables should be checked (text, section 10.2.3) No representation of functional form (i.e., linearity) of our model Co-linearity, allowable numbers of predictors ignored; more later about these issues Can be hard to specify convincingly without subject matter knowledge But, DAGS help us think through the consequences of including or excluding variables in analysis set or model 48

  25. Recommendations for Goal 2 Use DAG to identify MSASs, exclude mediators, common effects of exposure and outcome - use dagitty.net for complicated DAGs - choose the most feasible MSAS and adjust for identified confounders - use sensitivity analyses to deal with weakly supported potential confounders, ie the B-list More later on number of predictors, interactions with main predictor, mediation, high correlation among confounders Alternative procedure when you can t draw a convincing DAG - identify an A-list of confounders strongly supported by the literature and/or face validity - identify a B-list of plausible but unclearly supported potential confounders - exclude mediators, common effects of exposure and outcome - *** use change-in-coefficient criterion (or shrinkage) to exclude unimportant B-list confounders 49

  26. Recommendations for Goal 2 (continued) Hypothesis testing only of interest for primary predictor so somewhat robust against inflation of type-I error - type-I errors for adjustment variables are irrelevant because - effect estimates for adjustment variables are not main focus - over-fitting is primarily a problem for Goal 1, not Goal 2 We carefully construct our models based on prior knowledge, rather than signal in the data Typically fewer variables included than in Goal 1 and 2 which reduces over fitting 50

  27. Goal 3: risk factor identification 52

  28. Goal 3: evaluating multiple predictors as risk factors for an outcome What are the risk factors for an outcome? Most difficult of three inferential goals, a fishing expedition Instead of one predictor of primary interest, several variables may be targeted Introduces the possibility of false positives, need to control for multiplicity (covered in other courses) Examples - Disease with an unknown cause - Identifying genetic risk factors 53

  29. Goal 3: potential problems Identify risk factors without (semi-) complete knowledge of causal model Many possible mediating, interaction relationships False positive findings, particularly for interactions No single model will summarize causal relationships - addressing potential confounding for all included factors difficult - mediation problematic if causal model misspecified (e.g. in assessing effects of SS & DC in example below) Mediation by PNC 54

  30. Recommendations for Goal 3 Ruling out confounding is still central Best (but labor intensive) solution: treat each predictor as primary in turn, use DAG-based methods for Goal 2 (not the position taken in Section 10.3.2 of book) An alternative approach Fit a single big model including potential confounders - needed for face validity, regardless of statistical criteria - retain potential risk factors that meet liberal statistical inclusion criterion (p < 0.2) Note: change-in-coefficient criterion not applicable (i.e. many coeff. involved) Cautiously interpret weaker, less plausible findings Multiple models may be required to deal with mediation Ref: Westreich D, Greenland S. The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol. 2013;177:292-8. Greenland S, et al. Outcome modelling strategies in epidemiology: traditional methods and basic alternatives. Int J Epidemiol. 2016. PMID: 27097747 55

  31. Additional topics in predictor selection Number of predictors - how many can you reasonably include? Co-linearity Standard algorithms for predictor selection in regression models (if time) 56

  32. Number of predictors to include in a model? Too many predictors can - degrade precision in primary effect estimates - in smaller datasets, swamp a real association - induce bias in estimates (eg conditioning on a collider) 57

  33. Recommendations: number of predictors for regression models Use 10-15 observations/predictor (10 events per predictor for binary/survival outcomes) as a cautionary flag - Or fewer? Vittinghoff, E., & McCulloch, C. E. (2006). Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression. American Journal of Epidemiology, 165(6), 710 718. If close to 10, check - high correlations between predictors - inflated SEs when a new covariate is added - inconsistency between t/Wald and likelihood ratio tests (logistic and Cox models) - gross inconsistency with smaller models If trouble is apparent: - use more stringent inclusion criterion - omit variables included only for face validity With a binary primary predictor, many potential confounders, and a rare outcome, consider propensity scores - Note:propensity scores don t solve the problem with a rare predictor, or control for unmeasured confounders 58

  34. Co-linearity between predictors Often not a big deal as you will see 2 predictors are co-linear if their correlation is sufficient to substantially degrade precision of associated inferences about coefficients Can make individual coefficient estimates very imprecise, even when F test for combined effects is statistically significant Co-linear predictors give similar information about outcome - p-values don t help distinguish between them Variance inflation factor (VIF): variability (SE) of j(coeff. for jth predictor) increases with corr. (rj) between xjand other predictors in model (see lecture 2 & sect. 4.2.2.2): - - Describes impact of collinearity on coefficients estimated precision Linked to formula for variance of j 59

  35. Recommendations: dealing with co-linearity Goal 1: use cross-valid. to select one of co-linear predictors - typically handled automatically by machine learning methods so no direct intervention needed Goal 2: strong co-linearity between predictor of primary interest and a confounder in every MSAS is a problem Goal 3: see if the final model avoids the problem: e.g., one predictor clearly dominates, or both are independently predictive. Or, if it makes sense, create summary variables or select among them Co-linearity between adjustment variables (e.g. between confounders in goal 2, or between variables included in polynomial or spline terms): not a problem 60

  36. Model selection algorithms for regression Procedures for selecting predictors on statistical quantities (e.g. p-values), some implemented in Stata: - univariate screening; change-in-estimate criterion; backwards, forwards, and stepwise selection; all subsets Epidemiologists and statisticians are often uncomfortable with these methods - blind to causal considerations For goal 1, screening with cross-validation and/or many modern machine learning approaches work better than regression-based alternatives For goals 2 and 3, DAG-based procedures preferable since they at least acknowledge causality But we discuss these techniques since you will encounter them in your training and research 64

  37. Conducting sensitivity analyses Multiple plausible DAGs: show whether substantive results differ by MSAS For selection algorithms, check sensitivity to - model selection procedure (forwards, backwards, stepwise) - number of predictors retained / retention criterion - a priori variable choices (e.g., between collinear variables) 69

  38. Summary Recommendations Goal 1 prediction: select model using aggressive search plus cross- validation/bootstrapping, validate in external sample Goal 2 predictor of primary interest: adjust for MSAS based on a strongly- motivated DAG; use change-in-estimate criterion or sensitivity analysis to deal with weakly motivated predictors Goal 3 identifying multiple independent predictors: treat each predictor as primary in turn, using methods for Goal 2; or, use an inclusive model (all variable in), interpret weak findings cautiously (especially for causal exploration) Numbers of predictors: relax the 10-OPV/EPV rule of thumb if necessary to rule out confounding, but check for bad behavior Co-linearity: - between control variables, not a problem unless it occurs between a predictor of primary interest and a non-ignorable confounder - in prediction models, can be handled using methods like cross-validation 70

Related


More Related Content