Machine Learning for Improved Risk Stratification in Health Care

Marvin Ploetz

Philip Docena

Ojaswi Pandey

Aakash Mohpal

23 April, 2019

Objectives

•

Propose an alternative - machine learning based - approach

to patient risk stratification for ECM in Estonia

•

Illustrate the use and applicability of machine learning to

other areas of work relevant to EHIF

Overview

•

Big Data and Machine Learning in Health Care

•

Machine Learning Basics

•

Context of ECM

•

Research Question

•

Data Overview & Sample Construction

•

Feature Engineering

•

Evaluation & Modelling Choices

•

Results

•

Conclusions

Big Data and Machine

Learning in Health Care

•

Take advantage of

massive amounts of data

and provide the right

intervention to the right

patient at the right time

•

Personalized care to the

patient

•

Potentially benefit all

agents in the health care

system: patient,

provider, payer,

management

Right patient

Right intervention

Right time

Personalized medicine

Patients

Providers

Payers

Benefits

•

Osteoarthritis

: a common and painful chronic condition

•

Often requires replacement of hip and knees

•

More than 500,000 Medicare beneficiaries receive replacements each year

•

Medical costs

: roughly $15,000 per surgery

•

Medical benefits

: accrue over time, since some months after surgery is painful and

spent in disability

•

Therefore, a joint replacement only makes sense if you will live long enough to enjoy it. If

you die soon after, could be futile and painful

•

Prediction/classification problem

: Can we predict which surgeries will be futile using only

data available at the time of surgery?

20% of 7.4

million

beneficiaries

98,090 had hip or

knee replacement

in 2010

1.4% died within one month

of surgery

4.2% died within 1-12 months

65,395

observations

32,695

observations

3,305 independent

variables

Traditional Analysis: About Averages

Big Data and ML Analytics: Predict Individual Risks

Model to

predict

riskiest

patients

Train data

Test data

The first column sorts the test sample by risk percentiles. In the top 5

th

 percentile riskiest population, the observed

mortality rate within 1 year within 1-12 months was 43.5%. Reallocating these surgeries to those with median risk

level (50

th

 percentile) would have averted 1,984 futile procedures, and reallocated $30m to other beneficiaries.

•

Apply natural language processing

algorithms to extract data from EHRs

•

Extract 101.6m data points from 1.3m

EHRs of pediatric patients

•

High diagnostic accuracy among multiple

organ systems and comparable to

performance of experienced pediatric

physicians

•

Most common form of cancer afflicting

2.5 million patients worldwide in 2015

•

Need to distinguish malignant tumors

from benign ones

•

Early detection is key

•

Data: 62,219 mammography findings

from the Wisconsin State Cancer

Reporting System

•

A Neural Network based algorithm does

as well as radiologists in classifying the

tumors

Machine Learning Basics

•

Collection of large and complex data sets

which are difficult to process using common

database management tools or traditional

data processing applications

•

Not only about size: finding insights from

complex, noisy, heterogeneous, and

longitudinal data sets

•

This includes capturing, storing, searching,

sharing and analyzing

1)

Supervised

 – Making predictions using labeled/structured data

•

Classification

: use data to predict which category something falls into

•

Examples: If an image contains a store front or not; If a patient is high risk or not

•

Regression

: use data to make predictions on a continuous scale

•

Examples: Predict stock price of a company; given historical data, what will the

temperature be tomorrow

2)

Unsupervised

 – Detecting patterns from unstructured data

•

Problems where we have little or no idea what the results should look like

•

Provide algorithms with data and ask to look for hidden features and cluster the

data in a way it makes sense

•

Examples: identify patterns from genomics data, separating voice from noise in audio files

Data

Feature

engineering

/ Data

construction

Train data

80%

Test data

20%

Build

Machine

Learning

Model

Collect data

Validate model

results using test

data

Data

Data

Data

Model Results

Standardize and

clean data

Split data in

test/train

Build model using

train data

•

Accuracy

(TP+TN)/All

•

Precision

 = TP/(TP+FP)

•

Recall

 = TP/(TP+FN)

•

Accuracy

150/165 = 78%

•

Precision

 = 100/145 = 69%

•

Recall

 = 100/105 = 95%

Case I: High recall, low precision

•

Accuracy

190/230 = 83%

•

Precision

 = 90/95 = 95%

•

Recall

 = 90/125 = 72%

Case II: Low recall, high precision

•

Plot the true and false positive

rate for every classification

threshold

•

A perfect model has a curve

that passes through the upper

left corner (AUC = 1)

•

The diagonal (red line)

represents random guessing

(AUC = 0.5)

•

A non-parametric supervised

learning method used for

classification and regression

•

Built in the form a tree structure

•

Breaks data down in smaller and

smaller subsets while incrementally

building tree

•

Final result is tree with decision

nodes and leaf nodes

Outlook

No Golf

Golf

Windy

Play Golf

No Golf

Rainy

Overcast

Sunny

False

True

•

A collection of decision

trees whose results are

aggregated into one final

output

•

Use different sub-samples

of the data and different set

of features

•

Helps reduce overfitting,

bias and variance

Context of ECM

A Big Challenge of the Estonian Healthcare System

•

Changes in the demand for health care due to population ageing and

rise of non-communicable diseases

•

Chronic conditions as the driving force behind needs for better care

integration

•

Low coverage of preventive services and considerable share of

avoidable specialist and hospital care

•

Opportunity to improve management of specific patient groups at the

PHC level ->

care management for empaneled patients

•

Prediction for which patients breaches in care coordination will occur

->

risk-stratification of patients

Risk

Stratification

Until Now

•

No actual prediction

analysis done

•

Involvement of providers

to gain

trust/understanding

•

Behavioral and social

criteria are key, but

sparsely available -> use

insider knowledge of

doctors

Review by GPs (Behavioral & social factors,

information not in data)

Dominant/complex

condition (cancer,

schizophrenia, rare

disease etc.)

DM/ Hypertension/ Hyperlipidemia

Min. and Max.

Number/Combination of:

CVD/ Respiratory/ Mental Health/

Functional Impairment

Not eligible

Not eligible

Not eligible

Not eligible

ECM Candidate

Yes

No

No

Yes

Yes

No

Yes

No

Enhanced Care Management So Far In Estonia

•

Successful enhanced care management

 pilot with 15 GPs and < 1,000

patients to assess the feasibility and acceptability of enhanced care

management

•

Commitment of the Estonian Health Insurance Fund (EHIF) to

scale-up

the

care management pilot

•

Model for risk stratification: - Clinical algorithm + provider intuition

•

Need for a

better risk-stratification approach!?

Research Question

The Prediction Problem

•

Target patients

- Who benefits from care management?

•

combination of disease, social and behavioral factors…

•

Objective of ECM -

Ultimately improve health outcomes for patients with

cardio-vascular, respiratory, and mental disease.

•

 What is the right proxy prediction variable in the data?

•

There is

not

one single relevant adverse event (e.g. death,

hospital admission, health complication, high healthcare spending)

•

Some discussions on how to choose the dependent variable…

-> Unplanned hospital admissions have a large negative impact on patient

lives, are costly and relatively frequent. Some are also avoidable…

Many Patients

Repeatedly Have

Hospitalizations

•

22 percent of patients

need to be hospitalized

again in the following

year…

Hospitalizations

account for a bulk

of healthcare costs

Predicting Hospital Admissions

•

Hospital admissions are the main (avoidable) adverse health event

•

But predicting hospitalizations is a hard problem

•

Social factors matter a lot, patients may have a lot or no contacts with the

healthcare systems at all…

•

Tradeoff to choose which hospitalizations we want to predict

•

Admissions due to specific conditions vs. hospitalizations in general

Predicting Hospital Admissions

Key question

Not

“What is the best algorithm for predicting hospital

admissions?”

But

“How can we obtain the most useful prediction of

hospital admissions for a specific purpose?”

Data Overview & Sample

Construction

Administrative Claims Data (in Estonia)

•

Very reliable

•

High-quality data availability

as of 2007/2008

•

Comprehensive coding requirements

for providers

•

Reporting lag of data

is on average 2 weeks

•

No info on clinical outcomes

(i.e. test results)

•

Limited information on

social conditions and behavioral

characteristics

•

Need for a lot of feature engineering to create “meaningful” variables

at the patient level

Description of Available Data

Patient Cohort

Selection for

the ML Analysis

Characteristics of Patients in the ML sample vs.

Total Population

•

Relative to the population, the ML sample is older and more likely to be female.

Characteristics of Patients in the ML sample vs.

Total Population

Most Common

Chronic

Conditions

•

The ML Sample population is

also more sick on average

(i.e. the prevalence of

chronic conditions is higher)

Characteristics of Patients in the ML sample vs.

Total Population

Feature Selection &

Engineering

Feature

Selection &

Engineering

•

Series of attempts with

interim features to extract

better performance…

•

Final set: 141 features

Features Used…

Features Used…

Features Used…

Getting to Know the Data: Diagnosis and Admissions

•

Afib (Atrial Fibrillation And Flutter), Chf (Congestive Heart Failure), Htn (Hypertension),

and Ischemic Htd (Ischemic Heart Disease) are strong indicators of potential admissions

in the following year (2017)

•

Patient groups with these conditions have a non-trivial (~10% likelihood) of hospital

admissions

•

This likelihood increases to ~20%-~30% with one 2016 hospital admission and to >50%

with 3 and more admissions in 2016

Single DGN

Pairs of DGNs

Evaluation & Modelling

Choices

ML Models Selected for Evaluation

•

Selection criteria:

•

Algorithms are

readily available, easy-to-use, comprehensive

and

well-tested open-source

libraries in Python (

scikit

•

Algorithms and results are

relatively easy to describe/explain

(common algorithms)

•

For interpretability and model familiarity, no attempt at exploring more complex models; no deep networks

•

Included in comparison:

•

Decision Tree

•

Random Forest and Extremely Randomized Trees (ExtraTrees)

•

k-Nearest Neighbors*

•

Gaussian Naïve-Bayes**

•

Logistic Regression (L1, L2)

•

SVM (RBF, polynomial)***

•

Multi-layer Perceptrons (1 hidden layer)

•

Adaboost (Decision Tree and Random Forest)

•

Gradient Boosted Trees (scikit GBT, not XGBoost)

•

Calibrated (isotonic) variations of above classifiers

•

Neural Networks

•

Eventually excluded: *kNN for execution time and memory requirements, **NB for weak performance, and ***SVMs for very slow

training (but considered for final paper)

Evaluation metrics

•

Variable to be predicted:

Yes/No

hospital admission in 2017

•

Use data from 2011-2016

•

We deal with an unbalanced sample (i.e. 7.5% of patients had an

admission in 2017)

•

Appropriate metrics of model performance in an unbalanced dataset:

•

Precision, Recall, ROC curve and area under the curve (AUC)

•

(Problem-specific custom metric to penalize mistakes) for one type of error

more heavily: cost of a false positive (cost of ECM) vs. cost of

a missed positive (cost of subsequent hospitalization)

•

Different ML models have different strengths, but differences should

not be huge

Intuitive Interpretation of Metrics

•

Precision

 is the probability that a patient classified as a patient with a

hospital admission by an algorithm is actually going to have a hospital

admission.

•

Recall

 is the probability that a patient who is going to have a hospital

admission is being classified as such by an algorithm.

•

Which one is more important?

•

It depends a lot on the application. There is a tradeoff between

maximizing either of them…

Future: Deriving a Custom Score with Cost Data

•

We can represent savings in terms of true positives (TP), false positives (FP), true negatives (TN)

and false negatives (FN):

Savings = cost under status quo – cost under model

 = [(TP + FN)*c

] – [(TP + FP)*c

 + FN*c

 + TP*(1 – e

)*c

)]

…

 = TP* c

*(e

*k – 1) – FP*c

 – per patient annual average cost for ECM enrollment

– the

multiple of

– the average annual cost of hospital admission(s) per patient

– the impact of ECM enrollment on hospital admissions (decrease in probability)

•

We can convert the previous calculation into a score with a maximum positive value of 1 by

normalizing over a maximum value

Savings coefficient = [TP* (e

*k – 1)– FP] / [(TP + FN)* (e

*k – 1)]

Future: Custom Evaluation Score Based on Cost Data

•

A hypothetical exercise (not all the benefits of ECM are being captured)

•

Hypothetical cost/savings assumptions (based on historical data, references

from the literature):

•

ECM Prevention-to-treatment cost ratio is 1:30

•

Impact of ECM enrollment on hospital admissions (decrease in probability) is 10%, 15%, to

20%

Model Implementation Approach

•

Dataset

•

Size:

~610k records, randomly re-shuffled

•

Split:

80-20 split, so ~490k training records and ~120k for testing

•

Highly unbalanced:

only 7.5% positive samples

•

Algorithms:

optimized via cross-validated parameter grid search

•

Parameter grid:

•

Parameters and values are based on known useful combinations, and trials on small sets, p

arameter

grid

is limited to max two parameters per model (to manage execution time growth)

•

Cross validation:

•

5-fold CV over training set, stratified to maintain target variable imbalance

•

CV scoring metric:

Log loss and custom cost-sensitive metric

•

Benchmarks:

•

‘Expert’ algorithm developed for the same problem/dataset (see above, slide 5)

•

Random selection of patients (using the prevailing positive case rate in the training set, 7.5%)

First Results

Precision and Recall for Log Loss Models

ML models

outperform

 the benchmark on

precision

, but

lag

 behind on

recall

•

ML models have difficulty identifying all positive cases (i.e. patients with an

admission). Most positive samples have low probabilities of being a positive

case – a typical consequence of highly unbalanced datasets.

•

But classification above the 50% threshold is highly accurate (few false

alarms)

ROC and ROC-AUC for Log Loss Models

•

ROC curves are closely clustered

•

Optimized ROC curves are very close

to each other (suggesting reasonably

effective optimization)

•

Decision Tree based algorithms tend

to have lower AUC

Comparison to the Expert Benchmark ROC curve is

not possible as the benchmark model does not

produce probability estimates

Summary So Far…

•

Performance of ML models are

promising

, in line with known

expectations (close to 75% AUC) and beats benchmark on precision

•

But clear

weakness

 on recall (i.e., Patients with a high chance of a hospital

admission are not detected consistently)

•

Results on

original

 dataset have room for improvement

•

Why the sub-par classification capacity?

Next Round of Results…

Dealing with an Unbalanced Dataset…

•

How difficult is prediction using

standard ML models on the original

unbalanced

 dataset?

•

Example: Random forest

•

Quite difficult, the distributions of class

predictions are not separable

•

All models are consistently putting low

estimates for positive samples, well below

the 50% threshold (poor recall)

•

Almost total overlap (see the red and

green distributions on the right)

•

Change of

classification threshold (default

at 50%) does not help

Dealing with an Unbalanced Dataset…

•

Improve results via:

•

better features (possible, as a follow-up phase)

•

more complex models (possible, as a follow-up phase)

•

or

influence training

directly?

•

Rebalancing techniques (e.g., under-sampling majority) could be

applied during

training

•

Overall goal is to identify more positive cases, at (an acceptable) expense of

false positives, subject to tradeoff factors

•

So the accurate prediction of probabilities is not the main goal

•

Consider

some

 amount of rebalancing on the training set

only

•

Retain full set of minority class

and decrease the majority class to reach ratio

•

No hard rule on single most effective rebalancing ratio, so several trials

Effect of Under-sampling

•

‘Probability’ distribution for original and under-sampled training datasets.

•

Predictions are over-estimated as expected

•

Model can now detect

more

 positive samples than before

(more red/positive samples above

0.50)

, thus improved recall, in exchange for some precision loss.

Results for Models Based on Resampling

Next Round of Results…

More complex models: Preliminary Results for

Neural Networks

•

Preliminary results are from a run

of a Neural Network algorithm.

•

neural network

 is a series of

algorithms that endeavors to

recognize underlying

relationships in a set of data

through a process that mimics

the way the human brain

operates.

More complex models: Preliminary Results for

Neural Networks

•

Preliminary results are from a run

of a Neural Network algorithm.

•

The model detects much more

positive samples than before (more

red/positive samples above 0.50)…

More complex models: Preliminary Results for

Neural Networks II

•

…The model recall is 69% and

precision is 13% (outperforming the

expert reference model on both

measures)

•

The resulting ROC-AUC is comparable

to the one of the top high-precision

models presented above (i.e. 0.73)

•

We now have a

high-precision

 and a

high-recall

 model…

•

Different classifiers have different misclassification rates

•

Crucially,

a few models misclassify samples that other models get

right

, so taking an average over several classifiers might improve

results

•

Create a model that ensembles multiple classifiers to reduce

prediction variance

•

Hard voting: every individual classifier votes for a class, and the majority wins.

•

Soft voting: every individual classifier provides a probability value that a

specific data point belongs to a particular target class. The predictions are

weighted by the classifier's importance and summed up.

•

A Soft Voting Ensemble Model

•

There is no significant advantage

to ensembling in this sample,

but there is indeed a small gain.

Conclusions

ML vs. the Old Approach

Which Patients Do the ML Models Identify?

Which Patients Do the ML Models Identify?

Comparisons – Old Approach and the Literature

•

The ML models are better than the old approach at predicting

hospital admissions (but this is only one aspect/one goal of ECM)

•

Results are comparable to best performing results from the literature

-> John Hopkins Adjusted Clinical Groups (the leading proprietary risk

stratification tool) -

Haas et al.; Risk-Stratification Methods for

Identifying Patients for Care Coordination; The American Journal of

Managed Care (September 2013)

•

Predicting hospital admissions still remains a hard problem…

How to use ML techniques for ECM?

•

Use ML instead of the old approach or combine them?

•

Both approaches have advantages and disadvantages:

•

Mainly interpretability vs. prediction performance

•

What is the objective of ECM?

•

How long is ECM enrolment going to be for a patient?

•

Use ML in addition to the old algorithm?

•

Implement ML predictions for other purposes than hospital admissions

•

Use dashboards as a chance to give more frequent feedback for GPs

•

Move from retrospective feedback to forward-looking information sharing for

better decision making by care teams

Some More Observations

•

Improving the ML models: additional data on social status and

conditions of patients is key

•

Updating the models based on new available information (every 3, 6, 12

months) can improve performance considerably

•

More trials and evaluations offer more chances for model improvement

•

Implementation:

Analysis was carried out using Python. All codes will be

made available and can be adapted.

•

Data cleaning and preparation is a lengthy process…

•

Running the more advanced models takes some time. The availability of

multiple/scalable computing resources is key…

Other Potential ML Applications at EHIF

•

Predicting costs per patient and which patients are going to be the

high-cost patients in the next year

•

Predicting volumes of care services at different providers

•

Predicting which patients on a waiting list can benefit the most from a

given surgery (see above example)

•

Unsupervised machine learning: Identifying provider fraud and outlier

providers (in terms of their performance or their care provision)

•

…

Slide Note

Embed Share

Download

Explore the use of machine learning for risk stratification of patients with non-communicable diseases in Estonia. This study showcases the application of big data and machine learning in healthcare, emphasizing the benefits of personalized care, proactive disease prevention, and efficient interventions based on patient data.

mino Follow

Uploaded on Aug 14, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

MACHINE LEARNING FOR IMPROVED RISK STRATIFICATION OF NCD PATIENTS IN ESTONIA Big Data and Machine Learning in Health Care Big Data and Machine Learning in Health Care Marvin Ploetz Philip Docena Ojaswi Pandey Aakash Mohpal 23 April, 2019

Objectives Propose an alternative - machine learning based - approach to patient risk stratification for ECM in Estonia Illustrate the use and applicability of machine learning to other areas of work relevant to EHIF

Big Data and Machine Learning in Health Care Machine Learning Basics Context of ECM Research Question Data Overview & Sample Construction Feature Engineering Evaluation & Modelling Choices Results Conclusions Overview

Big Data and Machine Learning in Health Care

Big Data and Machine Learning in Health Care Big Data and Machine Learning in Health Care Vital signs Activity data Behavioral data Nutritional data EMR Clinical notes Medical images Genome data Take advantage of massive amounts of data and provide the right intervention to the right patient at the right time Patients Providers Personalized care to the patient Other Payers Potentially benefit all agents in the health care system: patient, provider, payer, management stakeholders Public health Pharma companies and drug discoveries Claims and billing Approvals and denials Population health and risk

Uses of Machine Learning in Health Care Uses of Machine Learning in Health Care Improve care and efficiency, lower costs Personalized medicine Benefits Proactively prevent diseases Assist diagnostics Right patient Patients Big data and machine learning analytics in health care Right intervention Providers Improve clinical trials Predict disease risk Right time Payers Study population health Find cures for conditions

Example 1: Hip and knee replacement in the US Example 1: Hip and knee replacement in the US Osteoarthritis: a common and painful chronic condition Often requires replacement of hip and knees More than 500,000 Medicare beneficiaries receive replacements each year Medical costs: roughly $15,000 per surgery Medical benefits: accrue over time, since some months after surgery is painful and spent in disability Therefore, a joint replacement only makes sense if you will live long enough to enjoy it. If you die soon after, could be futile and painful Prediction/classification problem: Can we predict which surgeries will be futile using only data available at the time of surgery?

Example 1: Hip and knee replacement in the US Example 1: Hip and knee replacement in the US 3,305 independent variables Train data 65,395 observations 98,090 had hip or knee replacement in 2010 20% of 7.4 million beneficiaries Model to predict riskiest patients Test data 32,695 observations 1.4% died within one month of surgery 4.2% died within 1-12 months Traditional Analysis: About Averages Big Data and ML Analytics: Predict Individual Risks

Example 1: Hip and knee replacement in the US Example 1: Hip and knee replacement in the US Predicted Mortality Percentile Observed mortality rate Futile procedures averted Futile spending ($ millions) 1 43.5% 1,984 30 2 42.2% 3,844 58 5 35.8% 8,061 121 10 24.2% 10,512 158 20 15.2% 12,317 185 30 13.6% 16,151 242 The first column sorts the test sample by risk percentiles. In the top 5th percentile riskiest population, the observed mortality rate within 1 year within 1-12 months was 43.5%. Reallocating these surgeries to those with median risk level (50th percentile) would have averted 1,984 futile procedures, and reallocated $30m to other beneficiaries.

Example 2: Diagnoses of pediatric conditions Example 2: Diagnoses of pediatric conditions Apply natural language processing algorithms to extract data from EHRs Extract 101.6m data points from 1.3m EHRs of pediatric patients High diagnostic accuracy among multiple organ systems and comparable to performance of experienced pediatric physicians

Example 2: Diagnoses of pediatric conditions Example 2: Diagnoses of pediatric conditions

Example 3: Breast cancer screening Example 3: Breast cancer screening Most common form of cancer afflicting 2.5 million patients worldwide in 2015 Need to distinguish malignant tumors from benign ones Early detection is key Data: 62,219 mammography findings from the Wisconsin State Cancer Reporting System A Neural Network based algorithm does as well as radiologists in classifying the tumors

Machine Learning Basics

Definition of Big Data Definition of Big Data Collection of large and complex data sets which are difficult to process using common database management tools or traditional data processing applications Volume Big Data Not only about size: finding insights from complex, noisy, heterogeneous, and longitudinal data sets This includes capturing, storing, searching, sharing and analyzing Variety Velocity

Types of Machine Learning Problems Types of Machine Learning Problems 1) Supervised Making predictions using labeled/structured data Classification: use data to predict which category something falls into Examples: If an image contains a store front or not; If a patient is high risk or not Regression: use data to make predictions on a continuous scale Examples: Predict stock price of a company; given historical data, what will the temperature be tomorrow 2) Unsupervised Detecting patterns from unstructured data Problems where we have little or no idea what the results should look like Provide algorithms with data and ask to look for hidden features and cluster the data in a way it makes sense Examples: identify patterns from genomics data, separating voice from noise in audio files

Machine Learning Implementation Machine Learning Implementation Standardize and clean data Build model using train data Split data in test/train Collect data Validate model results using test data Data Build Machine Learning Model Train data Model Results Data 80% Feature engineering / Data construction Data Test data 20% Data

Assessing Model Performance: Precision and Recall Assessing Model Performance: Precision and Recall Actual Condition/Outcome True False Accuracy = (TP+TN)/All Precision = TP/(TP+FP) Recall = TP/(TP+FN) Condition/Outcome True True Positive (TP) False positive (FP) Predicted False False negative (FN) True negative (TN)

Assessing Model Performance: Precision and Recall Assessing Model Performance: Precision and Recall Case I: High recall, low precision Case II: Low recall, high precision Actual Actual True False True False 100 TP 45 FP 90 TP 5 FP True True Predicted Predicted 5 80 TN 35 FN 100 TN False False FN Accuracy = 150/165 = 78% Precision = 100/145 = 69% Recall = 100/105 = 95% Accuracy = 190/230 = 83% Precision = 90/95 = 95% Recall = 90/125 = 72%

Assessing Model Performance: ROC Curve Assessing Model Performance: ROC Curve Plot the true and false positive rate for every classification threshold A perfect model has a curve that passes through the upper left corner (AUC = 1) The diagonal (red line) represents random guessing (AUC = 0.5)

Decision Tree: Playing Golf Decision Tree: Playing Golf A non-parametric supervised Outlook Temperature Humidity Windy Play Golf Rainy Hot High False No learning method used for Rainy Hot High True No classification and regression Overcast Hot High False Yes Built in the form a tree structure Sunny Mild High False Yes Sunny Cool Normal False Yes Breaks data down in smaller and Sunny Cool Normal True No smaller subsets while incrementally Overcast Cool Normal True Yes building tree Rainy Mild High False No Final result is tree with decision Rainy Cool Normal False Yes Sunny Mild Normal False Yes nodes and leaf nodes

Decision Tree: Playing Golf Decision Tree: Playing Golf Outlook Rainy Overcast Sunny No Golf Golf Windy False True Play Golf No Golf

Decision tree to Random Forest Decision tree to Random Forest A collection of decision trees whose results are aggregated into one final output Use different sub-samples of the data and different set of features Helps reduce overfitting, bias and variance

Context of ECM

A Big Challenge of the Estonian Healthcare System Changes in the demand for health care due to population ageing and rise of non-communicable diseases Chronic conditions as the driving force behind needs for better care integration Low coverage of preventive services and considerable share of avoidable specialist and hospital care Opportunity to improve management of specific patient groups at the PHC level -> care management for empaneled patients Prediction for which patients breaches in care coordination will occur -> risk-stratification of patients

DM/ Hypertension/ Hyperlipidemia No Yes Not eligible Risk Stratification Until Now Min. and Max. Number/Combination of: CVD/ Respiratory/ Mental Health/ Functional Impairment No Yes No actual prediction analysis done Involvement of providers to gain trust/understanding Behavioral and social criteria are key, but sparsely available -> use insider knowledge of doctors Not eligible Dominant/complex condition (cancer, schizophrenia, rare disease etc.) No Yes Review by GPs (Behavioral & social factors, information not in data) Not eligible No Yes Not eligible ECM Candidate

Enhanced Care Management So Far In Estonia Successful enhanced care management pilot with 15 GPs and < 1,000 patients to assess the feasibility and acceptability of enhanced care management Commitment of the Estonian Health Insurance Fund (EHIF) to scale-up the care management pilot Model for risk stratification: - Clinical algorithm + provider intuition Need for a better risk-stratification approach!?

Research Question

The Prediction Problem Target patients - Who benefits from care management? A combination of disease, social and behavioral factors Objective of ECM - Ultimately improve health outcomes for patients with cardio-vascular, respiratory, and mental disease. What is the right proxy prediction variable in the data? There is not one single relevant adverse event (e.g. death, hospital admission, health complication, high healthcare spending) Some discussions on how to choose the dependent variable -> Unplanned hospital admissions have a large negative impact on patient lives, are costly and relatively frequent. Some are also avoidable

Patients with an Admission in 2011 - Subsequent Hospitalization Rates Many Patients Repeatedly Have Hospitalizations 25 Percentage of patients who were 22.9 20.2 20 18.8 hospitalized in 2011 17.7 16.3 15 13.6 22 percent of patients need to be hospitalized again in the following year 9.3 10 5 0 One year later Two years later Three years later Four years later Five years later Six years later Seven years later

Average costs (in Euros, s) in different types of care in 2016 Hospitalizations account for a bulk of healthcare costs 167.63 Inpatient Care 148.17 Outpatient Care 37.72 Day Care 22.53 PHC 10.01 Inpatient Nursing Care 6.41 Outpatient Rehabilitation Care 6.01 Inpatient Rehabilitation Care 4.94 Outpatient Nursing Care ML Sample (N=712,104) General Population (N=1,0260,630)

Predicting Hospital Admissions Hospital admissions are the main (avoidable) adverse health event But predicting hospitalizations is a hard problem Social factors matter a lot, patients may have a lot or no contacts with the healthcare systems at all Tradeoff to choose which hospitalizations we want to predict Admissions due to specific conditions vs. hospitalizations in general

Predicting Hospital Admissions Hospital Admissions Excluded ICD-10 Chapter Title A00-B99 Certain infectious and parasitic diseases Key question C00-D48 Neoplasms Not What is the best algorithm for predicting hospital admissions? O00-O99 Pregnancy, childbirth and the puerperium P00-P96 Certain conditions originating in the perinatal period But How can we obtain the most useful prediction of hospital admissions for a specific purpose? S00-T98 Injury, poisoning V01-X59 Accidents

Data Overview & Sample Construction

Administrative Claims Data (in Estonia) Very reliable High-quality data availability as of 2007/2008 Comprehensive coding requirements for providers Reporting lag of data is on average 2 weeks No info on clinical outcomes (i.e. test results) Limited information on social conditions and behavioral characteristics Need for a lot of feature engineering to create meaningful variables at the patient level

Description of Available Data Administrative Beneficiary Family Doctor Patient-Year Level Types of Care 1. 2. 3. 4. 5. 6. 7. 8. Day Care Inpatient Care Inpatient Nursing Care Inpatient Rehabilitation Care Outpatient Care Outpatient Nursing Care Outpatient Rehabilitation Care Primary Health Care Utilization, Diagnosis, Procedures (Surgical and Other) Medications Prescriptions and Filling of Prescriptions

Patient Cohort Selection for the ML Analysis

Characteristics of Patients in the ML sample vs. Total Population Age distribution of the population in the data Gender Distribution 12 60 59 58 10 57 Percentage of population 56 8 55 54 6 53 52 4 51 % of Women 2 General Pop. ML Sample 0 18-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85+ Percentage of population Percentage of sample Relative to the population, the ML sample is older and more likely to be female.

Characteristics of Patients in the ML sample vs. Total Population AVERAGE COSTS (IN ) FOR PRESCRIPTIONS PRESCRIBED TO PATIENTS IN 2016 General Population ML Sample 375.99 Insurance type 1 = General 49.39 43.63 EUROS ( S) 180.21 2 = Unemployed 2.18 2.04 141.58 3 = Pensioner 28.94 37.59 75.01 4 = Disabled 9.67 12.51 5 = Welfare 0 0.00 TOTAL PRICE OF PRESCRIPTIONS TOTAL PRICE SHARE OF PRESCRIPTIONS BY PATIENTS 6 = Widow 0.28 0.07 ML Sample (N=712,104) Uninsured 9.54 4.15 ML Sample, conditional on patients being hospitalized at least once in 2016

Top-20 Chronic Conditions Percentages of Patients With Condition Hypertension 48.1 Joint Arthrosis 31.1 Most Common Chronic Conditions Hyperlipidemia 26.7 Chronic Gastritis/GERD 25.0 Congestive Heart Failure 15.7 Neuropathies 15.2 Thyroid Diseases 14.5 Mood Disorders 14.3 Ischemic Heart Disease 13.8 The ML Sample population is also more sick on average (i.e. the prevalence of chronic conditions is higher) Cardiac arrhythmias 13.0 Dizziness 11.4 Obesity 10.9 Diabetes Mellitus 10.8 Anemia 10.2 Migraine 9.75 Hemorrhoids 9.59 Vision And Hearing Impairments 8.87 COPD 8.53 Stroke 8.41 Asthma 8.10 General Population (N=1,260,630) ML Sample (N=712,104)

Characteristics of Patients in the ML sample vs. Total Population Percentage of people living in a given county (%) Name of county Harju Saare Tartu J rva Rapla P rnu L ne Viljandi Hiiu L ne-Viru J geva P lva V ru Ida-Viru Valga Poverty rate (%) 1 = 12.6 2 = 12.6-15.8 3 = 15.8-17.63 3 = 15.8-17.63 4 = 17.63-18.3 5 = 18.3-21.7 5 = 18.3-21.7 5 = 18.3-21.7 5 = 18.3-21.7 5 = 18.3-21.7 6 = 21.7-24.7 6 = 21.7-24.7 7 = 24.7-25.1 8 = 25.1-26.9 8 = 25.1-26.9 General Population ML sample 43.15 2.65 11.22 2.39 2.55 6.63 1.62 3.75 0.78 4.65 2.38 2.07 2.85 11.05 2.25 43.23 2.53 10.86 2.4 2.45 6.53 1.61 3.7 0.75 4.61 2.34 2.06 2.8 11.88 2.24

Feature Selection & Engineering

Feature Selection & Engineering Series of attempts with interim features to extract better performance Final set: 141 features

Features Used Feature Categories 1. Healthcare utilization Features Total number of hospital admissions Inpatient Admissions Inpatient Nursing Admissions Inpatient Rehab Admissions Total number of hospital stay days Stay days in Inpatient Care Stay days in Inpatient Nursing Care Stay days in Inpatient Rehabilitation Care Total number of PHC visits Total number of specialist visits Outpatient specialist visits Outpatient rehabilitation specialist visits Total number of surgeries Emergency surgeries Surgeries that took between 1 and 3 hours Respiratory surgeries Whether lab tests were done Cholesterol Fractions Cholesterol Glucose Total number of prescriptions

Features Used Feature Categories 2. Health Status Features Total number of major chronic conditions Any of - Joint arthrosis, Chronic gastritis Whether a patient had major chronic condition Hypertension Diabetes Mellitus Hyperlipidemia COPD Asthma Dementia Vision And Hearing Impairments Prescriptions Diabetic agents Diuretics NSAIDs Anticoagulants Antiplatelets Antihypertensives Antidepressants Narcotics Total price of prescriptions (In Euros) Total out-of-pocket expenditures of prescriptions by patients (In Euros)

Features Used Feature Categories 4. Socioeconomic Status Features Feature Categories 3. Patient Behavior Features Insurance status % of prescriptions picked up by patients 1 = General 2 = Unemployed 3 = Pensioner 4 = Disabled 5 = Welfare 6 = Widow Uninsured Feature Categories 5. Quality of care received Features Total number of family doctors utilized across time Admission rates of GPs, standardized by age and gender of patients in the patient list Compliance with diabetes guidelines by PHC doctor Poverty rates at the county level 1 = 12.6 2 = 12.6-15.8 3 = 15.8-17.63 4 = 17.63-18.3 5 = 18.3-21.7 6 = 21.7-24.7 7 = 24.7-25.1 8 = 25.1-26.9 All tests done No tests done

Getting to Know the Data: Diagnosis and Admissions Single DGN Pairs of DGNs Afib (Atrial Fibrillation And Flutter), Chf (Congestive Heart Failure), Htn (Hypertension), and Ischemic Htd (Ischemic Heart Disease) are strong indicators of potential admissions in the following year (2017) Patient groups with these conditions have a non-trivial (~10% likelihood) of hospital admissions This likelihood increases to ~20%-~30% with one 2016 hospital admission and to >50% with 3 and more admissions in 2016

Evaluation & Modelling Choices

ML Models Selected for Evaluation Selection criteria: Algorithms are readily available, easy-to-use, comprehensive and well-tested open-source libraries in Python (scikit) Algorithms and results are relatively easy to describe/explain (common algorithms) For interpretability and model familiarity, no attempt at exploring more complex models; no deep networks Included in comparison: Decision Tree Random Forest and Extremely Randomized Trees (ExtraTrees) k-Nearest Neighbors* Gaussian Na ve-Bayes** Logistic Regression (L1, L2) SVM (RBF, polynomial)*** Multi-layer Perceptrons (1 hidden layer) Adaboost (Decision Tree and Random Forest) Gradient Boosted Trees (scikit GBT, not XGBoost) Calibrated (isotonic) variations of above classifiers Neural Networks Eventually excluded: *kNN for execution time and memory requirements, **NB for weak performance, and ***SVMs for very slow training (but considered for final paper)

Evaluation metrics Variable to be predicted: Yes/No hospital admission in 2017 Use data from 2011-2016 We deal with an unbalanced sample (i.e. 7.5% of patients had an admission in 2017) Appropriate metrics of model performance in an unbalanced dataset: Precision, Recall, ROC curve and area under the curve (AUC) (Problem-specific custom metric to penalize mistakes) for one type of error more heavily: cost of a false positive (cost of ECM) vs. cost of a missed positive (cost of subsequent hospitalization) Different ML models have different strengths, but differences should not be huge

Intuitive Interpretation of Metrics Precision is the probability that a patient classified as a patient with a hospital admission by an algorithm is actually going to have a hospital admission. Recall is the probability that a patient who is going to have a hospital admission is being classified as such by an algorithm. Which one is more important? It depends a lot on the application. There is a tradeoff between maximizing either of them

Machine Learning for Improved Risk Stratification in Health Care

Download Presentation

Presentation Transcript

Related

More Related Content