Machine Learning for Public Health in R
In this comprehensive guide, delve into the realm of machine learning for public health in R. Explore key tasks such as clustering, classification, and prediction/regression. Uncover various methods like K-means, SVM, and RPT for both unsupervised and supervised learning. Gain insights into distance measurement, scaling techniques, and the significance of content knowledge in decision-making. Discover the power of unsupervised clustering through K-means and its practical applications in public health scenarios.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Machine Learning for Public Health in R Mike Dolan Fliss
Main tasks of Machine Learning 1. Clustering 2. Classification 3. Prediction / Regression We do this sometimes, but it isn t CI! But also: web scraping, text mining, data viz, etc.
Today: Methods sampler & applications More Unsupervised K-means we ll do some! Hierarchical clustering Principal Component Analysis Latent Class Analysis (^ not covering, but for categoricals) More Supervised K-nearest neighbors (KNN) Support vector machines (SVM) Recursive partitioning trees (RPT & and friends)
Many methods rely on distance. Distance how? Distance how? Continuous Numeric? Text? Geometry? . Euclidean (Manhatten, etc.) dist() Levenshtein - adist(), etc. Spatial st_distance(), or just x/y Euclidean in a pinch Categorical Binary (Jaccard distance) Dummify library(dummies). Really! Basically indicator coding.
Scaling 2 pounds vs. 2 feet vs. 2 inches Standardization but may not want to artificially spread data Custom weighting These decisions have consequences. Content knowledge helps!
K-means Wikipedia, yay
K-means Demo Find county clusters of exposure and outcome (two variables only). Assume 4 clusters.
K-means Demo Let s cluster on % of births % preterm % PNC Do you have an intuition about what s going ton happen?
K-means Demo Let s cluster on % of birthsn, Selecting the number of clusters can be done somewhat statistically but also there s the whole utility / communication thing.
K-means Demo Have an intuition about what happened here?
Hierarchical Clustering h_clust_data = county_profile[,1:3] row.names(h_clust_data) = county_results$county_name h_clusters = hclust(dist(h_clust_data)) county_profile$h_clusters_cut = cutree(h_clusters, k=4) # ^ k groups; h= for height cut ggplot(county_profile, aes(pct_pnc5, pct_preterm, color=as.factor(h_clusters_cut), group=as.factor(h_clusters_cut)))+ geom_point() + geom_density2d() head(county_profile) Random starts, get nearest neighbors
Principle Component Analysis Reduce high dimensionality Get linearly uncorrelated orthogonal vectors county_pca = prcomp(county_profile[,1:3], center=T, scale. =T) plot(county_pca, type="l") summary(county_pca) predict(county_pca) library(ggfortify) autoplot(county_pca, data=county_profile, label=T, loadings = T)
Support Vector Machines Partition space using linear vectors Can be single or knotted linear splines May not be linearly separable with hard margin, may need soft margin Least squares and equivalents can maximize the margin and minimize the error. Extendable as Support Vector Clustering (with kernals) to curve lines. Wikipedia and https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72
Why fish for patterns? Why clustering? Identification of hard to notice sub-groups for targeting interventions! Imagine including median household income in the county model what low MHHI counties perform well ? Which high MHHI counties perform low, and why? EMM and EMM-like questions Collapse dimensions for Story-telling!
Why fish for patterns? Why clustering? Identification of hard to notice sub-groups for targeting interventions! Imagine including median household income in the county model what low MHHI counties perform well ? Which high MHHI counties perform low, and why? EMM and EMM-like questions Collapse dimensions for Story-telling!
Story-telling example: multi-drug ED visits fviz_cluster(rates_kmeans, data=rates_for_means)+ geom_text(aes(label = rates$County, color=cluster))
K Nearest Neighbors Majority vote of your K nearest neighors!
Supervised Learning / Prediction Examples
How does our CI model do in prediction? Not well. Nearly a coin flip on preterm birth for many people. Why? Short talk about it leading into Sara next week
Reasonable question: Can we predict preterm birth? Not well with our CI GLM. Nearly a coin flip on preterm birth for many people. Why? Could try Na ve kmeans RPT More from Sara next week no pressure. ;)
to_model = births[,c("pnc5_f", "preterm_f", "smoker_f", "raceeth_f", "cores", "mage")] to_model = na.omit(to_model) head(to_model) scale_01 = function(x){ x = as.numeric(x); x = (x-min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T)) return(x) } to_model_01 = data.frame(lapply(to_model, FUN=scale_01)) summary(to_model_01) km = kmeans(to_model_01[,names(to_model_01) != "preterm_f"], 2) to_model$cluster = km$cluster table(to_model$cluster, to_model$preterm_f) prop.table(table(to_model$cluster, to_model$preterm_f), margin = 2) # ^ meh. cluster 1 is a little more preterm. km2 = kmeans(to_model_01[,names(to_model_01) != "preterm_f"], 10) to_model$cluster2 = km2$cluster prop.table(table(to_model$cluster2, to_model$preterm_f), margin = 1)*100 #meh. cluster 1 is a little more preterm. to_model %>% group_by(cluster2) %>% summarise(n = n(), pct_preterm=sum(preterm_f=="preterm", na.rm=T)/n)
str(to_model) tree1 = rpart(data = to_model, preterm_f ~ pnc5_f + smoker_f + raceeth_f + mage , method="class", parms=list(split="information"), control=rpart.control(minsplit=2, minbucket=1, cp=0.0001)) #no interaction terms, though. dropped cores. need minbucket. summary(tree1) plot(tree1) fancyRpartPlot(tree1) to_model$rpart_pred = predict(tree1, to_model[,c("pnc5_f","smoker_f","raceeth_f", "mage")], "class") to_model$rpart_pred table(to_model$rpart_pred, to_model$preterm_f) table(to_model$rpart_pred) DUH
Real World ML Projects Get more of a sense in practice you can do these!
Predicting Tobacco Retailer Characteristics Web scraping text mining mTurk and Tree-based classifiers
Background Tobacco retailer licensing is a main mechanism of control. TRL enables area based-policies (e.g. no retailers within distances of schools, parks, certain kinds of stores, maximum density, etc.) with evidence-based effects on health Unlike alcohol, not all states have a census of retail shops. Vaping is a new challenge here. Building and maintaining custom lists of
Background Tobacco retailer list generation and validation process. Web-scraped retailers augment the dataset after machine learning classification and imputation tasks, matching against known retailers, and validation of low- confidence data through Amazon MTurk.
Actual Data Characteristics of Counter Tools Tobacco Retailer Dataset. 16,544 subset of over 19,000 surveyed retailers with complete store type, store name, and tobacco selling status at time of analysis, representing 14 US states.
Web Scraping Web-Scraped Stores vs. Counter Tools (CT) Dataset in Durham County. n=220, n=218 respectively, with 52% of the scraped retailers linked to the CT dataset on individual hand review . Search results were generated centered around reverse-geocoded county subdivision centroids.
Text Mining Term-Frequency, Inverse Document Frequency (TF-IDF) is a measure of statistical unlikeliness of words (or n- gram tokens) within a subset of a larger set, sometimes called the corpus. Feature engineering is the process of combining data together into the features to feed a model. For instance, in this case: how to combine TD- IDF scores for tokens into a score for the full store name.
#_____________________________________________ # Create n-grams #### #_____________________________________________ create_ngram_df = function(names_df, max_ngrams){ results_df = data.frame(line = integer(0), token = character(0), n = integer(0), stringsAsFactors = F) for (n in 1:max_ngrams){ df = names_df %>% unnest_tokens(token, name, token = "ngrams", n=n) #note expects tibble! df$n = n results_df = bind_rows(results_df, df) } return(results_df) } all_tokens = create_ngram_df(names_df, 5); head(all_tokens) table(all_tokens$n) # token counts #_____________________________________________ # Might be good to drop single character words (N, E, I, 1, tc.) # line is really id. might be nice to clarify that. # Previous note: problem, unnest even characters within words for some reason (???) #_____________________________________________ Text Mining in R tidytext package makes this very easy! Works in dplyr. Also see tm and wordcloud packages. #_____________________________________________ # Create token count df #### #_____________________________________________ token_counts = count(all_tokens, token, sort=T) %>% arrange(desc(nn)) %>% mutate(rank = 1:n()); head(token_counts) # count and sort # Tokens > 100: Cloud... wordcloud(token_counts$token, freq = token_counts$nn, min.freq = 300, colors = brewer.pal(9, "Blues")[5:9], scale=c(7, 1))
Classify Many tools! Multinomial logistic regression Recursive partitioning trees Random forest models Smart adjustments to address overfitting and bad models Minimum leaf / branch size Pruning branches by hand or code Ensemble techniques (combine multiple weaker learners) Bagging (Bootstrap AGGregation) runs many small models, random sample w replacement, and combines Boosting upweighting what you get wrong Random forest
m1 = rpart(str_typ ~., data = df_train , method = 'class', control=rpart.control(minsplit=2, cp=0)) predictTest(m1, df_test) conf_mat_rpart = predictTest(m1, df_test, confuMat = T) Classify Example, Simplified Decision Tree for Store Type Classification Based on Modified TF-IDF Scores. Note that some tree-based classifiers may run multiple trees, or have too many branches and leaves to effectively visualize.
Classify in R Multinomial & RPT regression store type classification model based on cleaned and tokenized store names. Includes sensitivity (Sn), specificity (Sp), and accuracy (Acc) metrics for store types and overall model sensitivity. Specificity and accuracy metrics uninterpretable until multiple category model and omitted.
Linkage (Sometimes called matching*) Linkage can be considered a classification task: classifying pairs of observations, using multiple distance measures as the features, as matches or not. Can be done unsupervised, or after some unsupervised is done to curate a human review set, can be rerun as supervised with a true training and test set. Smart techniques to decrease complexity / maximize efficiency. Again, be careful with assumptions, feature engineering and training. More on this on the next project!
Validate / MTurk Give small tasks to human reviewers Inter-Rater Reliability (IRR) and other tools are useful for assessing survey (and surveyor!) quality
Death by Legal Intervention / by Law Enforcement Again, linkage as a classification problem
Research Intent Public attention is focused on Legal Intervention* due to concerns over racial bias and unnecessary force. AIM 1: describe the level of agreement between NVDRS (2013-2015) and crowd-sourced data: Mapping Police Violence (MPV) Guardian s The Counted Washington Post s Fatal Force AIM 2: describe demographics/circumstances where there is disagreement
Pilot First in NC VDRS Data Year NCVDRS MPV 2013 31 34 2014 26 39 Small numbers, but sufficient to pilot our techniques Awaiting updated 2014 NC-VDRS data Used model to inform suggestions for national datasets For more meaningful comparisons of difference, need cross-data Linking
Linking Required to assess characteristics of overlap and unknown / differently characterized relationships. Building a linking model now on NC VDRS, National Counted and Mapping Police Violence data (with full names) Using content-aware distance matrices Feeding an eventual supervised machine-learning / tree model Will then apply to National VDRS data (no name here for linkage)
Distance Matrices For all elements of dataset A and B, fill an AxB matrix of the pairwise distance comparisons by some method. The minimum distances along a row or column suggest linking on that method Data A1 n Data B1 m A trivial example would me a matrix of 1s and 0s for exact string match, 0 being a perfect match (0 distance) and 1 representing not-a-match. Efficiency note: for large
Distance Matrices Simpler but non-trivial linking methods use approximate text distance linking (Levinschtein distance or others) on a fully concatenated string. Example: A= John Q Smith 2017/02/12White Raleigh NC B= John P Smith 2016/12/21White Raleigh NC Number of substitutions, additions, deletions to get from A to B = 5 But this doesn t take advantage of the content (e.g. date distance or place distance).
Distance Matrices First build individual content-specific distance matrices for each element (example distance range in parentheses) Name: text distance (0->32) Race-eth: binary 0-1 Date: days different (0-707) City: text distance (0-19) geospatial distance State: binary 0-1
Distance Matrices Then collapsing them into a single n-dimensional distance by any of a number of methods. These aggregate link indices performs well by themselves. Name Race-eth Date City State etc Unweighted sum scaled sum (0-1) Normalized sum (~-3,~3)
Exploratory Tree Models Predictive trees can also be used for categorization. They benefit from 1. Maintain separate distance matrix information as decision nodes 2. Being able to reuse covariates if useful (natural interaction) 3. Do not require single a priori or parameterized generalized linear relationships
Exploratory Tree Models With violent deaths / deaths by police being relatively rare at the local level, even without decedent name in the tree model, date distance (in days) and the approximate text distance in the city name, by themselves, correctly categorize 99% of links between the Mapping Police Violence and The Counted datasets. Name is more useful when using the entire violent death dataset. Treating dates numerically instead of as strings in a concatenated ID may have application to other death linking projects.