Workshop on Latent Class Analysis: An Introduction
This workshop provides a conceptual introduction to Latent Class Analysis (LCA) by Bethany C. Bray, Ph.D., exploring the identification of latent classes and model selection. Learn about latent classes of adolescent drinking behavior, including model parameters and the inclusion of covariates. Discover the basic ideas behind LCA, such as dividing individuals into subgroups based on unobservable constructs with unknown class membership. Explore the importance of mutually exclusive and exhaustive latent classes in statistical analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
1 & 1 WORKSHOP: LATENT CLASS ANALYSIS Bethany C. Bray, Ph.D. Associate Director, The Methodology Center, Penn State methodology.psu.edu bethanycbray.wordpress.com
OVERVIEW Conceptual introduction to latent class analysis (LCA) An example: Latent classes of adolescent drinking behavior Parameters estimated in LCA Model identification, model selection
OVERVIEW Including grouping variables Including covariates Question & Answer
ABBREVIATIONS LCA = latent class analysis Static, categorical latent variable measured with categorical items LPA = latent profile analysis Static, categorical latent variable measured with continuous items LTA = latent transition analysis Dynamic, categorical latent variable
THE BASIC IDEAS Individuals can be divided into subgroups based on unobservable construct The construct of interest is the latent variable Subgroups are called latent classes
THE BASIC IDEAS Individuals can be divided into subgroups based on unobservable construct The construct of interest is the latent variable Subgroups are called latent classes True class membership is unknown Unknown due to measurement error Measurement of the construct is typically based on several categorical indicators
THE BASIC IDEAS Individuals can be divided into subgroups based on unobservable construct The construct of interest is the latent variable Subgroups are called latent classes True class membership is unknown Unknown due to measurement error Measurement of the construct is typically based on several categorical indicators Latent classes are mutually exclusive & exhaustive
ESTIMATED PARAMETERS Latent class prevalences e.g., probability of membership in HEAVY DRINKERS latent class Item-response probabilities e.g., probability of reporting 5+ DRINKS IN THE PAST 2 WEEKS given membership in HEAVY DRINKERS latent class
LATENT CLASSES OF ADOLESCENT DRINKING BEHAVIOR
DRINKING IN 12THGRADE Data from 2004 cohort of Monitoring the Future public release n = 2490 high school seniors who answered at least one question about alcohol use (48% boys, 52% girls) Goals of the study: Alcohol use behavior among U.S. 12thgraders Gender differences in measurement and behavior Predict behavior from skipping school and grades
DRINKING IN 12THGRADE Seven indicators of drinking behavior Item Lifetime alcohol use Past-year alcohol use Past-month alcohol use Lifetime drunkenness Past-year drunkenness Past-month drunkenness 5+ drinks in past 2 weeks Proportion Yes 82% 73% 50% 57% 49% 29% 26%
WE WILL USE LCA TO Identify and describe underlying classes of drinking behavior in U.S. 12th grade students
What would you name these 5 classes? THE 5-CLASS MODEL Probability of Yes response Class 1 (18%) .00 Class 2 (22%) 1.00 Class 3 (9%) 1.00 Class 4 (17%) 1.00 Class 5 (34%) 1.00 Item Lifetime alcohol use Past-year alcohol .00 .61 1.00 1.00 1.00 Past-month alcohol .00 .00 1.00 .39 1.00 Lifetime drunk .00 .24 .29 1.00 1.00 Past-year drunk .00 .00 .00 1.00 1.00 Past-month drunk .00 .00 .00 .00 .92 5+ drinks past 2 wk .00 .00 .16 .00 .73
What would you name these 5 classes? THE 5-CLASS MODEL Probability of Yes response Non- Drinkers .00 Experi- menters 1.00 Light Drinkers 1.00 Past Partiers 1.00 Heavy Drinkers 1.00 Item Lifetime alcohol use Past-year alcohol .00 .61 1.00 1.00 1.00 Past-month alcohol .00 .00 1.00 .39 1.00 Lifetime drunk .00 .24 .29 1.00 1.00 Past-year drunk .00 .00 .00 1.00 1.00 Past-month drunk .00 .00 .00 .00 .92 5+ drinks past 2 wk .00 .00 .16 .00 .73
GRAPHICAL REPRESENTATION Drinking Classes Past- Year Use Lifetime Use 5+ Drinks
LATENT CLASS NOTATION Y represents the vector of all possible response patterns y represents a particular response pattern Example: y = (Y, Y, N, N, N, N, N) Xrepresents the vector of all covariates of interest x represents a particular covariate
LATENT CLASS NOTATION The latent class model can be expressed as R M K m = c 1 m = = = I y mr c ( m r ) P Y [ Y y y | X X x x ] ( ) x x m i i i i c i | i m = = = 1 m r 1 where + + + exp[ x x ] 0 c 1 c i 1 pc ip = = = = ( ) x x P C [ c | X X x x ] i i i c i i i i i K 1 i c + + + + 1 exp[ x x ] i i i 0 c 1 c i 1 pc ip = 1
LATENT CLASS NOTATION with (c= 1,2, ,K) latent classes and (m= 1,2, ,M) indicators, each with (rm= 1,2, ,Rm) response options. c = probability of membership in latent class c (latent class membership probabilities) = I y mr c ( m r ) m = probability of response rm to indicator m, conditional on membership in latent class c (item-response probabilities) | m
ITEM-RESPONSE PROBABILITIES parameters express the relation between The discrete latent variable in an LCA and The observed indicator variables Similar conceptually to factor loadings Basis for interpretation of latent classes Are probabilities (between 0 and 1)
RHO PARAMETERS 0 1 When latent variable and manifest variable completely correspond, = 0 OR = 1 When latent variable does not at all predict manifest variable, = 1/(marginal probability) for all classes So, if we are trying to measure a latent variable, what kind of s do we like?
WHAT DO WE LOOK FOR? Homogeneity: degree to which parameters for a particular latent class are close to 0 and 1 Latent class separation: degree to which latent classes can clearly be distinguished from each other
WHAT DO WE LOOK FOR? High homogeneity + High latent class separation Probability of correctly performing practical task Latent Class 1 .10 .15 .05 .10 .12 Latent Class 2 .91 .90 .89 .95 .90 Task 1 Task 2 Task 3 Task 4 Task 5
WHAT DO WE LOOK FOR? High homogeneity + Lower latent class separation Probability of correctly performing practical task Latent Class 1 .80 .82 .81 .80 .84 Latent Class 2 .91 .90 .89 .95 .90 Task 1 Task 2 Task 3 Task 4 Task 5
MODEL IDENTIFICATION What is maximum likelihood estimation ? Likelihood function expresses likelihood of observed data, given model being fit and as a function of all possible parameter estimates Winning parameter estimates (if identified): the set that maximizes the likelihood
DEALING WITH MODEL ID Many estimation procedures require initial values for the parameters to kick off the estimation procedure If different starting values produce very different estimates and different G2 s, model is not well-identified Run many different sets of starting values, say 100 or more Look at distribution of G2 values
ABSOLUTE VS. RELATIVE Absolute model fit model fit refers to whether a specified LCA model provides an adequate representation of the data Adequate, according to some test statistic To test absolute model fit, we need the distribution of the test statistic under the null hypothesis H0: the specified model fits the data
COMMON TEST STATISTIC: G2 As in many contingency table methods, LCA computes predicted response pattern proportions according to the model and estimated parameters These predicted response pattern proportions are compared to the observed response pattern proportions This comparison is expressed in the likelihood ratio statistic G2
ISSUES WITH THIS APPROACH There are issues with this approach to model selection in LCA, and especially in LTA When data are sparse, G2 not distributed as chi-square This makes it hard to test the fit of model
DIFFERENCE IN G2 VS. BLRT It is tempting to calculate the G2 difference for two competing models For example, 3 vs. 4 classes But test is not appropriate because we do not know the correct reference distribution for the test One solution: bootstrap the G2 difference H0: 3 class model sufficient H1: 4 classes required
ABSOLUTE VS. RELATIVE Relative model fit refers to deciding whether Model A or Model B is better AIC, BIC good tools for relative model fit These are information criteria (penalized log-likelihood) Optimize balance between fit and parsimony Usually scaled so that smaller AIC, BIC is better
AIC AND BIC p = number of parameters estimated in the model n = sample size = = + + 2 2 [log( )][ ] n AIC BIC G G p 2 p
SELECTING THE # OF CLASSES Classes 1 2 3 4 5 6 7 df AIC 9524 3049 957 271 BIC 9564 3137 1091 452 308 372 434 BLRT .01 .01 .01 .01 .08 N/A N/A G2 9510 3019 911 209 120 112 104 96 88 80 72 4 4 3 81 98 113 BLRT not significant for 5-class model, indicating 6 classes are not needed
INCLUDING GROUPING VARIABLES
MULTIPLE-GROUPS LCA Two reasons to include a grouping variable: To explore measurement invariance e.g., Do the items map onto the latent construct in the same way for males and females? To divide sample into groups for comparison purposes e.g., How does the probability of membership in the HEAVY DRINKERS latent class differ in the experimental and control groups?
MULTIPLE-GROUPS LCA parameters may vary as a function of the grouping variable Allows test of measurement invariance parameters may vary as a function of the grouping variable Allows comparison of latent class prevalences
WE WILL USE LCA TO Identify and describe underlying classes of drinking behavior in U.S. 12th grade students Include a grouping variable (i.e., sex) Test for measurement invariance across males and females Examine sex differences in prevalence of behavior types
MEASUREMENT INVARIANCE Models with parameters free and constrained equal across groups are statistically nested Free parameters allow measurement to differ across groups Constrained parameters equate corresponding measurement parameters across groups In general, two models are nested if the simpler model can be arrived at by imposing parameter restrictions on the more complex model.
TESTING MI H0: Simpler model is adequate H1: Simpler model is not adequate Often, we hope to fail to reject the null hypothesis If non-significant, strong support for measurement invariance Our result not significant G2=18 with 35 df, p>.05 Measurement invariance is plausible Keep parameter restrictions
SEX DIFFERENCES Sex differences in probabilities of membership in drinking classes: parameters Class Nondrinkers Experimenters Light Drinkers Past Partiers Heavy Drinkers Males 18% 22% 9% 13% 38% Females 18% 23% 9% 21% 28%
SEX DIFFERENCES Sex differences in probabilities of membership in drinking classes: parameters Class Nondrinkers Experimenters Light Drinkers Past Partiers Heavy Drinkers Males 18% 22% 9% 13% 38% Females 18% 23% 9% 21% 28%
INCLUDING COVARIATES
WE WILL USE LCA TO Identify and describe underlying classes of drinking behavior in U.S. 12th grade students Include a grouping variable (i.e., sex) Test for measurement invariance across males and females Examine sex differences in prevalence of behavior types Explore whether skipping school and grades predict drinking class membership
GRAPHICAL REPRESENTATION Skipping School Drinking Classes Past- Year Use Lifetime Use 5+ Drinks
LCA WITH COVARIATES Regress latent class variable on predictors Logistic regression with latent outcome parameters express relation between covariates and class membership
INTERPRETING BETA PARAMETERS = ( ) ( ) x x + 1 log i x 01 11 i 2 i 11 is a logistic regression coefficient influencing the log- odds that an individual falls into Class 1 relative to Class 2