Conditional Random Fields
Conditional Random Fields (CRF) offer a generalized approach beyond the limitations of Hidden Markov Models (HMM). They address scenarios where HMMs fall short due to arbitrary dependencies in state transitions and observations. CRFs find applications in various fields like bioinformatics, natural language processing, speech recognition, and malware detection. This overview delves into the structure, limitations, extensions, and applications of HMMs and CRFs. It explores how CRFs provide a broader perspective by relaxing the strong independence assumptions of HMMs, allowing for modeling complex interactions among variables efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Conditional Random Fields Mark Stamp CRF 1
Intro Hidden Markov Model (HMM) used in o Bioinformatics o Natural language processing o Speech recognition o Malware detection/analysis And many, many other applications Bottom line: HMMs are very useful o Everybody knows that! CRF 2
Generic View of HMM Where A is a Markov process o Implies that Xi only depends on Xi-1 Matrix B is observation probabilities o Probability of Oi only depends on Xi CRF 3
HMM Limitations Assumptions o Observation depends on current state o Current state depends on previous state o Strong independence assumption Often independence is not realistic o Observation can depend on several states o And/or current state might depend on several previous states CRF 4
HMMs Within HMM framework, we can Increase N, number of hidden states And/or higher order Markov process o Order 2 means hidden state depends on 2 immediately previous hidden states o Order > 1 relaxes independence constraint More hidden states, more breadth Higher order, increased depth CRF 5
Beyond HMMs HMMs do not fit some situations o For example, arbitrary dependencies on state transitions and/or observations Here, focus on generalization of HMM o Conditional Random Fields (CRF) There are other generalizations o We mention a few Mostly focused on the big picture CRF 6
HMM Revisited Illustrates graph structure of HMM o That is, HMM is a directed line graph Can other types of graphs work? Would they make sense? CRF 7
MEMM In HMM, observation sequence O is related to states X via B matrix o And O affects X in training, not scoring o Might want X to depend on O in scoring Maximum Entropy Markov Model o State Xi is function of Xi-1 and Oi MEMM focused on problem 2 o That is, determine (hidden) states CRF 8
Generic MEMM How does this differ from HMM? o State Xi is function of Xi-1andOi o Cannot generate Oi using the MEMM, while we can do so using HMM CRF 9
MEMM vs HMM HMM Find best state sequence X o That is, solve HMM Problem 2 o Solution is X that maximizes P(X|O) = P(Oi|Xi) P(Xi|Xi-1) MEMM Find best state sequence X o Solution is X that maximizes P(X|O) = P(Xi|Xi-1,Oi) where P(x|y,o) = 1/Z(o,y) exp( wjfj(o,x)) CRF 10
MEMM vs HMM Note wj fj(o,x) in MEMM probability o This sum is over entire sequence o Any useful feature of input observation can affect probability MEMM more general in this sense o As compared to HMM, that is But MEMM creates a new problem o A problem that does not occur in HMM CRF 11
Label Bias Problem MEMM uses dynamic programming (DP) o Also known as the Viterbi algorithm HMM (problem 2) does not use DP o HMM -pass uses sum, DP uses max In MEMM probability is conserved o Probability must be split between successor states (not so in HMM) o Is this good or bad? CRF 12
Label Bias Problem Only one possible successor in MEMM? o All probability passed along to that state o In effect, observation is ignored o More generally, if one dominant successor, observation doesn t matters much CRF solves label bias problem of MEMM o So, observation matters We won t go into details here CRF 13
Label Bias Problem Example o Hot, Cold, and Medium states In M state o Observation does little (MEMM) o Observation can matter more (HMM) 0.7 0.99 H M 0.3 0.01 0.3 C 0.1 0.6 CRF 14
Conditional Random Fields CRFs a generalization of HMMs Generalization to other graphs o Undirected graphs Linear Chain CRF is simplest case But also generalizes to arbitrary (undirected) graphs o That is, can have arbitrary dependencies between states and observations CRF 15
Simplest Case of CRF How is it different from HMM/MEMM? More things can depend on each other o The case illustrated is a linear chain CRF o More general graph structure can work CRF 16
Another View Next, consider deeper connection between HMM and CRF But first, we need some background o Na ve Bayes o Logistic regression These topics are very useful in their own right o so wake up and pay attention! CRF 17
What Are We Doing Here? Recall, O observation, X is state Ideally, want to model P(X,O) o All possible interactions of Xs and Os But P(X,O) involves lots of parameters o Like the complete covariance matrix o Lots of data needed for training o And too much work to train Generally, this problem is intractable CRF 18
What to Do? Simplify, simplify, simplify o Need to make problem tractable o And then hope we get decent results In Na ve Bayes, assume independence In regression analysis, try to fit specific function to data Eventually, we ll see this is relevant o Wrt HMMs and CRFs, that is CRF 19
Nave Bayes Why is it na ve ? Assume features in X are independent o Probably not true, but simplifies things o And often works well in practice Why does independence simplify? o Recall covariance: For X = (x1, ,xn)and Y = (y1, ,yn), if means are 0, then Cov(X,Y) = (x1y1 + + xnyn) / n CRF 20
Nave Bayes Independent implies covariance is 0 If so, in covariance matrix only the diagonal elements are non-zero Only need means and variances o Not the entire covariance matrix o Far fewer parameters to estimate o And a lot less data needed for training Bottom line: Practical solution CRF 21
Nave Bayes Why is it Bayes ? Because it uses Bayes Theorem: o That is, o Or, o More generally, where Aj form partition CRF 22
Bayes Formula Example Consider a test for an illegal drug o If you use drug, 98% positive (TPR = sensitivity) o If don t use, 99% negative (TNR = specificity) o In overall population, 5/1000 use the drug Let A = uses the drug, B = tests positive Then = .98 .005 / (.98 .005 + .01 .995) CRF = 0.329966 = 33% 23
Nave Bayes Why is this relevant? Spse classify based on observation O o Compute P(X|O) = P(O|X) P(X) / P(O) o Where X is one possible class (state) o And P(O|X) is easy to compute Repeat for all possible classes X o Biggest probability is most likely class X o Can ignore P(O) since it s constant CRF 24
Regression Analysis Generically, method for measuring relationship between 2 or more things o E.g., house price vs size First, we consider linear regression o Since it s the simplest case Then logistic regression o More complicated, but often more useful o Better for binary classifiers CRF 25
Linear Regression y Spse x is house ft2 o Could be vector x of observations instead And y is sale price Points represent recent sales results How to use this info? o Given a house to sell o Given a recent sale x Eigenvector Techniques 26
Linear Regression y Blue line is best fit o Minimum squared error o Perpendicular distance o Linear least squares What good is it? o Given a new point, how well does it fit in? o Given x, predict y o This sounds familiar x Eigenvector Techniques 27
Regression Analysis In many problems, only 2 outcomes o Binary classifier, e.g., malware vs benign o Malware of specific type vs other Then x is an observation (vector) But each y is either 0 or 1 o Linear regression not so good (Why?) o A better idea logistic regression o Fit a logistic function instead of line CRF 28
Binary Classification y Suppose we compute score for many files Score is on x-axis Output on y-axis o 1 if file is malware o 0 if file is other Linear regression not very useful here x Eigenvector Techniques 29
Binary Classification y Instead of a line Use a function better for 0,1 data Logistic function o Transition from 0 to 1 more abrupt than line o Why is this better? o Less wasted time between 0 and 1 x Eigenvector Techniques 30
Logistic Regression Logistic function o F(t) = 1 / (1 + e-t) o Input: to o Output: 0 to 1, can be interpreted as P(t) Here, t = b0 + b1x o Or t=b0+b1x1+ +bmxm o I.e., x is observation CRF 31
Logistic Regression Instead of fitting a line to data o Fit logistic function to data And instead of least squares error o Measure deviance distance from ideal case (where ideal is saturated model ) Iterative process to find parameters o Find best fit F(t) using data points o More complex training than linear case o but, better suited to binary classification CRF 32
Conditional Probability Recall, we would like to model P(X,O) o Observe that P(X,O) includes all relationships between Xs and Os o Too complex, too many parameters So we settle for P(X|O) o A lot fewer parameters o Problem is tractable o Works well in practice CRF 33
Generative vs Discriminative We are interested in P(X|O) Generative models o Focus on P(O|X) P(X) o From Na ve Bayes (without denominator) Discriminative models o Focus directly on P(X|O) o Like logistic regression Tradeoffs? CRF 34
Generative vs Discriminative Na ve Bayes is generative model o Since it uses P(O|X) P(X) o Good in unsupervised case, unlabeled data Logistic regression is discriminative o Directly deal with P(X|O) o No need to expend effort modeling O o So, more freedom to model X o Unsupervised is active area of research CRF 35
HMM and Nave Bayes Connection(s) between NB and HMM? Recall HMM, problem 2 o For given O, find best (hidden) state X We use P(X|O) to determine best X Alpha pass used in solving problem 2 Looking closely at alpha pass o It is based on computing P(O|X) P(X) o With probabilities from the model CRF 36
HMM and Nave Bayes Connection(s) between NB and HMM? HMM can be viewed as sequential version of Na ve Bayes o Classifications over series of observations o HMM uses info about state transitions Conversely, Na ve Bayes is a static version of HMM Bottom line: HMM is generative model CRF 37
CRF and Logistic Regression Connection between CRF & regression? Linear chain CRF is sequential version of logistic regression o Classification over series of observations o CRF uses info about state transitions Conversely, logistic regression can be viewed as static (linear chain) CRF Bottom line: CRF discriminative model CRF 38
Generative vs Discriminative Na ve Bayes and Logistic Regression o A generative-discriminative pair HMM and (Linear Chain) CRF o Another generative-discriminative pair o Sequential versions of those above Are there other such pairs? o Yes, based on further generalizations o What s more general than sequential? CRF 39
General CRF Can define CRF on any (undirected) graph structure o Not just a linear chain In general CRF, training and scoring not as efficient, so o Linear Chain CRF used most in practice If special cases, might be worth considering more general CRF CRF 40
Generative Directed Model Can view HMM as defined on (directed) line graph Could consider similar process on more general (directed) graph structures This more general case is known as generative directed model Algorithms (training, scoring, etc.) not as efficient in more general case CRF 41
Generative-Discriminative Pair Generative directed model o As the name implies, a generative model General CRF o A discriminative model So, this gives us a 3rd generative- discriminative pair Summary on next slide CRF 42
HCRF Yes, you guessed it o Hidden Conditional Random Field So, what is hidden? To be continued CRF 44
Algorithms Where are the algorithms? o This is a CS class, after all Yes, CRF algorithms do exist o Omitted, since lot of background needed o Would take too long to cover it all o We ve got better things to do So, just use existing implementations o It s your lucky day CRF 45
References E. Chen, Introduction to conditional random fields Y. Ko, Maximum entropy Markov models and conditional random fields A. Quattoni, Tutorial on conditional random fields for sequence prediction CRF 46
References C. Sutton and A. McCallum, An introduction to conditional random fields, Foundations and Trends in Machine Learning, 4(4):267-373, 2011 H.M. Wallach, Conditional random fields: An introduction, 2004 CRF 47