Conditional Random Fields

Conditional Random Fields

CRF

Mark Stamp

Intro



Hidden Markov Model (HMM) used in

Bioinformatics

Natural language processing

Speech recognition

Malware detection/analysis



And many, many other applications



Bottom line: HMMs are very useful

Everybody knows that!

CRF

Generic View of HMM



Where

 is a Markov process

Implies that

 only depends on

-1



Matrix

 is observation probabilities

Probability of

 only depends on

CRF

HMM Limitations



Assumptions

Observation depends on current state

Current state depends on previous state

Strong independence assumption



Often independence is not realistic

Observation can depend on several states

And/or current state might depend on

several previous states

CRF

HMMs



Within HMM framework, we can…



Increase

, number of hidden states



And/or higher order Markov process

“Order 2” means hidden state depends on

2 immediately previous hidden states

Order

 1 relaxes independence constraint



More hidden states, more “breadth”



Higher order, increased “depth”

CRF

Beyond HMMs



HMMs do not fit some situations

For example, arbitrary dependencies on

state transitions and/or observations



Here, focus on generalization of HMM

Conditional Random Fields (CRF)



There are other generalizations

We mention a few



Mostly focused on the “big picture”

CRF

HMM Revisited



Illustrates graph structure of HMM

That is, HMM is a directed line graph



Can other types of graphs work?



Would they make sense?

CRF

MEMM



In HMM, observation sequence

is

related to states X via B matrix

And

 affects X in training, not scoring

Might want X to depend on

 in scoring



Maximum Entropy Markov Model

State X

 is function of X

i-1

 and O



MEMM focused on “problem 2”

That is, determine (hidden) states

CRF

Generic MEMM



How does this differ from HMM?

State X

 is function of X

i-1

and

Cannot generate O

 using the MEMM,

while we can do so using HMM

CRF

MEMM vs HMM



HMM



 Find “best” state sequence X

That is, solve HMM P

roblem 2

Solution is X that maximizes

P(X|O) =

Π

P(O

|X

Π

P(X

|X

i-1



MEMM



 Find “best” state sequence X

Solution is X that maximizes

P(X|O) =

Π

P(X

|X

i-1

,O

where

P(x|y,o) = 1/Z(o,y) exp

Σw

(o,x)

CRF

MEMM vs HMM



Note

Σw

(o,x)

 in MEMM probability

This sum is over entire sequence

Any useful feature of input observation

can affect probability



MEMM more “general” in this sense

As compared to HMM, that is



But MEMM creates a new problem

A problem that does not occur in HMM

CRF

Label Bias Problem



MEMM uses dynamic programming (DP)

Also known as the Viterbi algorithm



HMM (problem 2) does not use DP

HMM

α

-pass uses sum, DP uses max



In MEMM probability is “conserved”

Probability must be split between

successor states (not so in HMM)

Is this good or bad?

CRF

Label Bias Problem



Only one possible successor in MEMM?

All probability passed along to that state

In effect, observation is ignored

More generally, if one dominant successor,

observation doesn’t matters much



CRF solves label bias problem of MEMM

So, observation matters



We won’t go into details here…

CRF

Label Bias Problem



Example

Hot, Cold, and

Medium states



In M state…

Observation does

little (MEMM)

Observation can

matter more (HMM)

CRF

0.7

0.6

0.3

0.3

0.99

0.1

0.01

Conditional Random Fields



CRFs a generalization of HMMs



Generalization to other graphs

Undirected graphs



Linear Chain CRF is simplest case



But also generalizes to arbitrary

(undirected) graphs

That is, can have arbitrary dependencies

between states and observations

CRF

Simplest Case of CRF



How is it different from

HMM/MEMM?



More things can depend on each other

The case illustrated is a linear chain CRF

More general graph structure can work

CRF

Another View



Next, consider deeper connection

between HMM and CRF



But first, we need some background

Naïve Bayes

Logistic regression



These topics are very useful in their

own right…

…so wake up and pay attention!

CRF

What Are We Doing Here?



Recall, O observation, X is state



Ideally, want to model P(X,O)

All possible interactions of Xs and Os



But P(X,O) involves lots of parameters

Like the complete covariance matrix

Lots of data needed for “training”

And too much work to train



Generally, this problem is intractable

CRF

What to Do?



Simplify, simplify, simplify…

Need to make problem tractable

And then hope we get decent results



In Naïve Bayes, assume independence



In regression analysis, try to fit

specific function to data



Eventually, we’ll see this is relevant

Wrt HMMs and CRFs, that is

CRF

Naïve Bayes



Why is it “naïve”?



Assume features in X are independent

Probably not true, but simplifies things

And often works well in practice



Why does independence simplify?

Recall covariance: For

= (

,…,

and

= (

,…,

, if means are 0, then

Cov(

) = (

+…+

) /

CRF

Naïve Bayes



Independent implies covariance is 0



If so, in covariance matrix only the

diagonal elements are non-zero



Only need means and variances

Not the entire covariance matrix

Far fewer parameters to estimate

And a lot less data needed for training



Bottom line: Practical solution

CRF

Naïve Bayes



Why is it “Bayes”?



Because it uses Bayes Theorem:

That is,

Or,

More generally,

where

 form partition

CRF

Bayes Formula Example



Consider a test for an illegal drug

If you use drug, 98% positive (TPR = sensitivity)

If don’t use, 99% negative (TNR = specificity)

In overall population, 5/1000 use the drug



Let

A =

 uses the drug,

B =

 tests positive



Then

= .98 × .005 / (.98

×

.005 + .01

×

.995)

= 0.329966 = 33%

CRF

Naïve Bayes



Why is this relevant?



Spse classify based on observation O

Compute P(X|O) = P(O|X) P(X) / P(O)

Where X is one possible class (state)

And P(O|X) is easy to compute



Repeat for all possible classes X

Biggest probability is most likely class X

Can ignore P(O) since it’s constant

CRF

Regression Analysis



Generically, method for measuring

relationship between 2 or more things

E.g., house price vs size



First, we consider linear regression

Since it’s the simplest case



Then logistic regression

More complicated, but often more useful

Better for binary classifiers

CRF

Linear Regression



Spse x is house ft

Could be vector x of

observations instead



And y is sale price



Points represent

recent sales results



How to use this info?

Given a house to sell…

Given a recent sale…

Eigenvector Techniques

Linear Regression



Blue line is “best fit”

Minimum squared error

Perpendicular distance

Linear least squares



What good is it?

Given a new point, how

well does it fit in?

Given x, predict y

This sounds familiar…

Eigenvector Techniques

Regression Analysis



In many problems, only 2 outcomes

Binary classifier, e.g., malware vs benign

“Malware of specific type” vs “other”



Then x is an observation (vector)



But each y is either 0 or 1

Linear regression not so good (Why?)

A better idea



 logistic regression

Fit a logistic function instead of line

CRF

Binary Classification



Suppose we compute

score for many files



Score is on x-axis



Output on y-axis

1 if file is malware

0 if file is “other”



Linear regression not

very useful here

Eigenvector Techniques

Binary Classification



Instead of a line…



Use a function better

for 0,1 data



Logistic function

Transition from 0 to 1

more abrupt than line

Why is this better?

Less wasted time

between 0 and 1

Eigenvector Techniques

Logistic Regression



Logistic function

F(t) = 1 / (1 + e

-t

Input:

–

∞

to

∞

Output: 0 to 1, can be

interpreted as

P(t)



Here,

t = b

+ b

Or

t=b

+b

+…+b

I.e.,

 is observation

CRF

Logistic Regression



Instead of fitting a line to data…

Fit logistic function to data



And instead of least squares error…

Measure “deviance”



 distance from ideal

case (where ideal is “saturated model”)



Iterative process to find parameters

Find best fit

F(t)

 using data points

More complex training than linear case…

…but, better suited to binary

classification

CRF

Conditional Probability



Recall, we would like to model P(X,O)

Observe that P(X,O) includes all

relationships between Xs and Os

Too complex, too many parameters…



So we settle for P(X|O)

A lot fewer parameters

Problem is tractable

Works well in practice

CRF

Generative vs Discriminative



We are interested in P(X|O)



Generative

 models

Focus on P(O|X) P(X)

From Naïve Bayes (without denominator)



Discriminative

models

Focus directly on P(X|O)

Like logistic regression



Tradeoffs?

CRF

Generative vs Discriminative



Naïve Bayes is generative model

Since it uses P(O|X) P(X)

Good in unsupervised case, unlabeled data



Logistic regression is discriminative

Directly deal with P(X|O)

No need to expend effort modeling O

So, more freedom to model X

Unsupervised is “active area of research”

CRF

HMM and Naïve Bayes



Connection(s) between NB and HMM?



Recall HMM, problem 2

For given O, find “best” (hidden) state X



We use P(X|O) to determine best X



Alpha pass used in solving problem 2



Looking closely at alpha pass…

It is based on computing P(O|X) P(X)

With probabilities from the model

λ

CRF

HMM and Naïve Bayes



Connection(s) between NB and HMM?



HMM can be viewed as sequential

version of Naïve Bayes

Classifications over series of observations

HMM uses info about state transitions



Conversely, Naïve Bayes is a “static”

version of HMM



Bottom line: HMM is generative model

CRF

CRF and Logistic Regression



Connection between CRF & regression?



Linear chain CRF is sequential version

of logistic regression

Classification over series of observations

CRF uses info about state transitions



Conversely, logistic regression can be

viewed as static (linear chain) CRF



Bottom line: CRF discriminative model

CRF

Generative vs Discriminative



Naïve Bayes and Logistic Regression

A “generative-discriminative pair”



HMM and (Linear Chain) CRF

Another generative-discriminative pair

Sequential versions of those above



Are there other such pairs?

Yes, based on further generalizations

What’s more general than sequential?

CRF

General CRF



Can define CRF on any (undirected)

graph structure

Not just a linear chain



In general CRF, training and scoring

not as efficient, so…

Linear Chain CRF used most in practice



If special cases, might be worth

considering more general CRF

CRF

Generative Directed Model



Can view HMM as defined on (directed)

line graph



Could consider similar process on more

general (directed) graph structures



This more general case is known as

“generative directed model”



Algorithms (training, scoring, etc.) not

as efficient in more general case

CRF

Generative-Discriminative Pair



Generative directed model

As the name implies, a generative model



General CRF

A discriminative model



So, this gives us a 3

rd

 generative-

discriminative pair



Summary on next slide…

CRF

Generative-Discriminative Pairs

CRF

HCRF



Yes, you guessed it…

Hidden Conditional Random Field



So, what is hidden?



To be continued…

CRF

Algorithms



Where are the algorithms?

This is a CS class, after all…



Yes, CRF algorithms do exist

Omitted, since lot of background needed

Would take too long to cover it all

We’ve got better things to do



So, just use existing implementations

It’s your lucky day…

CRF

References



E. Chen,

Introduction to conditional random

fields



Y. Ko,

Maximum entropy Markov models and

conditional random fields



A. Quattoni,

Tutorial on conditional random

fields for sequence prediction

CRF

References



C. Sutton and A. McCallum,

An introduction

to conditional random fields

Foundations and

Trends in Machine Learning

, 4(4):267-373,



H.M. Wallach,

Conditional random fields:

An introduction

, 2004

CRF

Slide Note

Embed Share

Download

Conditional Random Fields (CRF) offer a generalized approach beyond the limitations of Hidden Markov Models (HMM). They address scenarios where HMMs fall short due to arbitrary dependencies in state transitions and observations. CRFs find applications in various fields like bioinformatics, natural language processing, speech recognition, and malware detection. This overview delves into the structure, limitations, extensions, and applications of HMMs and CRFs. It explores how CRFs provide a broader perspective by relaxing the strong independence assumptions of HMMs, allowing for modeling complex interactions among variables efficiently.

lkha Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Conditional Random Fields Mark Stamp CRF 1

Intro Hidden Markov Model (HMM) used in o Bioinformatics o Natural language processing o Speech recognition o Malware detection/analysis And many, many other applications Bottom line: HMMs are very useful o Everybody knows that! CRF 2

Generic View of HMM Where A is a Markov process o Implies that Xi only depends on Xi-1 Matrix B is observation probabilities o Probability of Oi only depends on Xi CRF 3

HMM Limitations Assumptions o Observation depends on current state o Current state depends on previous state o Strong independence assumption Often independence is not realistic o Observation can depend on several states o And/or current state might depend on several previous states CRF 4

HMMs Within HMM framework, we can Increase N, number of hidden states And/or higher order Markov process o Order 2 means hidden state depends on 2 immediately previous hidden states o Order > 1 relaxes independence constraint More hidden states, more breadth Higher order, increased depth CRF 5

Beyond HMMs HMMs do not fit some situations o For example, arbitrary dependencies on state transitions and/or observations Here, focus on generalization of HMM o Conditional Random Fields (CRF) There are other generalizations o We mention a few Mostly focused on the big picture CRF 6

HMM Revisited Illustrates graph structure of HMM o That is, HMM is a directed line graph Can other types of graphs work? Would they make sense? CRF 7

MEMM In HMM, observation sequence O is related to states X via B matrix o And O affects X in training, not scoring o Might want X to depend on O in scoring Maximum Entropy Markov Model o State Xi is function of Xi-1 and Oi MEMM focused on problem 2 o That is, determine (hidden) states CRF 8

Generic MEMM How does this differ from HMM? o State Xi is function of Xi-1andOi o Cannot generate Oi using the MEMM, while we can do so using HMM CRF 9

MEMM vs HMM HMM Find best state sequence X o That is, solve HMM Problem 2 o Solution is X that maximizes P(X|O) = P(Oi|Xi) P(Xi|Xi-1) MEMM Find best state sequence X o Solution is X that maximizes P(X|O) = P(Xi|Xi-1,Oi) where P(x|y,o) = 1/Z(o,y) exp( wjfj(o,x)) CRF 10

MEMM vs HMM Note wj fj(o,x) in MEMM probability o This sum is over entire sequence o Any useful feature of input observation can affect probability MEMM more general in this sense o As compared to HMM, that is But MEMM creates a new problem o A problem that does not occur in HMM CRF 11

Label Bias Problem MEMM uses dynamic programming (DP) o Also known as the Viterbi algorithm HMM (problem 2) does not use DP o HMM -pass uses sum, DP uses max In MEMM probability is conserved o Probability must be split between successor states (not so in HMM) o Is this good or bad? CRF 12

Label Bias Problem Only one possible successor in MEMM? o All probability passed along to that state o In effect, observation is ignored o More generally, if one dominant successor, observation doesn t matters much CRF solves label bias problem of MEMM o So, observation matters We won t go into details here CRF 13

Label Bias Problem Example o Hot, Cold, and Medium states In M state o Observation does little (MEMM) o Observation can matter more (HMM) 0.7 0.99 H M 0.3 0.01 0.3 C 0.1 0.6 CRF 14

Conditional Random Fields CRFs a generalization of HMMs Generalization to other graphs o Undirected graphs Linear Chain CRF is simplest case But also generalizes to arbitrary (undirected) graphs o That is, can have arbitrary dependencies between states and observations CRF 15

Simplest Case of CRF How is it different from HMM/MEMM? More things can depend on each other o The case illustrated is a linear chain CRF o More general graph structure can work CRF 16

Another View Next, consider deeper connection between HMM and CRF But first, we need some background o Na ve Bayes o Logistic regression These topics are very useful in their own right o so wake up and pay attention! CRF 17

What Are We Doing Here? Recall, O observation, X is state Ideally, want to model P(X,O) o All possible interactions of Xs and Os But P(X,O) involves lots of parameters o Like the complete covariance matrix o Lots of data needed for training o And too much work to train Generally, this problem is intractable CRF 18

What to Do? Simplify, simplify, simplify o Need to make problem tractable o And then hope we get decent results In Na ve Bayes, assume independence In regression analysis, try to fit specific function to data Eventually, we ll see this is relevant o Wrt HMMs and CRFs, that is CRF 19

Nave Bayes Why is it na ve ? Assume features in X are independent o Probably not true, but simplifies things o And often works well in practice Why does independence simplify? o Recall covariance: For X = (x1, ,xn)and Y = (y1, ,yn), if means are 0, then Cov(X,Y) = (x1y1 + + xnyn) / n CRF 20

Nave Bayes Independent implies covariance is 0 If so, in covariance matrix only the diagonal elements are non-zero Only need means and variances o Not the entire covariance matrix o Far fewer parameters to estimate o And a lot less data needed for training Bottom line: Practical solution CRF 21

Nave Bayes Why is it Bayes ? Because it uses Bayes Theorem: o That is, o Or, o More generally, where Aj form partition CRF 22

Bayes Formula Example Consider a test for an illegal drug o If you use drug, 98% positive (TPR = sensitivity) o If don t use, 99% negative (TNR = specificity) o In overall population, 5/1000 use the drug Let A = uses the drug, B = tests positive Then = .98 .005 / (.98 .005 + .01 .995) CRF = 0.329966 = 33% 23

Nave Bayes Why is this relevant? Spse classify based on observation O o Compute P(X|O) = P(O|X) P(X) / P(O) o Where X is one possible class (state) o And P(O|X) is easy to compute Repeat for all possible classes X o Biggest probability is most likely class X o Can ignore P(O) since it s constant CRF 24

Regression Analysis Generically, method for measuring relationship between 2 or more things o E.g., house price vs size First, we consider linear regression o Since it s the simplest case Then logistic regression o More complicated, but often more useful o Better for binary classifiers CRF 25

Linear Regression y Spse x is house ft2 o Could be vector x of observations instead And y is sale price Points represent recent sales results How to use this info? o Given a house to sell o Given a recent sale x Eigenvector Techniques 26

Linear Regression y Blue line is best fit o Minimum squared error o Perpendicular distance o Linear least squares What good is it? o Given a new point, how well does it fit in? o Given x, predict y o This sounds familiar x Eigenvector Techniques 27

Regression Analysis In many problems, only 2 outcomes o Binary classifier, e.g., malware vs benign o Malware of specific type vs other Then x is an observation (vector) But each y is either 0 or 1 o Linear regression not so good (Why?) o A better idea logistic regression o Fit a logistic function instead of line CRF 28

Binary Classification y Suppose we compute score for many files Score is on x-axis Output on y-axis o 1 if file is malware o 0 if file is other Linear regression not very useful here x Eigenvector Techniques 29

Binary Classification y Instead of a line Use a function better for 0,1 data Logistic function o Transition from 0 to 1 more abrupt than line o Why is this better? o Less wasted time between 0 and 1 x Eigenvector Techniques 30

Logistic Regression Logistic function o F(t) = 1 / (1 + e-t) o Input: to o Output: 0 to 1, can be interpreted as P(t) Here, t = b0 + b1x o Or t=b0+b1x1+ +bmxm o I.e., x is observation CRF 31

Logistic Regression Instead of fitting a line to data o Fit logistic function to data And instead of least squares error o Measure deviance distance from ideal case (where ideal is saturated model ) Iterative process to find parameters o Find best fit F(t) using data points o More complex training than linear case o but, better suited to binary classification CRF 32

Conditional Probability Recall, we would like to model P(X,O) o Observe that P(X,O) includes all relationships between Xs and Os o Too complex, too many parameters So we settle for P(X|O) o A lot fewer parameters o Problem is tractable o Works well in practice CRF 33

Generative vs Discriminative We are interested in P(X|O) Generative models o Focus on P(O|X) P(X) o From Na ve Bayes (without denominator) Discriminative models o Focus directly on P(X|O) o Like logistic regression Tradeoffs? CRF 34

Generative vs Discriminative Na ve Bayes is generative model o Since it uses P(O|X) P(X) o Good in unsupervised case, unlabeled data Logistic regression is discriminative o Directly deal with P(X|O) o No need to expend effort modeling O o So, more freedom to model X o Unsupervised is active area of research CRF 35

HMM and Nave Bayes Connection(s) between NB and HMM? Recall HMM, problem 2 o For given O, find best (hidden) state X We use P(X|O) to determine best X Alpha pass used in solving problem 2 Looking closely at alpha pass o It is based on computing P(O|X) P(X) o With probabilities from the model CRF 36

HMM and Nave Bayes Connection(s) between NB and HMM? HMM can be viewed as sequential version of Na ve Bayes o Classifications over series of observations o HMM uses info about state transitions Conversely, Na ve Bayes is a static version of HMM Bottom line: HMM is generative model CRF 37

CRF and Logistic Regression Connection between CRF & regression? Linear chain CRF is sequential version of logistic regression o Classification over series of observations o CRF uses info about state transitions Conversely, logistic regression can be viewed as static (linear chain) CRF Bottom line: CRF discriminative model CRF 38

Generative vs Discriminative Na ve Bayes and Logistic Regression o A generative-discriminative pair HMM and (Linear Chain) CRF o Another generative-discriminative pair o Sequential versions of those above Are there other such pairs? o Yes, based on further generalizations o What s more general than sequential? CRF 39

General CRF Can define CRF on any (undirected) graph structure o Not just a linear chain In general CRF, training and scoring not as efficient, so o Linear Chain CRF used most in practice If special cases, might be worth considering more general CRF CRF 40

Generative Directed Model Can view HMM as defined on (directed) line graph Could consider similar process on more general (directed) graph structures This more general case is known as generative directed model Algorithms (training, scoring, etc.) not as efficient in more general case CRF 41

Generative-Discriminative Pair Generative directed model o As the name implies, a generative model General CRF o A discriminative model So, this gives us a 3rd generative- discriminative pair Summary on next slide CRF 42

Generative-Discriminative Pairs CRF 43

HCRF Yes, you guessed it o Hidden Conditional Random Field So, what is hidden? To be continued CRF 44

Algorithms Where are the algorithms? o This is a CS class, after all Yes, CRF algorithms do exist o Omitted, since lot of background needed o Would take too long to cover it all o We ve got better things to do So, just use existing implementations o It s your lucky day CRF 45

References E. Chen, Introduction to conditional random fields Y. Ko, Maximum entropy Markov models and conditional random fields A. Quattoni, Tutorial on conditional random fields for sequence prediction CRF 46

References C. Sutton and A. McCallum, An introduction to conditional random fields, Foundations and Trends in Machine Learning, 4(4):267-373, 2011 H.M. Wallach, Conditional random fields: An introduction, 2004 CRF 47

Conditional Random Fields

Download Presentation

Presentation Transcript

Related

More Related Content