Revealing Introduction to Hidden Markov Models

a revealing introduction to hidden markov models n.w
1 / 63
Embed
Share

Explore the concepts of Hidden Markov Models (HMMs) and Markov Chains in this revealing introduction. Learn about the applications, benefits, and structure of HMMs through examples and visual representations. Understand how HMMs leverage machine learning techniques for tasks like speech recognition and information security.

  • Markov Models
  • HMMs
  • Machine Learning
  • Introduction
  • Applications

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A Revealing Introduction to Hidden Markov Models Mark Stamp A Revealing Introduction to HMMs 1

  2. Hidden Markov Models What is a hidden Markov model (HMM)? o A machine learning technique o trained via a discrete hill climb technique o Two for the price of one! Where are HMMs used? o Speech recognition, information security, and far too many other things to list o Q: Why are HMMs so useful? o A: Widely applicable and efficient algorithms A Revealing Introduction to HMMs 2

  3. Markov Chain Markov chain o Memoryless random process o Transitions depend only on current state (Markov chain of order 1) o and transition probability matrix Example? o See next slide A Revealing Introduction to HMMs 3

  4. Markov Chain 0.7 Suppose we re interested in average annual temperature o Only consider Hot and Cold From recorded history, obtain probabilities for o Year-to-year transitions o Based on thermometer readings for recent years H 0.3 0.4 C 0.6 A Revealing Introduction to HMMs 4

  5. Markov Chain 0.7 Transition probability matrix H 0.3 0.4 Matrix is denoted as A C 0.6 Note, Ais row stochastic A Revealing Introduction to HMMs 5

  6. Markov Chain Can also include begin, end states Matrix for begin state denoted o In this example, 0.7 0.6 H 0.3 0.4 begin end C Note that also row stochastic 0.4 0.6 A Revealing Introduction to HMMs 6

  7. Hidden Markov Model HMM includes a Markov chain o But the Markov process is hidden , i.e., we can t directly observe the Markov process o Instead, observe things that are probabilistically related to hidden states o It s as if there is a curtain between Markov chain and the observations Example on next few slides A Revealing Introduction to HMMs 7

  8. HMM Example Consider H/C annual temp example Suppose we want to know H or C annual temperature in distant past o Before thermometers were invented o Note, we only distinguish between H and C We ll assume transition between Hot and Cold years is same as today o Implies that the A matrix is known A Revealing Introduction to HMMs 8

  9. HMM Example Assume temps follow a Markov process o But, we cannot observe temperature in past We find modern evidence that tree ring size is related to temperature o Discover this by looking at recent data We ll only consider 3 tree ring sizes o Small, Medium, Large (S, M, L, respectively) Measure tree ring sizes and recorded temperatures to determine relationship A Revealing Introduction to HMMs 9

  10. HMM Example We find that tree ring sizes and temperature related by This is known as the B matrix The matrix B is also row stochastic A Revealing Introduction to HMMs 10

  11. HMM Example Can we now find H/C temps in past? We cannot measure (observe) temps But we can measure tree ring sizes and tree ring size is related to temp o Based on probabilities in the B matrix Can we say something intelligent about temps over some interval in the past? A Revealing Introduction to HMMs 11

  12. HMM Notation A lot of notation is required o Notation may be the hardest part A Revealing Introduction to HMMs 12

  13. HMM Notation For notational convenience, observations from V = {0,1, ,M-1} That is, The matrix A = {aij} is N x N, where o The matrix B = {bj(k)} is N x M, where o o Atypical notation for elements of B A Revealing Introduction to HMMs 13

  14. HMM Example Consider our temperature example What are the observations? o V = {0,1,2}, corresponding to S,M,L What are states of Markov process? o Q = {H,C} What are A,B, , and T? o A,B, on previous slides o T is number of tree rings measured What are N and M? o N = 2 and M = 3 A Revealing Introduction to HMMs 14

  15. Generic HMM Generic view of HMM HMM defined by A,B, and We denote HMM model as = (A,B, ) A Revealing Introduction to HMMs 15

  16. HMM Example Suppose that we observe tree ring sizes o Over a 4 year period of interest: S,M,S,L o Then = (0, 1, 0, 2) Most likely (hidden) state sequence? o That is, most likely X = (X0, X1, X2, X3) Let X0be prob. of starting in state X0 Note prob. of initial observation o And aX0,X1 is prob. of transition X0 to X1 And so on A Revealing Introduction to HMMs 16

  17. HMM Example Bottom line? We can compute P(X) for any given X For X = (X0, X1, X2, X3) we have Assuming we observe (0,1,0,2), then what is probability of, say, HHCC? Plug into formula above to find A Revealing Introduction to HMMs 17

  18. HMM Example Do same for all 4-state seq s We find that the winner is o CCCH Not so fast my friend! A Revealing Introduction to HMMs 18

  19. HMM Example The pathCCCH scores the highest In dynamic programming (DP), we find highest scoring path But, in HMM we maximize expected number of correct states o Sometimes called EM algorithm o That is, expectation maximization How does HMM work in this example? A Revealing Introduction to HMMs 19

  20. HMM Example For first position in sequence o Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position biggest wins Repeat for each position and we find A Revealing Introduction to HMMs 20

  21. HMM Example So, HMM (or EM) solution gives us CHCH While DP (or path) solution is CCCH Which solution is better? Neither solution is better! o Just using different definitions of best A Revealing Introduction to HMMs 21

  22. HMM Paradox? HMM maximizes expected number of correct states o Whereas DP chooses best overall path Possible for HMM to choose a path that is impossible o Could be a transition probability of 0 Cannot have impossible path with DP Is this a flaw with HMM? o No, it s a feature A Revealing Introduction to HMMs 22

  23. Probability of Observations Table computed for = (0,1,0,2) For this sequence, P( ) = .000412 + .000035 + .000706 + + .000847 = left to the reader Similarly for other observations A Revealing Introduction to HMMs 23

  24. HMM Model An HMM is defined by the three matrices, A, B, and Note that M and N are implied, since they are the dimensions of matrices So, we denote an HMM model as = (A,B, ) A Revealing Introduction to HMMs 24

  25. The Three Problems HMMs used to solve 3 problems Problem 1: Given a model = (A,B, ) and observation sequence O, find P(O| ) o That is, we can score an observation sequence to see how well it fits a given model Problem 2: Given = (A,B, ) and O, find an optimal state sequence (in HMM sense) o Uncover hidden part (like previous example) Problem 3: Given O, N, and M, find the model that maximizes probability of O o That is, train a model to fit observations A Revealing Introduction to HMMs 25

  26. HMMs in Practice We often, we use HMMs as follows: Given an observation sequence o Assume that (hidden) Markov process exists Train a model based on observations o That is, solve Problem 3 o Best N can be found by trial and error Then given a sequence of observations, score it versus the model we trained o This is Problem 1 high score implies similar to training data, low score says it s not similar A Revealing Introduction to HMMs 26

  27. HMMs in Practice Previous slide gives sense in which HMM is a machine learning technique o To train model, we do not need to specify anything except the parameter N o Best N often found by trial and error So, we don t need to think too much o Just train HMM and then use it o Practical, since efficient algorithms exist for training HMM and scoring A Revealing Introduction to HMMs 27

  28. The Three Solutions We give detailed solutions to 3 problems o Note: We must find efficient solutions The three problems: o Problem 1: Score an observation sequence versus a given model o Problem 2: Given a model, uncover hidden part o Problem 3: Given an observation sequence, train a model Recall that we considered example for 2 and 1 o But direct solutions are soooooo inefficient A Revealing Introduction to HMMs 28

  29. Solution 1 Score observations versus a given model o Given model = (A,B, ) and observation sequence O=(O0,O1, ,OT-1), find P(O| ) Denote hidden states as X = (X0, X1, . . . , XT-1) Then from definition of B, P(O|X, )=bX0(O0) bX1(O1) bXT-1(OT-1) And from definition of A and , P(X| )= X0 aX0,X1 aX1,X2 aXT-2,XT-1 A Revealing Introduction to HMMs 29

  30. Solution 1 Elementary conditional probability fact: P(O,X| ) = P(O|X, ) P(X| ) Sum over all possible state sequences X, P(O| ) = P(O,X| ) = P(O|X, ) P(X| ) = X0bX0(O0)aX0,X1bX1(O1) aXT-2,XT-1bXT-1(OT-1) This works but way too costly Requires about 2TNT multiplications o Why? There better be a better way A Revealing Introduction to HMMs 30

  31. Forward Algorithm Instead, use forward algorithm o Or alpha pass For t = 0,1, ,T-1 and i=0,1, ,N-1, let t(i) = P(O0,O1, ,Ot,Xt=qi| ) Probability of partial sum to t, and Markov process is in state qi at step t Can be computed recursively, efficiently A Revealing Introduction to HMMs 31

  32. Forward Algorithm Let 0(i) = ibi(O0) for i = 0,1, ,N-1 For t = 1,2, ,T-1 and i=0,1, ,N-1, let t(i) = ( t-1(j)aji)bi(Ot) o Where the sum is from j = 0 to N-1 From definition of t(i) we see P(O| ) = T-1(i) o Where the sum is from i = 0 to N-1 This requires only N2T multiplications A Revealing Introduction to HMMs 32

  33. Solution 2 Given a model, find hidden states o Given = (A,B, ) and O, find an optimal state sequence o Recall that optimal means maximize expected number of correct states o In contrast, DP finds best scoring path For temp/tree ring example, solved this o But hopelessly inefficient approach A better way: backward algorithm o Or beta pass A Revealing Introduction to HMMs 33

  34. Backward Algorithm For t = 0,1, ,T-1 and i = 0,1, ,N-1, let t(i) = P(Ot+1,Ot+2, ,OT-1|Xt=qi, ) Probability of partial sum from t to end and Markov process in state qi at step t Analogous to the forward algorithm As with forward algorithm, this can be computed recursively and efficiently A Revealing Introduction to HMMs 34

  35. Backward Algorithm Let T-1(i) = 1 for i = 0,1, ,N-1 For t = T-2,T-3, ,1 and i = 0,1, ,N-1, let t(i) = aijbj(Ot+1) t+1(j) Where the sum is from j = 0 to N-1 A Revealing Introduction to HMMs 35

  36. Solution 2 For t = 1,2, ,T-1 and i=0,1, ,N-1 define t(i) = P(Xt=qi|O, ) o Most likely state at t is qi that maximizes t(i) We see that t(i) = t(i) t(i)/P(O| ) o And recall P(O| ) = T-1(i) The bottom line? o Forward algorithm solves Problem 1 o Forward/backward algorithms solve Problem 2 (how?) A Revealing Introduction to HMMs 36

  37. Solution 3 Train a model: Given O, N, and M, find that maximizes probability of O We ll iteratively adjust = (A,B, ) to better fit the given observations O o The size of matrices are fixed (N and M) o But elements of matrices can change It is nice that this works o andamazing that it s efficient! A Revealing Introduction to HMMs 37

  38. Solution 3 For t=0,1, ,T-2 and i,j in {0,1, ,N-1}, define di-gammas as t(i,j) = P(Xt=qi, Xt+1=qj|O, ) Note t(i,j) is prob of being in state qi at time t and transiting to state qj at t+1 Then t(i,j) = t(i)aijbj(Ot+1) t+1(j)/P(O| ) And t(i) = t(i,j) o Where sum is from j = 0 to N 1 A Revealing Introduction to HMMs 38

  39. Model Re-estimation Given di-gammas and gammas For i = 0,1, ,N-1 let i = 0(i) For i = 0,1, ,N-1 and j = 0,1, ,N-1 aij = t(i,j)/ t(i) o Where both sums are from t = 0 to T-2 For j = 0,1, ,N-1 and k = 0,1, ,M-1 bj(k) = t(j)/ t(j) o Both sums from from t = 0 to T-1 but only t for which Ot = k are counted in numerator Why does this work? A Revealing Introduction to HMMs 39

  40. Solution 3 To summarize 1. Initialize = (A,B, ) 2. Compute t(i), t(i), t(i,j), t(i) 3. Re-estimate the model = (A,B, ) 4. If P(O| ) increases by more than (where is small), goto 2 A Revealing Introduction to HMMs 40

  41. Solution 3 Some fine points Model initialization o If we have a good guess for = (A,B, ) then we can use it for initialization o If not, let i 1/N, ai,j 1/N, bj(k) 1/M o Subject to row stochastic conditions o But, do not initialize to exactly uniform values Stopping conditions o Stop after some number of iterations and/or o Stop if increase in P(O| ) is too small A Revealing Introduction to HMMs 41

  42. HMM as Discrete Hill Climb Algorithm on previous slides shows that HMM is a discrete hill climb HMM consists of discrete states, Xt o Climb on the elements of the matrices And re-estimation process improves model by modifying parameters o So, climbs toward improved model o This happens in a high-dimensional space A Revealing Introduction to HMMs 42

  43. Dynamic Programming Brief detour For = (A,B, )as above, it s easy to define a dynamic program (DP) Executive summary: o DP is forward algorithm, with sum replaced by max Details on next few slides A Revealing Introduction to HMMs 43

  44. Dynamic Programming Let 0(i) = i bi(O0) for i=0,1, ,N-1 For t=1,2, ,T-1 and i=0,1, ,N-1 compute t(i) = max ( t-1(j)aji)bi(Ot) o Where the max is over j in {0,1, ,N-1} Note that at each t, the DP computes best path for each state, up to that point So, probability of best path is max T-1(j) This max gives the highest probability o Not the best path, for that, see next few slides A Revealing Introduction to HMMs 44

  45. Dynamic Programming To determine optimal path o While computing deltas, keep track of pointers to previous state o When finished, construct optimal path by tracing back points For example, consider temp example: recall that we observe (0,1,0,2) Probabilities for path of length 1: These are the only paths of length 1 A Revealing Introduction to HMMs 45

  46. Dynamic Programming Probabilities for each path of length 2 Best path of length 2 ending with H is CH Best path of length 2 ending with C is CC A Revealing Introduction to HMMs 46

  47. Dynamic Program Continuing, we compute best path ending at H and C at each step And save pointers why? A Revealing Introduction to HMMs 47

  48. Dynamic Program Best final score is .002822 o And thanks to pointers, best path is CCCH But what about underflow? o A serious problem in bigger cases A Revealing Introduction to HMMs 48

  49. Underflow Resistant DP Common trick to prevent underflow: o Instead of multiplying probabilities o add logarithms of probabilities Why does this work? o Because log(xy) = log x + log y o Adding logs does not tend to 0 Note that these logs are negative and we must avoid 0 probabilities A Revealing Introduction to HMMs 49

  50. Underflow Resistant DP Underflow resistant DP algorithm: Let 0(i) = log( i bi(O0)) for i=0,1, ,N-1 For t=1,2, ,T-1 and i=0,1, ,N-1 compute t(i) = max ( t-1(j) + log(aji) + log(bi(Ot))) o Where the max is over j in {0,1, ,N-1} And score of best path is max T-1(j) o As before, must also keep track of paths A Revealing Introduction to HMMs 50

More Related Content