Understanding Automated Speech Recognition Technologies

Slide Note
Embed
Share

Explore the world of Automated Speech Recognition (ASR), including setup, basics, observations, preprocessing, language modeling, acoustic modeling, and Hidden Markov Models. Learn about the process of converting speech signals into transcriptions, the importance of language modeling in ASR accuracy, and various techniques used for feature extraction and modeling. Gain insights into the complexities of ASR technologies and their applications in modern speech recognition systems.


Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Automated Speach Recognition By: Amichai Painsky Automated Speach Recognotion

  2. Automated Speech Recognition - setup Input speech waveform: Preprocessing: Modeling: Output transcription: The boy is in the red house Automated Speach Recognotion Page 2

  3. ASR - basics Observations representing a speech signal Vocabulary V of different words Our goal find the most likely word sequence Since language modeling acoustic modeling we have Automated Speach Recognotion Page 3

  4. Observations preprocessing of parameter vectors at a certain frame rate A sampled waveform is converted into a sequence because a speech signal is assumed to be stationary for about 10 ms A frame rate of 10 ms is usually taken, have been developed, some based on acoustic concepts or knowledge of the human vocal tract and psychophysical knowledge of the human perception Many different ways to extract meaningful features Automated Speach Recognotion Page 4

  5. Language modeling Most generally, the probability of a sequence m of words is Language is highly structured and limited histories are capable of capturing quite a bit of this structure. Bigram models: More powerful two-words (trigrams) history models Longer history -> exponentially increasing number of models -> more data is required to train, more parameters, more overfitting Partial Matching modeling Automated Speach Recognotion Page 5

  6. Acoustic modeling Determines what sound is pronounced when a given sentence is uttered Number of possibilities is infinite! (depends on the speaker, the ambiance, microphone placement etc.) Possible solution a parametric model in the form of Hidden Markov Model Notice other solutions may also apply (for example, neural nets) Automated Speach Recognotion Page 6

  7. Hidden Markov Model A simple example of HMM: Automated Speach Recognotion Page 7

  8. Hidden Markov Model Automated Speach Recognotion Page 8

  9. Hidden Markov Model Forward Algorithm Given an observation sequence (for example 10110), what is the probability it was generated from a given HMM (for example HMM from previous slides). For a path q=12312 and a given HMM ?: Therefore, summing over all possible paths: Automated Speach Recognotion Page 9

  10. Hidden Markov Model Forward Algorithm Complexity: for a sequence of T observations each path necessitates 2T multiplications. Total number of paths is ?? , therefore ?(???) A more efficient approach forward algorithm Automated Speach Recognotion Page 10

  11. Hidden Markov Model Forward Algorithm Forward algorithm: calculates the probabilities for all subsequences in each time step, using the results from the previous step (dynamic programming) Define ??(?) the probability of being at the state i at time t and having observed the partial sequence ?1, ,??: Automated Speach Recognotion Page 11

  12. Hidden Markov Model Forward Algorithm Define ??(?) the probability of being at the state i at time t and having observed the partial sequence ?1, ,??: Automated Speach Recognotion Page 12

  13. Hidden Markov Model Forward Algorithm Complexity - At time t each calculation only involves N previous values of ?? 1(?) . The length of the sequence is T. Therefore, for each state we need ? ?? and for the total N states we need ?(?2?) Automated Speach Recognotion Page 13

  14. Hidden Markov Model Viterbi algorithm Previously: given an observation sequence (for example 10110), what is the probability it was generated from a given HMM We now ask: given an observation sequence , what is the sequence of states that it most likely to have generated it? Automated Speach Recognotion Page 14

  15. Hidden Markov Model Viterbi algorithm Define ??(?) as the best path from the start state to state i at time t ??(?) is our objective We solve this with the same forward algorithm as before, but this time with maximization instead of summation. Automated Speach Recognotion Page 15

  16. Hidden Markov Model Viterbi algorithm Automated Speach Recognotion Page 16

  17. Hidden Markov Model Viterbi algorithm, example Observations sequence - 101 Automated Speach Recognotion Page 17

  18. Hidden Markov Model model fitting In practice, the parameters ? of the HMM are unknown We are interested in No analytical maximum likelihood solution. We turn to Baum-Welch algorithm or forward-backward algorithm. Basic idea count the visits of each state and the number of transitions to derive a probability estimator Automated Speach Recognotion Page 18

  19. Hidden Markov Model model fitting Define: This is the conditional probability that ??+1, ,?? are observed, given that the system starts at state ??= ?, given the model ?. This can be calculated inductively: Define ??(?) the probability of being at the state i at time t and having observed the partial sequence ?1, ,??: Automated Speach Recognotion Page 19

  20. Hidden Markov Model model fitting Define: This is the conditional probability that ??+1, ,?? are observed, given that the system starts at state ??= ?, given the model ?. Define ??(?) the probability of being at the state i at time t and having observed the partial sequence ?1, ,??: Therefore, the probability of being in state i at time t, given the entire observations sequence and the model is simply: Automated Speach Recognotion Page 20

  21. Hidden Markov Model model fitting Define: the probability of being in state i at time t and state j at time t+1 given the model and the observations sequence: Graphically: Automated Speach Recognotion Page 21

  22. Hidden Markov Model model fitting We are now ready to introduce the parameters estimators: Transition probability estimator: The expected number of transitions from state i to j, normalized by the expected number of visits of state i Automated Speach Recognotion Page 22

  23. Hidden Markov Model model fitting We are now ready to introduce the parameters estimators: Observations probability estimator: The expected number times in state j at which the symbol ?? was observed, normalized by the expected number of times the system visited state j Automated Speach Recognotion Page 23

  24. Hidden Markov Model model fitting Notice that the parameters we wish to estimate actually appear in both sides of the equation: Automated Speach Recognotion Page 24

  25. Hidden Markov Model model fitting Notice that the parameters we wish to estimate actually appear in both sides of the equation Therefore, we use an iterative procedure: after stating with an initial guess for the parameters we gradually update at each iteration and terminate once the parameters stop changing to a certain limit. Automated Speach Recognotion Page 25

  26. Hidden Markov Model model fitting For continuous observations ???? = ?(??,??) We estimate the mean and variance for each state j: Automated Speach Recognotion Page 26

  27. Conclusions and final remarks We learned how to: I. Estimate HMM parameters from a sequence of observations II. Determine the probability of observing a sequence given an HMM III. Determine the most likely sequence of states, given an HMM and a sequence of observations Notice the states may either represent words, syllables, phoneme, etc. This is up for the system architect to decide For example, words are more informative than syllables, but results with more states and less accurate probability estimation (curse of dimensionality) Automated Speach Recognotion Page 27

  28. Questions? Thank you! Automated Speach Recognotion Page 28

Related


More Related Content