Speech Recognition and Acoustic Modeling

專題研究
 WEEK2
Prof. Lin-
s
han Lee
  
TA. 
Tao Tu
 
1. Recap
2. 
Apply HMM to Acoustic Modeling
3. 
Acoustic Training
4. Homework
Outline
2
 
Recap
3
語音辨識系統
Front-end
Signal
Processing
 
Acoustic
Models
Lexicon
Feature
Vectors
Linguistic Decoding
and
Search Algorithm
Output
Sentence
Speech
Corpora
Acoustic
Model
Training
Language
Model
Constructi
on
Text
Corpora
Language
Model
Input Speech
4
Today
Last time
Feature Extraction
Feature Extraction
5
How to do recognition?
How to map speech O to a word sequence W ?
P(O|W): acoustic model
P(W): language model
 
 
 
 
6
 
Apply HMM to Acoustic Modeling
7
Hidden Markov Model
Simplified HMM
Hidden Markov Model
Elements of an HMM {S,A,B,π}
S is a set of 
N
 states
A
 is the 
N
N
 matrix of state transition probabilities
B is a set of 
N
 probability functions, each describing the
observation probability with respect to a state
 
π 
is the vector of initial state probabilities
Gaussian Mixture Model (GMM)
Observation may be continuous. (e.g., mfcc)
Use GMM to model continuous prob. density
function.
Acoustic Model 
P(O|
)
Model of a phone
 (
)
Gaussian
Mixture Model
(2.2)
Markov Model
(2.1, 4.1-4.5)
11
11
Acoustic Model 
P(O|
)
12
12
Acoustic Model
P(O|
)
13
13
t
+
1
(
 
j
)
 
=
 
[
 
 
 
t
(
i
)
 
a
i
j
 
]
 
b
j
(
o
t
+
1
)
   
                        1 
 j 
 N
 
             1 
 t 
 T
1
Acoustic Model 
– State Seq.
14
14
Acoustic Model 
– Training
15
15
Acoustic Model 
– Training
b
1
(
v
1
)=3/4, b
1
(
v
2
)=1/4
b
2
(
v
1
)=1/3, b
2
(
v
2
)=2/3
b
3
(
v
1
)=2/3, b
3
(
v
2
)=1/3
Acoustic Model 
– Training
Initialization
Bad initialization leads to local minimum with higher
probability.
17
17
Model
Initialization:
Segmental K-means
Model
Re-estimation:
Baum-Welch
Segmental K-means
假設今天有四個人發出
這個音
Acoustic Model 
P(O|W)
How to compute P(O|W) ?
19
19
Monophone vs. triphone
Monophone
C
onsider only one phone information per model
Triphone
Consider both left and right neighboring phones
(60)
3
→ 216,000
Triphone
a phone model taking into consideration both left and right
neighboring phones               
(60)
3
→ 216,000
Training Tri-phone Models with Decision Trees
Example Questions:
12: Is left context a vowel?
24: Is left context a back-vowel?
30: Is left context a low-vowel?
32: Is left context a rounded-vowel?
An Example: “( _ ‒ ) b ( +_ )”
03.mono.train.sh
05.tree.build.sh
06.tri.train.sh
Acoustic Model Training
23
23
Training steps
Get features(previous section)
Train monophone model
Use previous model to build decision tree
 
(for triphone).
Train triphone model
Training steps
Get features
 (last time
)
Train monophone model
a. gmm-init-mono
  
I
nitial
ize
 monophone model
b. 
compile-train-graphs
 
G
et train graph
c. 
align-equal-compiled
 
model -> decode
 
&
 
align
 
 
 
(
g
m
m
-
a
l
i
g
n
-
c
o
m
p
i
l
e
d
 
i
n
s
t
e
a
d
 
w
h
e
n
 
l
o
o
p
i
n
g
)
d. gmm-acc-stats-ali
 
EM training: E step
e. gmm-est
  
EM training: M step
f
.  
n
umgauss = numgauss + incgauss
g. Goto step c.                    
Train several times
Use previous model to build decision tree(for triphone).
Train triphone model
Training steps
Get features
 (last time
)
Train monophone model
Use previous model to build decision tree(for triphone).
Train triphone model
a. gmm-init-model
 
 
Initialize GMM (
 from 
decision tree)
b. gmm-mixup
 
 
Gaussian merging (increase #gaussian)
c. 
convert-ali
 
 
Convert alignments(model <-> decisoin tree)
d. compile-train-graphs
 
get train graph
e. 
gmm-align-compiled
 
model -> decode&align
f.  gmm-acc-stats-ali
 
EM training: E step
g. gmm-est
  
EM training: M step
h.
 numgauss = numgauss + incgauss
i.  
Goto step e.
 
 
train several times
How to get Kaldi usage?
source setup.sh
align-equal-compiled
 --help
align-equal-compiled
Write an equally spaced alignment (for getting training started)
Usage:  align-equal-compiled <graphs-rspecifier> <features-
rspecifier> <alignments-wspecifier>
e.g.:
 align-equal-compiled 1.mdl 1.fsts scp:train.scp ark:equal.ali
gmm-align-compiled $scale_opts --beam=$beam --retry-
beam=$[$beam*4] <hmm-model*> ark:$dir/train.graph ark,s,cs:$feat
ark:<alignment*>
For first iteration(in monophone) beamwidth = 6, others = 10;
Only realign at
$realign_iters="1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38”
$realign_iters=“10 20 30”
gmm-acc-stats-ali
Accumulate stats for GMM training.(E step)
Usage:  gmm-acc-stats-ali [options] <model-in> <feature-
rspecifier> <alignments-rspecifier> <stats-out>
e.g.:
 gmm-acc-stats-ali 1.mdl scp:train.scp ark:1.ali 1
.acc
 gmm-acc-stats-ali --binary=false <hmm-model*>
ark,s,cs:$feat ark,s,cs:<alignment*> <stats>
gmm-est
Do Maximum Likelihood re-estimation of GMM-based
acoustic model
Usage:  gmm-est [options] <model-in> 
<stats-in>
 <model-
out>
e.g.: gmm-est 1.mdl 1.acc 2.mdl
gmm-est --binary=false --write-occs=<*.occs> --mix-
up=$numgauss <hmm-model-in> <stats> <hmm-model-
out>
--write-occs :
 
File to write pdf occupation counts to
.
$numgauss increases every time.
03.mono.train.sh
05.tree.build.sh
06.tri.train.sh
Homework
T
ODO
Step1. Execute the following commands.
script/03.mono.train.sh | tee log/03.mono.train.log
script/05.tree.build.sh | tee log/05.tree.build.log
script/06.tri.train.sh | tee log/06.tri.train.log
Step2. finish code in T
ODO
script/03.mono.train.sh
script/06.tri.train.sh
Step3. Observe the output and results.
Step4.
 
(
o
pt.) tune #gaussian and #iteration.
Hint (extremely important!!)
03.mono.train.sh
Use the variables already defined.
Use these formula:
Pipe for error
compute-mfcc-feats … 2> $log
Questions
?
Email:
杜濤
: 
r07922022@ntu.edu.tw
Slide Note
Embed
Share

This presentation delves into the world of speech recognition, covering topics such as Hidden Markov Models, feature extraction, acoustic modeling, and more. Explore the essential elements of processing speech signals, linguistic decoding, constructing language models, and training acoustic models. Gain insights into Gaussian Mixture Models, the application of Hidden Markov Models, and the use of the Forward algorithm. Discover the fascinating process of mapping speech to word sequences and the intricate workings of recognition algorithms in this informative session.

  • Speech recognition
  • Acoustic modeling
  • Hidden Markov Models
  • Feature extraction
  • Language models

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. WEEK2 Prof. Lin-shan Lee TA. Tao Tu

  2. Outline 2 1. Recap 2. Apply HMM to Acoustic Modeling 3. Acoustic Training 4. Homework

  3. Recap 3

  4. 4 Last time Feature Vectors Output Sentence Input Speech Linguistic Decoding and Search Algorithm Front-end Signal Processing Language Model Constructi on Acoustic Model Training Acoustic Models Language Model Text Corpora Lexicon Speech Corpora Today

  5. Feature Extraction 5 Feature Extraction

  6. How to do recognition? 6 How to map speech O to a word sequence W ? P(O|W): acoustic model P(W): language model

  7. Apply HMM to Acoustic Modeling 7

  8. Hidden Markov Model Simplified HMM RGBGGBBGRRR

  9. Hidden Markov Model Elements of an HMM {S,A,B, } S is a set of N states A is the N N matrix of state transition probabilities B is a set of N probability functions, each describing the observation probability with respect to a state is the vector of initial state probabilities 0.6 s1 {R:.3,G:.2,B:.5} 0.3 0.3 0.3 0.1 0.2 s3 s2 0.7 0.5 0.2 {R:.3,G:.6,B:.1} {R:.7,G:.1,B:.2}

  10. Gaussian Mixture Model (GMM) Observation may be continuous. (e.g., mfcc) Use GMM to model continuous prob. density function.

  11. Acoustic Model P(O|) 11 Model of a phone ( ) Markov Model (2.1, 4.1-4.5) Gaussian Mixture Model (2.2)

  12. Acoustic Model P(O|) 12 Forward algorithm (4.0 - basic problem 1) t(i) = P(o1o2 ot, qt= i| ) = P(observing o1o2 ot, state i at time t| ) 1(i) = ibi(o1) , 1 i N t+1(j) = [ ?=1 ? ??? aij] bj(ot+1) ? P(O| ) = ?=1 ??(?)

  13. Acoustic Model P(O|) 13 t+1(j) j t+1( j) = [ t(i) aij] bj(ot+1) i 1 j N 1 t T 1 t(i) t+1 t Forward Algorithm

  14. Acoustic Model State Seq. 14 Viterbi algorithm (4.0 - basic problem 2) Goal: find the optimal state sequences ? = ?1?2 ?? P(O| ) can also be computed based on ?.

  15. Acoustic Model Training 15 Baum-Welch algorithm (4.0 - basic problem 3) Goal: Adjust parameters (?,?,?) to maximize P(O| )

  16. Acoustic Model Training s 3 s 3 s 3 s 3 s 3 s 3 s 3 s 3 s 3 s 3 State s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 s 2 s 1 1 2 3 4 5 6 7 8 9 10 O 4 5 O 1 0 O 1 O 2 O 3 O O 6 O 7 O 8 O 9 v1 v2 b1(v1)=3/4, b1(v2)=1/4 b2(v1)=1/3, b2(v2)=2/3 b3(v1)=2/3, b3(v2)=1/3

  17. Acoustic Model Training 17 Initialization Bad initialization leads to local minimum with higher probability. Model Model Initialization: Segmental K-means Re-estimation: Baum-Welch

  18. Segmental K-means

  19. Acoustic Model P(O|W) 19 How to compute P(O|W) ?

  20. Monophone vs. triphone Monophone Consider only one phone information per model Triphone Consider both left and right neighboring phones (60)3 216,000

  21. Triphone a phone model taking into consideration both left and right neighboring phones (60)3 216,000 Sharing at Model Level Sharing at State Level Shared Distribution Model (SDM) Generalized Triphone

  22. Training Tri-phone Models with Decision Trees An Example: ( _ ) b ( +_ ) 12 yes no sil-b+u 30 a-b+u o-b+u y-b+u Y-b+u 32 46 42 Example Questions: 12: Is left context a vowel? 24: Is left context a back-vowel? 30: Is left context a low-vowel? 32: Is left context a rounded-vowel? i-b+u 24 U-b+u u-b+u e-b+u r-b+u 50 N-b+u M-b+u E-b+u

  23. Acoustic Model Training 23 03.mono.train.sh 05.tree.build.sh 06.tri.train.sh

  24. Training steps Get features(previous section) Train monophone model Use previous model to build decision tree (for triphone). Train triphone model

  25. Training steps Get features (last time) Train monophone model a. gmm-init-mono b. compile-train-graphs c. align-equal-compiled (gmm-align-compiled instead when looping) d. gmm-acc-stats-ali e. gmm-est f. numgauss = numgauss + incgauss g. Goto step c. Train several times Use previous model to build decision tree(for triphone). Train triphone model Initialize monophone model Get train graph model -> decode & align EM training: E step EM training: M step

  26. Training steps Get features (last time) Train monophone model Use previous model to build decision tree(for triphone). Train triphone model a. gmm-init-model Initialize GMM ( from decision tree) b. gmm-mixup Gaussian merging (increase #gaussian) c. convert-ali Convert alignments(model <-> decisoin tree) d. compile-train-graphs get train graph e. gmm-align-compiled model -> decode&align f. gmm-acc-stats-ali EM training: E step g. gmm-est EM training: M step h. numgauss = numgauss + incgauss i. Goto step e. train several times

  27. How to get Kaldi usage? source setup.sh align-equal-compiled --help

  28. align-equal-compiled Write an equally spaced alignment (for getting training started) Usage: align-equal-compiled <graphs-rspecifier> <features- rspecifier> <alignments-wspecifier> e.g.: align-equal-compiled 1.mdl 1.fsts scp:train.scp ark:equal.ali gmm-align-compiled $scale_opts --beam=$beam --retry- beam=$[$beam*4] <hmm-model*> ark:$dir/train.graph ark,s,cs:$feat ark:<alignment*> For first iteration(in monophone) beamwidth = 6, others = 10; Only realign at $realign_iters="1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38 $realign_iters= 10 20 30

  29. gmm-acc-stats-ali Accumulate stats for GMM training.(E step) Usage: gmm-acc-stats-ali [options] <model-in> <feature- rspecifier> <alignments-rspecifier> <stats-out> e.g.: gmm-acc-stats-ali 1.mdl scp:train.scp ark:1.ali 1.acc gmm-acc-stats-ali --binary=false <hmm-model*> ark,s,cs:$feat ark,s,cs:<alignment*> <stats>

  30. gmm-est Do Maximum Likelihood re-estimation of GMM-based acoustic model Usage: gmm-est [options] <model-in> <stats-in> <model- out> e.g.: gmm-est 1.mdl 1.acc 2.mdl gmm-est --binary=false --write-occs=<*.occs> --mix- up=$numgauss <hmm-model-in> <stats> <hmm-model- out> --write-occs : File to write pdf occupation counts to. $numgauss increases every time.

  31. Homework 03.mono.train.sh 05.tree.build.sh 06.tri.train.sh

  32. TODO Step1. Execute the following commands. script/03.mono.train.sh | tee log/03.mono.train.log script/05.tree.build.sh | tee log/05.tree.build.log script/06.tri.train.sh | tee log/06.tri.train.log Step2. finish code in TODO script/03.mono.train.sh script/06.tri.train.sh Step3. Observe the output and results. Step4. (opt.) tune #gaussian and #iteration.

  33. Hint (extremely important!!) 03.mono.train.sh Use the variables already defined. Use these formula: Pipe for error compute-mfcc-feats 2> $log

  34. Questions? Email: : r07922022@ntu.edu.tw

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#