Speech Recognition and Dialog Systems

FLST: Speech Recognition
 
Bernd Möbius
moebius@coli.uni-saarland.de
 
http://www.coli.uni-saarland.de/courses/FLST/2014/
ASR and ASU
 
A
u
t
o
m
a
t
i
c
 
s
p
e
e
c
h
 
r
e
c
o
g
n
i
t
i
o
n
recognition of words or word sequences
necessary basis for speech understanding and dialog
systems
A
u
t
o
m
a
t
i
c
 
s
p
e
e
c
h
 
u
n
d
e
r
s
t
a
n
d
i
n
g
more directly connected with higher linguistic levels,
such as syntax, semantics, and pragmatics
 
2
Structure of dialog systems
 
3
feature
extraction
word
recognition
syntactic
analysis
semantic
analysis
pragmatic
analysis
dialog control
answer
generation
speech
synthesis
ASU
ASR
 
NLG
Acoustic analysis
 
Feature extraction
utterance is analyzed as a
sequence of 10 ms frames
in each frame, spectral information
is coded as a feature vector
(MFCC, here: 12 coefficients)
MFCC = mel frequency
cepstral coefficients
typically 13 static and
26 dynamic features
 
4
Acoustic analysis
 
Word recognition
acoustic model (HMM): probabilities of sequences of
feature vectors, given a sequence of words
stochastic language model: probabilities of word
sequences
 
n-best word sequences (word hypotheses graphs)
 
5
Word hypotheses graph
6
[Kompe 1997]
Linguistic analysis
 
Syntactic analysis
finds optimal word sequence(s) w.r.t. word recognition
scores and syntactic rules / constraints
determine phrase structure in word sequence
relies on grammar rules and syntactic parsing
Semantic analysis
utterance interpretation (w/o context/domain info)
Pragmatic analysis
disambiguation and anaphora resolution (context info)
 
7
Relevance of prosody
 
Output of a standard ASR system: WHG
sequences of words without punctuation and prosody
    
ja zur not geht's auch am samstag
Alternative realizations with prosody
(1) 
Ja,
 
zur Not geht's auch am Samstag.
     
'Yes, if necessary it will also be possible on Saturday.'
(2) 
Ja, zur Not.
 
Geht's auch am Samstag?
    
 'Yes, if absolutely necessary. Will it also be possible on Sat?'
(
3) - (12) …
… not only in contrived examples!
 
8
Relevance of prosody
 
Prosodic structure
sentence mode:
 
  
Treffen wir uns bei Ihnen?
   'Do we meet at your place?'
 
  
Treffen wir uns bei Ihnen!
    'Let's meet at your place!'
phrase boundaries:
 
  
Fünfter geht bei mir, nicht aber neunzehnter.
'The fifth is possible for me, but not the nineteenth.'
 
  
Fünfter geht bei mir nicht, aber neunzehnter.
'The fifth is not possible for me, but the nineteenth is.'
accents
:
 
 
 
Ich fahre doch nach Hamburg.
    'I will go to H (as you know).'
 
  
Ich fahre DOCH nach Hamburg.
            'I will go to H after all.'
 
9
Prosody in ASR
 
Historical perspective
application domains for ASR
until mid/late 1990s:  information retrieval dialog
since then also: less restricted domains, free dialog
a chance to demonstrate the impact of prosody!
dialog turn segmentation
information structure
user state and affect
first end-to-end dialog system using prosody:
Verbmobil
 
10
Role model systems: Verbmobil
 
Architecture
multilingual prosody module: German, English,
Japanese
common algorithms, shared features, separate data
input: speech signal, word hypotheses graph (WHG)
output: prosodically annotated WHG (prosody by word),
feeding other dialog system components (incl. MT):
detected boundaries 
 dialog act segmentation, dialog
manager, deep syntactic analysis
detected phrase accents 
 
 semantic module
detected questions 
 
 semantic module, dialog manager
 
11
Role model systems: SmartKom
 
Beyond Verbmobil: (emotional) user state
architecture: input and output as in Verbmobil
p
r
o
s
o
d
i
c
 
e
v
e
n
t
s
:
 
a
c
c
e
n
t
s
,
 
b
o
u
n
d
a
r
i
e
s
,
 
r
i
s
i
n
g
 
B
T
s
u
s
e
r
 
s
t
a
t
e
 
a
s
 
a
 
7
-
/
4
-
/
2
-
c
l
a
s
s
 
p
r
o
b
l
e
m
:
joyful (s/w), surprised, neutral, hesitant, angry (w/s)
joyful, neutral, hesitant, angry
angry vs. not angry
realistic user states evoked in WOZ experiments
large feature vector: 121 features (91 pros. + 30
POS), different subsets for events and user state
 
12
SmartKom
 
Classification performance (% correct recog.)
 
13
 
    * leave one out
prosodic events
(emotional) user state
 
** multimodal
 
[Zeisssler at al. 2006]
Role model systems: SRI
 
Acoustic feature space of prosodic events
similar to VM/SK approach: features derived from F0
contour, duration (phones, pauses, rate), energy
feature extraction by proprietary toolkit, but claimed to
be feasible with standard software (Praat, Snack)
standard statistical classifiers
all models are probabilistic and trainable to tasks
integration of prosodic and lexical modeling
language-independent: English, Mandarin, Arabic
 
[www.speech.sri.com/people/ees/prosody]
 
14
Parameters and functions
 
Analysis problem: many-to many mapping of
parameters to functions
 
15
lexical tone
lexical stress, word accent
syllabic stress
accenting
prosodic phrasing
sentence mode
information structure
discourse structure
speaking rate
pauses
rhythm
voice quality
phonation type
F0
duration
intensity
spectral prop.
Prosody recognition
 
Some approaches to exploiting prosody for ASR
recognition of ToBI events 
[Ostendorf & Ross 1997, ToBI-Lite:
Wightman et al. 2000]
resolving syntactic ambiguities using phrase breaks
[Hunt 1997]
analysis-by-synthesis detection of Fujisaki model
parameters 
[Hirose 1997; Nakai et al. 1997]
detection of phrase boundaries, sentence mode, and
accents 
[Verbmobil: Hess et al. 1997]
detection of prosodic events to support dialog manager
[Verbmobil, SmartKom: Batliner & Nöth et al. 2000-2003]
 
16
Conclusion
 
Prosody is an integral part of natural speech
processed and used extensively by human listeners
Few ASR/ASU systems exploit prosodic
structure
Prosody can play an important role in ASR
prosodic features are potentially useful on all levels of
ASR/ASU systems, including affective user state
 
17
Human-machine dialog
 
18
 
 
 
Thanks!
 
19
Slide Note
Embed
Share

Acoustic analysis, word recognition models such as HMM, linguistic analysis with syntactic and semantic interpretation, and the significance of prosody in ASR systems are explored in this informative content. The structural elements and processes involved in speech recognition and dialog systems are detailed, shedding light on the underlying technologies and methodologies utilized.

  • Speech Recognition
  • Dialog Systems
  • Acoustic Analysis
  • Word Recognition
  • Linguistic Analysis

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. FLST: Speech Recognition Bernd M bius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2014/ FLST: Speech Recognition

  2. ASR and ASU Automatic speech recognition recognition of words or word sequences necessary basis for speech understanding and dialog systems Automatic speech understanding more directly connected with higher linguistic levels, such as syntax, semantics, and pragmatics 2 FLST: Speech Recognition

  3. Structure of dialog systems ASR NLG speech synthesis feature extraction word recognition answer generation dialog control syntactic analysis pragmatic analysis semantic analysis ASU 3 FLST: Speech Recognition

  4. Acoustic analysis 0.1494 Feature extraction utterance is analyzed as a sequence of 10 ms frames in each frame, spectral information is coded as a feature vector (MFCC, here: 12 coefficients) MFCC = mel frequency cepstral coefficients typically 13 static and 26 dynamic features 0 -0.3043 0.07922 0.1391 Time (s) 12 Coefficients 0.07922 1 0.1391 Time (s) 4 FLST: Speech Recognition

  5. Acoustic analysis Word recognition acoustic model (HMM): probabilities of sequences of feature vectors, given a sequence of words stochastic language model: probabilities of word sequences n-best word sequences (word hypotheses graphs) 5 FLST: Speech Recognition

  6. Word hypotheses graph [Kompe 1997] 6 FLST: Speech Recognition

  7. Linguistic analysis Syntactic analysis finds optimal word sequence(s) w.r.t. word recognition scores and syntactic rules / constraints determine phrase structure in word sequence relies on grammar rules and syntactic parsing Semantic analysis utterance interpretation (w/o context/domain info) Pragmatic analysis disambiguation and anaphora resolution (context info) 7 FLST: Speech Recognition

  8. Relevance of prosody Output of a standard ASR system: WHG sequences of words without punctuation and prosody ja zur not geht's auch am samstag Alternative realizations with prosody (1) Ja, zur Not geht's auch am Samstag. 'Yes, if necessary it will also be possible on Saturday.' (2) Ja, zur Not. Geht's auch am Samstag? 'Yes, if absolutely necessary. Will it also be possible on Sat?' (3) - (12) not only in contrived examples! 8 FLST: Speech Recognition

  9. Relevance of prosody Prosodic structure sentence mode: Treffen wir uns bei Ihnen? 'Do we meet at your place?' Treffen wir uns bei Ihnen! 'Let's meet at your place!' phrase boundaries: F nfter geht bei mir, nicht aber neunzehnter. 'The fifth is possible for me, but not the nineteenth.' F nfter geht bei mir nicht, aber neunzehnter. 'The fifth is not possible for me, but the nineteenth is.' accents: Ich fahre doch nach Hamburg. 'I will go to H (as you know).' Ich fahre DOCH nach Hamburg. 'I will go to H after all.' 9 FLST: Speech Recognition

  10. Prosody in ASR Historical perspective application domains for ASR until mid/late 1990s: information retrieval dialog since then also: less restricted domains, free dialog a chance to demonstrate the impact of prosody! dialog turn segmentation information structure user state and affect first end-to-end dialog system using prosody: Verbmobil 10 FLST: Speech Recognition

  11. Role model systems: Verbmobil Architecture multilingual prosody module: German, English, Japanese common algorithms, shared features, separate data input: speech signal, word hypotheses graph (WHG) output: prosodically annotated WHG (prosody by word), feeding other dialog system components (incl. MT): detected boundaries dialog act segmentation, dialog manager, deep syntactic analysis detected phrase accents semantic module detected questions semantic module, dialog manager 11 FLST: Speech Recognition

  12. Role model systems: SmartKom Beyond Verbmobil: (emotional) user state architecture: input and output as in Verbmobil prosodic events: accents, boundaries, rising BTs user state as a 7-/4-/2-class problem: joyful (s/w), surprised, neutral, hesitant, angry (w/s) joyful, neutral, hesitant, angry angry vs. not angry realistic user states evoked in WOZ experiments large feature vector: 121 features (91 pros. + 30 POS), different subsets for events and user state 12 FLST: Speech Recognition

  13. SmartKom Classification performance (% correct recog.) train test prominent words 81.0 77.0 88.6 phrase boundaries 89.8 rising BT 72.0 66.4 user state (7) *30.8 * leave one out **68.3 user state (4) ** multimodal *66.8 user state (2) prosodic events (emotional) user state [Zeisssler at al. 2006] 13 FLST: Speech Recognition

  14. Role model systems: SRI Acoustic feature space of prosodic events similar to VM/SK approach: features derived from F0 contour, duration (phones, pauses, rate), energy feature extraction by proprietary toolkit, but claimed to be feasible with standard software (Praat, Snack) standard statistical classifiers all models are probabilistic and trainable to tasks integration of prosodic and lexical modeling language-independent: English, Mandarin, Arabic [www.speech.sri.com/people/ees/prosody] 14 FLST: Speech Recognition

  15. Parameters and functions Analysis problem: many-to many mapping of parameters to functions F0 duration intensity spectral prop. lexical tone lexical stress, word accent syllabic stress accenting prosodic phrasing sentence mode information structure discourse structure speaking rate pauses rhythm voice quality phonation type 15 FLST: Speech Recognition

  16. Prosody recognition Some approaches to exploiting prosody for ASR recognition of ToBI events [Ostendorf & Ross 1997, ToBI-Lite: Wightman et al. 2000] resolving syntactic ambiguities using phrase breaks [Hunt 1997] analysis-by-synthesis detection of Fujisaki model parameters [Hirose 1997; Nakai et al. 1997] detection of phrase boundaries, sentence mode, and accents [Verbmobil: Hess et al. 1997] detection of prosodic events to support dialog manager [Verbmobil, SmartKom: Batliner & N th et al. 2000-2003] 16 FLST: Speech Recognition

  17. Conclusion Prosody is an integral part of natural speech processed and used extensively by human listeners Few ASR/ASU systems exploit prosodic structure Prosody can play an important role in ASR prosodic features are potentially useful on all levels of ASR/ASU systems, including affective user state 17 FLST: Speech Recognition

  18. Human-machine dialog 18 FLST: Speech Recognition

  19. Thanks! 19 FLST: Speech Recognition

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#