Algorithm for Determining Endpoints in Speech Recognition

 
An Algorithm for Determining the
Endpoints for Isolated Utterances
 
L.R. Rabiner and M.R. Sambur
 
The Bell System Technical Journal
, Vol. 54, No. 2,
Feb. 1975, pp. 297-315
 
Outline
 
Intro to problem
Solution
Algorithm
Summary
 
Motivation
 
Word recognition needs to detect word
boundaries in speech
Recognizing silence can reduce:
Processing load
(Network not identified as savings source)
(Hands-free operation not identified as
convenience)
Relatively easy in sound proof room, with
digitized tape
 
Visual Recognition
 
Easy
Note how quiet beginning is (tape)
 
Eight
 
Slightly Tougher Visual Recognition
 
sss
” starts crossing the ‘zero’ line, so can
still detect
 
Six
 
Tough Visual Recognition
 
Eye picks ‘B’, but ‘A’ is real start
/f/ is a 
weak fricative
 
Four
 
Tough Visual Recognition
 
Eye picks ‘A’, but ‘B’ is real endpoint
V becomes 
devoiced
 
Five
 
Tough Visual Recognition
 
Difficult to say where final trailing off ends
 
Nine
 
The Problem
 
Noisy computer room with background noise
Weak fricatives: 
/f, th, h/
Weak plosive bursts: 
/p, t, k/
Final nasals (ex: “
nine
”)
Voiced fricatives becoming devoiced (ex: “
five
”)
Trailing off of sounds (ex: “
binary
”, “
three
”)
Need to do with simple, efficient processing
Avoid hardware costs
 
The Solution
 
Two measurements:
Energy
Zero crossing rate
Show: simple, fast, accurate
 
Energy
 
Sum of magnitudes of 10 ms of sound,
centered on interval:
E(n)
 
= 
 
i
=-50 to 50
 |
s(n + i)
|
 
Zero
 (Level) 
Crossing Rate
 
Remember, digital audio values are changes in
air pressure (higher or lower than base)
Base/midpoint is “zero”
But is always positive if unsigned (e.g., 127 if
unsigned byte)
Zero crossing rate 
is number of zero crossings
per 10 ms
Normal number of cross-overs during silence
Increase in cross-overs during speech
 
The Algorithm: Startup
 
At initialization, record sound for 100ms
A measure background noise
Assume ‘silence’
Compute average (
IZC’
) and std dev (
) of 
zero
crossing rate
Choose zero-crossing threshold (
IZCT
)
Threshold for unvoiced speech
IZCT
 = min(25 / 10ms, 
IZC’ 
+ 2 
)
 
The Algorithm: Thresholds
 
Compute energy, 
E
(
n
)
, for interval
Get max, 
IMX
Have ‘silence’ energy, 
IMN
Compute to values:
I1
 
= 0.03 * (
IMX
IMN
) + 
IMN
 
(3% of peak energy)
I2
 = 4 * 
IMN
 
(4
x
 silent energy)
Get energy thresholds (
ITU
 and 
ITL
)
ITL
 = MIN(
I1
, 
I
2
)
ITU
 = 5 * 
ITL
 
The Algorithm: Energy Computation
 
Search sample for energy greater than 
ITL
Save as start of speech, say 
s
Search for energy greater than 
ITU
s
 becomes start of speech
If energy falls below 
ITL
, restart
Search for energy less than 
ITL
Save as end of speech
Results in conservative estimates
Endpoints may be outside
 
The Algorithm: Zero Crossing
Computation
 
Search back 250 ms
Count number of intervals where rate exceeds
IZCT
If 3+, set starting point, 
s
, to first time
Else 
s
 remains the same
Do similar search after end
 
The Algorithm: Example
 
(Word begins with strong fricative)
 
Algorithm: Examples
 
Caught trailing 
/f/
 
 
Half
 
Algorithm:
Examples
 
Four
 
(Notice how
different each
four
” is)
 
Evaluation: Part 1
 
54-word vocabulary
Read by 2 males, 2 females
No gross errors (off by more than 50 ms)
Some small errors
Losing weak fricatives
None affected recognition
 
Evaluation: Part 2
 
10 speakers
Count 0 to 9
No errors at all
 
Evaluation: Part 3
 
Your Project 1b…
Slide Note
Embed
Share

This article discusses an algorithm proposed by L.R. Rabiner and M.R. Sambur in 1975 for determining endpoints in isolated utterances. The algorithm focuses on detecting word boundaries in speech through the recognition of silence, which can lead to reduced processing load and increased convenience, particularly in noisy environments. By utilizing measurements such as energy and zero-crossing rate, the algorithm offers a simple, efficient, and cost-effective solution for speech recognition tasks.

  • Speech Recognition
  • Algorithm
  • Endpoint Detection
  • L.R. Rabiner
  • M.R. Sambur

Uploaded on Oct 06, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. An Algorithm for Determining the Endpoints for Isolated Utterances L.R. Rabiner and M.R. Sambur The Bell System Technical Journal, Vol. 54, No. 2, Feb. 1975, pp. 297-315

  2. Outline Intro to problem Solution Algorithm Summary

  3. Motivation Word recognition needs to detect word boundaries in speech Recognizing silence can reduce: Processing load (Network not identified as savings source) (Hands-free operation not identified as convenience) Relatively easy in sound proof room, with digitized tape

  4. Visual Recognition Eight Easy Note how quiet beginning is (tape)

  5. Slightly Tougher Visual Recognition Six sss starts crossing the zero line, so can still detect

  6. Tough Visual Recognition Four Eye picks B , but A is real start /f/ is a weak fricative

  7. Tough Visual Recognition Five Eye picks A , but B is real endpoint V becomes devoiced

  8. Tough Visual Recognition Nine Difficult to say where final trailing off ends

  9. The Problem Noisy computer room with background noise Weak fricatives: /f, th, h/ Weak plosive bursts: /p, t, k/ Final nasals (ex: nine ) Voiced fricatives becoming devoiced (ex: five ) Trailing off of sounds (ex: binary , three ) Need to do with simple, efficient processing Avoid hardware costs

  10. The Solution Two measurements: Energy Zero crossing rate Show: simple, fast, accurate

  11. Energy Sum of magnitudes of 10 ms of sound, centered on interval: E(n) = i=-50 to 50 |s(n + i)|

  12. Zero (Level) Crossing Rate Remember, digital audio values are changes in air pressure (higher or lower than base) Base/midpoint is zero But is always positive if unsigned (e.g., 127 if unsigned byte) Zero crossing rate is number of zero crossings per 10 ms Normal number of cross-overs during silence Increase in cross-overs during speech

  13. The Algorithm: Startup At initialization, record sound for 100ms A measure background noise Assume silence Compute average (IZC ) and std dev ( ) of zero crossing rate Choose zero-crossing threshold (IZCT) Threshold for unvoiced speech IZCT = min(25 / 10ms, IZC + 2 )

  14. The Algorithm: Thresholds Compute energy, E(n), for interval Get max, IMX Have silence energy, IMN Compute to values: I1= 0.03 * (IMX IMN) + IMN (3% of peak energy) I2 = 4 * IMN (4x silent energy) Get energy thresholds (ITU and ITL) ITL = MIN(I1, I2) ITU = 5 * ITL

  15. The Algorithm: Energy Computation Search sample for energy greater than ITL Save as start of speech, say s Search for energy greater than ITU s becomes start of speech If energy falls below ITL, restart Search for energy less than ITL Save as end of speech Results in conservative estimates Endpoints may be outside

  16. The Algorithm: Zero Crossing Computation Search back 250 ms Count number of intervals where rate exceeds IZCT If 3+, set starting point, s, to first time Else s remains the same Do similar search after end

  17. The Algorithm: Example (Word begins with strong fricative)

  18. Algorithm: Examples Half Caught trailing /f/

  19. Algorithm: Examples Four (Notice how different each four is)

  20. Evaluation: Part 1 54-word vocabulary Read by 2 males, 2 females No gross errors (off by more than 50 ms) Some small errors Losing weak fricatives None affected recognition

  21. Evaluation: Part 2 10 speakers Count 0 to 9 No errors at all

  22. Evaluation: Part 3 Your Project 1b

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#