Speech Signal Processing Course Outline and Grading System
This course outline covers various topics related to speech signal processing, including properties of speech signals, quantization, coding techniques, and standards in cellular telephony. The grading system includes midterm exams, homework assignments, a term project, and a final exam. Key reference books are also listed for further study in the field.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
SPEECH SPEECH COMPRESS ON COMPRESS ON - COURSE OUTL NE AND RULES - PROPERT ES OF THE SPEECH S GNAL A. Enis Cetin
Course Outline Week 1: Properties of Speech Signals, Pitch, Formants, Phonemes etc; Week 1: Properties of Speech Signals, Pitch, Formants, Phonemes etc; Quantization and PCM Quantization and PCM Week 2: Vector Quantization, DPCM and ADPCM Week 2: Vector Quantization, DPCM and ADPCM Week 3: Subband Coding (Wavelet based Speech Coding) Week 3: Subband Coding (Wavelet based Speech Coding) Week 4: AR random processes and Autoregressive (AR) Modelling of Week 4: AR random processes and Autoregressive (AR) Modelling of Speech Speech Week 5: LPC Week 5: LPC- -10 Standard (Linear Predictive Coding) and 10 Standard (Linear Predictive Coding) and pitch estimation estimation and and midterm midterm exam exam pitch Week 6: Line Spectral Frequencies, LPC Week 6: Line Spectral Frequencies, LPC- -10e and MELP 10e and MELP Week 7: Analysis by Synthesis LPC Coding and Code Week 7: Analysis by Synthesis LPC Coding and Code- -Excited Linear Predictive Predictive Coding Coding (CELP) (CELP) Excited Linear Week 8: Harmonic Speech Coding Week 8: Harmonic Speech Coding Week 9: Week 9: European Digital Cellular Telephony Standards European Digital Cellular Telephony Standards (GSM) (GSM) Week 10: Week 10: North American Digital Cellular Telephony Standards North American Digital Cellular Telephony Standards
Grading Midterm Exam: 30% Homeworks: 10% Term Project (implementation of a speech coding algorithm using MATLAB or on an Android or Iphone platform): 20% Final Exam: 40%
Books and References Key Reference: Digital Speech: Coding for Low Bit Rate Communication Systems, 2nd Edition Oct 29, 2004 by A. M. Kondoz, Wiley Other related books: Signal Compression, Coding of Speech, Audio, Image and Video (Selected Topics in Electronics and Systems) May 1997 by N.S. Jayant Digital Speech Processing Using Matlab (Signals and Communication Technology) Dec 4, 2013 by E. S. Gopi, Springer Voice Compression and Communications: Principles and Applications for Fixed and Wireless Channels, Sep 11, 2001by Lajos L. Hanzo and F. Clare A. Somerville Theory and Applications of Digital Speech Processing, Mar 13, 2010 by Lawrence Rabiner and Ronald Schafer Discrete-Time Speech Signal Processing: Principles and Practice Nov 8, 2001 by Thomas F. Quatieri, Prentice Hall Video, Speech, and Audio Signal Processing and Associated Standards (The Digital Signal Processing Handbook), Nov 20, 2009 by Vijay Madisetti Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Aug 23, 2011 by Ben Gold and Nelson Morgan
Speech Generation Blow air through lungs Vibration of vocal folds (chords) in larynx Vocal tract shape introduces resonance Tongue, teeth, lips, velum (nasal passage) modify the sound
Speech signal Sampling freq. = 8kHz
Hearing Sound waves reach our ears Vibrate ear drum Cause fluid in cochlear to vibrate Spiral cochlear Vibrate hairs inside cochlear Different frequencies vibrate different hairs Converts time domain to frequency domain (Fourier Transform magnitude of the speech is more important than phase). People hear with their brains
Phonemes Considered as fundamental units of speech When you change it, it (can) change the meaning of the word pat to fat pat to rat There are more phonemes than Latin or Roman alphabet letters. Phonetic alphabet concept is first invented by Phonecians in the Middle East.
US English Wovels AA AH wAshington AE bUt, hUsh hOW, sOUth fAt, bAd lAWn, mAll About, cAnoe gEt, fEAther AO AW AX AY hIde, bUY EH makER, sEARch ER EY gAte, EIght bEAt, shEEp tOY, OYster fOOl IH bIt, shIp IY OW UH lOne, nOse OY fUll UW
Vowels Almost periodic signals They have high amplitudes compared to other phonemes (consonants). They are all voiced sounds (vocal cords vibrate). Information content is low compared to consonants: Washington = w-sh-n-gt-n versus a-i-o
Period (1/pitch) of Vowels Pitch: rate of vibration during voiced speech Males: 80-140 times a second Females: 130-220 times a second Children: 180-320 times a second Formant (F0) First peak of the spectrum of the smoothed speech signal of a vowel (Some consonants can be also periodic, e.g. l , r )
US English Consonants Stops: P, B, T, D, K, G Fricatives: F, V, HH, S, Z, SH, ZH Affricatives: CH, JH Nasals: N, M, NG Glides: L, R, Y, W They can be voiced or unvoiced! They carry more information compared to vowels.
Voiced and Unvoiced Speech Speech signal segments can be also classified into voiced, e.g., a, e, i, ,b,r, or unvoiced: p, t, ch,.. A voiced speech segment: Relatively high energy content and it is almost periodic => the pitch of voiced speech. The unvoiced part of speech looks like random noise with no periodicity (not white noise!) Mixed segments: neither voiced nor unvoiced, but a mixture of the two. They occur at transition regions, when there is a change either from voiced to unvoiced or unvoiced to voiced.
Speech Waveform examples Unvoiced speech is amplified 5 times! Sampling frequency is 8 kHz
Voiced and unvoiced consonants Unvoiced consonants: P , t , ch , k , f , th , s , sh , h (no vibration) Voiced consonants: b , d , j of joke, g , v , th of that, z , s of vision, m , n , ng of thing, l , r , w , y of you (vocal cords vibrate) Voiced consonants have low amplitudes compared to vowels
Number of Phonemes in a Language US English: 43 UK English: 44 Japanese: 25 Hindi: 81 Hawaian Language: about 12 Since it is some sort of quantization phoneme numbers can be different in different books.
Prosody of Speech Intonation Tune or melody Duration How long/short of each phoneme Phrasing Where the breaks are
Narrow Band Speech Compression Narrow-band speech compression (telephone speech, Skype, Messenger, Google-Talk ) - Sampling frequency = 8 KHz (narrow band) - Intelligible speech - We can recognize the speaker Wide-band speech and audio compression - Sampling freq = 44.1 KHz - MP3 (MPEG Audio Coding Layer 2) - Music and teleconferencing - CD: almost no compression We will mainly study narrow-band speech compression in this course
Narrow-band Speech Compression Waveform coding - tries to preserve the shape of the waveform - e.g., PCM, DPCM, Subband coding (wavelet), (MP3 is also a waveform coder) - High bit rates: 64 Kbit/sec to 16 Kbit/sec Vocoders (Vo Voice-coders coders): - parametric coders ( they extract parameters from speech and parameters are transmitted to the receiver) - shape of the waveform is not important - e.g., LPC-10, MELP, CELP, GSM vocoder - Low bit rates, e.g., 2.4Kbit/sec
MOS: Mean Opinion Score Scale Measure of the quality of a coder It is based on psychological tests Grade (MOS) Subjective opinion Quality 5 Excellent Imperceptible Transparent Perceptible, but not annoying 4 Good Toll 3 Fair Slightly annoying Communication 2 Poor Annoying Synthetic 1 Bad Very annoying Bad Sound quality testing was carried out by people with golden ears .
Comparison of telephone band speech coding standards MOS Delay+ Standard Y ear Algorithm Bit rate (kb/s) Companded PCM G.711 1972 64 4.3 0.125 G.726 1991 VBR-ADPCM 16/24/32/40 toll 0.125 G.728 1994 LD-CELP 16 4 0.625 G.729 1995 CS-ACELP 8 4 15 A/MP-MLQ CELP G.723.1 1995 5.3/6.3 toll 37.5 ITU 4 4 toll 25 GSM FR 1989 RPE-L TP 13 3.7 20 GSM EFR 1995 ACELP 12.2 4 20 GSM/2 1994 VSELP 5.6 3.5 24.375 IS54 1989 VSELP 7.95 3.6 20 IS96 1993 Q-CELP 0.8/2/4/8.5 3.5 20 JDC 1990 VSELP 6.7 commun. 20 JDC/2 1993 PSI-CELP 3.45 commun. 40 Inmarsat-M 1990 IMBE 4.15 3.4 78.75 FS1015 1984 LPC-10 2.4 synthetic 112.5 FS1016 1991 CELP 4.8 3 37.5 New FS 2.4 1997 MELP 2.4 3 45.5
Experiments 8 kHz sampled 1 bit/sample (sign information) speech is intelligible All-pass filter the speech. You may not notice any difference between input and output Speech and audio waveforms are zero mean signals Speech can be assumed to be wide-sense stationary during each phoneme.
References Digital Speech: Coding for Low Bit Rate Communication Systems, 2nd Edition Oct 29, 2004 by A. M. Kondoz, Wiley Lecture notes, Speech Processing 11-492/18-492, CMU A-law and Mu-law Companding Implementations Using TMS320C54x, Texas Instruments, Applications note: SPRA163A