Assistive Speech System for Individuals with Speech Impediments Using Neural Networks

Slide Note

Individuals with speech impediments face challenges with speech-to-text software, and this paper introduces a system leveraging Artificial Neural Networks to assist. The technology showcases state-of-the-art performance in various applications, including speech recognition. The system utilizes feature extraction techniques such as Mel-Frequency Cepstrum Coefficients (MFCC) and Phonetic Transcription for improved accuracy. The Neural Networks employed include Vanilla Recurrent Neural Network to process sequential data efficiently. The research presented at the International Mechanical Engineering Congress and Exposition (IMECE 2017) aims to provide a solution that requires minimal user expertise and can significantly enhance communication capabilities for individuals with speech impediments.

albi Follow

Uploaded on Jul 23, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Speech Assistance for Persons with Speech Impediments using Artificial Neural Networks Ramy Mounir Redwan Alqasemi Rajiv Dubey IMECE2017-71027 Presented by Ramy Mounir International Mechanical Engineering Congress and Exposition (IMECE 2017)

Introduction Methodology Experimental Setup Results and Analysis Conclusions Introduction Individuals with speech impediments have difficulty using speech-to-text software. This work introduces a system to help solve this problem Artificial Neural Network: Only needs data and a fast computer; doesn t require a deep understanding of the problem. Has shown the state-of-the-art performance in many applications. Are capable of lips reading [1], classification of raw pixels into objects, prediction of depth information from a single camera, autonomous vehicles, etc. State-of-the-art performance in speech recognition. International Mechanical Engineering Congress and Exposition (IMECE 2017) 1/17

Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Mel-Frequency Cepstrum Coefficients (MFCC) [2]: Introduced in the 1980 s Each frame is converted to 13 coefficient The 13 Coefficients are used as inputs to the NN 1 Frame 1 Frame 1 Frame 1 Frame 13 13 13 13 Coefficients Coefficients Coefficients Coefficients International Mechanical Engineering Congress and Exposition (IMECE 2017) 2/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Phonetic Transcription ARPAbet Phonetic Transcription [3]: Carnegie Mellon University Pronouncing Dictionary The dictionary includes lexical stress markers for vowels 39 phonemes, not counting variations due to lexical stress International Mechanical Engineering Congress and Exposition (IMECE 2017) 3/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Artificial Neural Network Vanilla Recurrent Neural Network: Accepts Sequential data Weights and biases are shared Do not work with long-term dependencies International Mechanical Engineering Congress and Exposition (IMECE 2017) 4/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Long Short-Term Memory Cell International Mechanical Engineering Congress and Exposition (IMECE 2017) 5/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Unfolded LSTM International Mechanical Engineering Congress and Exposition (IMECE 2017) 6/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Bidirectional LSTM International Mechanical Engineering Congress and Exposition (IMECE 2017) 7/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Deep Bidirectional LSTM International Mechanical Engineering Congress and Exposition (IMECE 2017) 8/17

Connectionist Temporal Classification Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model Connectionist Temporal Classification: Were introduced by Graves et. al. [4] Labels unsegmented sequence data for temporal classification tasks. Is used to align inputs with outputs for RNN training. Allows for end-to-end ASR system using Artificial Neural Networks [5]. Generates a probability distribution at each time step. Uses Best Path or Prefix search for decoding the probability distribution into labels. International Mechanical Engineering Congress and Exposition (IMECE 2017) 9/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model LER and LED Label Error Rate (LER): Is Best-known way to measure the accuracy of ASR system. Is the mean normalized edit distance between the network s prediction and the labels Levenshtein Edit Distance (LED): Minimum number of edits (inserts, substitutes, deletes) from a list of phonemes to another Dynamic programming is used for efficiency and optimization International Mechanical Engineering Congress and Exposition (IMECE 2017) 10/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Feature Extraction Phonetic Transcription DBRNN Architecture CTC LER and LED Language Model NN Language Model Language Model: Is a Neural Network that can predict the probability of a specific word occurring given a sequence of words. Is used with Greedy/Beam search decoding to form sentences from potential words. International Mechanical Engineering Congress and Exposition (IMECE 2017) 11/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Methodology Feature Extraction Phonetic Transcription Language Model DBRNN with CTC Edit Distance Dictated Speech ASR System Beam Width International Mechanical Engineering Congress and Exposition (IMECE 2017) 12/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Experimental Setup Deep Bidirectional RNN: 2 LSTMs (Bidirectional) with 100 hidden blocks in each direction were used. We used a 3 layer architecture to capture higher level representations. 2 fully connected layers were attached to the output of the RNN. Adagrad optimizer, 0.01 initial learning rate, 0.95 momentum were used for training. The network was trained on the LibriSpeech ASR corpus dataset Resulted in 38.5% LER on new data. Language Model: 2 layers of a single direction LSTM cell were used. Each LSTM has 1500 hidden blocks and we used 1.0 for the initial learning rate. Trained on 5% of the Penn Tree Bank (PTB) dataset for 55 epochs. International Mechanical Engineering Congress and Exposition (IMECE 2017) 13/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Results and Analysis DBRNN + Language Model: Data without speech impediment shows lower accuracy as edit distance increases. Our target is to use the smallest edit distance that guarantees the delivery of the correct words to the language model. Edit Distance LERWER 1 2 3 Beam Width 55.3 13.6% 100.3 25.9% 70.4 12.6% 108.7 22.6% 1 76.2 10.1% 110.0 20.5% 52.0 13.5% 97.6 28.8% 2 64.4 11.4% 105.4 24.3% 72.0 8.9% 108.2 21.4% International Mechanical Engineering Congress and Exposition (IMECE 2017) 14/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Results and Analysis DBRNN + Language Model Human Testing: Tested the full system on 8 different human subjects with different accents. Better results at a higher edit distance, especially for users with accents. International Mechanical Engineering Congress and Exposition (IMECE 2017) 15/17

Introduction Methodology Experimental Setup Results and Analysis Conclusions Conclusions This method is comparable to how we think our brains understand distorted audio. The user can choose an edit distance that matched their speech impediment severity. We need a better language model to increase the overall accuracy of the system Acknowledgement: The authors would like to thank the Florida Department of Education - Division of Vocational Rehabilitation for their support. International Mechanical Engineering Congress and Exposition (IMECE 2017) 16/17

Introduction Proposed Approach Methodology Experimental Details Results and Analysis Conclusions References [1] Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N. (2016). Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599. [2] P. Cryptography, "Mel Frequency Cepstral Coefficient (MFCC) tutorial," in practicalcryptography.com. [3] C. M. U., The CMU Pronouncing Dictionary, The CMU Pronouncing Dictionary. [4] A. Graves, S. Fern ndez, F. Gomez, and J. Schmidhuber, Connectionist temporal classification, Proceedings of the 23rd international conference on Machine learning - ICML '06, 2006. [5] A. Graves and N. Jaitly, "Towards End-to-End Speech Recognition with Recurrent Neural Networks,". International Mechanical Engineering Congress and Exposition (IMECE 2017) 17/17