Automated vs Naive: Speech Emotion Classification

undefined
EMOTION CLASSIFICATION: HOW
DOES AN AUTOMATED SYSTEM
COMPARE TO NAÏVE HUMAN
CODERS?
Sefik Emre Eskimez, Kenneth Imade, Na Yang, Melissa Sturge-
Apple, Zhiyao Duan, Wendi Heinzelman
University of Rochester
Motivation
Emotions play a vital role in social interactions
Realistic human-computer interactions require accurately
determining affective state of the user
How does an automated system compare to naïve
human coders?
Can automated systems replace naïve human coders in
speech-based emotion classification applications?
Introduction
In this study, an automated system is compared with
naïve human coders in terms of speech emotion
classification performance
Results show that 
it is feasible to replace
 naïve human
coders with automatic emotion classification systems
Naïve human coders’ confidence level in classification 
does
not affect their classification accuracy
, while automated
system can increase accuracy by only classifying samples in
which it is confident
Automatic Speech Emotion
Classification System Overview
Feature Extraction
All features and their 1
st
 order derivatives (except
speaking rate) are calculated in overlapping frames
Statistical values are calculated using all frames
min, max, mean, standard deviation and  range (max-min).
Feature Selection
Support Vector Machine (SVM) Recursive Feature
Elimination
Train the SVMs to obtain weights
Eliminate the feature that has the lowest weight value
Continue until there is no feature left
Rank the features according to reverse of the elimination
order to get top N best features
In our experiments, we use N = 80 (out of 331)
Automatic Emotion Classification
The system labels each sample with three different
labels from the following sub-systems:
6 Emotion Categories: anger, disgust, panic, happy, neutral,
sadness.
Arousal Categories: active, passive and neutral (APN).
Valence Categories: positive, negative and neutral (PNN).
Automatic Emotion Classifiers
System uses binary SVM classifiers with RBF kernel
for each emotion
6 binary SVMs for first sub-system
3 binary SVMs for second and third-sub systems
Total of 12 binary SVMs
Automatic Emotion
Classification Threshold Fusion
LDC Dataset
15 Emotions
Speakers: 4 actresses and 4 actors
Total of 2433 utterances
Acted dataset
In our experiments
6 Emotions: anger, disgust, panic, happy, neutral and
sadness.
Speakers: 4 actresses and 3 actors
727 utterances
Experimental Setup: Automatic
Emotion Classification System
7-fold cross validation
6/7 of the data used for training, 1/7 of the data used for
testing
In each fold, training and testing data have been randomly
chosen
Data have been up-sampled to even out all classes
Leave-One-Subject-Out (LOSO) test
Experimental Setup: Amazon’s
Mechanical Turk
138 unique workers participated
10-100 random samples per worker
Only one sample per emotion category is presented beforehand
Experimental Setup: Amazon’s
Mechanical Turk
Turkers are asked to listen, label and
transcribe the audio sample
Turkers are asked for demographic
information after they are done labeling
Results: Turkers
Results: Computer System
Conclusion/Discussion
This study compares naïve human coders with a computer
emotion classification system
Expressed vs. perceived emotions!
The computer system 
achieves better accuracy 
in almost all
cases
The computer system 
can improve classification accuracy 
by
rejecting samples with low confidence
Naïve human coders 
were not able to improve their accuracy
through specifying their confidence in their classification
Results show that 
it is feasible to replace
 naïve human coders
with automatic emotion classification systems
The End
Thank you!
Slide Note
Embed
Share

In this study, the performance comparison between an automated system and naive human coders in speech emotion classification is explored. Results suggest the feasibility of replacing naive human coders with automatic systems, with the automated system showing increased accuracy by only classifying confident samples. Various aspects such as feature extraction, selection, and the overall system overview are discussed.

  • Automated systems
  • Emotion classification
  • Speech
  • Human coders
  • Performance

Uploaded on Mar 04, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NA VE HUMAN CODERS? Sefik Emre Eskimez, Kenneth Imade, Na Yang, Melissa Sturge- Apple, Zhiyao Duan, Wendi Heinzelman University of Rochester

  2. Motivation Emotions play a vital role in social interactions Realistic human-computer interactions require accurately determining affective state of the user How does an automated system compare to na ve human coders? Can automated systems replace na ve human coders in speech-based emotion classification applications?

  3. Introduction In this study, an automated system is compared with na ve human coders in terms of speech emotion classification performance Results show that it is feasible to replace na ve human coders with automatic emotion classification systems Na ve human coders confidence level in classification does not affect their classification accuracy, while automated system can increase accuracy by only classifying samples in which it is confident

  4. Automatic Speech Emotion Classification System Overview

  5. Feature Extraction All features and their 1storder derivatives (except speaking rate) are calculated in overlapping frames Statistical values are calculated using all frames min, max, mean, standard deviation and range (max-min). Feature name # Feature name # Fundamental Frequency (f0) 10 Spread 10 Energy 10 Skewness 10 Frequency and bandwidth for the first four Formants 80 Kurtosis 10 12 Mel-frequency Cepstral Coefficients (MFCCs) 120 Flatness 10 Zero-cross rate 10 Entropy 10 Roll-off 10 Roughness 10 Brightness 10 Irregularity 10 Centroid 10 Speaking Rate 1 Size of Feature Vector: 331

  6. Feature Selection Support Vector Machine (SVM) Recursive Feature Elimination Train the SVMs to obtain weights Eliminate the feature that has the lowest weight value Continue until there is no feature left Rank the features according to reverse of the elimination order to get top N best features In our experiments, we use N = 80 (out of 331)

  7. Automatic Emotion Classification The system labels each sample with three different labels from the following sub-systems: 6 Emotion Categories: anger, disgust, panic, happy, neutral, sadness. Arousal Categories: active, passive and neutral (APN). Valence Categories: positive, negative and neutral (PNN).

  8. Automatic Emotion Classifiers System uses binary SVM classifiers with RBF kernel for each emotion 6 binary SVMs for first sub-system 3 binary SVMs for second and third-sub systems Total of 12 binary SVMs

  9. Automatic Emotion Classification Threshold Fusion

  10. LDC Dataset 15 Emotions Speakers: 4 actresses and 4 actors Total of 2433 utterances Acted dataset In our experiments 6 Emotions: anger, disgust, panic, happy, neutral and sadness. Speakers: 4 actresses and 3 actors 727 utterances

  11. Experimental Setup: Automatic Emotion Classification System 7-fold cross validation 6/7 of the data used for training, 1/7 of the data used for testing In each fold, training and testing data have been randomly chosen Data have been up-sampled to even out all classes Leave-One-Subject-Out (LOSO) test

  12. Experimental Setup: Amazons Mechanical Turk 138 unique workers participated 10-100 random samples per worker Only one sample per emotion category is presented beforehand

  13. Experimental Setup: Amazons Mechanical Turk Turkers are asked to listen, label and transcribe the audio sample Turkers are asked for demographic information after they are done labeling

  14. Number of labeled instances by Turker s age and gender Accuracy by Turker s age and gender 7270 8000 70 64.2 63.6 61.2 60.4 60.1 61.9 61.4 61.4 61.3 58.7 57.5 7000 59 56.9 56.5 60 51.6 6000 50 Number of instances 4350 Accuracy (%) 5000 40 3980 4000 2850 30 2610 3000 20 1570 2000 1300 1270 940 10 630 1000 620 550 550 300 250 0 0 Male 2610 940 550 250 4350 Female 1300 630 620 300 2850 Total 3980 1570 1270 550 7270 Male 61.4 56.5 64.2 51.6 60.1 Female 63.6 59 58.7 61.4 61.2 Total 61.9 57.5 61.3 56.9 60.4 18-29 30-39 40-49 50-59 Total 18-29 30-39 40-49 50-59 Total 18-29 30-39 40-49 50-59 Total 18-29 30-39 40-49 50-59 Total

  15. Results: Turkers Accuracy by Turker s confidence level 18-29 30-39 40-49 50-59 Total 80 68.5 64.4 63.2 62.9 62.1 67 61.7 61.6 61.6 61.4 61.3 61.1 60.8 60.7 60.6 60.5 60.4 60.4 59.6 70 58.4 58.3 57.9 56.3 56.2 56.1 55.9 54.7 57 50.8 60 Accuracy (%) 50 37.8 40 30 20 10 0 Male Female (Confident) 63.2 58.3 55.9 61.4 60.4 Female (Not Sure) 64.4 60.7 61.1 68.5 62.9 Total Total (Not Sure) 61.6 60.5 58.4 50.8 59.6 Male (Not Sure) (Confident) 61.7 56.1 67 56.3 60.8 (Confident) 62.1 57 61.3 56.2 60.6 18-29 30-39 40-49 50-59 Total 61.6 60.4 54.7 37.8 57.9

  16. Results: Computer System

  17. Turkers vs. Computer System: Emotions 90 77.7 73.2 72.9 80 72 65.4 64.9 64.4 62.9 70 61.2 61.2 60.8 60.6 60.4 60.4 60.1 59.6 57.9 57.1 54.1 52.5 60 Accuracy (%) 50 40 30 20 10 0 Female Samples 73.2 64.9 64.4 65.4 All Samples Male Samples Confident (80%) Not Sure (20%) Computer System All Turkers Female Turkers Male Turkers 72.9 60.4 61.2 60.1 72 54.1 57.1 52.5 77.7 60.6 60.4 60.8 61.2 59.6 62.9 57.9 Computer System All Turkers Female Turkers Male Turkers

  18. Turkers vs. Computer System: APN & PNN 94.4 92.4 100 89.3 86.8 88 82.9 82.9 82.4 90 75.5 73.1 72.1 71.8 71.5 70.7 70.5 80 67.9 66.6 71 69 70 62 Accuracy (%) 60 50 40 30 20 10 0 Female Samples 86.8 71.5 82.9 75.5 Confident (80%) 94.4 71 88 72.1 Not Sure (20%) 73.1 67.9 62 70.7 All Samples Male Samples Computer System (APN) All Turkers (APN) Computer System (PNN) All Turkers (PNN) 89.3 70.5 82.9 71.8 92.4 69 82.4 66.6 Computer System (APN) All Turkers (APN) Computer System (PNN) All Turkers (PNN)

  19. Conclusion/Discussion This study compares na ve human coders with a computer emotion classification system Expressed vs. perceived emotions! The computer system achieves better accuracy in almost all cases The computer system can improve classification accuracy by rejecting samples with low confidence Na ve human coders were not able to improve their accuracy through specifying their confidence in their classification Results show that it is feasible to replace na ve human coders with automatic emotion classification systems

  20. The End Thank you!

Related


More Related Content

giItT1WQy@!-/#