ABSTRACT
Title of dissertation: TOWARDS BUILDING GENERALIZABLE
SPEECH EMOTION RECOGNITION MODELS
Saurabh Sahu, Doctor of Philosophy, 2019
Dissertation directed by: Professor Carol Espy-Wilson
Department of Electrical and Computer Engineer-
ing
Detecting the mental state of a person has implications in psychiatry, medicine, psy-
chology and human-computer interaction systems among others. It includes (but is not
limited to) a wide variety of problems such as emotion detection, valence-affect-dominance
states prediction, mood detection and detection of clinical depression. In this thesis we fo-
cus primarily on emotion recognition. Like any recognition system, building an emotion
recognition model consists of the following two steps:
1. Extraction of meaningful features that would help in classification
2. Development of an appropriate classifier
Speech data being non-invasive and the ease with which it can be collected has made it
a popular choice to extract features from. However, an ideal system designed should be
agnostic to speaker and channel effects. While feature normalization schemes can counter
these problems to some extent, we still see a drastic drop in performance when the training
and test data-sets are unmatched. In this dissertation we explore some novel ways towards
building models that are more robust to speaker and domain differences.
Training discriminative classifiers involves learning a conditional distribution p(yi|xi),
given a set of feature vectors xi and the corresponding labels yi, i = 1..N. For a classifier
to be generalizable and not overfit to training data, the resulting conditional distribution
p(yi|xi) is desired to be smoothly varying over the inputs xi. Adversarial training proce-
dures enforce this smoothness using manifold regularization techniques. Manifold regular-
ization makes the model’s output distribution more robust to local perturbation added to a
datapoint xi. In the first part of the dissertation, we investigate two training procedures: (i)
adversarial training where we determine the perturbation direction based on the given labels
for the training data and, (ii) virtual adversarial training where we determine the perturba-
tion direction based only on the output distribution of the training data. We demonstrate the
efficacy of adversarial training procedures by performing a k-fold cross validation experi-
ment on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a cross-corpus
performance analysis on three separate corpora. We compare their performances to that of
a model utilizing other regularization schemes such as L1/L2 and graph based manifold
regularization scheme. Results show improvement over a purely supervised approach, as
well as better generalization capability to cross-corpus settings.
Our second approach leverages multi-modal learning and automated speech recognition
(ASR) systems toward improving the generalizability of an emotion recognition model
that requires only speech as input. Previous studies have shown that emotion recognition
models using only acoustic features do not perform satisfactorily in detecting valence level.
Text analysis has been shown to be helpful for sentiment classification. We compared
classification accuracies obtained from an audio-only model, a text-only model and a multi-
modal system leveraging both by performing a cross-validation analysis on IEMOCAP
dataset. Confusion matrices show it’s the valence level detection thats being improved
by incorporating textual information. In the second stage of experiments, we used three
ASR application programming interfaces (APIs) to get the transcriptions. We compare the
2
performances of multi-modal systems using the ASR transcriptions with each other and
with that of one using ground truth transcription. This is followed by a cross-corpus study.
In the third part of the study we investigate the generalizability of generative adversarial
networks (GANs) based models. GANs have gained a lot of attention from machine learn-
ing community due to their ability to learn and mimic an input data distribution. GANs
consist of a discriminator and a generator working in tandem playing a min-max game to
learn a target underlying data distribution; when fed with data-points sampled from a sim-
pler distribution (like uniform or Gaussian distribution). Once trained, they allow synthetic
generation of examples sampled from the target distribution. We investigate the applicabil-
ity of GANs to get lower dimensional representations from the higher dimensional feature
vectors pertinent for emotion recognition. We also investigate their ability to generate syn-
thetic higher dimensional feature vectors using points sampled from a lower dimensional
prior. Specifically, we investigate two set ups: (i) when the lower dimensional prior from
which synthetic feature vectors are generated is pre-defined, (ii) when the lower dimen-
sional prior is learned from training data. We define the metrics used to measure and
analyze the performance of these generative models in different train/test conditions. We
perform cross validation analyses followed by a cross-corpus study.
Finally we make an attempt towards understanding the relation between two different
sub-problems encompassed under mental state detection namely depression detection and
emotion recognition. We propose approaches that can be investigated to build better de-
pression detection models by leveraging our ability to recognize emotions accurately.
TOWARDS BUILDING GENERALIZABLE SPEECH EMOTION
RECOGNITION MODELS
by
Saurabh Sahu
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2019
Advisory Committee:
Professor Carol Espy-Wilson, Chair/Advisor
Professor Shihab Shamma
Professor Jonathan Simon
Professor Behtash Babadi
Professor William Idsardi (Dean’s Representative)
©c Copyright by
Saurabh Sahu
2019
Acknowledgments
First and foremost I would like to express my gratitude towards my advisor Dr. Carol
Espy-Wilson. This work wouldn’t have been possible without her encouragement, moti-
vation and intuitive inputs. I have learned a lot about speech signal processing as well as
about research in general along with presentation and writing skills that has helped me in
writing this thesis. I would like to thank my parents and siblings and their families for their
constant support and providing me with the confidence in times of need. I would also like
to thank Dr. Vikramjit Mitra and Dr. Rahul Gupta for their ideas, discussions and contri-
butions to my thesis. I am grateful to my peers at Speech Communication Lab at UMD Dr.
Ganesh Sivaraman, Nadee Seneviratne, Vasudha Kowtha, Rahil Parikh, Nirat Saini and my
colleagues from one of the internships Dr. Elizabeth Shriberg and Ben Reeves for interest-
ing discussions and valuable suggestions towards improving my dissertation. I am indebted
to my PhD dissertation talk panel members - Dr. Shihab Shamma, Dr. Jonathan Simon,
Dr. Behtash Babadi and Dr. William Idsardi for taking time out of their busy schedule
and agreeing to be in my panel. I would also like to thank Ravi uncle and his family and
my friends who have supported me in times of need and treated me like their own in this
foreign land. Finally, I am thankful for Almighty’s blessings without which this wouldn’t
have been possible.
ii
Table of Contents
Acknowledgements ii
1 Introduction 1
1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Continuous features . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Voice quality features . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Spectral features . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Teager energy operator . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Classifiers for emotion recognition . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Objectives of this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Survey 12
3 Smoothing model predictions using adversarial examples 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Understanding adversarial examples . . . . . . . . . . . . . . . . . . . . . 25
3.3 Delving into loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 L1/L2 regularization . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Adversarial training . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3 Virtual Adversarial training . . . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Graph based manifold regularization . . . . . . . . . . . . . . . . . 35
3.4 Comparison of various generalization schemes . . . . . . . . . . . . . . . . 36
3.4.1 Single corpora setting . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Cross corpus evaluation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Results: Single corpus setting . . . . . . . . . . . . . . . . . . . . 40
iii
Results: Cross corpus evaluation . . . . . . . . . . . . . . . . . . . 43
3.5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Multi-modal learning for Speech Emotion Recognition : An Analysis and com-
parison of ASR outputs with ground truth transcriptions 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IEMOCAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
MSP-IMPROV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 Classification models . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.4 ASR models employed . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Exploring attention mechanisms . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Multi-modal experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Comparing audio and text modalities . . . . . . . . . . . . . . . . . 57
4.4.2 ASR model output vs ground truth transcriptions used for multi-
modal classification . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.3 Cross-corpus analysis . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Generative models to capture the underlying distribution of feature vectors 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 Adversarial auto-encoders . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Data generating GAN . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Adversarial auto-encoder with data generating GAN . . . . . . . . 76
5.3 Comparison of various models’ performance . . . . . . . . . . . . . . . . . 79
5.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Projecting higher dimensional points onto lower dimensions . . . . 80
Single corpora setting . . . . . . . . . . . . . . . . . . . . . . . . . 81
Cross corpus setting . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.3 Generative capability of GAN models . . . . . . . . . . . . . . . . 87
Single corpora setting . . . . . . . . . . . . . . . . . . . . . . . . . 90
Lower dimensional visualizations of synthetic data . . . . . . . . . 92
Cross corpus setting . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Future directions 100
6.1 Appending low volume datasets with adversarial counterparts . . . . . . . . 102
6.2 Emotion recognition on real world datasets . . . . . . . . . . . . . . . . . . 104
6.3 Connecting depression detection and emotion recognition . . . . . . . . . . 109
iv
Bibliography 113
v
Chapter 1: Introduction
Human interactions via speech is the most common and efficient way of interaction
that occurs on a daily basis. Moreover, their non-invasive nature has also resulted in speech
features being popular for various tasks such as emotion recognition. Human listeners have
the ability to pick up on the emotions of the speaker they are listening to and they can
assess the speaker’s mental state. This is an ability that machines lack. Even though we
have made great progress in the field of speech recognition, speech emotion recognition
systems are still not that accurate. Furthermore, such models that have been trained with
few amounts of data, don’t perform well when evaluated under unseen test conditions. If
we want to achieve our goal of building a realistic human-computer interaction system, it is
imperative that the machines understand the emotional state of a person and generalize their
performance across unseen speakers and environments so that the conversation is natural.
Speech emotion recognition systems can help us extract useful semantics from speech
thereby improving the performance of speech recognition systems (El Ayadi et al. [30],
Nicholson et al. [88]). Apart from that, automatic recognition of a speaker’s emotional/mental
state from speech can help us build systems that can play a crucial role in early diagnosis
of psychiatric diseases (France et al. [36]). It can be used in designing web tutorials whose
response depend on the users emotion and can also be useful for pilot stress management if
implemented in cockpits of aircrafts or even in cars or trucks (Schuller et al. [107], Hansen
et al. [52]). Speech emotion recognition has also found applications in call centers (Jin et
al. [61]).
1
The classification of emotions has been researched from two fundamental viewpoints:
one, that emotions are discrete and fundamentally different constructs; or two, that emo-
tions can be characterized on a dimensional basis in groupings. A typical set of emotion
classes may contain contain up to 300 emotional states (El Ayadi et al. [30], Schubiger
[104], O’Connor [90]). Obviously, classifying such a large number of emotions is very
difficult and impractical. Hence, to counter this problem, researchers came up with the
idea that emotions, like colors can be decomposed into primary components (El Ayadi et
al.[30]). The primary emotions are Anger, Disgust, Fear, Joy, Sadness, and Surprise. They
are called the archetypal or categorical emotions (El Ayadi et al. [30], Cowie et al. [25]).
Moreover, there has also been research on classifying emotions by scoring them along
certain pre-defined dimensions. There are several models having their own definition of di-
mensions that reflect the affect state of an individual. Valence-Arousal-Dominance model
has been used predominantly in studies (Valstar et al. [120], Grimm et al. [45]). Each
of these dimensions can take a continuous value from low to high. We can quantify them
as lying in the set [−1,1]. The valence dimension measures how pleasurable an emotion
is. For instance, ’anger’ and ’fear’ are low valence emotions while ’happy’ is a high va-
lence emotion. The arousal scale measures the intensity of an emotion. For example, both
’anger’ and ’rage’ are low valence emotions but ’rage’ has a higher arousal state. Similarly,
we can say ’boredom’ is a low arousal emotion along with being a low valence emotion.
The dominance scale represents the dominant/controlling nature of emotions. While both
anger and fear are low valence emotions, ’anger’ has a higher dominance value than ’fear’.
In this thesis we focus on categorical emotion classification.
Speech emotion recognition is a quite challenging task for several reasons. One of
the main reasons is that it is unclear as to which speech features are most powerful in
distinguishing between emotions (El Ayadi et al. [30]). Further, the acoustic variability
introduced by different sentences, speakers, speaking styles, speaking rates and record-
2
ing conditions adds another obstacle because these properties directly affect most of the
common extracted speech features such as pitch, energy contours, Mel frequency cepstral
coefficients (MFCC) etc. Thus, designing speech recognition systems typically requires
extraction of a considerably large dimensionality of features to reliably capture the emo-
tional traits, followed by training of a machine learning system ideally with a huge amount
of data so that it performs well in unseen conditions.
As this problem has a lot of practical applications it is not surprising that research in the
field of speech emotion recognition has been going on for quite some time (Williams and
Stevens [124]). The degree of naturalness of a dataset is an important factor to be consid-
ered while designing a speech emotion recognizer (El Ayadi et al. [30]). A list of common
databases used for speech emotion/depression research has been listed in (El Ayadi et al.
[30], Ververidis and Kotropoulos [122]). An important concern is whether we should focus
on real world emotions or acted emotions. Acted ones tend to be more exaggerated but
acoustic correlates found with acted emotions do not contradict those found with real ones
(Williams and Stevens [124]) . Some acted corpora play out real life scenarios to elicit
realistic emotions (Busso et al. [13]). Below we describe some features and classification
schemes that have been used in various studies.
1.1 Features
While using features, researchers always ponder on the question of global vs local fea-
tures. Global features are one per utterance and hence its convenient for cross validation.
However, we lose the temporal information contained in speech. It could also be unreason-
able to try and train a complex classifier like SVM or HMM with global features since the
number of training vectors might not be enough (El Ayadi et al. [30]).
3
1.1.1 Continuous features
Prosodic continuous speech features such as frequency and energy have been investi-
gated in a lot of studies involving mental state detection (El Ayadi et al. [30], Cummins
et al. [26]). Continuous features are related to F0, energy, articulation rate and spectral
information in various regions. They can be grouped into the following categories:
• pitch-related features
• formants features
• energy related features
• timing features
• articulation rate features
Continuous speech features have been heavily used in speech emotion recognition. For
example, Banse et al. examined vocal cues for 14 emotion categories [7]. Prosodic features
like F0, energy, etc have been used in various studies like Cowie et al. [25], Williams and
Stevens [124], Murray and Arnott [85], Oster and Risberg [91]. While these features are
quite reliable and used predominantly for emotion detection, there have been contradictory
reports for some of them. To get a global value of these features, researchers use functionals
like mean, median, range, standard deviation, quartile ranges etc (Eyben et al [35]).
1.1.2 Voice quality features
Another class of features commonly used is voice quality features. Voice quality and
perceived emotion in particular full blown emotions that make people take direct actions
are strongly related (Cowie et al. [25]). The same study categorizes the acoustic correlates
related to voice quality into following categories
4
• voice level (amplitude, energy)
• voice pitch
• temporal structures
Voice quality features can be obtained from glottal signals which can be obtained from
speech by filtering out the effects of the vocal tract. Since only voiced signals are gen-
erated form voiced signals, it is imperative that we do a voiced segment detection before
using this method. However, non-uniform vocal fold behavior, presence of noise, formant
ripples are some big problems that researchers have to overcome while inverse filtering. Re-
searchers have used the method mentioned in (El Ayadi et al. [30]) for voice quality feature
extraction. Pitch and the first four formant frequencies and bandwidth are estimated from
the speech signal. The effect of the vocal tract is mitigated by subtracting the vocal tract’s
influence from harmonic amplitudes. Source features which estimate the air flow from the
lungs through glottis are an effective way to capture voice qualities (Cummins et al. [26]).
Voice qualities features can quantify irregular phonation relating to laryngeal qualities such
as breathiness, creakiness or harshness (Klatt and Klatt [66], Gobl et al. [40]). Other source
features include jitter, shimmer and harmonic to noise ratio (HNR). However, like prosodic
features there is disagreement between researchers on how to associate vocal quality fea-
tures with emotions (El Ayadi et al. [30]). For example, according to Scherer [103], tense
voice is associated with anger,joy and fear and lax voice is associated with sadness. On
the other hand, Murray and Arnott [85] were of the opinion that breathy voice is associated
with both anger and happiness while sadness is associated with a resonant voice quality.
1.1.3 Spectral features
Spectral features characterize the spectrum of speech i.e. the frequency domain repre-
sentation of a speech signal at a particular time instant in some high dimensional space.
5
Studies have indicated that different emotions affect the sub-band energy distribution in
speech (El Ayadi et al. [30]. Some depression papers have reported a relative shift in
energies from lower to higher sub-bands while others have focused on sub-band energy
variability (Cummins et al. [26]). Cepstral features which are obtained from spectral fea-
tures have been used extensively for emotion recognition. While Bou-Ghazale and Hansen
[12] report that cepstral features like MFCC, linear predictor cepstral coefficients (LPCC)
outperform linear based features like LPC, Nwe et al. [89] report that linear based feature
like the log frequency power coefficient worked better than MFCC and LPCC for their
experiments. Features like power spectral density, linear prediction coefficients, Mel fre-
quency cepstral coefficients, spectral centroid, spectral flux are some examples of spectral
features. One major drawback of using these kind of features is that spectral features con-
tain both linguistic and paralinguistic information. They also contain speaker dependent
information and hence MFCCs has been widely used in speech and speaker recognition.
Ideally we would like our features for this task to be independent of any linguistic infor-
mation. So features of this sort might hinder speaker emotion recognition systems. Clearly
there are some relationships among the feature types described above. For example, spec-
tral variables relate to voice quality, and the pitch contours relate to the patterns arising
from different tones. But links are rarely made in the literature (El Ayadi et al. [30].
1.1.4 Teager energy operator
Teager energy operator based features characterize the non-linear airflow in the vocal
system (Teager and Teager [116]). Under stressful conditions, the muscle tension of the
speaker affects the air flow in the vocal system producing the sound (El Ayadi et al. [30].
Therefore, nonlinear speech features could be useful for detecting emotions or a speaker’s
mental state.
6
1.2 Classifiers for emotion recognition
Classifiers like hidden Markov models (HMM), Gaussian mixture models (GMM), ar-
tificial neural networks (ANN) and Support vector machines (SVM) have been used for the
task of emotion recognition (El Ayadi et al. [30]). Furthermore, k-nearest neighbor, fuzzy
classifiers, decision trees and systems where multiple classifiers are combined have been
employed by researchers in the past.
1.2.1 Hidden Markov Models
HMMs have been widely used for automatic speech recognition (ASR). However, HMMs
used for emotion recognition are usually fully connected unlike the HMMs used for ASR
which are left to right. Also the output of HMM states for emotion recognition are fea-
ture values extracted from larger time units spanning one or multiple words since it doesn’t
make sense to label smaller units like a phoneme with an emotion category. While de-
signing an HMM system, some design criteria are number of optimal states, type of obser-
vations (discrete or continuous) and the optimal number of observation symbols/optimum
number of Gaussians. Nwe et al. [89] showed that a four state HMM model does better
than humans in recognizing emotion for Burmese and Mandarin databases. However, we
can’t generalize the results until a more comprehensive study is done.
1.2.2 Gaussian Mixture Models
GMMs are probabilistic models used to model multi-modal distributions. Determining
the number of Gaussians is an important design problem and methods such as classifi-
cation error with respect to a cross-validation set, minimum description length, Akaike
information criterion, kurtosis based goodness of fit measures and greedy expectation-
7
maximization have been employed for this task (El Ayadi et al. [30]) .GMMs are trained
using global features and thus they don’t have the ability to model the temporal dynamics of
training data. In order to do that GMMs were employed with vector auto-regressive process
resulting in Gaussian mixture vector auto-regressive models. El Ayadi et al. [31] showed
that GMM performed better than methods like HMM, k-nearest neighbors and ANN.
1.2.3 Artificial neural networks
ANNs have a better ability to model non-linear mappings as compared to HMMs and
GMMs. They also perform better than GMM or HMM when the number of training ex-
amples is low. The design criteria for ANN include the number of hidden layers, number
of neurons in each hidden layer and the activation function. The performance of an ANN
depends heavily on these parameters. Hence, researchers have tried using the aggregate de-
cision of multiple ANN architectures. The ANN can be one-class-in-one network classifier
where we have one neural network for each emotion giving binary output values indicating
the presence/absence of the emotion. The final decision is based on the output of all the
neural networks. In contrast we can also have all-emotion-in-one neural network archi-
tectures by having a softmax output layer. ANNs have performed less well compared to
GMM/HMM and how good they perform is thought to be dependent on the corpus used
for the study (El Ayadi et al. [30]). However, due to more computing power being avail-
able these days, Deep Neural Networks (DNNs) have become quite popular. DNNs are
simply ANNs with multiple hidden layers. Stuhlsatz et al. [113] report some impressive
accuracies using a DNN on 9 corpora using Generalized Discriminant Analysis features to
do a binary classification between positive and negative arousal and positive and negative
valence states.
8
1.2.4 Support Vector Machines
SVMs use kernels to map the non-linear feature space to a high dimensional space
where they are linearly separable. SVMs can be used for multi-class classification by train-
ing one SVM per emotion to give a binary decision and then combining the decisions
from all the SVMs. The design criteria include the type of kernel and the cost parame-
ter C (Chang and Lin [18]). A table summarizing the performances of these classifiers is
presented in (El Ayadi et al. [30]). HMM and GMM are the more popular choice of classi-
fiers for emotion recognition. HMM has the ability to model the transitions between states
thereby capturing the temporal dynamics of how features change. But they need proper
initialization schemes for their training and parameter estimation. Models like ANN and
SVM have been employed widely as well because of their ease of implementation. Even
though the training time for an SVM is larger compared to a GMM or an HMM we used it
for our classification purposes because unlike an HMM/GMM we don’t need to initialize
any parameters.
1.3 Objectives of this study
Performance of speech emotion recognition classifiers deteriorate when evaluated on a
dataset different than the training set. Below we show the mean class-wise accuracies when
we train and evaluate a neural network based classifier on two datasets named IEMOCAP
(Busso et al. [13]) and MSP-IMPROV (Busso et al. [16]). We consider four way classifi-
cation between emotions angry, sad, neutral and angry. As can be seen in Table 1.1 cross-
corpus accuracies are always lower than in-domain speaker independent cross-validation
accuracies. The model trained on MSP-IMPROV under-performs than that of the model
trained on IEMOCAP probably because of its lower size. At the same time we can see that
9
Table 1.1: Mean class-wise accuracies obtained for in-domain speaker independent cross-validation
and cross-corpus evaluation
→ Training IEMOCAP MSP-IMPROV
↓ Evaluation
. IEMOCAP 58.15 47.04
MSP-IMPROV 43.43 49.56
the cross-corpus differences are more evident when using a model trained on IEMOCAP.
The objective of this thesis is to investigate the generalizability of speech emotion
recognition systems trained on limited amount of data. Towards that end we use existing
feature extraction schemes to train classifiers and carry out in-domain speaker independent
cross-validation studies which gives us an idea how well our models perform on unseen
speakers. Additionally, we perform cross-corpus experiments to determine how differ-
ent recording conditions/label space/annotation and data collection procedures affect the
models. We investigate the generalizability of discriminative and generative models. Dis-
criminative models learn a conditional distribution p(yi|xi), given a set of feature vectors
xi and the corresponding labels yi, i = 1..N. In other words given the training data-points
they aim to learn the hard/soft boundary between classes. Generative models on the other
hand model the distribution of the classes. They aim to learn the joint distribution p(xi,yi).
Once the joint distribution has been computed, it can be used to evaluate the p(y|x) in or-
der to classify a new data-point x. Also by sampling a point from the joint distribution it
is possible to generate synthetic examples of data-points x. If our goal is to build classifi-
cation models and we have enough labeled data available for training, then discriminative
models are the way to go. However, the annotation process can be expensive and time
consuming. Generative models can be helpful in such cases when limited labeled training
data are available since they can also exploit vast amount of unlabeled data. However, in
often cases the generalization performance of generative models is found to be poorer than
discriminative models due to differences between the model and the true distribution of
10
the data (Bernardo et al [11]) which is what we aim to investigate in one of the chapters.
Specific contributions of this thesis are as follows:
1. Investigating adversarial/manifold learning based regularization schemes and com-
paring them with L1/L2 regularization that have been known to prevent overfitting in
discriminative models (chapter 3).
2. Leveraging the generalizability of automatic speech recognition systems to get the
transcripts from audio files and using them to build multi-view speech emotion recog-
nition models (chapter 4). We also study the effect of using attention mechanisms in
discriminative models that uses frame-wise features for speech emotion recognition.
3. Studying the capability of generative adversarial networks (GANs) to learn the proba-
bility distribution of the feature vectors used for speech emotion recognition (chapter
5). We perform experiments to determine how well these models can encode higher
dimensional feature vectors to lower dimensions and then use the trained GAN based
models to generate synthetic higher dimensional feature vectors. The synthetic fea-
ture vectors can potentially be used to train models in low resource conditions. To
our knowledge, this is the first such study comparing the GAN based models with
regards to their encoding ability and comparing the quality of the generated synthetic
feature vectors for speech emotion recognition.
4. Finally we talk about future directions. We show some results on how a speech emo-
tion recognition model trained on acted datasets performs on real world samples. We
also propose some experiments that can potentially improve mental heath detection
using information gained from emotion recognition models (chapter 6).
11
Chapter 2: Literature Survey
A generalizable speech emotion recognition system should perform well under various
conditions. Shami and Verhelst [109] showed that aggregating data from different emotion
datasets to train a model can improve the performance as compared to when the model
is trained on just one dataset. More recently, Zhang et al [133] performed a cross-corpus
binary arousal and valence level classification across six databases to explore the effec-
tiveness of unsupervised learning across six emotion databases. The databases they used
corresponded to different languages such as German, English and Danish. They used the
Opensmile toolkit [35] to extract a 6552 dimensional feature set that consists of 39 func-
tionals of 56 low level descriptors and their delta and delta-delta coefficients. This high
dimensional feature vector was used to train a SVM with linear kernel. They performed a
6-way leave one corpus out cross-validation experiments and reported the mean class-wise
accuracies. They experimented with three different normalization techniques namely mean
subtraction, min-max normalization and mean-variance normalization performed before
and after aggregating the datasets. They reported higher accuracies with mean centering
for valence level detection while mean-variance normalization worked the best for arousal
level detection. Also normalizing the features before aggregating was found to be more
beneficial. Their next step was to investigate a unsupervised adaptation technique. They
considered three settings (i) they used only three out of five datasets for training resulting in
ten training set permutations for each of the six test sets. (ii) they trained a model on three
datasets and used it to predict the labels for remaining two training sets. They then consid-
12
ered these predicted labels as ground truth labels and used all five sets for training a model
which was evaluated on the test set. This was the unsupervised framework where availabil-
ity of unlabeled data was assumed (iii) they used data and ground truth labels from all five
sets for training. As expected the performance of unsupervised method lied somewhere in
between case (i) and case (iii). The absolute improvement in the mean accuracy over six
cross-validation splits was 0.4% for arousal detection and 0.8% for valence detection.
Abdelwahan and Busso [2] proposed feature selection based domain adaptation tech-
nique for a 4-class (angry, sad, neutral and happy) speech emotion recognition problem.
Domain adaptation is closely related to transfer learning in which our aim is to learn and fit
a model on a source data distribution that performs ’well’ on a different but related target
data-distribution. The authors’ goal here was to improve the performance of a SVM classi-
fication technique trained on IEMOCAP dataset [13] and tested on MSP-IMPROV [16]. In
most cases as in this paper, different emotion datasets are considered to have different (due
to speaker, channel, recording conditions, label space) but related (since all of them have
emotionally colored utterances) distributions. They extracted 6373 dimensional feature sets
using Opensmile which was reduced to 3000 dimensions by selecting features that corre-
late with class label of the training set but not with each other. They also train an ensemble
of SVM classifiers with each targeted towards classifying a particular emotion in particular
maximizing the f2-score obtained for that class. They then try three different methods to
select data from target set to label and use them for feature selection for each of the SVM
classifiers in the ensemble. They data selection methods include (i) vote entropy where
they select samples in target set which have the highest disagreement over the ensemble of
classifiers (ii) uncertainty sampling where the samples closest to the decision margin of a
SVM trained on source data are considered. Note that in this case only one SVM classifier
is used (iii) random sampling. Once the labeled set from target data is obtained, feature se-
lection was done sequentially by adding features one at a time to the selection set for each
13
of the classifiers. The selection is done on the basis of highest f2-score obtained when the
SVM trained on source data is evaluated on the labeled target set. They showed that vote
entropy works the best when target training set is small whereas random sampling works
better for larger sample size. They also showed that implementing this feature selection
technique more often than not performs better than a baseline model trained on the features
selected using forward feature selection technique.
Abdelwahab and Busso [1] proposed using supervised model adaptation techniques to
obtain better cross-corpus performances. It can also be viewed as a domain adaptation tech-
nique where the classifier trained on a source data is adapted to perform better on a target
data. They used two different English databases for training (source data) and a smaller
French database for evaluation (target data) of their classifiers across high/low arousal and
valence detection. To reduce the effect of data-differences they mean-variance normal-
ized the features obtained for utterances from different datasets using their correspond-
ing dataset specific statistics. To mitigate the differences in their label spaces they mean-
variance normalized the scores obtained from different datasets individually. Samples with
normalized value below -0.3 were considered negative or low arousal/valence while those
with normalized value above 0.3 were considered positive or high arousal/valence. Base-
line classifiers were SVM trained only on one of the two source datasets. They explored
two adaptation techniques (i) adaptive SVM which tries to transform the decision boundary
of the already trained SVM classifier such that it classifies the labeled target data correctly
without changing the decision boundary by a large amount. (ii) incremental SVM where
along with newer target domain training data, a portion of the older source data was used
for training. Specifically, only those datapoints from source domain were retained that cor-
responded to the support vectors of the already trained SVM. This process was repeated
iteratively. They also considered using different amounts of target data to adapt the SVM
parameters. They found that using 35% of target training set for adaptation is as good as
14
using all of the target training set. Furthermore, speaker diversity also didn’t seem to matter
much while selecting the data used for adaptation from target training set. Both adaptation
schemes were found to perform similarly out-performing a non-adapted model.
Sanchez et al. [102] also applied domain adaptation techniques for improving cross-
corpus accuracies. They used two datasets in their experiments namely a 911 call dataset
and LDC dataset (Liberman et al. [74]) for a binary classification task between neutral and
fear classes. Their metric was f-measure score obtained for fearful calls. The 911 call had
95 calls and hence they reported the average 95 fold leave one call out cross-validation
f-scores. The calls were broken down into segments after prosodic features were extracted.
They were used for training and evaluation of a SVM classifier. They compared training
only using the 911 data and found that it always performs similar or worse than a model
trained using 911+LDC. They tried another approach where they trained a classifier only
on LDC and used the prediction probabilities obtained for 911 as additional features along
with the prosodic features. Then they trained a model using this appended feature set ob-
tained for 911. This model was found to perform better than both the above methods. They
also employed a method called nuisance attribute projection (NAP) to compensate for the
differences in channel and utterance length duration between the two datasets. The goal
of this method was to find a projection matrix P to project the original features to a space
more robust to nuisances arising from domain differences. While the f-measure obtained
using a model trained only on 911 dataset was found to be 64.1% the best performance
was achieved by training a model on NAP 911+LDC features with the f-measure being
64.8%. In [112], Song et al. used a similar projection matrix based approach to enhance
the cross-corpus performance of speech emotion recognition models. They proposed a joint
framework to learn a common projection matrix W to project the feature representations X
onto label space or embedding space Y for the source and target datasets. Moreover, they
also attempt to minimize the differences in Y that appear due to differences in X belonging
15
to different datasets. Hence, they add a maximum mean discrepancy (MMD) regulariza-
tion term that tries to minimize the difference between the mean embeddings of features
in source and target domains. At the same time, they also implement feature selection by
minimizing the l2,1 norm of W . Generally, we can say that whichever row of W has the
least l2 norm, that corresponding dimension of the feature vector has the least contribution
towards generating the embeddings. Additionally the also add a graph based regularization
term that minimizes the distance between the embeddings if the corresponding features
lie are close to each other in the feature space. The resulting objective function does not
have a closed form solution due to the presence of l2,1 norm and so an iterative algorithm
is implemented to get the matrix W and the resulting label space embeddings for target
dataset. Compared to a baseline linear classifier that directly learns the matrix W from
source without any of the regularization terms, this method showed an absolute improve-
ment of 17-18% in cross-corpus recognition accuracies and other popular projection matrix
based methods. Liu et al. [76] followed a very similar projection based matrix approach to
get the embeddings in label space except they did not have the graph based regularization
term in their objective function. The two studies used the same databases for their cross-
corpus evaluations and the better accuracies obtained using Song et al’s method shows the
worth of the graph based regularization term.
Now we focus on methods that instead of projection matrices, use auto-encoders (Baldi
[6]) to learn common feature representations for different data sets. Deng et al. [28] ex-
plored a supervised sparse auto-encoder based feature learning scheme. Their task was va-
lence level detection (high/low) in cross-corpus training scenarios. While the target dataset
was kept fixed, they used five different databases as source dataset. There were variations
between target and source datasets due to participant ages, languages and recording condi-
tions. They trained a single hidden layer auto-encoder to minimize the reconstruction error
when the feature vectors were fed to them as input. Moreover, they enforced a sparsity
16
constraint penalizing when the expected activations of the hidden layer exceeds a low fixed
level. Hence, these auto-encoders were termed as sparse auto-encoders. If the hidden layer
dimension is less than that of input layer dimension then the auto-encoder learns a sparse
low dimensional representation of the input feature vectors. The transfer feature learning
procedure starts with randomly choosing a certain number of instances from the target set
belonging to a particular class and training a class specific sparse auto-encoder to minimize
its reconstruction error. Then instances belonging to the same class are selected from the
source database and reconstructed using the trained auto-encoder. It is this reconstructed
vectors that are now used as features to train a SVM classifier. The authors experimented
with different numbers of instances chosen from target set. They showed that even with
only 50 target instances, on an average the classifiers trained on reconstructed source fea-
ture sets exceed the performance of classifiers trained on actual source feature sets by 9%
across the different source datasets. Note that this was a supervised feature transfer learn-
ing method because class information from target dataset was utilized to train the sparse
auto-encoders. Deng et al. [27] extended it to an unsupervised setting as well. They used
a single layer denoising auto-encoder (DAE). While training denoising auto-encoders, the
input feature vectors are first corrupted by either adding Gaussian noise or masking certain
dimensions. Then the corrupted version is fed to auto-encoder while the output’s recon-
struction error is minimized with respect to the clean input. Since the auto-encoder in this
case is made to reconstruct clean input from its noisy version, it learns more robust and per-
tinent representations of the input. To avoid over-fitting a weight decay regularization term
was also added. The authors also implemented an adaptive DAE (A-DAE). The first step
towards training an adaptive DAE is to train a DAE to reconstruct the features in target set.
Once the weights of this target-DAE are learned, a new DAE is trained on source set but
with a modified objective function. The objective function now not only reduces the recon-
struction error for feature vectors belonging to source set, but it also forces the parameters
17
of this new DAE to be closer in values to the parameters of the target-DAE. The importance
of these two terms in the objective function can be controlled by hyper-parameter tuning.
Finally the source and target datasets are encoded using the adaptive DAE and are used
to train and evaluate SVM classifiers. The authors showed that the A-DAE approach per-
forms better than DAE and some other popular feature transformation methods. Mao et al.
[80] implemented a shared hidden layer auto-encoder (SHLA) based scheme for domain
adaptation. A SHLA is an auto-encoder with a common input and hidden layer but with
separate output layers for source and target dataset. It is trained to reconstruct data from
both source and target domains. While the hidden to output layer weights are updated based
on the reconstruction errors obtained on the corresponding dataset, the input to hidden layer
weight matrix is updated in both cases. The authors considered a binary classification task
between high and low valence levels with different databases used for training and testing.
A feed-forward neural network classifier with one hidden layer was implemented for this
purpose. Similar to a SHLA, there were two output layers one for the source training set
and one for target training set. Since it was binary classification problem, each output layer
had two neurons corresponding to high and low valence classes. The input to hidden layer
weights of the classifier were initialized with the input to hidden layer weight matrix of
the SHLA. The hidden layer to output layer weights were initialized in such a way that the
priors between the related classes are shared i.e. the weight vectors from hidden layer to
’high valence’ neurons for both the source and target domains followed a Gaussian distri-
bution with identical mean vectors and covariance matrices. Similarly, the weight vectors
from hidden layer to ’low valence’ neurons for both the source and target domains followed
a Gaussian distribution with identical mean vectors and covariance matrices but different
from that of ’high valence’ neurons. So, we can say that priors between related classes
cross different datasets are shared. The authors showed that in general the shared prior
technique led to better accuracies compared to when the priors weren’t shared.
18
Abdelwahab and Busso [3] trained what they call as domain adversarial neural network
(DANN) to improve the cross-corpus recognition accuracies of speech emotion recognition
models. Their goal was to learn a common representation between the samples from source
and target domain. They train a neural network classifier with two different output layers -
one for emotion classification and one for domain classification. The two tasks share some
portion of the hidden layers which is used to get the aforementioned common representa-
tion for samples from both domains. The loss function includes a term that reduces the
prediction loss obtained for source data for which we have ground truth labels available.
It includes another term that updates the shared hidden layer parameters by reversing the
sign of gradients obtained from domain classification error term. We can leverage both
source and unlabeled target data to get the domain classification loss. Hence its an un-
supervised domain adaptation method. Both loss terms compete against each other in an
adversarial manner. The gradient reversal attempts to make the features similar across do-
mains, so that feature transformation information learned from source domain is retained
for target domain. The t-distributed stochastic neighbour embedding (TSNE) plots show
that the feature transformations obtained from the last layer of the shared hidden layers
from a trained DANN for source and target domains are more indistinguishable compared
to that obtained from a baseline vanilla deep neural network (DNN). They also showed
better performance of DANN models for arousal, valence and dominance value prediction
compared to a vanilla DNN. The relative improvements over the baseline models are 22.8%
for arousal, 33.4% for valence and 15.5% for dominance. Since, they formulated it as a re-
gression problem, the metric used was concordance correlation coefficient (CCC) between
ground truth and predicted values. CCC is defined so that it combines the idea of two more
well known metrics - root mean square error (RMSE) and Pearsons correlation coefficient.
Neumann and Vu [86] investigated the importance of pre-training on source dataset fol-
lowed by fine-tuning on a few training samples from target set. They attempted a binary
19
arousal and valence level classification using two datasets namely IEMOCAP (English) and
RECOLA (French). The features used for this experiment was 26 dimensional logMel fil-
terbank coefficients extracted at a frame rate of 10ms for 7.5 second long utterances. They
applied a 1D convolution over time followed by a max-pooling layer, output from which
was used to compute attention weights over time. The attention coefficients were then used
to compute a weighted sum of the information obtained from different parts of the input.
It was concatenated with the feature maps obtained after max-pooling layer and was fed
to softmax layer to get the classification prediction probabilities. They consider 4 training
scenarios : (i) within corpus speaker independent cross-validation (ii) multi-lingual cross-
validation by combining the data in IEMOCAP and RECOLA and using them for training
and validation (iii) cross-lingual (CL) where they trained on one corpus and tested on an-
other (iv) CL followed by fine-tuning on a smaller target training set (CL+FT). The mean
class-wise accuracies for both arousal and valence detection showed an increase in CL+FT
as compared to CL showing leveraging information from few target training samples can
be beneficial in cross-corpus experiments. As we have seen from previous studies this
was expected. The in-domain cross-validation studies in general performed better than CL
and CL+FT showing cross-domain differences can still matter even with fine-tuning. Kim
et al. [62] used a different transfer learning approach namely multi-task learning where
the knowledge gained from auxiliary tasks was used for improving the performance on the
main task. Their main task was emotion detection while the auxiliary tasks were gender de-
tection and naturalness (acted or natural) detection for samples in database. They used six
corpora in their experiments. Features such as F0, MFCC coefficients, voicing probability
were extracted after normalizing the gain of the utterance. They first performed in-domain
cross-validation studies on each dataset separately using long short-term memory (LSTM)
and DNN architectures both implemented under (i) a single task learning (STL) framework
where they recognized only emotions (ii) a MTL framework with shared hidden layers but
20
different output layers for each of the tasks. While DNN-MTL did not provide any im-
provement over DNN-STL, LSTM-MTL improved the average of mean class-wise accura-
cies over the six cross-validation experiments by 1.7%. They then performed cross-corpus
experiments using utterances from five of the corpora for training and one for testing. The
MTL framework showed more significant improvement compared to STL framework in
his scenario. They achieved an average absolute gain of 7.4% and 5.4% when comparing
the MTL with STL frameworks in case of DNN and LSTM respectively. They concluded
that MTL is potentially more effective for larger corpora. Furthermore, the TSNE of high
level features showed better clustering according to emotions in case of MTL than STL
frameworks.
As mentioned above most of the research done to improve cross-corpus generalizability
of speech emotion recognition models involve domain adaptation techniques that try to
learn a data representation that is nuisance free across source and target datasets. The model
adaptation methods we discussed used SVM classifier with some of them leveraging the
labels of the target domain data. In this thesis, we focus our effort on training discriminative
neural network based classifiers without leveraging any information from target domain
data. Furthermore, none of the past works have explored the generalization of generative
methods. One of the aims of this thesis is to try and close that gap.
21
Chapter 3: Smoothing model predictions using adversarial examples
3.1 Introduction
A speech emotion recognition system design involves extracting cues from speech and
depicting them as feature representations. This is followed by training a classification
algorithm using existing supervised/semi-supervised methods (Busso et al. [14], Koelstra
et al. [68]). We consider a setting with N training examples {xi,yi}, i = 1, ..,N, where xi
is the obtained feature representation for example i and yi is the corresponding label. Let
x and y denote the random variables of which xi and yi are instances. A typical supervised
learning approach involves modeling the probability p(y|x) using a chosen functional form
(e.g. neural networks or a support vector machine classifier). For the chosen model (trained
on finite training data) to generalize well to unseen data, the probability p(y|x) is desired
to have certain properties (Chapelle et al. [19]). One such property is the smoothness of
the distribution p(y|x) which states that if two points xi and xj are close to each other in the
feature space (based on some distance metric), then so should be their corresponding model
outputs p(yi|xi) and p(y j|xj). The underlying idea is for the classifier to be generalizable
and not overfit to training data. Enforcing this smoothness can be particularly useful for
low resource tasks such as emotion recognition, where collecting a large number of labeled
data instances may not always be possible.
Szegedy et al. [114] showed that neural network models are vulnerable to something
called adversarial examples. These are examples only slightly different from the training
22
examples but the trained model fails in recognizing their class correctly. Clearly, this is
a roadblock towards building generalizable models. Goodfellow et al. [44] suggested an
improved method called adversarial training (AT) in which the perturbations are added
along a specific adversarial direction. The adversarial direction for a certain training data
point is the direction along which the label probability of the model for that data point is
most sensitive. Miyato et al. [? ] proposed an extension of adversarial training, termed
Virtual Adversarial Training (VAT) wherein determining the adversarial direction doesn’t
depend on the availability of labels. Both of these methods are implemented by adding an
extra regularization term to the vanilla loss term. We refer to the training methods proposed
by Goodfellow et al. [44] and Miyato et al. [? ] as adversarial training procedures,
and investigate their applicability for improving the performance of emotion recognition
systems.
Similarly, manifold regularization methods impose this smoothness by modifying the
optimization objective (Belkin et al. [10]). Manifold regularization exploits the distribu-
tion p(x) as available through a set of labeled/unlabeled points to better estimate p(y|x),
thereby leveraging the concept of manifold learning to enforce model smoothness. In the
past, researchers have investigated manifold learning methods for speech-based emotion
recognition. Most of these methods attempt to learn the manifold by reducing the di-
mensionality of the input feature space and subsequently feeding them to a classifier. For
example, Kim et al. [64] and Ping et al. [94] employed isometric feature embedding for
deriving the manifolds and then used Gaussian Mixture Models as classifiers. You et al.
[129] employed Lipschitz embedding for non-linear manifold learning in an unsupervised
way followed by using support vector machines for classification. Qian et al. [96] applied a
supervised manifold learning method by considering the difference between feature subsets
of different classes and reported improvement in recognition accuracy. However, none of
them have investigated manifold regularization techniques that jointly optimize a manifold
23
regularization loss along with supervised classification loss. In particular, jointly optimiz-
ing the two losses has shown promise with deep neural networks (DNNs) for improving
ASR (Tomar and Rose [117]) and sentiment classification (Zhou et al. [134]). Researchers
have proposed several manifold regularization techniques, starting from Belkin et al. [10]
and Geng et al. [37]. These methods make use of available labeled/unlabeled data points
for regularization for better performance of classification models.
In this paper we compare the performance of AT and VAT to that of a baseline DNN
model for emotion recognition. After training the model using the aforementioned pro-
cedures, we evaluate its performance under two settings: (i) Running a cross validation
experiment on a single corpora (ii) Doing a cross corpora study. Under the single corpora
setting, we aim to understand the impact of adversarial training on system performance un-
der matched conditions. In the cross corpora setting, we train the model on a single dataset
and evaluate performance on three separate unseen datasets. We also compare the adver-
sarial training procedures with other regularization schemes. Along with the widely known
L1/L2, we also investigate the effect of the graph based manifold regularization scheme
discussed in Tomar and Rose [117]. We hypothesize that since manifold regularization
imposes smoothness constraint on the model’s outputs for the data points that are in the
neighborhood of each other, it can also make the model more robust to noise arising due
to difference in data distributions. In the following sections, we provide a background of
the adversarial training procedures (AT and VAT) and other regularization schemes that we
have employed. This is followed by a detailed explanation of the experiments after which
we present our conclusions.
24
Figure 3.1: Figure representing the sigmoid activation function and its derivative
3.2 Understanding adversarial examples
Even though deep neural networks are highly non-linear functions, Goodfellow et al.
[44] suggest that the existence of adversarial examples can be explained by considering
simple linear models. They argue that complex neural network models such as LSTMs or
feed-forward networks with activation functions such as sigmoid, ReLUs (Rectified Linear
Units) or maxout (Goodfellow et al. [43]) are kept in their linear region of operation for
efficient and easier optimization. For example, the derivative of sigmoid function gets
saturated in its non-linear region, thereby not providing enough gradients for the network
to learn as shown in the figure 3.1. However it might come as a surprise that these models
have very different outputs for a small deviation in the input. But that is usually the case
when we work in high dimensional spaces. Consider a linear model with weights w and
input x. Let x̃ = x+η be the adversarial input with η being the adversarial error term
added to input. Assume that ‖η‖∞≤ ε where ε is a small quantity. So, the activation of the
25
linear model for the perturbed input is given by wT x̃ = wT x+wT η . Assuming w ∈Rn and
the average absolute value of the elements in w is m, wT η ≤ εmn. Hence, even though the
input changes by a small amount ε the output of the model changes by an amount of the
order of εmn which can be a substantial quantity when working in higher dimensions.
One way of mitigating the effect of adversarial examples is to find such examples and
train the model with them. Let us consider a binary classification problem with ground truth
labels y = {−1,1}. We train a model that estimates the probability for a given datapoint x
using the equation : P(y = 1) = σ(wT x+b), where σ is the sigmoid function. To estimate
the parameters w, loss function L(x,y) is minimized over the training data. Since the log
function is monotonic, formally L(x,y) can be defined as:
−log(σ(w
T x+b)), if y = 1
L(x,y) =  (3.1)−log(1−σ(wT x+b)), if y =−1 −(wT x+b)
= 
log(1+ e ), if y = 1
(3.2)
T
log(1+ e(w x+b)), if y =−1
T
= log(1+ e−y(w x+b)) (3.3)
= T (−y(wT x+b)) (3.4)
where, T (z) = log(1+ ez). To derive the perturbation term η constrained by ‖η‖∞≤ ε for
the above loss function, we can follow the fast gradient sign method as mentioned in [44].
Basically it states:
η = εsign((∇xL((x,y)) ) ) (3.5)
e−y(w
T x+b)
= εsign − − wT x yw (3.6)1+ e y( +b)
= −εsign(y)sign(w) (3.7)
26
Since, y · sign(y) = |y|= 1 and (wT sign(w)) = ‖w‖1, the loss term L(x̃,y) for the perturbed
input x̃ = x+η can be written as:
L(x̃,y) = T (−y(wT x̃+b)) (3.8)
= T (−y(wT x+b)− y(wT η)) (3.9)
= T (−y(wT x+b)+ ε(y · sign(y))(wT sign(w))) (3.10)
= T (−y(wT x+b)+ ε‖w‖1) (3.11)
It can be observed that to some extent minimizing the loss function for adversarial inputs
is akin to L1 regularization. However, it is less stringent than L1 regularization penalty
because in this case as the model starts making confident predictions, y(wT x+b) attains a
high value thereby making the function T (z) enter its saturation region. This in turn does
not provide enough gradients for w to change as can be seen in Figure 3.2. Hence the effect
of term ε‖w‖1 disappears in this case unlike vanilla L1 regularization.
An alternative way to counter the effect of adversarial examples is via adding a reg-
ularization term to the overall loss function as has been done in Goodfellow et al. [44]
and Miyato et al. [? ]. This is the approach that we have followed for our task of speech
emotion recognition. In the following section, we explain the loss functions for a baseline
neural network without any regularization and the loss terms obtained after employing the
various regularization techniques.
3.3 Delving into loss functions
Let {xi,yi}, i = 1, ..,N be the set of N labeled data points that will be used to train a
neural network model. We represent the parameters of the neural network by θ . Then the
27
Figure 3.2: T (x) = log(1+ ex). Note how the graph saturates when x→−∞
output for the point xi is given by θ(xi). θ(xi) is a vector of probabilities that the neural
network assigns to each class in the label space spanned by y. It is computed using softmax
activation. A loss function is defined based on the neural network outputs and the one
hot-vectors yi corresponding to labels yi, as shown below. V (θ(xi),yi) is the loss for the
data point xi and ground truth label yi . Choices for the loss function include cross-entropy
(usually for classification tasks), mean squared error (usually for regression tasks) or the
hinge loss. We choose cross-entropy as the loss function for our baseline neural network.
1 N
L = ∑V (θ(xi),yi) (3.12)N i=1
With a small number of training instances N, its hard to generalize the performance
of a model trained solely on the loss above. The trained model performs well on training
set but not so much on unseen data suggesting overfitting. Studies have used L1 or L2
regularizers on neural network parameters or dropout to prevent overfitting on the training
28
set (Tripathi and Jadeja [118]). Manifold learning and smoothing is another way to pre-
vent overfitting and build models that generalize better. Along with the methods mentioned
above, we also implemented a graph based manifold regularization technique that lever-
ages the smoothness assumption that was mentioned in Section 3.1 Another approach is to
add a regularization term that penalizes large differences in model outputs when a small
perturbation is added to a data point. This mitigates the effect of adversarial examples on
a trained model. We determine the perturbation based on two existing methods: (i) adver-
sarial training and, (ii) virtual adversarial training. A brief overview of these methods are
given below.
3.3.1 L1/L2 regularization
L1/L2 regularization techniques belong to a larger class of parameter norm based penal-
ties which have been used for regularization for quite some time (Goodfellow et al. [41]).
Denoting training data points as xi, ground truth labels as yi and the parameters of the
model as Θ, the modified loss function consists of the supervised loss function V and a
regularization term f (Θ).
1 N
L ′ = ∑V (θ(xi),yi)+α f (Θ) (3.13)N i=1
The hyperparameter α controls the weight given to the regularization term. For a neural
network based model, usually only the weights (denoted by W) are modified and the biases
(denoted by b) are unaffected by the regularization term. Weights involve the dynamics be-
tween two variables while biases control a single variable. Hence, fitting biases accurately
requires less data than fitting weights and hence we would not gain much by regularizing
the biases.
For L2 regularization f (Θ) is chosen as squared L2 norm. L2 norm for a vector is
29
defined as the square root of the sum of squares of its components. So, the loss function
becomes
N
L ′
1
=
N ∑V (θ(xi),y
2
i)+α‖W‖2 (3.14)
i=1
The weights W are updated by taking the derivative of the loss function with respect to
the weight vector and implementing stochastic gradient descent. L2 regularization doesn’t
have much affect on the components of the weight vector that have a larger impact on the
objective function. On the contrary, the components of the weight vector which do not
affect the gradient undergo decay much faster. This leads to mitigating any training noise
induced along those components and prevent overfitting.
For L1 regularization f (Θ) is chosen as L1 norm. L1 norm for a vector is defined as
the sum of the absolute value of its components. So, the loss function becomes
1 N
L ′ = ∑V (θ(xi),yi)+α‖W‖1 (3.15)N i=1
Unlike L2 regularization, L1 regularization makes the model more more generalizable by
inducing sparsity in the parameter space. A technique called ’least Absolute shrinkage and
selection operator’ or LASSO leverages this property and removes less important feature’s
coefficients to zero thereby doing feature selection. While L2 regularization is the same
as Bayesian MAP (Maximum a posteriori) inference with Gaussian prior on weights, L1
regularization is the same as MAP with Laplacian prior on weights.
3.3.2 Adversarial training
Adversarial training (Goodfellow et al. [44], Miyato et al. [? ]) modifies the loss
function in such a way that it penalizes large deviations in model outputs when small per-
turbations are added to the training data points xi. A perturbation vector r ai is determined
30
for every datapoint xi followed by optimizing the modified loss Ladv to train a neural net-
work, as shown in Equation 3.16. D is a non-negative function that quantifies the distance
between the predictions θ(xi + r ai ) and targets yi. D is usually chosen to be cross entropy
or Kullback-Leibler divergence. α is a tunable hyper-parameter, determining the trade-off
between L and the adversarial loss.
1 N
Ladv = L +α× ∑ D(y aN i,θ(xi + ri )) (3.16)i=1
The perturbation r ai is determined based on Equation 3.17. The hyper-parameter ε
determines the search neighborhood for r ai .
r ai = arg max D(yi,θ(xi + r)) (3.17)
r:‖r‖≤ε
Considering ||r|| to be the Euclidean norm, r ai in Equation 3.17 can be approximated as
shown below.
g
radv ≈ ε ,where g = ∇xiD(yi,θ(xi)) (3.18)‖g‖2
If ||r|| is considered to be the infinity norm, then r ai is computed using Equation 3.5. The
gradient term in both the equations is obtained by differentiating the baseline loss func-
tion with respect to the input. It can be easily computed during back-propagation. It can
be observed that the regularization term added to the baseline loss function in adversarial
training depends on the ground truth labels. Hence, it can be considered as a supervised
regularization scheme unlike the other methods discussed here. We note that this opti-
mization has two hyper-parameters to tune, α and ε . We investigate the impact of these
hyper-parameters on the model performance in one of our experiments.
31
3.3.3 Virtual Adversarial training
Virtual adversarial training (Miyato et al. [? ]) also modifies the loss function in such a
way that it penalizes large deviations in model outputs for small perturbations in the input.
A perturbation vector r vi is determined for every datapoint xi followed by optimizing the
modified loss Ladv to train a neural network, as shown in Equation 3.19.
N
Ladv = L +α×
1 ∑ D(θ(xi),θ(xi + r vi )) (3.19)N i=1
The perturbation r vi is determined based on Equation 3.20. The hyper-parameter ε
determines the search neighborhood for r vi .
r vi = arg max D(θ(xi),θ(xi + r)) = arg max D(xi,r, θ̂) (3.20)
r:‖r‖≤ε r:‖r‖≤ε
where θ̂ is the current estimate of model parameters. Assuming D is KL divergence, we
can observe that D(xi,r, θ̂) = 0 for r = 0. Hence we can’t find an expression for r vi as we
did for r ai using Equation 3.18. Since the minimum value a KL divergence can attain is 0,
the derivative ∇rD(xi,r, θ̂) = 0 at r = 0. Using these along with Taylor’s approximation
we get:
D(xi,r, θ̂) ≈ D(xi,r, θ̂)| +rTr=0 ∇rD(xi,r, θ̂)| +rTr=0 H(xi, θ̂)r (3.21)
= rT H(xi, θ̂)r (3.22)
where H(xi, θ̂) = ∇∇rD(xi,r, θ̂)|r=0. Assuming we have a symmetric Hessian matrix
H(xi, θ̂) (which would be the case if D is twice differentiable at r = 0) would imply its
unit length eigenvectors ei (associated with the i-th biggest eigenvalue λi) are orthogonal.
32
Therefore, any unit vector r can be expressed as sum of these basis vectors i.e.
K K
r = ∑ αiei such that ∑ α2i = 1 (3.23)
i=1 i=1
Hence, for ‖r‖= 1
K
rT H(xi, θ̂)r = ∑ α2i e Ti H(xi, θ̂)ei (3.24)
i=1
K K
= ∑ α2i λ 2i ≤∑ αi λ1 = λ1 (3.25)
i=1 i=1
And the maximum is obtained when r is the dominant eigenvector e1. The perturbation
term can therefore be computed as:
r vi = arg max rT H(xi, θ̂)r = ε · e1(xi, θ̂) (3.26)
r:‖r‖≤ε
The dominant eigenvector for H(xi, θ̂) can be computed by initializing a randomly sampled
unit vector r̃0 and using the power iteration method.
H r̃
r̃ mm+1 = (3.27)‖H r̃m‖
The power iteration method will converge as long as the random initialization isn’t orthog-
onal to the dominant eigenvector e1 and the rate of convergence would depend on the ratios
λk
λ for k 6= 1. Expressing r̃0 as mentioned in Equation 3.23 we can work out the convergence1
33
of power iteration method.
r1 = H r̃0 (3.28)
K
= H(∑ αiei) (3.29)
i=1
K
= ∑ αiHei (3.30)
i=1
K
= ∑ αiλiei (3.31)
i=1
Pre-multiplying H to both sides of the above equation m times, where m is large we get,
K
rm = ∑ αiλ mi ei (3.32)
i=1
K (λ )k m
= λ m1 [α1e1 +∑ αi ei] (3.33)
i=2 λ1
≈ λ m1 α1e1 (3.34)
Equation 3.34 follows from the fact that |λ1|< |λ2|< ... < |λk| and therefore the ratios→ 0
as m→ 0. Hence, power iteration method converges to a vector lying along the dominant
eigenvector direction. The number of iterations m is a hyper-parameter for VAT. We didn’t
see any major differences in the model’s performance for different values of m and so it
was fixed at m = 1. In VAT, it can be seen that the adversarial perturbation term depends
on the model parameters θ . While updating the model parameters using backpropagation,
we do not take into account the gradient flow from the perturbation term. Further details
about the algorithm to compute r vi can be obtained from (Miyato et al. [? ]).
Equation 3.20 is very similar to Equation 3.17 except instead of ground truth labels yi
we use the ”virtual” labels θ(xi) which are probabilistic estimates obtained from a neural
network model. Since the regularization loss term is independent of ground truth labels,
34
it can be used in semi-supervised training scenarios where the first term L is computed
using labeled data and the second term is computed using both labeled and unlabeled data.
3.3.4 Graph based manifold regularization
We also try a graph based manifold regularization scheme that penalizes the model
for producing very different outputs for input datapoints within a certain neighborhood.
However unlike using KL divergence as in case of VAT, we consider Euclidean distance
in this case. Manifold regularization was proposed by Belkin et al. [10] and similar to
above methods we add a regularization penalty term to the cross entropy loss function. Let
us consider training datapoints xi(i = 1, ..,N), with corresponding labels yi. For a choice
of Reproducible Kernel Hilbert Space (RHKS) Hk and a loss function V , they optimize
equation 3.35 to yield a classifier function f ∗ belonging to the space Hk. In the equation,
V (xi,yi, f ) is considered to be cross-entropy loss function as ours is a classification prob-
lem. || f ||2I is a regularization term modifying the parameters of the classifier depending on
the distribution of the set of given data points in the training set (please refer to Section 2
in Belkin et al. [10] for more details). γI is the hyper-parameter controlling the trade-off
between the losses in the equation 3.35.
N
f ∗
1
= arg min ∑V (xi,yi, f )+ γI|| f ||2I (3.35)f∈Hk N i=1
|| f ||2I is computed as shown in equation 3.36.
N
2 ∑ ∑ || f (xi)− f (x
u
i )||2|| f ||I= 2 (3.36)
i=1 x u∈ ||xi−xiu||i 2
Neighborhood of xi
The loss minimizes the Euclidean distance between the outputs for labeled instance
xi: f (xi) and a set of data-points in the neighborhood of xi: f (x ui ). Neighborhood x ui for
35
any point xi is defined as the set of points lying within a L2 norm ball centered at xi. The
distance between the outputs is inversely weighted by the distance between xi and x ui , so
that the loss function weights the distance || f (xi)− f (x ui )||2 u2 more when xi is closer to xi
when considering Euclidean distance. For fast computation, we compute the loss term in
Equation 3.35 iteratively, updating the weights based on cross-entropy loss first and then
updating them based on the regularization loss. A similar approach was followed in Gupta
et al. [48] where the authors explored semi-supervised learning on twitter sentiment dataset
using doc2vec features. Since, the regularization term is independent of ground truth labels,
it can be used in a semi-supervised setting like VAT.
3.4 Comparison of various generalization schemes
We perform experimental investigations under two settings: (i) a single corpora setting
using a cross validation setup and, (ii) a cross corpora setting involving training on one
corpus and testing on the other. In the single corpora setting, we aim to test improvements
in the generalized performance of the model under matched dataset conditions. However,
in the case of cross-corpora evaluation, representations for emotional utterances tend to be
dissimilar due to factors such as differences in data collection protocol and noise conditions.
Through cross-corpora evaluation, we aim to investigate if manifold regularization can
yield models robust to the corpus specific variations.
3.4.1 Single corpora setting
We use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset (Busso
et al. [13]) for our single corpora evaluation. The dataset consists of five sessions of scripted
and improvised interactions between two actors acting out real world situations. No two
sessions have the same set of actors, enabling us to do a speaker independent leave-one-
36
session-out five-fold cross validation. The database comes with the dyadic conversation
segmented into utterances which are on an average about 5 seconds in duration. The utter-
ances were then labeled by three annotators for emotion labels such as happy, sad, angry,
excitement and neutral. We only use utterances for which we obtain a majority vote re-
garding the ground truth label. Following the work of Kim and Provost [65], we combine
the utterances in the happy and excited classes to get a “combined happy” class for our
experiments. This was done to obtain a more balanced dataset, given there are only a small
number of “happy” class instances. For our classification experiments we focused on a set
of 5531 utterances shared amongst four emotional labels: neutral (1708), angry (1103), sad
(1084), and happy (1636). Overall, this amounts to approximately 7 hours of data.
3.4.2 Cross corpus evaluation
We use a set of four datasets for the cross corpora evaluation. We train a DNN on the
IEMOCAP dataset to identify four classes of emotion, followed by predictions on these
datasets.
Surrey Audio Visual Expressed Emotion (SAVEE) database: Surrey Audio-Visual
Expressed Emotion (SAVEE) database (Haq et al. [53]) has recordings of four male speak-
ers reciting IEEE sentences in seven different emotions. For the purpose of our evaluation,
we only select the subset of utterances belonging to one of the four target emotions, as pre-
dicted by the model trained on the IEMOCAP dataset. The dataset consists of 60 utterances
each belonging to the angry, sad, happy classes and 120 neutral utterances. We acknowl-
edge that transfer of models across corpora spanning different label spaces is a challenge.
By selecting a subset of utterances in our experiments, we simulate a study that assumes
that the two datasets span the same label space.
37
Electromagnetic Articulography (EMA) database: Electromagnetic Articulography
(EMA) database (Lee et al. [72]) contains a set of 680 utterances spoken in four differ-
ent target emotions, such as anger, happiness, sadness and neutrality. Speakers are na-
tive speakers of American English: two females and one male. Note that the label space
spanned by this dataset is equivalent to the one spanned by utterances in the training set.
Linguistic Data Consortium’s (LDC) emotional prosody dataset: This database
(Liberman et al. [74]) was developed by LDC and contains the recordings of professional
actors reading a series of semantically neutral utterances (dates and numbers) spanning
fourteen distinct emotional categories. We select a subset of 714 utterances from the dataset
that span the four emotion labels as modeled using training on the IEMOCAP dataset.
MSP-IMPROV dataset: MSP-IMPROV (Busso et al. [16]) has actors participating
in dyadic conversations across six sessions and like IEMOCAP they also have been seg-
mented into utterances. But unlike IEMOCAP, it also includes a set of pre-defined 20 target
sentences that are spoken with different emotions depending on the context of conversation.
There are 7798 utterances belonging to the same four emotion classes. The class distribu-
tion is unbalanced with the number of utterances belonging to happy/neutral class more
than three times that of angry/sad.
We note that there are several dissimilarities between the IEMOCAP dataset and the
datasets used in the cross corpora study. Whereas the speakers in EMA and LDC have an
American accent, SAVEE has speakers having a British accent. Unlike IEMOCAP, these
databases aren’t dyadic conversations. While EMA and SAVEE have speakers speaking
different sentences emulating different emotions, in the LDC database we have speakers
reading out numbers while emulating different emotions. While MSP-IMPROV is more
similar to IMEOCAP than others in terms of how it was collected, the data distribution in
both is very different. We next discuss the features extracted on these datasets.
38
3.4.3 Features
We use the openSMILE toolkit to extract 1582 dimensional feature vector (Eyben et al.
[35]). This feature set consists of various functionals computed for spectral, prosody and
energy based features. The same feature set has also been used in several previous works
including the INTERSPEECH Paralinguistic Challenges (Schuller et al. [108]). Similar
sets of spectral, prosodic and energy based features have shown considerable success in
emotion classification and affect tracking (Gupta et al. [47]). However an increased fea-
ture count leads to the “curse of dimensionality”, a problem that manifold learning and
smoothing can mitigate.
3.4.4 Experimental setup
We use a DNN as our classification model, such that the output layer consists of four
nodes (each corresponding to an emotion), with softmax activation function. The DNN has
three hidden layers with the number of neurons in each layer set to 64. The objective func-
tion V (θ(xi),yi) is chosen to be the cross entropy loss in our experiments (Goodfellow et al.
[41]). We performed L1/L2 regularizations and compared their performance with the other
regularization techniques mentioned above. While performing AT, we chose the D function
to be the cross entropy between yi and θ(x + rai i ), while in the case of VAT, D is set to be
the cross entropy between θ(x vi) and θ(xi+ri ). Miyato et al. [? ] considered two different
distance functions D for VAT training: (i) Kullback-Leibler divergence between θ(xi) and
θ(xi + rvi ). (ii) cross entropy between θ(xi) and θ(xi + r
v
i ). We also experimented with
the Kullback-Leibler divergence as the distance function D, without observing significant
differences in the model performances. We also replaced the adversarial error terms rai and
rvi with a random error term and analyzed if adding perturbations along targeted directions
rather than random has any advantage. While performing graph based regularization, we
39
consider the neighborhood of each point xi to be its two nearest neighbors.
We implemented the models in Keras (Chollet et al. [22]) with a Tensorflow backend
and performed optimization using stochastic gradient descent (Zhang [132]). Our evalua-
tion metric is Unweighted Accuracies (UWA) which has been used previously in emotion
classification tasks (Sahu et al. [101]). Since the distribution of emotion classes are un-
balanced in the datasets of interest, the UWA metric assigns equal weight to each emotion
class during evaluation. Next, we present further details regarding the single corpus and
cross corpus evaluation.
Results: Single corpus setting
We perform a leave one session out cross validation experiment on IEMOCAP. Through
this experiment, we aim to understand the impact of the hyper-parameters, mainly the im-
pact of ε and α for adversarial training procedures on the model performance. We first
study the impact of regularization factor α as mentioned in Equations 3.14 and 3.15. The
plots are shown in Figure 3.3. It was observed that with regularizations, increasing the
weight of the regularization term to a higher value decreases model performance. This
makes sense because we still want the cross entropy error term V (θ(xi),yi) to dictate the
training of the DNN and not the regularization terms. The optimum performance was ob-
tained for α = 0.005 for L1 and α = 0.05 for L2 regularizations.
For adversarial training procedures, in order to study their impact individually, we per-
form evaluation by perturbing one of the two parameters, while keeping the other constant.
By altering ε , we aim to understand the impact of smoothing radius around the data-points
on the model performance and perturbing α impacts the weight of the adversarial loss on
the overall optimization. The plots comparing the UWA of baseline DNN with that of DNN
with adversarial training procedures for different values of hyper-parameters is shown in
40
Figure 3.4. It is evident that DNN trained with adversarial training procedures perform
better than the baseline DNN. First, the value of α was kept fixed at 2 and ε was varied.
For DNN trained with AT regularization term, the model shows a higher performance for
lower value of ε peaking at ε = 0.5. As we increase the value of ε , the model’s perfor-
mance starts deteriorating. This is expected since ε defines the neighborhood around an
input feature vector over which the conditional distribution p(y|x) is smoothed. Increasing
the radius of this neighborhood forces our model to learn smoother functions that cannot
capture the complexity of the conditional distribution function p(y|x) thereby decreasing
its performance on the validation set. For lower values of ε , AT outperforms VAT which
may be due to the fact that AT is a supervised learning scheme where we use actual labels
to find the adversarial direction. However, for higher values of ε , the trend reverses which
leads us to believe that for larger values of search radius we are better off smoothening the
output of the perturbed input with respect to the output of the actual input rather than the
label. Similar observations can be made when we compare targeted perturbations versus
random perturbation term added to input. For lower values of ε , targeted adversarial train-
ing procedures are better while for higher values, adding a perturbation in random direction
performs better than both the adversarial training procedures; all of which perform worse
than baseline DNN with no regularization. Changing the weight α while keeping ε fixed
at 0.5 did not seem to affect the accuracies of AT very much. For VAT however, increas-
ing the weight of the VAT loss parameter in the loss function decreases the performance
of the system. These experiments showed that with the right value of hyper-parameters,
using adversarial training procedures that add perturbation along a targeted direction per-
form better than adding perturbations along a random direction. It was observed that for
α = 2 and ε = 0.5 performance of AT was the best. We also implemented the graph based
manifold regularization scheme. The results obtained using the above hyper-parameters
for various regularization schemes are mentioned in Table 4.1. The hyper-parameters so
41
Table 3.1: Unweigthed accuracies obtained for different regularization schemes from the cross-
validation experiment
Model UWA
Baseline DNN 58.15
L1 regularization 59.02
L2 regularization 59.21
.
AT 59.54
VAT 58.17
Graph based manifold regularization 58.35
L2 + AT 60.33
obtained have been tuned via the cross-validation scheme. It was seen that AT performs the
best compared to other regularization schemes. Implementing it along with L2 regulariza-
tion further seemed to improve the results by a relative amount of approximately 4%. It is
observed that AT performs better than the other regularization schemes where the ground
truth label is not taken into account.
We further analyze the posterior probability distribution of the labels given the fea-
ture vectors (expressed by p(y|x)) by projecting and visualizing the four dimensional out-
put vector θ(xi) using t-Stochastic Neighbor Embedding (t-SNE) approach proposed by
Maaten and Hinton [78]. t-SNE is a dimension reduction technique that clusters similar
vector values together. The four dimensional output is projected to a two dimensional
space using t-SNE and plotted in Figure 3.5 for one of the cross-validation sets. The results
shown are with the hyperparameter values ε and α values fixed at 0.5 and 2, respectively.
We observe that compared to baseline DNN, neural networks trained with adversarial train-
ing procedures are better able to distinguish the ’happy’ samples. While for the baseline
DNN most of the pink ’happy’ samples overlap with blue ’neutral’ samples, for the other
two regularized models especially the one trained with AT, we see a few clusters formed
more or less entirely of the pink samples. Analyzing the confusion matrix suggests that this
leads to less confusion between utterances belonging to other classes with ’happy’ class.
42
Figure 3.3: Unweighted accuracies vs the hyper-parameter α for baseline DNN (green) and DNN
with L1 regularization on left (blue) and with L2 regularization on right (blue).
Table 3.2: Cross-corpus accuracies (%) obtained using baseline DNN and DNN models trained with
different regularization schemes. The training was performed using IEMOCAP in all cases.
Test Dataset Baseline DNN DNN with L2 DNN with AT DNN with L2+AT
MSP-IMPROV 43.43 43.57 45.22 45.37
SAVEE 47.29 52.29 53.13 52.5
EMA 57.77 58.32 64.51 64.1
LDC 43.66 43.28 45.64 45.97
Results: Cross corpus evaluation
Since the adversarial training procedure make the model robust to small perturbations
to the input training points, we hypothesize that the regularized models are also robust to
variation across datasets due to dissimilar noise conditions. Hence a model trained on an
external corpus can achieve better performance on a dataset of interest. To verify this, we
did a cross corpus analysis where the whole of IEMOCAP dataset was used for training and
a different corpus was used for testing. We extract the openSMILE features for the four
external corpora, followed by mean-variance normalization using in-corpus statistics. We
compare the UWA for three datasets as shown in Table 3.2 and show the superior perfor-
43
Figure 3.4: Unweighted accuracies vs the hyper-parameters ε (left) and α (right)
Figure 3.5: TSNE plots comparing the output of the baseline DNN model (left), DNN trained with
AT (center) and DNN trained with VAT (right)
mance of models trained with regularization procedures than baseline DNN. This indicates
that the adversarial procedures increase model robustness to cross-corpus differences. We
also note that the IEMOCAP trained models perform better on EMA compared to the other
three datasets. This can be explained by domain variabilities. While SAVEE has British
accented speech, in LDC the actors are reading out just numbers instead of actual English
sentences. EMA being an American English corpus where participants are reading out
sentences, comes closest to IEMOCAP which has actors having conversations in English.
44
The worst performance on MSP-IMPROV can probably be explained by the fact that this
dataset is more realistic compared to SAVEE and EMA that are more extreme in terms
of the emotional dialogs. This observation suggests that despite better model generaliza-
tion across datasets, data specific characteristics still play a part in determining the model
performance.
3.5 Conclusion and future work
In this chapter, we show the effectiveness of adversarial training procedures for emotion
classification using a DNN model. The regularization schemes enforce the smoothness of
the output probabilities p(y|x), a case particularly applicable to low resource tasks such as
emotion classification. We perform two sets of evaluation, a single corpus evaluation on
the IEMOCAP dataset and four evaluations using a cross-corpus setup. In both the cases,
we observe an improvement in the classification performance using adversarial training.
Regularization methods such as VAT and graph based manifold learning scheme that do
not leverage the ground truth labels do not show significant improvement probably because
of the less amount of data available to us while training. We perform further investigation
to understand the impact of the model hyper-parameters on the model performance and
analyze the model outputs using t-SNE projections of the model outputs.
In the future, we aim to conduct further investigations using the adversarial loss. In
particular, the VAT training procedure and the graph based manifold learning scheme
can be used for semi-supervised training schemes. This can be performed using an in-
domain/external source for unlabeled data. Another interesting area to further investigate
would be to study the effects of one regularization scheme on another when multiple regu-
larization schemes are implemented together. As can be seen from the cross-corpus results,
using L2 regularization with AT doesn’t always give an improvement. We also aim to
45
investigate other distance metrics D and its impact on the performance. Another perti-
nent problem is making the cross-corpus study compatible to different output label spaces
across the datasets. Finally, one can also test the adversarial methods to other low resource
problems.
46
Chapter 4: Multi-modal learning for Speech Emotion Recognition : An
Analysis and comparison of ASR outputs with ground truth
transcriptions
4.1 Introduction
Speech is the most common and efficient way of interaction that occurs on a daily basis
and its non-invasive nature has also resulted in speech features being popular for various
tasks one of them being emotion recognition. It has applications in several fields including
building intelligent voice-assistants, psychiatry, analysis of human interaction and other
behavioral studies (El Ayadi et al. [30]). Affect recognition or emotion recognition is a
well-researched field and the results demonstrate that using speech features does a better
job at predicting arousal levels (intensity) than valence (pleasantness) level of the utterance.
Valstar et al. [119] employed a support vector machine based regressor and found that the
metric concordance correlation coefficient (CCC) is higher for predicting arousal levels
than valence. They managed to improve the valence prediction task using information
from other modalities such as video and physiological signals. The work by Yang and
Hirschberg [127] shows similar results on a couple of databases after extracting features
from raw waveform and spectrogram using a convolutional neural network and passing
them through a neural network based regressor to get the predicted arousal and valence
scores. Li and Akagi [73] employed a fuzzy inference based system and their results show
47
a lower mean absolute error and a higher CCC in predicting arousal than valence across
three different languages. From the results shown by Lotfian and Busso [77] it can be
observed that the same is still true even after employing curriculum learning. The work by
Kim et al. [63] compared different neural network based systems in classifying between
angry, sad, neutral and happy and it was observed that all of them struggled in classifying
the ’happy’ samples correctly.
These results indicate that audio-based systems can be improved in predicting valence
levels by leveraging information from other modalities. Since our aim was to build an
emotion recognition model that only uses speech as input and modern state of the art ASR
models can generate good transcriptions, we looked at previous works using audio and
text features. Metallinou et al. [81] combined audio, video and phoneme level transcripts
for multi-modal emotion classification and showed an improvement as compared to a uni-
modal classifier. Zadeh et al. [130] used word level acoustic, vision and text features to
implement an attention architecture that captures cross-modal dynamics. In the work by
Hazarika et al. [54], similar features were input to a deep neural framework that was im-
plemented to capture the dynamics between speakers in a dyadic conversation. However,
none of these works have provided an insight of why multi-modal learning helps for emo-
tion classification and what is the contribution from each modality. Furthermore, all of
them have used ground truth text transcription for their experiments which can be time-
consuming and expensive to obtain. Schuller et al. [105, 106] trained an ASR model on
the dataset at hand and used the spoken words along with acoustic features for emotion
recognition. However, due to unavailability of ground truth transcriptions they were unable
to compare how much is the loss in performance when they use ASR transcriptions instead
of ground truth. In this paper we analyze the performance of a multi-modal system em-
ploying audio and text features, with the hypothesis that while audio features help us with
detecting arousal levels, the text features help us with valence prediction. We also devel-
48
oped a system that uses transcriptions obtained from different ASR models and compare its
performance with that of a system that uses only audio features and a multi-modal system
using ground truth transcriptions. In the following sections, we provide a background of
the datasets used for our cross-validation and cross-corpus study. Before we carry out the
experiments analyzing audio and text modality we explore a couple of different attention
mechanisms (Chorowski et al. [23]) in neural network architectures with audio features as
input for emotion classification. We wanted to see if they could improve accuracies when
we are training with limited data. We then carry out the multi-modal experiments where we
use both audio and textual features and show our results and analysis. Finally we present
our conclusions and future directions.
4.2 Methodology
In this section we explain the databases, feature sets and classifiers used for our exper-
iments. We then talk about the different ASR models employed to get the transcriptions to
be used instead of ground truth transcriptions.
4.2.1 Datasets
IEMOCAP
We use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset (Busso
el al. [13]) as one of the datasets in our experiments. The dataset consists of five sessions.
In each session, two actors act out scenarios which are either scripted or improvised. No
two sessions have the same actor participating in them. This enabled us to perform a five
fold leave-one-session out cross-validation analysis on IEMOCAP. The conversations have
been segmented into utterances which are then labeled by three annotators for emotions
49
Figure 4.1: Distribution of valence (top) and arousal (bottom) values for utterances in IEMOCAP
belonging to classes angry (red), sad (green), neutral (blue) and happy (black) classes
such as happy, sad, angry, excitement and, neutral. Manual transcriptions provided with
the dataset are considered as ground truth transcriptions. For our experiments, we only
use utterances for which we could obtain a majority vote and assign that as the ground
truth label. We used approximately 7 hours of data from the dataset which amounts to
5530 utterances : neutral (1708), angry (1103), sad (1083), and happy (1636). Apart from
annotating for categorical emotions, the utterances were also rated on a scale of 1-5 in terms
of their arousal and valence; 1 being low arousal/valence and 5 being high arousal/valence.
In Figure 4.1 we show a class-wise distribution of arousal and valence values. It can be seen
that while ’anger’ is low valence and high arousal, ’sad’ is low both in terms of valence and
arousal. ’Neutral’ is more or less symmetrical along the mean for arousal and valence. The
emotion ’happy’ is high in valence and and has more percentage of utterances with higher
arousal than ’neutral’ and ’sad’ but lesser than that of ’angry’. Following the observations
in the work by Neumann and Vu [87] we set the length of utterances as 7.5 seconds. Shorter
utterances were pre-padded with zeros while longer ones were clipped.
50
MSP-IMPROV
MSP-IMPROV (Busso et al. [16]) has actors participating in dyadic conversations
across six sessions and like IEMOCAP they also have been segmented into utterances.
But unlike IEMOCAP, it also includes a set of pre-defined 20 target sentences that are
spoken with different emotions depending on the context of conversation. There are 7798
utterances belonging to the same four emotion classes : neutral (3477), angry (792), sad
(885), and happy (2644). The class distribution is unbalanced with the number of utterances
belonging to happy/neutral class more than three times that of angry/sad. We didn’t have
ground truth transcriptions available for this dataset. We used MSP-IMPROV to perform a
cross-corpus study where we used it as a test set while IEMOCAP was used as training set.
4.2.2 Feature extraction
We extracted two sets of features for the speech based model and compared their per-
formances. The first set was the Extended Geneva Minimalistic Acoustic Parameter Set
(EGeMAPS) extracted using the openSMILE toolkit (Eyben et al. [34]). It is a 23 dimen-
sional feature set consisting of prosodic features like pitch, loudness, jitter, shimmer and
spectral parameters. These features were computed for every 20 ms window with a 10 ms
overlap. To reduce the computation time, we took the expectation of every ten such con-
secutive frames so that we have a smoother feature summary vector every 100ms which
was then fed to the LSTM. A similar approach was employed in the work by Zadeh et al.
[130] to get word level acoustic features from frame level features. The second feature set
was computed using the toolkit pyAudioAnalysis (Giannakopoulos [39]). This feature set
was also used by Chernykh et al. [21] for speech emotion recognition. The motivation
behind using such a feature set is the expectation that it would be more helpful towards
building a speaker agnostic emotion recognition model since they don’t include prosodic
51
features. Speaker-based normalization was applied to reduce speaker specific effects using
only the neutral speech as proposed by Busso et al [15]. Real world emotion recognition
systems usually have access to such samples, so its a fair assumption to make that they can
normalize the utterances from test speakers (Le and Provost [69]).
We used 100 dimensional Glove embeddings (Pennington et al. [92]) to initialize the
embedding layer of the text based neural network model. The embeddings are computed by
essentially factorizing the logarithm of a word-word co-occurrence count matrix obtained
from a 2014 dump of Wikipedia (glove.6B). The embedding layer was then fine-tuned for
the task at hand by backpropagating the error values obtained from the output layer.
4.2.3 Classification models
We used recurrent cells to compute a sequence of high-level representations from the
time-series of feature vectors capturing their contextual information as has been done by
Lee and Tashev [71] and Huang and Narayanan [59]. For the audio modality, we had two
long short-term memory (LSTM) layers with 256 and 128 hidden units, respectively, fol-
lowed by a dense layer of 64 neurons with rectified linear unit (ReLU) activation which was
connected to the output layer consisting of four neurons with softmax activation. Our text
based model had a similar architecture except there was an embedding layer that matched
the words with their corresponding Glove vector which was input to the first LSTM layer.
For our multi-modal system, the summary feature vectors obtained from the second LSTM
layer of the audio modality and the text based model were concatenated to form a 256 di-
mensional vector. This was followed by a ReLU activated dense layer with 64 neurons and
finally the output layer. A recurrent dropout probability of 0.3 was applied to all the recur-
rent layers in all the models. The hyper-parameters such as the number of recurrent/dense
layers, number of recurrent units, batch size, dropout probability etc. were decided based
52
on the cross-validation study done on IEMOCAP. Since the audio features were computed
every 100 ms for a 7.5 second segment, we have 75 time-steps for audio modality. For text-
based LSTM, we had 40 time-steps meaning the transcriptions were limited to 40 words if
they were longer than that, otherwise zero-padding was applied.
4.2.4 ASR models employed
Our next experiments involved running two free ASR applications to generate the tran-
scriptions and using them in our experiments instead of ground truth transcriptions. This
automatically generated transcription enables us to have an emotion recognition model that
only requires speech as its input so that we can do away with the manual transcription of the
utterances. We used the codes from Zhang’s github repository [131] to get the transcrip-
tions by implementing the models from Wit.ai (a Facebook company) and Google. We
note that the ASR engines from Google and Wit.ai were not able to generate transcriptions
for all the utterances mainly due to troubles with communicating with the API’s server.
For IEMOCAP, Google and Wit.ai could transcribe 89.9% and 78.3% of the samples, re-
spectively. For MSP-IMPROV, the percentage of utterances for which we could obtain
transcriptions using APIs from Google and Wit.ai are 90.55% and 60.24%, respectively.
Below we show a few ground truth (GT) annotations and their ASR transcriptions.
1. GT: You’re going to fill out a form on your desk
Google: fill out a form on your desk
WIT.ai: out a form on your desk
2. GT: you have to tell me
Google: you have to tell me
WIT.ai: you have to tell me
53
3. GT: Really you don’t work for anybody it’s just you
Google: really you don’t work for anybody it’s just you
WIT.ai: really don’t work for anybody is up
4.3 Exploring attention mechanisms
In IEMOCAP and MSP-IMPROV the annotations for emotion are given at an utter-
ance level. However all the frames of an utterance might not be relevant at determining
the emotion that it should belong to. Hence, we might need to weigh the frame/window
level features appropriately based on their importance. This is what the attention mecha-
nism does. Our goal here was to see if attention mechanisms help us with our experiments
where we have limited data. Weighing the frame-wise features might be a good idea but at
the same time computing the attention weights leads to an increase in number of parameters
which might lead to over-fitting in such cases. We considered only the audio modality for
these experiments. The baseline model was as described in section 4.2.3. We incorporated
our attention mechanisms between the first and second layer of LSTMs i.e. the sequential
output of the first layer of LSTM was weighed by the attention weights before being fed to
the second layer of LSTM. Let’s denote the output for time-step t obtained from the first
LSTM layer as h(t) ∈ Rn, where n is the number of hidden units in the LSTM layer. If we
denote the attention weight for time t by α(t) ∈ R, then the output of the attention mech-
anism is denoted by c(t) = α(t)h(t) We investigated two ways to compute the attention
weights as described below:
1. att1: a scalar attention weight α(t) ∈ R computed based on the root mean square
energy (RMSE) of the audio signal. For a segment of an audio signal, its RMSE
is defines as the square root of the sum of squares of the samples occurring in that
segment. Let e(t) ∈ R be the RMSE value obtained for the t-th window of an audio
54
signal. To make sure it has the same temporal resolution as the time-series of feature
vectors being fed to LSTMs, the RMSE was computed for a window of 200 ms with
a frame interval of 100 ms. The attention weight for time t was then computed using
the following equation:
e(t)
α(t) = (4.1)
∑T ′t ′=1 e(t )
So, h(t) from segments with lower RMSE will have lower contribution towards mak-
ing the final decision than the ones with higher RMSE. Note that employing this at-
tention mechanism doesn’t lead to any increase in the number of trainable parameter
2. att2: a scalar attention weight αt ∈ R computed from the output of the first LSTM
layer h(t). This is similar to the way the attention weights were computed using
equation 9 in Huang and Narayanan [59] and also in Neumann and Vu [87]. We
define a weight layer w ∈ Rn which was shared across the time-steps. The attention
weight for time t was then computed using the following equation:
wT h(t)
α(t) =
∑T
(4.2)
t ′=1 wT h(t′)
Note that in this case the number of trainable parameters increase by a value of n.
In Figure4.2, we plot the raw waveform and the attention waveforms obtained using the
two mechanisms along with the energy waveform obtained from h(t). Note that as h(t) ∈
Rn, we simply squared and summed its different dimensions to get the energy waveform
for h(t). We can see that while the profile of attention time-series att1 computed from
RMSE e(t) follows the intensity of signal, att2 follows the envelope of the energy waveform
obtained from h(t). This is expected behavior given how they have been computed. So, to
get the best of the both worlds our final attention att was considered to be a weighted sum
55
Figure 4.2: Plots of raw waveform (top left), att1 (bottom left), att2 (top right), energy of h(t)
(bottom right) for two audio files. Note that while att1 follows the envelope of the raw waveform,
att2 follows the envelope of energy waveform of h(t)
of att1 and att2 i.e.
att = (1−β )att1 +βatt2 (4.3)
We varied the parameter β = [0,0.2,0.8,1] to see the effect of weighting the two attention
mechanisms differently. From Figure4.3 we observe that the mean class-wise accuracy
obtained over the five cross-validation splits of IEMOCAP doesn’t vary significantly from
baseline. This is similar to the results obtained in Huang and Narayanan [59] and also
in Neumann and Vu [87] where implementing attention mechanisms didn’t lead to any
significant changes in accuracies from the model without attention mechanism. At the
same time we can see that it decreases as we increase the value of β . Note that as β
increases, contribution of att2 increases. We believe it could be because of limited data
and any advantage gained by implementing the attention mechanism is overshadowed by
over-fitting. This means the attention profile so computed is specific to the training set and
does not generalize well to unseen validation sets. It would be an interesting experiment to
56
Figure 4.3: Unweighted accuracies vs beta. As we increase the parameter ’beta’ accuracy decreases
which is possibly because of overfitting
see their effects on emotion recognition on a larger dataset.
4.4 Multi-modal experiments
Here we show the results and our analysis for the experiments performed. Our met-
ric would be un-weighted accuracy (UWA) which is the average of class-wise accuracies.
Since our datasets are not perfectly balanced, we believe it would be a better metric to use
than the overall accuracy or weighted accuracy. The results shown have been averaged
across four runs with different random seeds.
4.4.1 Comparing audio and text modalities
Our initial set of experiments were carried out to show the worth of multi-modal sys-
tems. We compared the two different audio feature sets EGeMAPS and the ones obtained
using pyAudioAnalysis but didn’t notice a big difference in the accuracies. We believe that
since the feature sets have undergone speaker based normalization prior to being fed to
57
the neural network model, we are getting rid of speaker specific characteristics and hence
the speaker-specific prosodic features used in EGeMAPS don’t deteriorate the performance
of the audio-only model.We chose to use the pyAudio feature set for further experiments.
Next we investigated the performance of a text-based system and a multi-modal system.
It can be seen from Table 4.1 that both of those models perform better than an audio-only
model. To verify our assumption that the audio modality is better for detecting arousal
while the text modality is better at detecting valence, we provide the confusion matrices
in Figure 4.4. It can be seen that the audio-based model (left) performs better than the
text-based model (center) in detecting ’anger’ which is a high arousal emotion. From the
first row of the matrices, we can also see that the text based model is more likely to confuse
the ’angry’ and ’sad’ classes than is the audio based model. This is because both anger and
sadness are low valence emotions but they differ in their arousal level, thereby making it
easier for the audio modality to distinguish between the two. Both the models perform sim-
ilarly when it comes to identifying the ’neutral’ speech samples. This is probably because
the ’neutral’ class lies somewhere in the middle of the arousal and valence axes and not at
one of the extremes. Hence neither of the modalities end up having any advantage over the
other. However, text based models do a much better job in identifying the ’happy’ sam-
ples than the audio based model. While the audio based model classifies 26% of ’happy’
samples as angry, our text based model does a better job at distinguishing between the two
classes. It further strengthens our hypotheses that text based models are better than audio
based models in distinguishing between high and low valence utterances. While anger is
a low valence emotion, happiness is high valence. Combining the two modalities we see
that class-wise accuracies either improve or remain almost the same for all of the classes.
Accuracies for ’sad’ and ’neutral’ obtained using multi-modal system are better than that
of uni-modal systems indicating that speech and text features supplement each other while
identifying samples from these two classes.
58
To further identify which modality helps with classification of what emotions, we im-
plemented an attention based multi-modal fusion for emotion recognition as has been done
by Hori et al [57] for video description. However, our implementation was a simpler ver-
sion of their implementation. One reason for a simpler implementation was because we had
lesser amount of data-points to work with and hence a simple implementation with lesser
number of parameters would prevent over-fitting. Another reason for this was because of
the way these problems are formulated. For a video description model, input to a recurrent
network are video samples and their output would be sentences which is a sequence of
words. This architecture is known as many-to-many recurrent model because both input
and output are sequential. In our case the input to LSTM models is a time-series consist-
ing of audio/text features for an utterance whereas the output is just one emotion label per
utterance. These models are known as many-to-one recurrent models. Implementing an
attention model for a many-to-one recurrent model is more simplified than many-to-many
recurrent models as we don’t have to consider the effect of a sequential output on comput-
ing the attention weights. But the core idea still remains the same. We want to implement
an attention mechanism which assigns relevance weights to the summary feature vectors
obtained from each modality. The summary feature vectors are then multiplied by the cor-
responding relevance weights and added before being fed to the following layers. Please
note that in the multi-modal system explained in section 4.2.3 the two summary vectors
were being concatenated instead of being added. While Hori et al. reported an improve-
ment in performance from multimodal fusion, we didn’t notice any significant change in
our metric UWA. However it did give us some important insights as to how the audio and
text modalities work towards classifying the four emotions. We now describe our imple-
mentation to compute the relevance weights for the two modalities. Let the summary vector
obtained for the time series of audio features after passing them through LSTM layers be
denoted by a ∈ Rm and the summary vector obtained for the sequential text features be
59
t ∈ Rm where m is the number of units in the final LSTM layer (assuming its the same for
both audio and text modalities as in our case). We define a weight layer w ∈ Rm which
was shared across the two modalities. The relevance weights γ and τ for audio and text
modalities respectively were computed using the following equations:
wT a
γ =
wT
(4.4)
a+wT t
wT t
τ = 1− γ = T T (4.5)w a+w t
Hence, if for an utterance γ < 0.5, it implies τ > 0.5 and so text modality was weighed
more in classifying that particular utterance. In Figure 4.5 we plot the histograms depicting
the number of utterances having a certain value of audio relevance γ as computed from
the multi-modal attention mechanism. We observe that while for ’angry’ and ’sad’ classes
γ > 0.5 for most utterances, its the other way around for ’neutral’ and ’happy’. This means
that for classifying most of the ’angry’ and ’sad’ utterances audio modality was found to
be more relevant while for most ’neutral’ and ’happy’ utterances text modality was more
helpful. Referring to Figure 4.1 gives us an insight as to why this could have been the case.
We note that the histogram is obtained for the training set of one of the cross-validation
splits so as to make sure that most of the utterances were assigned to the correct class
although the accuracy wasn’t 100% because the training was stopped once the validation
error started rising. We see that while most of ’angry’ utterances have high arousal, most
of ’sad’ utterances have low arousal but they have similar valence distribution. Hence, text
features being better at detecting valence would get confused to distinguish between the two
while audio features can easily make that distinction. At the same time both ’neutral and
’happy’ classes have a good amount of their utterances with neither high nor low arousal
values (concentrated around that 3-3.5 region on the x-axis) and hence audio modality can
be helpful in classifying ’angry’ or ’sad’ from these two classes. Similarly, ’neutral’ is
60
Table 4.1: UWA obtained from 5-fold cross-validation on IEMOCAP. Ground truth text transcrip-
tions are used here.
Model pyAudio Egemaps Glove pyAudio + Glove
UWA 56.94 56.85 61.89 68.18
Figure 4.4: Confusion matrices for one of the validation splits showing the class-wise accuracies
of audio-only (left), text-only (center) and multi-modal (right) systems. Numbers shown are in
percentages
the only class with most utterances having neither high nor low valence unlike the other
three classes which leads to text features being more helpful in classifying the ’neutral’
utterances. Similarly,’happy’ is the only class with utterances having higher valence than
the rest of the classes so text features play a greater role in identifying them. However, we
also see audio modality being assigned higher relevance weight for some of the ’happy’
utterances. This is probably because some ’happy’ utterances also have high arousal values
causing them to be confused with ’angry’ as seen from the multi-modal confusion matrix
in Figure 4.4. Even though that matrix was computed for one of the validation sets, the
same confusion must also be occurring in training set. This experiment further verified our
claim that audio helps with arousal level classification while text helps with valence level
classification.
61
Figure 4.5: Histogram plots showing the relevance weights assigned to audio modality (γ) for dif-
ferent utterances belonging to different classes namely angry (green), sad (black), neutral (blue) and
happy (red). The vertical black line in each plot shows x=0.5 i.e. for samples lying along that line
there is equal contribution from audio and video. Note that for most angry and sad utterances, audio
modality contributes more towards classifying them while for neutral and happy utterances its the
other way round.
4.4.2 ASR model output vs ground truth transcriptions used for multi-
modal classification
Having performed the multi-modal experiments on ground truth transcriptions, we now
created a pipeline where we only used audio data as input. We used the ASR transcriptions
generated from audio in the multi-modal system instead of ground truth transcriptions.
Since, the different APIs were able to transcribe different numbers of utterances, we ran
the experiments comparing the models with a different train/test file-list for each API. This
resulted in different accuracies even when only audio features were used or when they were
used along with text features obtained from ground-truth transcriptions. Figure 4.6(a)
shows the comparison between the cross-validation UWAs obtained from the audio only
model, the multi-modal system using ground truth text and the multi-modal system using
the API’s transcription. It indicates that the model trained on ground truth transcriptions
62
Figure 4.6: (a) Left figure compares the performance of an audio-only model with multi-modal
systems using ground truth text or the ASR transcriptions for the different ASR systems (b) In the
right we show the performance of the different ASR modules used in our experiment on IEMOCAP.
Lower is better
perform better than the ASR transcriptions as expected. We get a relative loss of 4% and
5.3% in accuracy compared to ground truth transcriptions when using Google’s and Wit.ai’s
ASR engine, respectively. To compare the quality of the transcriptions generated, we com-
puted the word error rate (WER) by measuring the Levenshtein distance (Heeringa [55])
(LD) between the generated transcriptions and the ground truth ones for each IEMOCAP
utterance and then averaging it over the entire dataset. Levenshtein distance between two
sentences measures the minimum number of insertions/deletions/substitutions of words re-
quired to convert one sentence to another. In general, longer utterances are more likely
to have a higher LD when compared to shorter utterances because there are more words
where the ASR model can make an error in transcribing. Since the different API’s tran-
scribed different numbers of utterances, this measure could provide us with a skewed idea
about the performance of APIs. Hence, we also computed a normalized Levenshtein dis-
tance (NLD) where we divide LD by the number of words in the ground truth transcription.
Figure 4.6 compares the performance of the two ASR APIs in terms of those two metrics.
We see that the difference is less stark in case of NLD, however both the metrics show sim-
ilar trends. The lower drop in UWA compared to ground truth transcriptions was obtained
63
using Google’s system. This can be explained by its lower word error rate as obtained for
the IEMOCAP dataset. Google’s and Wit.ai’s APIs have probably been trained on a large
amount of data so that the deep learning models used for ASR in both the APIs were more
generalizable giving us satisfactory performance on an unseen dataset. Wit.ai’s API seems
to perform worse than Google’s API in terms of UWA, but we should also keep in mind that
we are using different subsets of the dataset to evaluate the models. Also we are using less
data to train the pipeline using Wit.ai’s transcriptions (as explained in section 4.2.4) which
could also be one of the reasons for its worse performance. Having looked at the WER
of the two APIs, we now compare the average confusion matrix obtained over five cross-
validation sets for multi-modal systems using ground truth transcription vs ASR outputs
in Figure 4.7. It can be observed that class-wise accuracies are higher when ground truth
transcriptions are used as expected. Comparing models using Google API’s output with
that of using ground truth transcriptions, the absolute increase in percentage of ’happy’
samples being classified as ’angry’ and ’neutral’ is more than that of the ’sad’ class. This
could be because the arousal value distribution of ’happy’ utterances is more similar to
’angry’ and ’neutral’ than that of ’sad’ utterances (from Figure 4.1). When using Wit.ai
API’s output instead of the ground truth transcriptions we see more ’angry’ samples being
miss-classified as ’happy’ probably for a similar reason. The same is true when there are
more ’sad’ samples being miss-classified as ’neutral’ and ’happy’ when Wit.ai is used to
transcribe. These observations show that using worse quality transcriptions leads to more
confusion between classes with similar arousal values which points to the fact that audio
features contribute to the classification to a greater extent in such cases.
64
Figure 4.7: Confusion matrix obtained from multi-modal systems using ground truth transcriptions
(left) and ASR transcriptions (right) using (a) Google’s and (b) Wit.ai’s API
4.4.3 Cross-corpus analysis
To verify the generizability of our model, we did a cross-corpus analysis where we
trained our model using IEMOCAP and tested it on MSP-IMPROV. We preferred IEMO-
CAP for training because it is more balanced. We have compared the performance between
an audio-only system and a multi-modal system using the generated transcriptions. The
tokenizer used in these experiments were generated from the IEMOCAP dataset. Doing
so would allow us to capture the cross-domain difference in their vocabulary. Utterances
in MSP-IMPROV for which we could not find any of the words in the tokenizer were not
used in the experiment. We see a similar trend where using ASR transcriptions along with
audio results in a better emotion recognition model. The Google based system gives a rela-
tive improvement of 9.8% and using Wit.ai’s ASR API results in a relative improvement of
65
Table 4.2: Cross corpus results with IEMOCAP as training set and MSP-IMPROV as test set.
Model Google Wit.ai
Audio only 35.93 38.06
Audio + ASR output 39.45 40.08
5.2% compared to an audio-only model. However, the improvements weren’t as much as
observed in cross-validation experiments, possibly due to cross-domain differences in the
vocabulary of IEMOCAP and MSP-IMPROV.
4.5 Conclusion
Our experiments demonstrate that acoustic features help in detecting level of arousal
whereas the text based model helps in detecting valence level. Combining information
from both to build a multi-modal system seems to increase the class-wise accuracies. When
using ASR transcriptions instead of ground truth ones, audio features seem to contribute
more towards deciding which class an utterance should belong to. Deep learning based
ASR models trained on thousands of hours of data (Prabhavalkar et al. [95]) improves their
generalizability thereby giving us meaningful transcriptions for unseen datasets which we
can leverage to get higher cross-corpus accuracies. Hence, we can take advantage of the
genralizability of ASR models to improve the generalizability of emotion classification
models. In the future we plan to investigate the utility of articulatory features by incorpo-
rating them in our models. We also aim to explore various word embeddings other than
Glove or sub-word embeddings which are better at handling out of domain vocabulary
words. We also plan to look at ways we can get word embeddings specific for an emotion
recognition/sentiment classification task (Tang et al. [115]). It would also be interesting to
explore text features obtained using dictionaries used specifically for an emotion recogni-
tion/sentiment classification tasks. Additionally, we plan to explore novel ways to combine
the information from the audio and text modes in the multi-modal learning framework.
66
Chapter 5: Generative models to capture the underlying distribution of
feature vectors
5.1 Introduction
Emotion recognition is a fairly widely researched topic. Some of the previous works
done by Williams and Stevens [124] and by Banziger and Schere [8] include use of F0
contours. In their survey paper, El Ayadi et al. [30] mention several features such as for-
mant features, energy related features, timing features, articulation features, TEO features,
voice quality features and spectral features useful for emotion recognition . Researchers
have also investigated various machine learning algorithms such as Hidden Markov Mod-
els (Lin and Wei [75]), Gaussian Mixture Models (GMM) (Hu et al [58]), Artificial Neural
Networks (Singh et al [111]), Support Vector Machines (SVM) (Ververidis and Kotropou-
los [122]) and binary decision trees (Lee et al [70]) for emotion classification. Recently,
researchers have also proposed several deep learning based approaches for emotion recog-
nition (Huang and Narayanan [59]). Stuhlsatz et al. [113] reported accuracies using a Deep
Neural Network on 9 corpora using Generalized Discriminant Analysis features to do a bi-
nary classification between positive and negative arousal and positive and negative valence
states. Xia and Liu [125] implemented a denoising auto-encoder for emotion recognition.
They captured the neutral and emotional information by mapping the input to two hid-
den representations, and later using an SVM model for further classification. Ghosh et al.
[38] used denoising auto-encoders and showed that the bottleneck layer representations are
67
highly discriminative of activation intensity and at distinguishing negative versus positive
valence.
A typical setup in several of these studies involves using a large dimensionality of fea-
tures and using a machine learning algorithm to learn class boundaries in the corresponding
feature space. This design renders a joint feature analysis in the high dimensional space
rather difficult. Methods such as principal component decomposition (PCA) and linear
discriminant analysis (LDA) have been known to compress high dimensional feature into
lower dimensions. PCA aims to de-correlate the features by finding the axes with maximum
variance where the data is most spread, and then projecting the original feature vectors onto
those dimensions. LDA projects data-points onto axes so as to minimize the within class
covariance of the projected data-points but maximize the between class co-variances. More
details about these methods can be found in the book ”Pattern classification” by Duda et
al [29]. Auto-encoders (Baldi [6]) have also been used for similar tasks. The input high
dimensional feature vector is passed as input to a stack of neural network layers. The initial
part of an auto-encoder is known as encoder. It consists of a series of hidden layers with
the number of neurons in them decreasing from one layer to another. The final layer of
encoder called the bottleneck layer, has the same number of neurons as the dimension of
the compressed space. Assuming the encoding function is denoted by E, for an input x
the output of an encoder can be denoted by E(x). The second part of auto-encoder follow-
ing the encoder is called a decoder which renders the compressed representation back to
the original dimension through a series of neural network layers. Denoting the decoding
function by D, the final transform that the auto-encoder applies on an input x is given by
D(E(x)). The weights of the neural network layers are then updated by backpropagating
errors from a loss function which intends to make D(E(x)) as similar to x as possible either
by minimizing the mean square error between them or using a cross entropy loss func-
tion. PCA, LDA and auto-encoders have been investigated as dimensionality reduction
68
techniques for speech emotion recognition (You et al [128] and Cibau et al [24]). However
the drawback of these methods is that even in the lower dimensions, it is hard to see struc-
tures or clusters being formed by the feature vectors belonging to same class. We train a
model that leverages label information to cluster the input vectors and then apply the com-
pressed representation on a test/validation set. We believe this type of a framework can
be useful in deciding the worth of features being used for a particular task. For example,
if the compressed features from both training and test sets cluster well according to the
categories they belong to, it means the features are well suited for the classification task. If
on the other hand it doesn’t cluster well for test set, it means they are not. We compare this
method with other traditional methods both qualitatively and quantitatively and investigate
their generalizability.
The next part of this chapter is focused on generative models. As mentioned before,
discriminative models learn a conditional distribution p(yi|xi), given a set of feature vectors
xi and the corresponding labels yi, i = 1..N. Generative models on the other hand model the
distribution of the classes. They aim to learn the joint distribution p(xi,yi). We implement
generative adversarial networks (GANs, proposed by Goodfellow et al [42]) based models
to learn the distribution of feature vectors used for speech emotion recognition. We define
metrics on how to compare the different models and investigate their generalizability and
applicability.
5.2 Generative adversarial networks
A generative adversarial network consists of two components: a generator, G and a
discriminator, D. Given a random sample z from a random probability distribution pz, the
generator is responsible for generating a fake data-point G(z). The discriminator attempts
to classify real samples x (drawn from a distribution pdata) against the one generated by
69
the generator. Probability distribution pz is usually considered to be of lower dimensional
and simpler than the data distribution pdata. Popular choices include Gaussian or a uniform
distribution. The objective of training a GAN is to obtain a generator that can mimic
real data such that the discriminator is incapable of differentiating between real and fake
samples. GAN is trained using the following optimization on the GAN loss V (D,G).
minmaxV (D,G) = Ex∼pdata[logD(x)]+Ez∼pz [log(1−D(G(z)))] (5.1)G D
In the equation above, D(x) and D(G(z)) are the probabilities that x and G(z) are inferred
to be real sample by the discriminator. Note that in the optimization in equation 5.1, the
generator attempts to fool the discriminator as it tries to minimize V (D,G). During GAN
training, optimization of the loss function is achieved by updating the parameters of the
discriminator and generator networks in an iterative way. We minimize the discriminator
and generator losses as defined below and track them separately. Note that for discriminator
loss, y is 1 if input is x and 0 if input is G(z).
Disc. loss: − y log(D(x))− (1− y) log(1−D(G(z)))
(5.2)
Gen. loss: − log(D(G(z))),where x∼ pdata, z∼ pz
Figure 5.1 provides a block diagram of a GAN architecture. While a lot of work has
been done exploring the applicability of GANs for vision tasks, there are only a few such
works that has explored their utility for speech emotion recognition in recent years. In
[51], Han et al propose adding an extra GAN based adversarial loss term along with the
usual categorical cross entropy loss term to predict emotions from speech. Eskimez et al
[32] investigate unsupervised feature learning using various GAN and auto-encoder based
architectures for speech emotion recognition. In the next sections we discuss the variations
of GAN architectures we have used followed by our experiments and results.
70
Figure 5.1: Block representation of a GAN architecture. A GAN requires access to real samples
from a dataset and samples from a probability density.
5.2.1 Adversarial auto-encoders
Adversarial auto-encoders (AAE) proposed by Makhzani et al [79] have been shown
to perform quite well in digit recognition and face recognition tasks. We use adversarial
auto-encoders for emotion recognition in this paper motivated by their performances on
other tasks for feature compression as well as data generation from random noise sam-
ples. Speech emotion recognition involves working with high dimensional features which
can render a joint feature analysis in a high dimensional space rather difficult. Adversar-
ial auto-encoders address this issue by encoding a high dimensional feature vector onto a
code vector, which can be further enforced to follow a pre-defined probability distribution
function. This has been termed as mapping space distribution (MSD) in Figure 5.2. To the
best of our knowledge, this is the first such application of adversarial auto-encoders to the
domain of emotion recognition. We borrow a specific setup of adversarial auto-encoders
with adversarial regularization to incorporate class label information as has been shown
in Figure 5.2. An adversarial auto-encoder broadly consists of two major components: a
generator and a discriminator. In Figure 5.2, we show the generator at the top, which given
a sample x from the real data (e.g. pixels from an image, features from a speech sample)
learns a code vector for the data sample. We model an auto-encoder for this purpose, where
71
the model learns to reconstruct x through a bottleneck layer. We represent the reconstruc-
tion for x as x′ in Figure 5.2. The discriminator (in the bottom half of Figure 5.2) obtains
the code vectors encoded by the auto-encoder as well as samples from MSD, and learns
to discriminate the encoded real samples from the MSD samples. The generator and the
discriminator operate against each other, where the discriminator attempts to accurately
classify real samples against MSD samples and the generator produces code vectors re-
sembling MSD samples to confuse the discriminator (so that the discriminator is not able
to distinguish real from synthetic inputs). They further proposed tricks such as, in a setting
where the samples x belong to different classes, the MSD is a mixture of Probability Dis-
tribution Functions with as many components as the number of classes. In our case it was
chosen to be a 2-dimensional 4 component (since our task is a 4-way classification) Gaus-
sian mixture with same co-variance matrices and their mean vectors orthogonal to each
other. The orthogonal means ensure that the different mixture components are maximally
separated. Furthermore, to enforce each component of the mixture PDF to correspond to
a class, the authors regularized the hidden code vector generation by providing a one-hot
encoding for the classes to the discriminator.
Our model is trained while the following two adversarial losses converge: (i) cross-
entropy is minimized for code vectors to be classified as MSD samples (implying encoder
is able to generate code vectors resembling MSD), and (ii) cross-entropy is minimized so
that the discriminator is able to classify between encoded samples and samples from MSD.
More specifically, while adversarial losses converge the parameters of the adversarial auto-
encoder are updated in the following iterative way:
• Weights of the auto-encoder (both encoder and decoder) are updated based on a re-
construction loss function. We chose this function to be Mean Squared Error (MSE)
between the inputs x and the reconstruction x′.
72
Figure 5.2: A summarization of the adversarial auto-encoders. The generator at the top creates code
vectors. The discriminator learns to classify the code vectors generated from real data from the
synthetic samples. Label information is provided to discriminator so that samples from a particular
class are mapped to a specific mixture component of MSD
• The data is transformed by the encoder and we sample an equal number of samples
from MSD p(z). Weights of the discriminator are updated to minimize cross-entropy
to classify between encoded samples and samples from MSD.
• We then freeze the discriminator weights. The weights of encoder are updated based
on its ability to fool the discriminator (equivalently minimizing the cross-entropy for
real samples to be labeled as MSD samples).
Once trained, we can use the encoder to get compressed representations of higher dimen-
sional feature vectors. At the same time we can sample points from MSD, pass it through
decoder and generate synthetic feature vectors. The performances of an adversarial auto-
encoder in this regard are discussed later in this chapter.
73
5.2.2 Data generating GAN
The main purpose of implementing an adversarial auto-encoder was to generate mean-
ingful lower dimensional representations from higher dimensional features. The synthetic
samples generated from the decoder of the auto-encoder was a useful by-product. While
we mapped the encoded samples to match a mapping space distribution with maximally
separated mixture components until adversarial losses converge, the only updates done to
the decoder (that generates synthetic samples given samples from MSD) were based on
the auto-encoder reconstruction error. A question that comes to mind is how realistic the
synthetic samples would be if we trained a GAN based model where a discriminator differ-
entiates between synthetic and real samples until adversarial losses converge. We discuss
such an implementation in this section. The architecture is similar to what has been shown
in Figure 5.1 with a few modifications. The real samples consisted of the high dimensional
feature vectors obtained from real data. pz was considered to be same as the pdf of MSD
in case of an adversarial auto-encoder i.e. a 2-dimensional 4 component Gaussian mixture
with orthogonal means and same co-variance matrix to ensure the mixture components are
maximally separated. Since, we consider our dataset to have four classes, we train the GAN
in such a way that each component of the mixture pdf when passed through the generator,
generates a synthetic sample belonging to a specific class. To ensure correspondence be-
tween the mixture components and classes, we provide the discriminator with an additional
input of one-hot label vector in the same way we did in case of adversarial auto-encoder’s
discriminator. For real data-points the one-hot label vector depicted its class whereas in
case of synthetic samples it depicted the mixture index of the 4 component Gaussian mix-
ture from which it was generated. The generator in our data generating GAN had the same
architecture as the decoder of our adversarial auto-encoder set up. To make sure the adver-
sarial errors converge, we had to incorporate some tricks. The changes incorporated were
74
mainly to improve the generator: (i) Initializing the generator’s weight with the decoder’s
weight of a trained adversarial auto-encoder (ii) keeping the generator’s learning rate higher
than the discriminator (0.01 vs 0.001 respectively) and, (iii) training the generator for two
iterations for every iteration of discriminator training. The effects of these methods has
been discussed in more detail in Sahu et al [100]. We call this architecture dGAN 1.
One thing to note in this architecture is that pz was considered to be a mixture pdf based
on the number of classes we have. However, in most GAN applications it is considered to
be a simpler pdf like a normal or a uniform distribution. We modified our above GAN
architecture so that pz was now a normal distribution. We also modified it so that now it
was a conditional GAN (Mirza et al [82]) where along with providing the generator with a
sample z from pz we also provide it the class label that we want the generator’s output to
belong to. This was required as we did not have different mixture components to generate
data from for different classes. In addition we added an extra term to the GAN loss function
that maximizes the mutual information between the class label provided at the generator
input and the discriminator output. Since, implementing the actual mutual information loss
was intractable, we implement an approximation using variational technique as has been
discussed in Chen et al [20]. We note that only the parameters of the generator are modified
based on this mutual information loss function. We call this architecture dGAN 2.
While adversarial losses converge the parameters of the dGANs are updated in the
following iterative way:
• Points are sampled from pz and fed to generator to generate synthetic samples (along
with class labels in case of dGAN 2). Weights of the discriminator are updated to
minimize cross-entropy to classify between synthetic samples and real samples.
• We then freeze the discriminator weights. The weights of generator are updated
based on its ability to fool the discriminator (equivalently minimizing the cross-
75
Figure 5.3: Architectures for dGAN 1 (left) and dGAN 2 (right). Note that in case of dGAN 2 the
discriminator has a second output layer which predicts the class of the synthetic samples generated.
Mutual info. based loss is added while optimization so that the predictions are as close to the class
label being provided to the generator.
entropy for synthetic samples to be labeled as real samples). In case of dGAN 2,
an additional loss term based on mutual information is also considered to update
generator’s weights.
Once trained, we can sample points from pz and feed it to the generator (along with the
class labels in case of dGAN 2) to generate synthetic samples.
5.2.3 Adversarial auto-encoder with data generating GAN
The next set of GAN architectures we implemented was to get the best of both worlds by
combining adversarial auto-encoders with data generating GANs discussed in last sections.
Since, the generator of the data generating GANs in our case had the same architecture as
the decoder of adversarial auto-encoders, we simply combined the the two and trained them
jointly as shown in Figure5.4. Since the adversarial auto-encoder and the data generating
GANs are trained simultaneously, there was no need to initialize with the decoder’s weight
of a trained adversarial auto-encoder as was done in case of dGAN 1. Also instead of five,
the generator was trained for three iterations for every iteration of discriminator training.
We call the two architectures as AAE dGAN 1 and AAE dGAN 2 corresponding to the
76
Figure 5.4: Architectures for AAE dGAN 1 (top) and AAE dGAN 2 (bottom). Note that there are
two discriminators now, one to learn the encoding space and one to generate data samples. While
in AAE dGAN 1, the encoding space is pre-defined to be a mixture of 4 maximally separated
Gaussians, in case of AAE dGAN 2 it is being learned from the training data provided using a code
generator block
data generating GANs they have been derived from. Note that AAE dGAN 2 has been
implemented by Wang et al. for computer vision tasks [123].
While adversarial losses converge the parameters of the AAE dGANs are updated in
an iterative way alternating between an AAE training phase and a dGAN training phase. In
the AAE training phase:
• Weights of the auto-encoder (both encoder and decoder) are updated based on a re-
construction loss function. We chose this function to be Mean Squared Error (MSE)
between the inputs x and the reconstruction x′.
• The data is transformed by the encoder and we sample an equal number of samples
77
from pz. In case of AAE dGAN 2, the sampled points are also passed through the
code generator: block CG. Weights of the discriminator (D 1 in pictures) are up-
dated to minimize cross-entropy to classify between encoded samples and samples
obtained/derived from pz.
• We then freeze the discriminator (D 1) weights. The weights of encoder are up-
dated based on its ability to fool the discriminator (equivalently minimizing the cross-
entropy for real samples to be labeled as MSD samples).
In the dGAN training phase:
• Points are sampled from pz and fed to decoder (in case of AAE dGAN 1) or to
CG + decoder along with a class label (in case of AAE dGAN 2). Weights of the
discriminator (D 2 in pictures) are updated to minimize cross-entropy to classify
between synthetic samples and real samples.
• We then freeze the discriminator (D 2) weights. The weights of decoder (in case of
AAE dGAN 1) or to CG + decoder (in case of AAE dGAN 2) are updated based
on its ability to fool the discriminator (equivalently minimizing the cross-entropy for
synthetic samples to be labeled as real samples). In case of dGAN 2, an additional
loss term based on mutual information is also considered to update the weights.
Once trained, we can use the encoder to get compressed representations of higher dimen-
sional feature vectors. At the same time we can sample points from pz, pass it through
decoder (in case of AAE dGAN 1) or through CG + decoder along with a class label(in
case of AAE dGAN 2) to get synthetic feature vectors.
One thing to note is that unlike in case of AAE dGAN 1 where the coding space was
specified to be a mixture of 4 maximally separated Gaussians, in case of AAE dGAN 2
78
the code generator block (CG) learns the coding space from the training data provided. We
next evaluate these models’ coding and synthetic data generating capabilities.
5.3 Comparison of various models’ performance
In this section we discuss the performances of the various various GAN based models.
The adversarial auto-encoder and the derived models (AAE dGAN 1 and AAE dGAN 2)
have an encoder that can project the higher dimensional features onto a lower dimensional
code space. At the same time they can also be used to generate synthetic features by
sampling points from a prior pz and passing it through the decoder. The other two GAN
based models dGAN 1 and dGGAN 2 have only the capability to generate synthetic data
points when we feed their generator with points sampled from a prior pz. After training the
models on utterances with emotions, we conduct two specific experiments: (i) judging the
encoder’s capability to project higher dimensional feature vectors onto lower dimensions in
AAE based models, (ii) judging the GAN based models’ capabilities to generate synthetic
data. We compare them with each other and with a traditional approach like fitting a GMM.
5.3.1 Features
We use the openSMILE toolkit to extract a set of 1582 dimensional feature vector [35].
This feature set consists of various functionals computed for spectral, prosody and energy
based features. Same feature set has also been used in several previous works including
the INTERSPEECH Paralinguistic Challenges [108]. Similar sets of spectral, prosodic and
energy based features has shown considerable success in emotion classification and affect
tracking [47].
79
5.3.2 Projecting higher dimensional points onto lower dimensions
In this section we look at the encoders’ performance to project higher dimensional fea-
ture vectors onto lower dimensions. The goal of this experiment is to quantify the loss in
discriminability after compressing the original feature to a smaller feature subspace. We
compare the different AAE based models as well as using more traditional methods for
feature compression like PCA and LDA. We also implement a vanilla auto-encoder and
compare its compression quality with others. The encoder of the AAE based models im-
plemented have 1582 neurons in their input layer corresponding to the feature dimension.
It is followed by three hidden layers with 1000, 500 and 100 neurons respectively before
the samples are fed to a bottleneck layer with K neurons corresponding to the dimension
of the code space. This is followed by a decoder with 100, 500 and 1000 neurons in three
hidden layers followed by an output layer of 1582 neurons. The code generator in case
of AAE dGAN 2 had an input layer of 24 neurons which was fed samples from a 20 di-
mensional zero mean unit variance Gaussian distribution. The remaining 4 neurons were
used to provide the class info using a four dimensional one-hot vector, each neuron corre-
sponding to an emotion. This was followed by a hidden layer with 40 neurons followed
by an output layer with K neurons corresponding to the dimension of the code space. The
hidden layers in all these blocks had regularized linear (ReLU) activation while the bottle-
neck and output layers in all these blocks had linear activation. K was fixed at 2 for AAE
and AAE dGAN 1 where the prior pz was explicitly defined as mixture of four maximally
separated Gaussians, each mixture corresponding to an emotion. In case of AAE dGAN 2
where the code space is learned from data using a code generator block, we swept the value
of K among 2, 8, 64 and 256 to find the dimension of the code space that achieves the
best separation of the compressed feature vectors. We also experimented with training the
encoder and decoder of AAE dGAN 2 with and without L2 regularization. We discuss the
80
effect of regularization below. The discriminator in AAE and the discriminators D 1 in
the AAE dGANs all had the same architecture. The input layer has K + 4 neurons, and
was fed with K dimensional samples from either the bottleneck layer of auto-encoder or
the samples derived from prior pz along with class label/mixture index using a four dimen-
sional one hot vector. This was followed by three hidden layers with 1000,500 and 100
neurons with ReLU activation and finally an output layer with one neuron with sigmoid
activation indicating whether the sample fed to the discriminator was from bottleneck layer
or whether it was a sample derived from pz. The discriminators D 2 in AAE dGANs has
the same architecture as D 1 except the input layer which takes in high dimensional feature
sets as input. There is also an auxiliary output layer with four neurons to predict the class
of synthetic samples which is used to compute the mutual information based loss. The
vanilla auto-encoder implemented for comparison purposes had the same architecture as
the architecture of the auto-encoder blocks in the AAE based models.
Single corpora setting
We use IEMOCAP dataset (Busso et al [13]) to run a leave-one-session out five fold
cross validation analysis. Since each session had independent speakers we ensure there is
no speaker overlap between training and validation sets. Since the individual feature values
lie over a wide range of values we do mean-variance normalization of both training and
validation sets. We use training data statistics to normalize the validation set.
For a specific cross-validation iteration, we train an AAE based model with the train-
ing set. Then we obtain the lower dimensional representations of the raw features for both
training and validation set using the trained encoder. Note that the AAE based models
aren’t being provided any label information while getting the lower dimensional repre-
sentations. In Figure 5.5, we look at the two adversarial cross-entropy losses namely the
81
Figure 5.5: Discriminator’s (blue) and Generator’s (red) loss curves for training (left) and one of the
validation sets (right) for (a) AAE (b) AAE dGAN 1 (c) AAE dGAN 2
discriminator and generator losses for the three AAE based models. We plot these errors
per epoch on the training and the validation set during one specific cross-validation set for
all three AAE based models. We observe that the adversarial losses converge indicating
that the discriminator’s ability to discriminate is countered by generator’s ability to con-
fuse it. This trend is observed for both, training and validation sets, indicating that the
learnt parameters generalize well to data unseen during model training. We also observe
that the error seems to be converging the best for AAE dGAN 1 indicating their superior
coding ability. After we train the AAE based models, we use the encoder in auto-encoder to
compute the code vectors for the training set as well as the validation set. We then train an
SVM classifier on the openSMILE features as well as lower dimensional representation of
these features as obtained using the following techniques: Principal Component Analysis
(PCA), Linear Discriminant Analysis (LDA), an auto-encoder and finally the code vector
representations learned using the AAE based models. We learn and obtain these lower di-
mensional representations (PCA, LDA, auto-encoder and AAE based) of the openSMILE
features on the training set, which are then used to train the SVM model. The SVM param-
eters (box-constraint and kernel) are tuned using an inner-cross validation on the training
set. Since our goal here is dimension reduction, we keep the maximum dimension of these
representations during our experiments to be 256. We use Unweighted Accuracy (UWA)
82
Table 5.1: Cross-validation accuracies when the raw opensmile features or its compressed represen-
tations are fed to SVM. Metric used was UWA or mean class-wise accuracies
Features Dimension UWA
Raw opensmile features 1582 59.46
2 45.76
8 53.62
PCA on raw features
64 57.61
256 56.49
2 50.22
LDA on raw features
3 53.42
2 46.78
8 56.02
Vanilla auto-encoder
64 57.36
256 57.41
AAE with 4-mixture Gaussian prior 2 57.01
AAE dGAN 1 with 4-mixture Gaussian prior 2 57.78
2 33.89
8 50.59
AAE dGAN 2
64 56.14
256 57.06
as our evaluation metric which is basically the average of class-wise accuracies. This is
especially helpful in cases where all the training set doesn’t have equal number fo samples
from all classes. We list the results of the classification experiment in Table 5.1.
From the results, we observe that the performances of SVMs trained on the openSMILE
features and the code vectors are fairly close especially in case of AAE and AAE dGAN 1
where we compress the feature sets from 1582 to just 2 dimensions. This indicates that
the compressed code vectors capture the differences between the emotional labels in the
openSMILE feature space to a fairly high degree. We do not observe as high a performance
from any of the other feature compression techniques when we limit their dimensionality
to 2. AAE dGAN 1 slightly outperforms AAE which further confirms our observation
from Figure 5.5 that it has better coding capability. This is probably because while in case
of AAE, the parameters of the decoder are updated based only on reconstruction error,
83
Figure 5.6: Effect of regularization on coding capability of AAE dGAN 2
in case of AAE dGAN 1 they are also updated based on the adversarial losses so as to
make their output resemble real features. Updating the parameters of encoder then based
on reconstruction error of the auto-encoder, makes it better at compressing the real sam-
ples onto K dimensions because they can probably generate more realistic samples when
passed through the decoder in case of AAE dGAN 1. We will verify this claim in a later
section. Methods such a PCA and compressing using vanilla auto-encoder achieve similar
performance when we increase the dimension of the compressed feature vectors. Same
is the case with AAE dGAN 2. The superior performance of AAE and AAE dGAN 1 as
compared to AAE dGAN 2 can be explained by the fact that while in the former cases
we explicitly define the code space pz to be maximally separated Gaussians, in the latter
there is no such explicit definition and learning the code space is data driven. Note that
the results for AAE dGAN 2 shown in Table 5.1 are when we use L2 regularization on en-
coder and decoder parameters while training them. Figure 5.6 shows the accuracies when
regularization is applied vs when its not and we can see clearly that regularization helps.
Although the results havent been reported, we note that regularization did not provide us
with any benefits in the other two AAE based models. This is probably because without
regularization, the code space learned while training in AAE dGAN 2 is too specific to the
84
Figure 5.7: TSNE/scatter plots for samples in training (left) and validation (right) set for one of
corss-validation IEMOCAP splits (a) 1582-D raw Opensmile features (b) their 64-D PCA encodings
(c) 256-D encodings obtained from a vanilla auto-encoder (d) 2-D encodings obtained from AAE
(e) 2-D encodings obtained from AAE dGAN 1 (f) 256-D encodings obtained from AAE dGAN 2.
Note that the 2-D encoding from AAE and AAE dGAN 1 resemble the matching space distribution
which is a mixture of four 2-D Gaussians with orthogonal mean vectors.
training set and fails to generalize well for an unseen validation set. We don’t face this is-
sue with the other two AAE based models because the mapping space distribution has been
pre-defined to be a mixture of four maximally separated Gaussians. Figure 5.7 we show
the scatter/TSNE plots of clusterings obtained from models shown in bold in Table 5.1 for
one of the training and validation sets. We observe that AAE based models show almost
perfect clustering for training set unlike the non-AAE based methods. This is because of
providing label information to the discriminator explicitly while training them. The scatter
85
plot of 2-dimensional encodings show that the validation set samples are also fairly sepa-
rable in case of AAE and AAE dGAN 1. Figure 5.7 provides a sense of the separability
of emotion labels by plotting the scatter/TSNE plots of the compressed feature space of
the 1582-dimensional openSMILE features. The classification experiments in Table 5.1
quantify this separability.
This low dimensional representation retaining the discriminability across classes pro-
vides a powerful tool for analysis in a low dimensional subspace, which is otherwise not
possible with a large feature dimensionality. The low dimensional representation could be
used for applications such as clustering as well as an “experimentation by observation”, as a
low dimensional code vector (in particular 2-D) allows plotting the emotion utterances and
analyzing them. We also note the fact that the auto-encoder allows reconstructing the fea-
tures from these code vectors. Therefore, a recovery of the actual utterance representations
is also possible, which is otherwise more lossy in other dimension reduction techniques.
Cross corpus setting
Having studied the ability of GAN based architectures to encode higher dimensional
features onto a lower dimensional space in a single corpora setting, we now move to per-
forming cross-corpus evaluations. The objective of this experiment is to investigate if a
GAN based model can produce meaningful lower dimensional encodings for an external
corpus. We use IEMOCAP for training and MSP-IMPROV [16] as our testing set. MSP-
IMPROV, like IEMOCAP, also has actors participating in dyadic conversations which has
then been segmented into utterances and annotated by evaluators. There are 7798 utterances
in total spanned across the same four emotion classes. However, the distribution across
classes was highly unbalanced with the number of utterances belonging to happy/neutral
class more than three times the number of angry/sad samples. This prompted us to use it
86
Table 5.2: Cross-corpus accuracies obtained on MSP-IMPROV. Training of SVM is done using
either raw opensmile features or encoded version of higher dimensional raw features extracted from
utterances in IEMOCAP
Raw PCA Vanilla AAE AAE dGAN 1 AAE dGAN 2
opensmile auto-encoder
Dimension 1582 64 256 2 2 256
UWA 45.14 44.29 44.53 41.51 41.83 44.88
as a test set rather than training set. We only select a few of the models for cross-corpus
comparison, specifically the ones which performed better than others and whose results are
mentioned in bold in Table 5.1. From results in Table 5.2 we can make similar observa-
tions that using AAE and AAE dGAN 1 we can reduce the dimensionality significantly
without much loss in accuracy. However the loss in accuracy in cross-corpus case is more
compared to what we saw in cross-validation cases due to domain differences. Unlike
cross-validation experiments, higher dimensional encodings obtained from a vanilla auto-
encoder or AAE dGAN 2 or by using PCA seem to be a better representation than the 2D
encodings obtained from AAE and AAE dGAN 1. In Figure 5.8 we show the scatter plots
of the encoded features obtained from AAE and AAE dGAN 1. It shows the some amount
of separability is still maintained when features are compressed onto two dimensions. How-
ever the scatter plot for test set is highly populated with blue and magenta samples because
of higher number of neutral and happy samples in MSP-IMPROV.
5.3.3 Generative capability of GAN models
We next examine the possibility of synthetically creating samples representative of
utterances with emotions using the GAN based models we have trained. We generate
the synthetic vectors using the models discussed above i.e. AAE, dGAN 1, dGAN 2,
AAE dGAN 1 and AAE dGAN 2. For AAE, dGAN 1 and AAE dGAN 1 we randomly
sample points from prior pz which is a mixture of four Gaussians with orthogonal means.
87
Figure 5.8: Scatter plots for 2D encodings obtained for MSP-IMPROV utterances for (a) AAE (b)
AAE dGAN 1. Models are trained using IEMOCAP.
For dGAN 2 and AAE dGAN 2 the points are sampled from a zero mean uni-variate 20
dimensional Gaussian distribution. Once the points are sampled they are passed through
the trained decoder of the auto-encoder in case of AAE and AAE dGAN 1 and through the
trained generator in case of dGAN 1 and dGAN 2. For AAE dGAN 2 it is passed through
the code generator block followed by feeding its output to the trained decoder. Note that
in case of dGAN 2 and AAE dGAN 2 the generator/code generator is also given the label
as input along with samples from the Gaussian prior. Please refer to Figures 5.2, 5.3, 5.4
for a visualization of the work flow of how the synthetic feature vectors were generated.
This synthetically generated vector thus is an openSMILE-like feature vector obtained by
passing a randomly sampled 2-dimensional code vector through the decoder/generator of
the GAN based models (and not directly obtained from an utterance from the database).
Note that in case of AAE, dGAN 1 and AAE dGAN 1 each GMM component was en-
forced to pertain to a specific emotion label through discriminator regularization using the
one hot label vector. The labels for the synthetically generated samples is assigned to be
the same as the GMM component label used to sample the code vector. In case of dGAN 2
and AAE dGAN 2, the synthetic vectors are assigned the label which was fed to the gen-
erator/code generator block along with samples from the prior. Note that while in case
88
of AAE the decoder’s parameter is only updated based on reconstruction error, in case of
dGANs the generator’s parameter are updated based on the adversarial losses. In case of
AAE dGANs we use both reconstruction error and adversarial losses to update the decoder
parameters. In Figure 5.9 we show the reconstruction error curves for AAE, AAE dGAN 1
and AAE dGAN 2 and adversarial losses for dGANs and AAE dGANs for training and
validation set belonging to one of the cross-validation experiments. We observe that while
the reconstruction errors decreases, the adversarial error converges indicating that the dis-
criminators ability to discriminate between real and synthetic data points is countered by
generators ability to confuse it. This trend is observed for both, training and validation sets,
indicating that the learned parameters generalize well to data unseen during model training.
While the convergence of loss functions are a helpful tool to judge the training of GANs,
we can’t judge the quality of synthetic samples being generated. To this end we perform
two experiments explained below:
• Using real samples as training set and synthetic data as test set. The objective of
this experiment is to assess the similarity between real and synthetic data by using
a model trained on real data to classify synthetic data. This would give us an idea
about the quality of the synthetic data. However, it may so happen that all or majority
of the generated samples belong to the same or only a few of the class (called mode
collapse Arjovsky et al. [5]) or even if we get samples from all modes/classes there is
not much variance within samples belonging to the same class i.e. they are not a good
representation of the class to which they belong because it will be under represented.
• Using synthetic samples as training set and real data as test set. This experiment
will give us an idea which model produces samples that are good representations of
all the classes. This measure would reflect the diversity of synthetic data. In other
words, the classifier trained using synthetic samples form a GAN model that’s liable
89
Figure 5.9: Reconstruction or adversarial errors (discriminator’s (blue) and generator’s (red) er-
rors) for one of the cross-validation splits (a) AAE (b) dGAN 1 (c) dGAN 2 (d) AAE dGAN 1 (e)
AAE dGAN 2. a(i), b(i), c(i), d(i,iii), e(i,iii) belong to training set while a(ii), b(ii), c(ii), d(ii,iv),
e(ii,iv) belong to validation set
to mode collapse would perform poorly. Note that the test set in this case should have
real data samples from all classes.
Single corpora setting
We perform the above experiments on IEMOCAP in a speaker independent cross-
validation setting. For each cross-validation split, lets call the training set used to train
the GAN model as set-1 and the validation set as set-2. The real samples mentioned in
above three steps can either come from set-1 or set-2. We can use synthetic data points as
90
training set and use wither set-1 or set-2 as test set. Similarly, we can use set-1 or set-2 as
training set and synthetic samples as test set. Finally we can append set-1 with synthetic
data points to train a classifier and evaluate it on set-2. The purpose of using set-1 for ex-
periments is to judge how the synthetic samples compare to the real samples that were used
to train the GAN generating them. With set-2 we intend to determine how the synthetic
samples compare to real samples obtained from utterances involving new speakers. We
choose SVM as classifier and as before the SVM parameters (box-constraint and kernel)
are tuned using an inner-cross validation on the training set. The metric used is UWA.
We report the average UWA over the five cross-validation splits. We generate 6000 syn-
thetic data-points which is almost the same number of data-points in the entire IEMOCAP
set that we consider for our experiments. We show the results for the different train and
test conditions in Table 5.3. Since, it is a 4-way classification chance accuracy should be
1
4 = 25%. We note that the accuracies obtained using the dGAN models are either equal to
or very close to that number which indicates that the data generated using those models are
as good as sampling random points from 1582 dimensional space. The results are however
more interesting for AAE based models which shows the importance of auto-encoders in
the GAN based models. It can be seen that results obtained for set-1 are better than that for
set-2. This is expected as set-1 was used to train the GAN models and hence the generated
data should be more like set-1 than set-2 which has independent speakers than set-1. It
can be observed that using synthetic data produced by AAE dGAN 1 gives us better re-
sults than that produced by AAE which is probably because how the decoder is trained.
While in case of AAE decoder parameters are updated based only on reconstruction loss,
in case of AAE dGAN 1 the parameters are updated based on an additional adversarial
loss that determines how close it is to real data. We hypothesize that this extra update is
what leads to better synthetic sample generation by AAE dGAN 1 that also generalizes
better for unseen speakers. Another interesting thing to note is the characteristic of syn-
91
Table 5.3: Cross-validation accuracies (%) obtained using different combinations of data-sets for
training and evaluating a classifier. Set-1 refers to the training set of a cross-validation split used to
train the GAN model while set-2 refers to the validation set.
Train : Set-1 Train : Synthetic Train : Set-2 Train : Synthetic
Test : Synthetic Test : Set-1 Test : Synthetic Test : Set-2
AAE 85.60 48.58 72.52 45.89
AAE dGAN 1 88.41 49.91 74.75 46.96
AAE dGAN 2 55.20 52.24 45.63 51.58
dGAN 1 25.26 25 25 25.28
dGAN 2 25 25 25 25
thetic data generated using AAE dGAN 2. When used as test set, the classifiers trained
on real data are unable to classify them as good as they classify samples generated from
AAE and AAE dGAN 1. However, training a classifier on synthetic data obtained from
AAE dGAN 2 performs better at classifying real data-points than a classifier trained on
samples generated from AAE and AAE dGAN 1. This indicates that data generated using
the model AAE dGAN 2 has more diverse samples than data generated using AAE and
AAE dGAN 1. The difference lies in the prior pz from which points are sampled to be fed
into the decoder to generate synthetic samples. While in case of AAE and AAE dGAN 1 it
is pre-defined to be a mixture of four Gaussians, in case of AAE dGAN 2 the GAN model
learns it during training. The coding space of AAE dGAN 2 also has more dimensions
(256) than AAE and AAE dGAN 1 (2). This provides the decoder with a wider range of
input samples which probably leads to diverse synthetic data-points.
Lower dimensional visualizations of synthetic data
To further analyze the synthetic data samples, we plot the t-SNE embeddings of real
data and data generated from AAE, AAE dGAN 1 and AAE dGAN 2. From Figure ??
we observe that while majority of the synthetic data lies in the space spanned by the real
data, we also see some samples in the space not spanned by the real data. We believe
92
Figure 5.10: Comparison of t-SNE embeddings of real data with synthetic data generated using the
three AAE based GAN models for one of the cross-validation splits of IMEOCAP.
such samples when used to train a classifier along with real data, will supplement the per-
formance of the classifier by providing it with extra information. We look at this aspect
briefly when we do cross-corpus experiments. We observe that the data generated by AAE
models with pz as mixture of four Gaussians has similar t-SNE plots and quite different
from the data generated by AAE dGAN 2 where pz is Gaussian. We note that synthetic
data generated using our models also come with their corresponding labels. In Figure ??
we observe the class-wise clustering of the synthetic samples with each color representing
one class. We note that the clustering of samples obtained from AAE and AAE dGAN 1
is quite different from that of obtained from AAE dGAN 2. This is probably because of
the way we are enforcing the condition that generated samples should fall into four distinct
classes. While in case of AAE and AAE dGAN 1 it is enforced by selecting a mixture
of Gaussian prior with orthogonal means, in case of AAE dGAN 2 it is enforced by max-
imizing the mutual information between generated data and the labels being fed to the
code generator block which generates those samples. The distinct clusters seen in case of
AAE dGAN 2 suggest that it produces samples that have higher between-class discrim-
93
Figure 5.11: Class-wise clustering of the synthetic data generated using AAE (left), AAE dGAN 1
(center) and AAE dGAN 2 (right). Each of the four color represents a specific class
inability to AAE and AAE dGAN 1. On the other hand in case of AAE dGAN 2, the
embeddings of the synthetic samples belonging to the same class are concentrated over a
small circular region as compared to the embeddings of samples generated from the other
two models suggesting lower within-class variability in case of AAE dGAN 2. We also
pass the synthetic samples through the trained encoder of AAE dGAN 1. We re-iterate
that the encoder of AAE dGAN 1 was trained to project the higher dimensional feature
vectors onto two dimensions with the coded space resembling mixture of four Gaussian
with orthogonal means. The scatter plot for various models are shown in Figure 5.12. We
see a significant amount of overlap across classes for data-points obtained from dGAN
models which explains their substandard performance. Looking at the ’angry’ (red) and
’sad’ (green) synthetic samples obtained from AAE and AAE dGAN 1, we observe that
their variance is only along one dimension unlike the ’angry’ and ’sad’ samples obtained
from AAE dGAN 2. The ’happy’ (magenta) samples also seem to have more variance in
case of AAE dGAN 2 compared to the other two models. The ’neutral’ (blue) samples
seem to be overlapping to a greater extent with ’happy’ and ’sad’ samples in case of AAE
94
Figure 5.12: Scatter plot for the encoded points obtained for synthetic data generated using (a) AAE
(b) dGAN 1 (c) dGAN 2 (d) AAE dGAN 1 (e) AAE dGAN 2
and AAE dGAN 1 than AAE dGAN 2. Hence, we hypothesize that a classifier trained
with the more diverse synthetic samples obtained from AAE dGAN 2 is better at classify-
ing real data points than the other two AAE based models.
Cross corpus setting
Having studied the convergence of GAN architectures and evaluating the quality of syn-
thetically generated samples produced by them in a single corpora setting, we now move to
performing cross-corpus evaluations. The objective of this experiment is to investigate how
well the synthetically generated samples generalize for classification tasks on an external
corpus (as opposed to being applicable for only in-domain tasks). We generate the synthetic
samples from GAN models trained using the entire IEMOCAP dataset. Since, synthetic
95
data obtained from trained dGAN models didn’t give us better than chance accuracy in
cross-validation experiments, for cross-corpus experiments we only train AAE based mod-
els. As before, we conduct two experiments (Table 5.4). First, use the synthetic dataset as
training set and MSP-IMPROV as test set. This was followed with using MSP-IMPROV to
train a classifier and evaluating it on synthetic data. We observe that evaluating a classifier
that has been trained on MSP-IMPROV to classify the synthetic sets shows higher accura-
cies when the synthetic samples are generated using AAE dGAN 1 and AAE dGAN 2 that
the AAE based models where decoder is receiving an extra adversarial error to update its
parameters produces more generalizable samples. On the other hand, evaluating different
classifiers which has been trained using synthetic samples generated from different AAE
based GAN models perform almost similarly in classifying samples from MSP-IMPROV.
This is probably because of the differences in distributions of utterances belonging to dif-
ferent classes in MSP-IMPROV and the synthetic data sets. While the synthetic data is
balanced with respect to all the classes, MSP-IMPROV has more happy and neutral sam-
ples compared to angry and sad. Hence, the classifiers trained on different synthetic sets
make similar mistakes when it comes to classifying the evaluation set leading to similar
mean class-wise accuracies.
Finally we investigate the feasibility if using the synthetic feature vetors along with real
data in low resource conditions. To start with, for our baseline classifier we train a SVM
on IEMOCAP data and tested it on MSP-IMPROV. Then we added 600 synthetic data
samples generated from the AAE based models and look for any improvement in accuracy.
As Table 5.5 shows we indeed see a minor improvement in accuracy when synthetic data
is used along with real data. Further experiments are needed to determine the applicability
of synthetic data for training models in low resource conditions
96
Table 5.4: Cross-corpus accuracies obtained on MSP-IMPROV. Synthetic data is generated from
GAN based models trained on IEMOCAP
Train : MSP-IMPROV Train : Synthetic
Test : Synthetic Test : MSP-IMPROV
AAE 42.07 38.3
AAE dGAN 1 51.93 38.61
AAE dGAN 2 49.53 37.72
Table 5.5: Training using IEMOCAP with and without synthetic data and test on MSP-IMPROV.
Note that we see aminor improvement in accuracy when synthetic data is used along with IEMO-
CAP. Each column represents the model from which synthetic data was generated.
Baseline AAE AAE dGAN 1 AAE dGAN 2
31.01 32.08 32.52 32.32
5.4 Conclusion and future work
Automatic emotion recognition is a problem of wide interest with implications on un-
derstanding human behavior and interaction. A typical emotion recognition system de-
sign involves use of high dimensional features on a curated dataset. We implemented
GAN based models which can encode the higher dimensional features onto a lower di-
mensional space. At the same time, they are also generative models that can provide us
with synthetic feature vectors. We establish that the code vectors learnt by the adversarial
auto-encoder can be obtained in a low dimensional subspace without losing much class
discriminability in the higher dimensional feature space. Having a pre-defined code space
pz with maximally separated components seem to encode the higher dimensional features
more efficiently than if we try to learn the encoding space from data. We also observe that
synthetically generated samples from these models do seem to retain relevant class infor-
mation. Additionally from our experiments we found that updating the decoder parameters
with adversarial error along with reconstruction error seems to generate ”better” synthetic
samples. It is encouraging to see these results given limited datasets that we have exper-
97
imented with (IEMOCAP has around only 7 hours of training data). With more data it is
expected that GANs will be able to learn a more generalized distribution/manifold where
the openSMILE feature vectors lie. The experiments show that a generator’s job to esti-
mate a more complex PDF from a simpler PDF is more complex than a discriminator’s
job which is to distinguish between fake and real samples. Hence, we had to incorporate
tricks like updating the generator more times for a single update of discriminator or keep-
ing the learning rate of generator more than that of a discriminator. All our GAN models
enforce the conditio that the generated data should belong to four distinct classes. We ob-
serve that enforcing that clustering by implementing an infoGAN framework rather than
specifying a Gaussian mixture prior with orthogonal means leads to generation of samples
with inter-class variability but low within class variability. Cross corpus results showing an
improvement in accuracy when real data is appended with synthetic data suggests that the
synthetic samples could be generalizable across datasets with different priors. This opens
up the possibility of using them in low resource conditions.
In future, we plan to investigate auto-encoder architectures that can be fed frame level
features instead of utterance level features. We believe temporal dynamics of feature con-
tours can lead to better classification results. It will also be interesting to study the effects
of weighting the two updates that the decoder in AAE dGAN 1 and AAE dGAN 2 re-
ceives (one from reconstruction error of the auto-encoder and one from the adversarial
error determining how well it can fool a discriminator into believing its output is are real
datapoints). We can also investigate trying different loss functions to train the GAN based
models. Wasserstein GAN [5] has been very popular in recent years. It uses earth-mover
distance instead of Jensen-Shannon divergence (used by traditional GANs, Arjovsky et al.
[4]) to learn a complex PDF from a simpler one and has been shown to be better at handling
the mode collapse problem. Another interesting avenue could be to explore the usage of
synthetic data generated as additional training samples in low resource settings. A more
98
detailed study needs to be done to determine their applicability in low resource conditions.
For example, we can vary the amount of synthetic data in the training set and see how it
affects accuracy. While too less of synthetic data probably wouldn’t give us any advantage,
using too much of synthetic data might not be ideal because we want the model to be trained
mostly on real data. Finally, the GAN based architecture can also be used in analysis of
other behavioral traits such as engagement [46] jointly with the emotional states.
99
Chapter 6: Future directions
This dissertation investigates the ways in which we can enhance the generalizability of
speech emotion recognition models so that they perform better on evaluating the emotions
of unseen speakers and cross-corpus tasks. We start off with discussing ways to improve
discriminative models and then move on to talk about generative models. Our first approach
to boost the performance of discriminative models using only audio features was based on
regularization approaches and manifold learning techniques. We found that adversarial
examples based manifold learning techniques can lead to more generalizable models. Fur-
thermore, supervised approach led to better results than semi-supervised approaches. At
the same semi-supervised approaches where the adversarial examples were found without
using the label information did not lead to significant improvements. This could be because
of the limited training data available to us (approximately 7 hours as compared to hundreds
or thousands of hours used to build models used for commercial applications). However,
audio features can only help so much. The reason behind this was because while audio
features can help detect the arousal or intensity level of a speaker, they fail to do well in
detecting the valence or pleasantness of an utterance. Since, the words used by a speaker
can tell us about their sentiments, we leveraged the text transcriptions and used them to
build a multi-modal system utilizing both audio and text features for emotion recognition.
We observed that while audio features helped with arousal detection, text features indeed
helped with valence detection with the multi-modal system out-performing both these uni-
modal systems. Since, our end goal was to come up with a pipeline that utilizes only
100
speech as input, we used existing deep learning based ASR API’s trained on thousands of
hours of data to get the transcriptions. We then compared the performances of systems
using audio features along with ASR transcriptions with that of using audio features with
ground truth transcriptions. While we see a deterioration in performance as expected, using
ASR transcriptions along with audio features as opposed to using only audio features gave
us an absolute improvement of around 10% in our within-corpus cross-validation study.
Moreover, we also saw improvements in cross-corpus study showing that these models
are indeed generalizable across cross-corpus differences appearing due to recording con-
ditions, label space, speakers and annotators. Having investigated ways to improve the
generlizability of discriminative models, our next step was to focus on generative models.
Generative models could be helpful in understanding the underlying process that generated
the data. We can set the model parameters so that we can generate synthetic data that can
potentially be used to train models in low resource conditions. We used auto-encoder and
GAN based architectures to do so. Not only did we measure the quality of synthetic data
generated, we also investigated their ability to encode higher dimensional feature vectors
onto lower dimensions. The quality of these encodings was judged by how distinguishable
they are as compared to the original higher dimensional raw feature vectors. We compared
different architectures with different loss functions to generate synthetic feature vectors
from samples obtained from lower dimensional priors. While one of them only used the
reconstruction error as seen in auto-encoders, couple of them used only GAN based ad-
versarial errors. Finally, we implemented architectures using both of them. Additionally
we also investigated the effect of having a pre-defined prior vs having a data-driven prior
from which synthetic samples are generated. We also defined metrics and visualizations
to judge the quality of synthetic feature vectors generated. We found that the quality of
the synthetic vectors obtained from architectures using only reconstruction error was better
than those using only adversarial error. But it was best for the architectures whose parame-
101
ters were updated using both these errors showing that they were complimentary. From our
within-corpus speaker independent cross-validation experiments we concluded that using
a data-driven prior learned from training data produces more generalizable samples. How-
ever, with cross-corpus experiments it did not matter as much. The error functions used to
update the parameters seemed to matter more in this case. We next list out some potential
future directions we can extend this work to.
6.1 Appending low volume datasets with adversarial counterparts
More the amount of training data, better is the performance of deep learning algorithms.
To that end, researchers in computer vision community resort to data augmentation tech-
niques such as cropping, rotating and flipping input images etc (Wang and Perez [93]). In
speech emotion recognition where limited data-sets are usually an issue, data augmentation
techniques can be handy. One method could be augmenting the training data set with their
respective adversarial counterparts. As mentioned in chapter 3, we can find the perturbation
vector r ai for every data-point xi using the equations 3.17 and 3.18. We can then add the
perturbation vector to the actual data-point to get its adversarial counterpart. The ground
truth label for the adversarial counterpart can be considered to be same as that of the actual
datapoint following the smoothness assumption which states that the resulting conditional
distribution p(yi|xi) is desired to be smoothly varying over the inputs xi (Huang et al. [60]).
In chapter 3, instead of appending the dataset, we enforced this assumption by regularizing
the loss function of the neural network. It would be interesting to see how appending a
dataset with its adversarial counterparts instead of loss function regularization affects the
training and generalization of the model. We can also compare the effect of augmenting
the dataset by adding random perturbations to actual data-points rather than finding the
adversarial perturbation. We think data augmentation using adversarial counterparts would
102
be more beneficial than augmentation by adding random noise vectors. We saw a similar
trend in Figure 3.4 with the model performing better when regularization was carried out by
adding adversarial perturbation as opposed to random perturbation to the input datapoint.
Another interesting experiment could be to generate such adversarial examples using a
GAN based framework as proposed by Xiao et al. [126] and augment the training dataset
using such examples. The purpose of [126] was however not data augmentation but to
generate adversarial examples of high perceptual quality that a traditional classifier trained
on real unperturbed images would fail to classify. They compare their method with other
methods that generate such adversarial examples. In their approach, instead of computing
the perturbation vector r ai for every data-point xi explicitly, they train a GAN to do so.
While the generator generates the perturbation given an input image, the discriminator
tries to distinguish between the input image and the perturbed image obtained by adding
the perturbation to the input. Training this GAN architecture would make the perturbed
images look like real images. At the same time, they also include another term to their loss
function that encourages these perturbed images to be miss-classified (not being classified
as input image’s class) by a trained classifier. This can be done by maximizing the distance
between the prediction of the classifier and the ground truth. Note that the parameters of
the trained classifier aren’t updated while back-propagation. Only the parameters of the
generator is updated during this process. This would encourage the generated images to
be an adversary of the actual image. They trained their network using image datasets and
claim once the generator is trained, it can generate perturbations efficiently for any image.
They show perturbed images which look exactly like the images they have been generated
from which is miss-classified by a classifier trained only un-augmented dataset. It would
be interesting to investigate the effect of such a data augmentation technique for speech
emotion recognition.
103
6.2 Emotion recognition on real world datasets
The experiments presented in this thesis till now has been on acted datasets. While it
is easier to collect it would be more realistic to train and evaluate our models on real word
datasets. We have done a pilot study on a couple of call center conversations provided
to us by Hughes telecommunications that involved affect detection based on audio and
sentiment analysis based on text to analyze how the calls went. For audio based affect level
detection, we trained two convolutional neural network based architectures, one for valence
and another for arousal detection using 3 second speech samples from two acted corpora
available to us namely, IEMOCAP and MSP-IMPROV. The arousal and valence levels of
the resulting utterances have been annotated by various annotators on a Likert scale of 1-
5, with their average being considered as the ground truth rating. For our experiments,
speech samples with arousal/valence rating < 2.5 were labelled as low arousal/valence
and anything with arousal/valence rating > 2.5 as high arousal/valence. In total we have
approximately 7 hours of speech that was used for training the models and performing a
binary classification between high/low arousal or high/low valence with the two soft-max
activated neurons in the final layer of the model giving us the probability of a speech sample
belonging to each of the two classes . We performed a speaker independent cross-validation
(i.e. there is no overlap of speakers between training and validation set) to fix the hyper-
parameters of the network. The cross-validation accuracy obtained for arousal detection is
around 85%, whereas for valence it is only around 63%. These results are consistent with
the findings in chapter 4 where we have shown that models trained using audio data work
better for arousal level detection than for valence level detection.
The first step in analyzing the Hughes data consisted of speaker diarization which was
done by hand for the purpose of this study. Second, the data for the customers and the
representatives was divided into 3 second chunks that were then passed through the model
104
separately. We extract Mel filterbank based spectrograms for the utterances with a frame
interval of 0.5 seconds. This data is fed to our pre-trained models to obtain the probability
of the utterance being high arousal/high valence. Since, there were no ground truth anno-
tations of the call center conversations for their arousal and valence levels, our analysis are
based on our own perceptions of affect and sentiment. The representatives in the conversa-
tions were always calm. An example is shown in Figure 6.1 where the spectrogram and the
arousal plot are shown for a 3 second segment of speech from the representative in one of
the phone calls. Note that there appears to be a shift in the data plot vs the audio because
each point in the former is plotted at the center of the 3 second window. That is, a data
point at 1.5 seconds shows the value for a window ranging from 0-3 seconds. As can be
seen, the probability of being aroused never exceeds 0.2, indicating that the representative
was calm while speaking, a result we have verified by listening to the utterances. As in the
example above, the representatives in both of the calls provided are perceived to be calm
by us throughout the call and the arousal values we get confirm this impression except in 6
out of 1087 3-second turns when the probability of being aroused exceeded 0.5 (for these
cases, the values always lie between 0.55 to 0.77). This results gives an accuracy of 99.4%
in detecting low arousal calls belonging to the representatives. Being able to detect that the
representatives are calm based on the audio is especially important. The speech of the rep-
resentative in one of the calls was not recognized correctly by Google’s ASR API for most
of the call, presumably because of his accent. It is likely the case that many representatives
will have an accent which may make textual analysis ineffective.
In Figure 6.2, we show an example from the customer in the 25 minute phone call who
was sometimes annoyed. For this particular segment, she is annoyed about her internet bill
increasing and we can see the arousal detector evaluating it to be a high arousal utterance
towards the end of the utterance. Listening to the turn confirms the fact that her voice in-
deed becomes louder and annoyed towards the end of the conversation. The transcription
105
Figure 6.1: Spectrogram and plot of arousal probability of a call center representatives utterance.
The higher the value, the more aroused is the speaker. Its fairly low values indicate the person
remained calm during the call
Figure 6.2: Customer’s response where the arousal tends to rise towards the end of her sentence.
The arousal values are marked with an arrow at the bottom for the boxed region
of the audio with the more aroused portion written in bold is: ”Okay (stammers) they had
somebody come out here (stammers) that was just a thing (unclear), I called yall coz my
bill went up”. Altogether, the customer in the 25-minute call had 28 conversation sides
that were at least 3 seconds long. Out of the 28 utterances, 14 were judged by us as being
low arousal, while we felt the remaining had some portions that we would consider high
arousal. Of the 14 low arousal calls, our model predicted 12 of them correctly with it show-
ing low arousal probabilities throughout the utterance. For one of the other two calls, it is
the case that while the customer does not appear to be aroused, she was clearly bothered
and it is this part of the utterance where the arousal detector had high values. Of the 14
106
Figure 6.3: Valence probability plot for two utterances for when the customer laughs (left) vs when
she doesn’t (right). Higher value shows the person is more pleasant. Note that there is an increase
in valence in case of laughter.
conversation sides that had certain high arousal parts within them, our model failed to find
any high arousal regions in four of them. In all of these four utterances, the relevant aroused
portion was only about 3 seconds long which means the algorithm had little data to predict
the arousal values. For the second phone call, the arousal detection of half of 6 conversa-
tion sides for the customer confirm our perception. The arousal detector gives high values
in the beginning of the call when she introduces herself and during an utterance when she
laughs. On the other hand, the arousal value declines when she faintly says ”okay thank
you”. However, in the remaining three conversation sides, the model mistakenly detects
high arousal during portions of the utterances. The performance of valence detector wasnt
as good as we would like. However, there was an interesting result that we show in Fig-
ure 6.3 where we contrast two situations, one where the valence profile rises and one where
the valence profile stays low. In the one to the left, the valence profile rises when she starts
to laugh. In the one on the right, she is making a statement of why she has made the call.
Transcriptions are provided in the boxes. Overall we felt the arousal detector performed
107
reasonably well especially given it was trained on acted and improvised speech. Thus, we
feel it would be significantly more accurate if it is trained on actual call center conver-
sations. This should also be the case for the valence detector and we believe it is worth
exploring whether using matched data would improve its performance to a level where it
will be useful. Furthermore, we also obtained ASR transcriptions from Google’s API and
used it for sentiment analysis of the conversations. We used term frequency-inverse docu-
ment frequency (Ramos [97]) also called tf-idf that depicts the importance of a word in a
corpus as the value of it is increased proportionally to the number of occurrences of that
word in the document (in our case, one side of the conversation), but will be offset by how
many documents contains that particular word. For our purpose, a document is one side
of a telephone conversation. Then we used the SentiWordNet corpus (Esuli and Sebastiani
[33]) in Pythons Natural Language Toolkit to evaluate the sentiment of a particular word
in the given context by assigning them a positivity or negativity score. The overall positive
and negative score for a response was computed by weighing each of its constituent word’s
sentiment score by their tf-idf score and adding them. The sentiment analysis metric was
computed by taking the difference between the positive and negative scores of the response.
If the metric is less than 0, the response is classified as having a negative sentiment, and as
having a positive sentiment otherwise and the magnitude gives an idea of the strength of
that overall sentiment. Based on these values, we concluded that the customer in one of the
calls had a negative sentiment while the representative in that conversation had a slightly
positive sentiment. From listening, we feel that the representative handled the situation
well. For the other call, both the customer and the representative had positive sentiment
scores indicating the call went through without a hiccup which was indeed the case.
Our next step in this study is to use in-domain call center conversations to train the affect
recognition models. We have been provided with 75 phone conversations which we had
diarized using pyAudioAnalysis [39] and chunked into 5 second segments. Currently we
108
are in the process of getting annotations that we can use as ground truth labels to train the
emotion recognition models. Apart from getting the annotations for arousal and valence,
we are also asking the annotators to categorize the utterances into one of the four classes
relevant to the task at hand namely angry, calm, frustrated and pleasant. Moreover we
have also asked the annotators if they can hear just one speaker or they can hear multiple
speakers in that 5 second segment which can be used to evaluate the diarization system.
6.3 Connecting depression detection and emotion recognition
In [99], we talk about an Average Magnitude Difference Function (AMDF) based fea-
ture that quantifies the voice quality features such as jitter (change in pitch across pitch
periods), shimmer (change in amplitude across pitch periods) and breathiness (amount of
aperiodicity in voiced regions) that showed promise for the task of depression detection.
We believe these features capture the psychomotor retardation present in a depressed per-
sons speech. It was found that in general, depressed voice has more jitter, shimmer and
sounds, more breathy. The dataset we used was the Mundt database [84] which include
speech data collected from 35 physician-referred patients undergoing treatment for depres-
sion. The patients were assessed weekly once over a period of 6 weeks. We used both
the sustained vowel sounds (four vowels a, i, ae, u held for 4-5 seconds each) and the free
flowing utterances where the patients talk about their emotional state, physical state and
their ability to function for our study. The decision whether a person was depressed or
not was determined by using the Hamilton Depression (HAM-D) score [50] which ranges
between 0-26, with higher score implying more severe depression. Sessions with scores
greater than 17 were considered to be ones where the patient was depressed and sessions
with score less than 7 were considered to be ones where the patient was not depressed,
while the session with intermediate scores were considered ones where the subjects mental
109
state was ambiguous and so their data was not included in the study as has also been done
by Helfer et al. [56]. We only used the data from six patients who underwent a change
in their depressive state which was reflected by a decrease in their HAM-D scores. Since,
we didnt have a lot of data we build a speaker dependent classifier i.e. the training and
test set had speaker overlap. We used support vector machines (SVM) as classifiers. We
used voiced frames from the free flowing utterances for training the SVMs and reported an
utterance wise classification accuracy of 77.8%. At the same time, it was also found that
the AMDF based features can be useful for the task of emotion classification by Ko and
Espy-Wilson [67]. Using the emotion database from USC that also contains electromag-
netic articulography measurements [72] it was shown that using the AMDF based features
together with Mel Frequency Cepstrum Coefficients(MFCC) and pitch related features led
to a better performance in comparison to the openSMILE toolkits emobase feature set. It
not only improved the overall accuracy by 3.3% but also greatly reduced the feature space
by 72%. Caruana [17] defines two tasks to be similar if they use the same features to make
a decision. Based on the results we got from the above two experiments we can conclude
that the task of emotion recognition and depression detection are similar because the same
AMDF features helped us with both the classification tasks.
In another study conducted by Gupta et al. [49], it was found that incorporating depres-
sion severity as a parameter in Deep Neural Networks (DNNs) by altering the activation
functions using the depression score show improvements in arousal and valence prediction
compared to a DNN with a vanilla tanh activation function. We performed experiments on
affect prediction using the Audio-Visual Depressive language Corpus [120], which involves
subjects with varying degree of depression. The dataset consists of both free flowing ut-
terances where the participant is answering a question and read speech where the person is
reading a pre-defined script. Each video is continuously rated for three affective dimensions
of valence, arousal and dominance at a frame rate of 30 Frames Per Second by a set of 3-5
110
annotators. The final ground truth affect ratings are computed as the frame-wise mean over
the annotator ratings for a given session. The subjects in the sessions also complete the
standardized self-assessment based Beck Depression Inventory-II (BDI-II) questionnaire
[9]. The score ranges between 0-63, with a higher score implying more severe depression.
Our results on both the sessions show that using depression severity can improve arousal
and valence prediction, thereby suggesting a link between the two tasks.
Since, the tasks of emotion recognition and depression detection have similar charac-
teristics we hypothesize that we can use the information from one task to help with the task
of the other. Our goal is to use a model’s knowledge in recognizing emotions/affect state
to better estimate the presence/severity of depression. One such approach could be to use
transfer learning. The core idea of transfer learning is utilizing the knowledge learned from
solving one problem to solve a different but related problem. While the concept of transfer
learning can be useful to build more generalizable models, it can also assist us with related
tasks. For example, Razavian et al. [110] reported state of the art results when they used
a convolutional neural network (CNN) trained on Imagenet dataset for object classification
to classify bird images from Caltech-UCSD Birds (CUB) 200-2011 dataset. Note that birds
formed one of the categories in the Imagenet dataset. Pre-training has also applications in
audio music classification (Van den Oord et al. [121]) and for natural language processing
tasks (Mou et al. [83]). Now we describe a typical transfer learning set up. We have a
dataset A and a task for which it was collected TA. We have another dataset B and corre-
sponding task TB. We aim to use the knowledge that the model has gained while learning
TA to perform TB. The steps involved in pre-training are:
1. We have a neural network model M initialized randomly.
2. Pre-training: We have a large dataset A on which we train M for task TA.
3. We have the dataset of interest B and we are interested in Task TB. Instead of initializing
the weights randomly and training on B, we take the model M already trained on A as our
111
initialization.
4. Fine-tuning: We fine-tune the weights by training on B. The fine-tuning can be done for
all/some of the hidden layers.
Since, we believe emotion recognition and depression detection are related, we can pre-
train a network to recognize emotions/affect and then fine-tune the weights of the layers on
the depression dataset. So dataset A is a dataset which has the emotion/affect labels avail-
able and task TA for which the model M is trained is emotion/affect recognition. B is the
dataset of interest which in our case is the depression database and hence task TB is depres-
sion detection. One potential roadblock we could face in such an approach is the size of
pre-training dataset. Usually pre-training dataset A is large compared to dataset of interest
B. Researchers working in object detection problems have access to huge amounts of data.
For example, Imagenet has 1.2 million images in it. But it is difficult to get such a huge
dataset for the task of emotion recognition/depression detection. For, example, IEMO-
CAP database that we have used for our experiments has around 10 hours of data (around
10,000 utterances) if we consider all 15 emotional categories. It would be interesting to
explore how good of a pre-trained model we can achieve from 10 hours of emotional data.
We could combine several emotion recognition datasets to mitigate this problem. We can
pre-train using all of them together or using them sequentially one at a time. Apart from
pre-training we can also explore multi-task learning (MTL) frameworks that leverages the
use of auxiliary tasks to improve the performance of a model for a target task. One such
neural network based MTL framework could be where we train the same neural network
for emotion classification and depression detection simultaneously with a different output
layer for each of these tasks while the other layers are shared. Ruder [98] discusses some
MTL framework that could be worth exploring.
112
Bibliography
[1] Mohammed Abdelwahab and Carlos Busso. Supervised domain adaptation for emo-
tion recognition from speech. In 2015 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5058–5062. IEEE, 2015.
[2] Mohammed Abdelwahab and Carlos Busso. Ensemble feature selection for domain
adaptation in speech emotion recognition. In 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 5000–5004. IEEE,
2017.
[3] Mohammed Abdelwahab and Carlos Busso. Domain adversarial for acoustic emo-
tion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, 26(12):2423–2435, 2018.
[4] Martı́n Arjovsky and Léon Bottou. Towards principled methods for training genera-
tive adversarial networks. CoRR, abs/1701.04862, 2017.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adver-
sarial networks. In International Conference on Machine Learning, pages 214–223,
2017.
[6] Pierre Baldi. Autoencoders, unsupervised learning, and deep architectures. ICML
unsupervised and transfer learning, 27(37-50):1, 2012.
[7] Rainer Banse and Klaus R Scherer. Acoustic profiles in vocal emotion expression.
Journal of personality and social psychology, 70(3):614, 1996.
[8] Tanja Bänziger and Klaus R Scherer. The role of intonation in emotional expres-
sions. Speech communication, 46(3):252–267, 2005.
[9] Aaron T Beck, Robert A Steer, and Gregory K Brown. Beck depression inventory-ii.
San Antonio, 78(2):490–498, 1996.
[10] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A
geometric framework for learning from labeled and unlabeled examples. Journal of
machine learning research, 7(Nov):2399–2434, 2006.
113
[11] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and
M West. Generative or discriminative? getting the best of both worlds. Bayesian
statistics, 8(3):3–24, 2007.
[12] Sahar E Bou-Ghazale and John HL Hansen. A comparative study of traditional and
newly proposed features for recognition of speech under stress. IEEE Transactions
on speech and audio processing, 8(4):429–442, 2000.
[13] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,
Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemo-
cap: Interactive emotional dyadic motion capture database. Language resources and
evaluation, 42(4):335, 2008.
[14] Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe
Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan. Analysis
of emotion recognition using facial expressions, speech and multimodal information.
In Proceedings of the 6th international conference on Multimodal interfaces, pages
205–211. ACM, 2004.
[15] Carlos Busso, Angeliki Metallinou, and Shrikanth S Narayanan. Iterative feature
normalization for emotional speech detection. In Acoustics, Speech and Signal
Processing (ICASSP), 2011 IEEE International Conference on, pages 5692–5695.
IEEE, 2011.
[16] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab,
Najmeh Sadoughi, and Emily Mower Provost. Msp-improv: An acted corpus of
dyadic interactions to study emotion perception. IEEE Transactions on Affective
Computing, (1):67–80, 2017.
[17] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, Jul 1997.
[18] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27,
2011.
[19] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learn-
ing. The MIT Press, 1st edition, 2010.
[20] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter
Abbeel. Infogan: Interpretable representation learning by information maximizing
generative adversarial nets. In Advances in neural information processing systems,
pages 2172–2180, 2016.
[21] Vladimir Chernykh, Grigoriy Sterling, and Pavel Prihodko. Emotion recognition
from speech with recurrent neural networks. arXiv preprint arXiv:1701.08071,
2017.
114
[22] François Chollet et al. Keras, 2015.
[23] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and
Yoshua Bengio. Attention-based models for speech recognition. In Advances in
neural information processing systems, pages 577–585, 2015.
[24] Neri E Cibau, Enrique M Albornoz, and Hugo L Rufiner. Speech emotion recog-
nition using a deep autoencoder. Anales de la XV Reunion de Procesamiento de la
Informacion y Control, 16:934–939, 2013.
[25] Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos
Kollias, Winfried Fellenz, and John G Taylor. Emotion recognition in human-
computer interaction. IEEE Signal processing magazine, 18(1):32–80, 2001.
[26] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien
Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment
using speech analysis. Speech Communication, 71:10–49, 2015.
[27] Jun Deng, Zixing Zhang, Florian Eyben, and Björn Schuller. Autoencoder-based
unsupervised domain adaptation for speech emotion recognition. IEEE Signal Pro-
cessing Letters, 21(9):1068–1072, 2014.
[28] Jun Deng, Zixing Zhang, Erik Marchi, and Björn Schuller. Sparse autoencoder-
based feature transfer learning for speech emotion recognition. In 2013 Humaine
Association Conference on Affective Computing and Intelligent Interaction, pages
511–516. IEEE, 2013.
[29] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John
Wiley & Sons, 2012.
[30] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Survey on speech emotion
recognition: Features, classification schemes, and databases. Pattern Recognition,
44(3):572–587, 2011.
[31] Moataz MH El Ayadi, Mohamed S Kamel, and Fakhri Karray. Speech emotion
recognition using gaussian mixture vector autoregressive models. In Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference
on, volume 4, pages IV–957. IEEE, 2007.
[32] Sefik Emre Eskimez, Zhiyao Duan, and Wendi Heinzelman. Unsupervised learn-
ing approach to feature analysis for automatic speech emotion recognition. In
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 5099–5103. IEEE, 2018.
[33] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In LREC, volume 6, pages 417–422. Citeseer, 2006.
115
[34] Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth
André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S
Narayanan, et al. The geneva minimalistic acoustic parameter set (gemaps) for
voice research and affective computing. IEEE Transactions on Affective Comput-
ing, 7(2):190–202, 2016.
[35] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. Recent develop-
ments in opensmile, the munich open-source multimedia feature extractor. In Pro-
ceedings of the 21st ACM international conference on Multimedia, pages 835–838.
ACM, 2013.
[36] Daniel Joseph France, Richard G Shiavi, Stephen Silverman, Marilyn Silverman,
and M Wilkes. Acoustical properties of speech as indicators of depression and sui-
cidal risk. IEEE transactions on Biomedical Engineering, 47(7):829–837, 2000.
[37] Bo Geng, Dacheng Tao, Chao Xu, Linjun Yang, and Xian-Sheng Hua. Ensemble
manifold regularization. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 34(6):1227–1233, 2012.
[38] Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. Learn-
ing representations of affect from speech. arXiv preprint arXiv:1511.04747, Inter-
national Conference on Learning Repre-sentations (ICLR) 2016 Workshop, 2015.
[39] Theodoros Giannakopoulos. pyaudioanalysis: An open-source python library for
audio signal analysis. PloS one, 10(12):e0144610, 2015.
[40] Christer Gobl, Ailbhe Ni, et al. The role of voice quality in communicating emotion,
mood and attitude. Speech communication, 40(1):189–212, 2003.
[41] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,
2016.
[42] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680, 2014.
[43] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua
Bengio. Maxout networks. In Proceedings of the 30th International Conference on
Machine Learning, volume 28, pages 1319–1327. PMLR, 17–19 Jun 2013.
[44] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harness-
ing adversarial examples. arXiv preprint arXiv:1412.6572, ICLR 2015, 2014.
[45] Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. The vera am mittag
german audio-visual emotional speech database. In Multimedia and Expo, 2008
IEEE International Conference on, pages 865–868. IEEE, 2008.
116
[46] Rahul Gupta, Daniel Bone, Sungbok Lee, and Shrikanth Narayanan. Analysis of
engagement behavior in children during dyadic interactions using prosodic cues.
Computer Speech & Language, 37:47–66, 2016.
[47] Rahul Gupta, Nikolaos Malandrakis, Bo Xiao, Tanaya Guha, Maarten Van Seg-
broeck, Matthew Black, Alexandros Potamianos, and Shrikanth Narayanan. Mul-
timodal prediction of affective dimensions and depression in human-computer inter-
actions. In Proceedings of the 4th International Workshop on Audio/Visual Emotion
Challenge, pages 33–40. ACM, 2014.
[48] Rahul Gupta, Saurabh Sahu, Carol Espy-Wilson, and Shrikanth Narayanan. Semi-
supervised and transfer learning approaches for low resource sentiment classifica-
tion. arXiv preprint arXiv:1806.02863, ICASSP 2018.
[49] Rahul Gupta, Saurabh Sahu, Carol Y Espy-Wilson, and Shrikanth S Narayanan. An
affect prediction approach through depression severity parameter incorporation in
neural networks. In INTERSPEECH, pages 3122–3126, 2017.
[50] Max Hamilton. A rating scale for depression. Journal of Neurology, Neurosurgery
& Psychiatry, 23(1):56–62, 1960.
[51] Jing Han, Zixing Zhang, Zhao Ren, Fabien Ringeval, and Björn Schuller. Towards
conditional adversarial training for predicting emotions from speech. In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 6822–6826. IEEE, 2018.
[52] John HL Hansen and Douglas A Cairns. Icarus: Source generator based real-time
recognition of speech in noisy stressful and lombard effect environments. Speech
Communication, 16(4):391–422, 1995.
[53] Sanaul Haq, Philip JB Jackson, and J Edge. Speaker-dependent audio-visual emotion
recognition. In AVSP, pages 53–58, 2009.
[54] Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe
Morency, and Roger Zimmermann. Conversational memory network for emotion
recognition in dyadic dialogue videos. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Papers), volume 1, pages 2122–2132,
2018.
[55] Wilbert Jan Heeringa. Measuring dialect pronunciation differences using Leven-
shtein distance. PhD thesis, Citeseer, 2004.
[56] Brian S Helfer, Thomas F Quatieri, James R Williamson, Daryush D Mehta,
Rachelle Horwitz, and Bea Yu. Classification of depression state based on artic-
ulatory precision. In Interspeech, pages 2172–2176, 2013.
117
[57] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R
Hershey, Tim K Marks, and Kazuhiko Sumi. Attention-based multimodal fusion for
video description. In Proceedings of the IEEE international conference on computer
vision, pages 4193–4202, 2017.
[58] Hao Hu, Ming-Xing Xu, and Wei Wu. Gmm supervector based svm with spec-
tral features for speech emotion recognition. In Acoustics, Speech and Signal Pro-
cessing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages
IV–413. IEEE, 2007.
[59] Che-Wei Huang and Shrikanth S Narayanan. Attention assisted discovery of sub-
utterance structure in speech emotion recognition. In INTERSPEECH, pages 1387–
1391, 2016.
[60] Gao Huang, Shiji Song, Jatinder ND Gupta, and Cheng Wu. Semi-supervised
and unsupervised extreme learning machines. IEEE transactions on cybernetics,
44(12):2405–2417, 2014.
[61] Hai Jin, Laurence T Yang, and Jeffrey J-P Tsai. Ubiquitous Intelligence and Com-
puting: Third International Conference, UIC 2006, Wuhan, China, September 3-6,
2006, Proceedings, volume 4159. Springer, 2006.
[62] Jaebok Kim, Gwenn Englebienne, Khiet P Truong, and Vanessa Evers. Towards
speech emotion recognition” in the wild” using aggregated corpora and deep multi-
task learning. arXiv preprint arXiv:1708.03920, Interspeech, 2017.
[63] Jaebok Kim, Khiet P Truong, Gwenn Englebienne, and Vanessa Evers. Learning
spectro-temporal features with 3d cnns for speech emotion recognition. In Affective
Computing and Intelligent Interaction (ACII), 2017 Seventh International Confer-
ence on, pages 383–388. IEEE, 2017.
[64] Jangwon Kim, Sungbok Lee, and Shrikanth S Narayanan. An exploratory study
of manifolds of emotional speech. In Acoustics Speech and Signal Processing
(ICASSP), 2010 IEEE International Conference on, pages 5142–5145. IEEE, 2010.
[65] Yelin Kim and Emily Mower Provost. Emotion recognition during speech using
dynamics of multiple regions of the face. ACM Transactions on Multimedia Com-
puting, Communications, and Applications (TOMM), 12(1s):25, 2015.
[66] Dennis H Klatt and Laura C Klatt. Analysis, synthesis, and perception of voice
quality variations among female and male talkers. the Journal of the Acoustical
Society of America, 87(2):820–857, 1990.
[67] Yi-Chun Ko. A study of feature sets for emotion recognition from speech signals.
Master’s thesis, 2015.
118
[68] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan
Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap:
A database for emotion analysis; using physiological signals. IEEE Transactions on
Affective Computing, 3(1):18–31, 2012.
[69] Duc Le and Emily Mower Provost. Emotion recognition from spontaneous speech
using hidden markov models with deep belief networks. In Automatic Speech Recog-
nition and Understanding (ASRU), 2013 IEEE Workshop on, pages 216–221. IEEE,
2013.
[70] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth
Narayanan. Emotion recognition using a hierarchical binary decision tree approach.
Speech Communication, 53(9):1162–1171, 2011.
[71] Jinkyu Lee and Ivan Tashev. High-level feature representation using recurrent neural
network for speech emotion recognition. In Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[72] Sungbok Lee, Serdar Yildirim, Abe Kazemzadeh, and Shrikanth Narayanan. An
articulatory study of emotional speech production. In Interspeech, pages 497–500,
2005.
[73] Xingfeng Li and Masato Akagi. A three-layer emotion perception model for va-
lence and arousal-based detection from multilingual speech. Proc. Interspeech 2018,
pages 3643–3647, 2018.
[74] Mark Liberman. Emotional prosody speech and transcripts. http://www. ldc. upenn.
edu/Catalog/CatalogEntry. jsp? catalogId= LDC2002S28, 2002.
[75] Yi-Lin Lin and Gang Wei. Speech emotion recognition based on hmm and svm. In
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Con-
ference on, volume 8, pages 4898–4901. IEEE, 2005.
[76] Na Liu, Yuan Zong, Baofeng Zhang, Li Liu, Jie Chen, Guoying Zhao, and Junchao
Zhu. Unsupervised cross-corpus speech emotion recognition using domain-adaptive
subspace learning. In 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 5144–5148. IEEE, 2018.
[77] Reza Lotfian and Carlos Busso. Curriculum learning for speech emotion recog-
nition from crowdsourced labels. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(4):815–826, 2019.
[78] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal
of Machine Learning Research, 9(Nov):2579–2605, 2008.
119
[79] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan
Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, ICLR 2016.
[80] Qirong Mao, Wentao Xue, Qiru Rao, Feifei Zhang, and Yongzhao Zhan. Domain
adaptation for speech emotion recognition by sharing priors between related source
and target classes. In 2016 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pages 2608–2612. IEEE, 2016.
[81] Angeliki Metallinou, Sungbok Lee, and Shrikanth Narayanan. Decision level com-
bination of multiple modalities for recognition and analysis of emotional expression.
In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Con-
ference on, pages 2462–2465. IEEE, 2010.
[82] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv
preprint arXiv:1411.1784, 2014.
[83] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How transfer-
able are neural networks in nlp applications? In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing (EMNLP). Association for
Computational Linguistics, pages 479–489, 2016.
[84] James C Mundt, Peter J Snyder, Michael S Cannizzaro, Kara Chappie, and Dayna S
Geralts. Voice acoustic measures of depression severity and treatment response col-
lected via interactive voice response (ivr) technology. Journal of neurolinguistics,
20(1):50–64, 2007.
[85] Iain R Murray and John L Arnott. Toward the simulation of emotion in synthetic
speech: A review of the literature on human vocal emotion. The Journal of the
Acoustical Society of America, 93(2):1097–1108, 1993.
[86] Michael Neumann et al. Cross-lingual and multilingual speech emotion recognition
on english and french. In 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 5769–5773. IEEE, 2018.
[87] Michael Neumann and Ngoc Thang Vu. Attentive convolutional neural network
based speech emotion recognition: A study on the impact of input features, signal
length, and acted speech. arXiv preprint arXiv:1706.00612, Interspeech, 2017.
[88] Joy Nicholson, Kaxuhiko Takahashi, and Ryohei Nakatsu. Emotion recognition
in speech using neural networks. In Neural Information Processing, 1999. Proceed-
ings. ICONIP’99. 6th International Conference on, volume 2, pages 495–501. IEEE,
1999.
[89] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. Speech emotion recognition
using hidden markov models. Speech communication, 41(4):603–623, 2003.
120
[90] Joseph D O’Connor. Intonation of colloquial english. 1984.
[91] AM Oster and Arne Risberg. The identification of the mood of a speaker by hearing
impaired listeners. SLT-Quarterly Progress Status Report, 4:79–90, 1986.
[92] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vec-
tors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[93] Luis Perez and Jason Wang. The effectiveness of data augmentation in image clas-
sification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
[94] Fan Ping, Jiang Dongmei, Wang Fengna, Ravyse Ilse, and Sahli Hichem. Manifold
analysis for subject independent dynamic emotion recognition in video sequences.
In Image and Graphics, 2009. ICIG’09. Fifth International Conference on, pages
896–901. IEEE, 2009.
[95] Rohit Prabhavalkar, Tara N Sainath, Bo Li, Kanishka Rao, and Navdeep Jaitly. An
analysis of attention in sequence-to-sequence models,. In Proc. of Interspeech, 2017.
[96] Yu Qian, Li Ying, and Jia Pingping. Speech emotion recognition using supervised
manifold learning based on all-class and pairwise-class feature extraction. In Con-
ference Anthology, IEEE, pages 1–5. IEEE, 2013.
[97] Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In
Proceedings of the first instructional conference on machine learning, volume 242,
pages 133–142, 2003.
[98] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv
preprint arXiv:1706.05098, 2017.
[99] S Sahu and C. Y. Espy-Wilson. Speech features for depression detection. In Inter-
speech, pages 1928–1932, 2016.
[100] Saurabh Sahu, Rahul Gupta, and Carol Espy-Wilson. On enhancing speech
emotion recognition using generative adversarial networks. arXiv preprint
arXiv:1806.06626, Interspeech, 2018.
[101] Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, and Carol
Espy-Wilson. Adversarial auto-encoders for speech based emotion recognition. In
Interspeech, pages 1243–1247, 2017.
[102] Michelle Hewlett Sanchez, Gokhan Tur, Luciana Ferrer, and Dilek Hakkani-Tür.
Domain adaptation and compensation for emotion detection. In Eleventh Annual
Conference of the International Speech Communication Association, 2010.
121
[103] Klaus R Scherer. Vocal affect expression: a review and a model for future research.
Psychological bulletin, 99(2):143, 1986.
[104] Maria Schubiger. English intonation, its form and function. M. Niemeyer Verlag,
1958.
[105] Bjorn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Emotion recognition
from speech: putting asr in the loop. In Acoustics, Speech and Signal Processing,
2009. ICASSP 2009. IEEE International Conference on, pages 4585–4588. IEEE,
2009.
[106] Björn Schuller, Ronald Müller, Manfred Lang, and Gerhard Rigoll. Speaker in-
dependent emotion recognition by early fusion of acoustic and linguistic features
within ensembles. In Ninth European Conference on Speech Communication and
Technology, 2005.
[107] Björn Schuller, Gerhard Rigoll, and Manfred Lang. Speech emotion recognition
combining acoustic features and linguistic information in a hybrid support vector
machine-belief network architecture. In Acoustics, Speech, and Signal Process-
ing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, volume 1,
pages I–577. IEEE, 2004.
[108] Björn W Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Dev-
illers, Christian A Müller, Shrikanth S Narayanan, et al. The interspeech 2010 par-
alinguistic challenge. In Interspeech, volume 2010, pages 2795–2798, 2010.
[109] Mohammad Shami and Werner Verhelst. Automatic classification of expressiveness
in speech: a multi-corpus study. In Speaker classification II, pages 43–56. Springer,
2007.
[110] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of
the IEEE conference on computer vision and pattern recognition workshops, pages
806–813, 2014.
[111] Mandeep Singh, Mr Mooninder Singh, and Nikhil Singhal. Ann based emotion
recognition. Emotion, (1):56–60, 2013.
[112] Peng Song, Wenming Zheng, Shifeng Ou, Yun Jin, Wenming Ma, and Yanwei Yu.
Joint transfer subspace learning and feature selection for cross-corpus speech emo-
tion recognition. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2018.
[113] André Stuhlsatz, Christine Meyer, Florian Eyben, Thomas Zielke, Günter Meier,
and Björn Schuller. Deep neural networks for acoustic emotion recognition: raising
122
the benchmarks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5688–5691. IEEE, 2011.
[114] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv
preprint arXiv:1312.6199, ICLR, 2014.
[115] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In Proceed-
ings of the 52nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), volume 1, pages 1555–1565, 2014.
[116] HM Teager and SM Teager. Evidence for nonlinear sound production mechanisms
in the vocal tract. In Speech production and speech modelling, pages 241–261.
Springer, 1990.
[117] Vikrant Singh Tomar and Richard C Rose. Graph based manifold regularized deep
neural networks for automatic speech recognition. arXiv preprint arXiv:1606.05925,
2016.
[118] Nishtha Tripathi and Avani Jadeja. A survey of regularization methods for deep
neural network. International Journal of Computer Science and Mobile Computing,
IJCSMC, 3(11):429–436, 2014.
[119] Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne,
Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja
Pantic. Avec 2016: Depression, mood, and emotion recognition workshop and chal-
lenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion
Challenge, pages 3–10. ACM, 2016.
[120] Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek
Krajewski, Roddy Cowie, and Maja Pantic. Avec 2014: 3d dimensional affect and
depression recognition challenge. In Proceedings of the 4th International Workshop
on Audio/Visual Emotion Challenge, pages 3–10. ACM, 2014.
[121] Aäron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen. Transfer learn-
ing by supervised pre-training for audio-based music classification. In Conference
of the International Society for Music Information Retrieval (ISMIR 2014), 2014.
[122] Dimitrios Ververidis and Constantine Kotropoulos. Emotional speech recognition:
Resources, features, and methods. Speech communication, 48(9):1162–1181, 2006.
[123] Hui-Po Wang, Wei-Jan Ko, and Wen-Hsiao Peng. Learning priors for adversarial
autoencoders. In 2018 Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC), pages 1388–1396. IEEE, 2018.
123
[124] Carl E Williams and Kenneth N Stevens. Emotions and speech: Some acoustical
correlates. The Journal of the Acoustical Society of America, 52(4B):1238–1250,
1972.
[125] Rui Xia and Yang Liu. Using denoising autoencoder for emotion recognition, pages
2886–2889. International Speech and Communication Association, 2013.
[126] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn
Song. Generating adversarial examples with adversarial networks. arXiv preprint
arXiv:1801.02610, 2018.
[127] Zixiaofan Yang and Julia Hirschberg. Predicting arousal and valence from wave-
forms and spectrograms using deep neural networks. Proc. Interspeech 2018, pages
3092–3096, 2018.
[128] Mingyu You, Chun Chen, Jiajun Bu, Jia Liu, and Jianhua Tao. Emotion recognition
from noisy speech. In Multimedia and Expo, 2006 IEEE International Conference
on, pages 1653–1656. IEEE, 2006.
[129] Mingyu You, Chun Chen, Jiajun Bu, Jia Liu, and Jianhua Tao. Manifolds based
emotion recognition in speech. Computational Linguistics and Chinese Language
Processing, 12(1):49–64, 2007.
[130] Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-
Philippe Morency. Multi-attention recurrent network for human communication
comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[131] A Zhang. Speech recognition (version 3.8). https://github.com/Uberi/
speech_recognition#readme, 2017.
[132] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient
descent algorithms. In Proceedings of the twenty-first international conference on
Machine learning, page 116. ACM, 2004.
[133] Zixing Zhang, Felix Weninger, Martin Wöllmer, and Björn Schuller. Unsupervised
learning in cross-corpus acoustic emotion recognition. In 2011 IEEE Workshop on
Automatic Speech Recognition & Understanding, pages 523–528. IEEE, 2011.
[134] Shusen Zhou, Qingcai Chen, and Xiaolong Wang. Active deep networks for semi-
supervised sentiment classification. In Proceedings of the 23rd International Con-
ference on Computational Linguistics: Posters, pages 1515–1523. Association for
Computational Linguistics, 2010.
124