Beyond words – What can we get from speech?

Gestures, mimics, and speech are the key ways to communicate and convey thoughts to others. However, the content of speech (e.g., words) only serves around seven percent of communications. The other parts of human communication include HOW we talk (38%) and how we move our body and perform facial mimics (55%) [1].

In conversations where the addressee is not present (such as telephone communication), the way of talking has essential influence on conveying a message. Beside that, our speech carries other information that is not easily recognisable through an untrained ear. This includes age, gender, nativeness, language, cultural background, personality, emotions, deception, status of psychological disorders (e.g. depression), stress, and so on. Based on paralinguistic features in speech, such as tone, jitter, formants, pitch and pauses, we can obtain this data. Indeed, speech is a great source of information to learn about someone.

Machine learning techniques

MixedEmotions applies machine learning techniques to extract such information by processing a combination of speech, video, and textual contents. Analysis of speech signals is beneficial for scenarios such as call centers. It can inspect the emotional status of a caller during conversation, or it can enhance interaction between human and machines or robots. Despite great progress in machine learning and data analysis, still many challenges exist to reach an optimal emotion recognition system. Some of them are: speaker variabilities, recording condition variabilities, background noises and combination of modalities.

In the following we briefly discuss three of these challenges of emotion analysis: speaker variability, data annotation, and multimodal sensing.


Speaker variability

Older people tend to have  less emotional expressivity for negative emotions than young people

The variations in the expression of emotions make it difficult to reach high performance in emotion recognition. The source of these variations comes from individual factors such as: gender, age and personality. For example, it is shown that women are more expressive than men [2], and older people express less negative emotions than young people [3]. Personality trait plays a great role in the emotion expression; e.g. extroverts are expressing positive emotions stronger [4]. Moreover, another essential individual difference that causes variations in the expression of emotions is the cultural background (e.g., Asian-Americans are less expressive than other ethnic groups [4]).

Nevertheless, the good news is, that these individual factors can be recognised through speech itself!

In the MixedEmotions, we aim to improve an emotion recogniser through automatically detecting the speaker’s individual factors and adapt the system accordingly. For example, one approach is to, first, recognise a cultural background (e.g., by using a language identifier), and then, based on the detected background, select a corresponding tuned model for emotion recognition [5]. Another approach is to adapt a trained system toward those new data [6]. These approaches show enhancement of an emotion recogniser.


Improving multilingual emotion recogniser through language identification and model selection [5]


Big Data annotation

Let the computer do the work

In the Big-Data era, tremendous amounts of data are collected within seconds. How to  get good annotations for creating a model? Recruiting people and paying them to make annotations could be expensive, time consuming and frustrating. However, platforms such as iHEARu-Play [7] tries to encourage annotators by gamifying the annotation task. Nevertheless, given a huge amount of data, these platforms may not be still efficient.

Therefore, in the MixedEmotions, we try to automatise annotating and put this burden on computer.

One way is to use semi-supervised approaches, where the computer guesses a label and computes a confidence for the guess. If the confidence is low, it asks a human annotator for help [8]. The other way is to match the data distribution between a labelled corpus and an unlabelled corpus, and train a model on the matched labelled data and apply it to classify the matched unlabelled data [6].

Learning Scheme - MixedEmotions

Semi-supervised learning scheme [8]


Mixing it all together

In the MixedEmotions project, we aim to extract emotions from speech along with other modalities such as facial gestures and spoken words.

Certainly, the combination of these modalities will bring more confidence to the final decision of the system. However, the challenge is how to combine these source of information together, and which tactics to use if some of the modalities are not present (e.g., if in a video, the face is not detected), or the decisions of the modalities are incongruent (e.g., in a deceptive speech, the words may convey false information, while the way the speaker talks produces more certain evidences about the actual fact).

Finally, how we can measure a confidence value showing the level of trust for each modality as well as for the final decision. One approach is to measure a distance between features of the test data and of the training data — larger distance means less confidence, or pass data through an autoencoder (which has been trained on the training data) and calculate the reconstruction error [9].


Speech, apart from its textual content, carries a lot of information explaining individual and mental characteristics of a speaker. Within the MixedEmotions project, we use this information and apply techniques to overcome multilinguality, adaptivity, and multimodality challenges.  With these new data insights we aim to enhance customers’ service e.g., in a call center and aim to create TV recommendations based on the mood of a viewer.


The movie “Her”, is one of the movies presenting a perfect emotionally intelligent human computer interaction, through voice analysis


MixedEmotions is an European Research project an innovative two-year research program. It involves five companies and four European universities. With a budget of more than 3.5 million euros, it aims to search, identify, classify and characterize the emotions in large volumes and data sources by applying Big Data analysis technologies.

The University of Passau is the 5th best university in Germany in the field of computer science, located on the banks of Inn, in the three rivers town, Passau. The chair of Complex & Intelligent Systems (CIS) does research as well as development for speech processing, particularly for affective computing. They work on projects funded by European Union (H2020, FP7), German funding institutes (BMBF) and companies (such as Huawei). The target applications of their systems are toward health (e.g., monitoring depression), emotionally intelligent human computer interaction (e.g., for autistic children), and the big-data.   

Image source Martin Luther King


[1] Mehrabian, A. (1981). Silent messages: Implicit communication of emotions and attitudes. Belmont, CA: Wadsworth.
[2] Kring, A. M., & Gordon, A. H. (1998). Sex differences in emotion: expression, experience, and physiology. Journal of personality and social psychology, 74(3), 686.
[3] Gross, J. J., Carstensen, L. L., Pasupathi, M., Tsai, J., Götestam Skorpen, C., & Hsu, A. Y. (1997). Emotion and aging: experience, expression, and control. Psychology and aging, 12(4), 590.
[4] Gross, J. J., & John, O. P. (1995). Facets of emotional expressivity: Three self-report factors and their correlates. Personality and individual differences, 19(4), 555-568.
[5] Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., & Schuller, B. Enhancing multilingual recognition of emotion in speech by language identification.
[6] Sagha, H., Deng, J., Gavryukova, M., & Han, J. (2016, March). Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5800-5804). IEEE.
[7] Hantke, S., Marchi, E., & Schuller, B. (2016). Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification.
[8] Zhang, Z., Ringeval, F., Dong, B., Coutinho, E., & Marchi, E. (2016, March). Enhanced semi-supervised learning for multimodal emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5185-5189). IEEE.
[9] Marchi, E., Vesperini, F., Weninger, F., Eyben, F., Squartini, S., & Schuller, B. (2015, July). Non-linear prediction with lstm recurrent neural networks for acoustic novelty detection. In International Joint Conference on Neural Networks (IJCNN) (pp. 1-7). IEEE.