Emotions are an integral aspect of human mental processes and everyday experience. They drive much of our behaviour and play an important part in communication. Emotions are often intertwined with mood, temperament, personality, disposition, and motivation.
To understand human emotions, to react to them, and to intentionally induce them has been a long-standing dream of researchers in human-computer interaction. How much better could our lives be if computers, search engines, or smart personal assistants would be able to sense when we start getting annoyed or frustrated with them and if they could adapt accordingly?
Today, these goals are getting closer than ever. For example, consider the following video of BabyX.
Emotions are not always directly observable, but they also manifest in our actions and communication. in our voice, gaze, facial expressions, body posture and movement, and of course in our use of language. To be able to reliably understand emotions even humans have to observe as many modalities as possible. In MixedEmotions we focus on the three main modalities – text, sound, and video. This article provides a brief glimpse into the state-of-the-art in automatic emotion understanding from video and how this task is addressed in MixedEmotions.
Can machines beat humans?
Most people assume that humans are very good at recognizing faces, expressions, and emotions. Many people believe that current computer programs are not able to achieve similar results in these ambiguous tasks, or even that they will never reach the human performance. However, such assumptions are far from the truth. In fact, computer programs have already achieved super-human performance in a large number of visual perception tasks. Mostly thanks to the availability of large dataset, cheap computing power provided by GPUs, and convolutional neural networks.
For example, FaceNet  achieved face verification accuracy 99.63% (Is it the same person in two photographs?) on dataset Labeled Faces in the Wild. Whereas human accuracy was assessed at 97.53% for tight facial crops and 99.2% for looser crops with context. Similarly, a system created by researchers from Ohio State University  is able to distinguish 21 distinct facial expressions. Including some seemingly contradictory expressions such as “happily disgusted” or “sadly angry”. Not directly relevant but definitely impressive are the image classification results in ImageNet challenge . The goal is to identify a dominant object in a picture from 1000 possible object classes. The leading systems from 2015 by Microsoft Research and Google Research  achieved impressive 3.5% error rates on this task. Whereas the error rate achieved by a highly trained human is 5.1% .
How to understand faces
Focussing on faces, emotions are expressed mostly as facial expressions, gaze directions, and head movements. Facial expressions are driven by a relatively small set of muscles. They are known as action units in the Facial Action Coding System (FACS) . For example happiness (smile) is expressed using Cheek Raiser and Lip Corner Pullers. Automated systems can estimate the activations of these individual muscles and combine them into estimates of expressions and emotions. But it is more common to estimate the expressions directly from facial appearance and shape.
In order to be able to analyse faces, they must be detected first and their basic pose has to be established. The detection of faces is now considered mostly solved with a number of real-time methods. They are able to detect faces in low-resolution images and in large range of poses. Methods based on boosted simple classifiers , deformable part models, and convolutional neural networks  all provide speed and detection quality sufficient for most applications. A face can be aligned by estimating its pose using positions of facial landmark points which can be localized by a number of methods. The state of the art in facial landmark localization is currently provided by various convolutional networks [10,11], which predict the 2D coordinates from an image.
When a face is detected and its pose aligned, its appearance can be directly processed using some kind of classifiers. Such approach provides extremely good results when enough labeled training data is available, as was demonstrated by the results in face recognition . However, the available datasets for expression and emotion recognition are quite limited in their size, which also limits the results of this direct approach.
Two possible solutions to this problem exist. Either unlabeled data has to be used to help build the classifiers or an additional outside knowledge has to be introduced. Preferably in a form of 3D deformable facial model. 3D facial models aim to infer how an observed image was created. They try to estimate shape of a face, its texture, and illumination which gave rise to the image. The 3D facial models have reached a level of modeling accuracy which is sufficient, for example, to transfer facial performance between different actors as shown in the following video.
The approach to understand faces in MixedEmotions
In the MixedEmotions project, we are pursuing the other direction. By modeling a large number of unlabeled data using convolutional networks, we are able to create a person and pose independent from general representation of facial expressions. Those can then be efficiently associated with specific expressions and emotions on a relatively small labeled dataset. We are creating a visual system which will gather information about facial expressions, gaze, and head movement in order to estimate emotions. This visual information will be further fused with audio cues in order to provide even higher accuracy.
The developed capabilities will allow us to automatically analyse emotional content of television debates, interviews, product reviews, and other video content which will provide a foundation for novel approaches to video search, user studies, and media content analysis.
MixedEmotions is an European Research project an innovative two-year research program, which involves five companies and four European universities. With a budget of more than 3.5 million euros, it aims to search, identify, classify and characterize the emotions in large volumes and data sources by applying Big Data analysis technologies.
Brno University of Technology is one of the leading higher education and research institutions in Czech Republic. The Faculty of Information Technology (FIT) conducts top quality research in computer vision, speech analysis, flight control avionics, hardware design, security, formal languages, and many more.
 Florian Schroff, Dmitry Kalenichenko, James Philbin: FaceNet: A Unified Embedding for Face Recognition and Clustering. arXiv:1503.03832, 2015.
 N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verifi- cation. In ICCV, pages 365–372, 2009.
 Shichuan Du, Yong Tao, and Aleix M. Martinez. Compound facial expressions of emotion
PNAS 2014 111 (15) E1454-E1462; 2014.
 ImageNet Challenge. http://image-net.org/challenges/LSVRC/2015/results
 Andrej Karpathy. What I learned from competing against a ConvNet on ImageNet. 2014.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.
 P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.
 M. Mathias and R. Benenson and M. Pedersoli and L. Van Gool. Face detection without bells and whistles. ECCV, 2014
 H. Li, Z. Lin, X. Shen, J. Brandt and G. Hua. A convolutional neural network cascade for face detection. CVPR, Boston, 2015.
 Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang. Facial Landmark Detection by Deep Multi-task Learning, in Proceedings of European Conference on Computer Vision (ECCV), 2014.
 Z. Zhang, P. Luo, C. C. Loy and X. Tang, “Learning Deep Representation for Face Alignment with Auxiliary Attributes,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 5, pp. 918-930, May 1 2016.