In computer vision, classifying facial attributes has attracted deep interest from researchers and corporations. Deep Neural Network based approaches are now widely spread for such tasks and have reached higher detection accuracies than previously manually-designed approaches. Our paper reports how preprocessing and face image alignment influence accuracy scores when detecting face attributes. More importantly it demonstrates how the combination of a representation of the shape of a face and its appearance, organized as a sequence of convolutional neural networks, improves classification scores of facial attributes when compared with previous work on the FER+ dataset.
While most studies in the field have tried to improve detection accuracy by averaging multiple very deep networks, exposed work concentrates on building efficient models while maintaining keeping high accuracy scores. By taking advantage of the face shape component and relying on an efficient shallow CNN architecture, we unveil the first available, highly accurate real-time implementation on mobile browsers.
Link to the paper AMDO 2018: 73–84, Nicolas Livet, George Berkowski
Exploring human face attributes based on images and sequence of images has been a topic of interest over the years. For decades, a number of approaches have been carefully engineered in order to try to solve this problem with the highest possible accuracy. However, most manually-crafted approaches appear to become inefficient when dealing with real life “face-in-the-wild” problems. Manually-crafted approaches are often combined with some Machine Learning principles as for example LBP+SVM [1] or SIFT/HOG+SVM [2]
In contrast with most recent approaches that consists in evaluating always deeper architectures or averaging multiple deep models, our approach consists in classifying emotions using shallow CNN architectures and prior information. Our objective is to achieve robust and real-time face attributes (emotions in shown examples) detection on Mobile Browser.
Presented architecture is trained and results are evaluated on the FER+ dataset [FER] which has been first released in 2013 and later improved with better annotations. The FER dataset contains about 30,000 samples making it one of the most complete in-the-wild face emotion dataset available.
Note that for our training and testing phased, we got rid of the traditional CK+ dataset [CK+] as accuracy scores are saturating. CK+ and other traditional datasets only provide very constraint images, with:
However, the FER+ dataset is far from being perfect and include several biases/issues:
To reduce impact of such biases, the dataset is preprocessed as follow:
Instead of relying on ever deeper and averaged CNN architectures, we decided to rely on modern mobilenet architectures [MOBNET]. Such architecture includes several pointwise/ depthwise separable filters thus optimizing computations at inference time (see illustration).
However, shallower architecture cannot compete with veery deep VGG or Resnet-50 architectures. That’s the reason why a supplemental information is fed to the network to give a prior knowledge to the system with the objective to approach as much as possible the results obtained with a deep architecture.
A supplemental channel is constructed and added to the RGB or grey input image. This channel contains the shape of the face as depicted by its internal landmarks. It is an image we named the shape prior heatmap as it contains a (rescaled) Gaussian peak response at face landmark vicinity (see illustration).
Our architecture for facial emotion detection is decomposed in two step: first face landmarks are localised (not described in this work, see www.deepar.ai for some results), then emotions are detected based on face landmark localizations. Following illustration describes our complete architecture.
Not surprisingly, the shape prior helps to improve accuracy scores on the FER+ dataset. It’s especially the case when inferring on shallow CNN architectures.
We have benchmarked our approach on different architectures. To make our application real-time on Mobile browsers, an optimized implementation of the pointwise 1×1 and depthwise 3×3 convolutions has been developed, relying on the Emscripten tool to build JavaScript bitcode. Even though our implementation could be further optimized (eg. by taking advantage of SIMD instructions), our native Android application reached 300fps on a on a Google Pixel 2 and nearly 100fps on the same device using Chrome Web browser (refer to the last columns of Table 1. for more results).
Our work discusses existing datasets, their respective drawbacks,and how to prepare the data to improve the quality of the information passed to the learning process. It is shown how to transform the results of a facial feature detector system to a face shape heatmap image and how to combine the face of a shape with its appearance to learn modified CNN models. Using this approach, accuracy scores on the FER+ dataset are substantially improved.
The choice of a smooth loss (a Huber Loss) evaluated on non-discrete label distributions has brought to our system the capability to interpolate between different emotions attributes. Our architecture and in-house implementation take advantage of efficient separable pointwise 1×1 and depthwise 3×3 convolutional filters. We were able to deploy a face tracker combined with a face emotion detector that works in real-time on mobile browser architecture.
[1] Dynamic texture recognition using local binary patterns with an application to facial expressions — G. Zhao & All
[2] Facial expression recognition and histograms of oriented gradients: a comprehensive study — P. Carcagnì & All
[FER+] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution — E. Barsoum
[CK+] The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression — P. Lucey
[SRCNN] Image Super-Resolution Using Deep Convolutional Networks — Chao Dong, Chen Change Loy, Member, IEEE, Kaiming He, Member, IEEE, and Xiaoou Tang, Fellow, IEEE
[MOBNET] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications — A. G. Howard & All
We write about AR case studies, insights and the newest AR tech we're creating.