Structured Output SVM Prediction of Apparent Age, Gender and Smile From Deep Features

CMP+ETH team:
Michal Uricar1, Radu Timofte2, Rasmus Rothe2, Jiri Matas1, Luc Van Gool2
1Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic
2Computer Vision Lab, D-ITET, ETH Zurich
ChaLearn LAP and FotW Challenge and Workshop @ CVPR2016
3rd place of Track 1: Apparent Age Estimation


We propose structured output SVM for predicting the apparent age as well as gender and smile from a single face image represented by deep features. We pose the problem of apparent age estimation as an instance of the multi-class structured output SVM classifier followed by a softmax expected value refinement. The gender and smile predictions are treated as binary classification problems. The proposed solution first detects the face in the image and then extracts deep features from the cropped image around the detected face. We use a convolutional neural network with VGG-16 architecture [1] for learning deep features. The network is pretrained on the ImageNet [2] database and then fine-tuned on IMDB-WIKI [3] and ChaLearn 2015 LAP datasets [4]. We validate our methods on the ChaLearn 2016 LAP dataset [5]. Our structured output SVMs are trained solely on ChaLearn 2016 LAP data [5]. We achieve excellent results for both apparent age prediction and gender and smile classification.

Overview of the solution

The overview of our apparent age estimation pipeline is depicted in Figure 1. Our solution builds on the winning solution [3] of the first edition of the ChaLearn LAP competition on Apparent Age Estimation [4], i.e. the VGG-16 CNN architecture [1] pre-trained on ImageNet and consequently fine-tunned on IMDB-WIKI [3] (annotated with the real, biological age) and on ChaLearn LAP 2015 [4] is used for features extraction.

On top of these deep features, we build a Structured Output SVM (SO-SVM) classifier, which is formulated as a multi-class classifier. Each class corresponds to the discretized apparent age. For the face detection [6] is used.

Figure 1: Pipeline overview.

For more details, please, consult our paper

title = {Structured Output {SVM} Prediction of Apparent Age, Gender and Smile From Deep Features},
author = {Michal U\v{r}i\v{c}\'{a}\v{r} and Radu Timofte and Rasmus Rothe and Ji\v{r}\'{i} Matas and Luc Van Gool},
booktitle = {Proceedings of IEEE conference on Computer Vision and Pattern Recognition Workshops},
address = {Las Vegas, USA},
year = {2016}


  1. K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv: abs/1409.1556, 2014.
  2. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al. Imagenet large scale visual recognition challenge, International Journal of Computer Vision, 2015.
  3. R. Rothe, R. Timofte, L. Van Gool. DEX: Deep EXpectation of apparent age from a single image, International Conference on Computer Vision ChaLearn Looking at People Workshop, 2015.
  4. S. Escalera, J. Fabian, P. Padro, X. Baro, J. Gonzales et al., ChaLearn Looking at People 2015: Apparent age and cultural event recognition datasets and results, International Conference on Computer Vision Workshops, 2015.
  5. S. Escalera, M. Torres, B. Martinez, X. Baro, H. J. Escalante et al., ChaLearn Looking at People and Faces of the World: Face Analysis Workshop and Challenge 2016, Computer Vision and Pattern Recognition Workshop, 2016.
  6. M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, Face Detection Without Bells and Whistles, European Conference on Computer Vision, 2014.