Head pose estimation without keypoints in MXNET/GLUON

By the Blueprint Team

In retail, sales data is commonly used to identify hot products in stores for marketing. For instance, products that sell well in established stores are usually marketed heavily in new stores. With the recent success of machine deep learning use cases, especially Convolutional Neural Networks (CNN) in computer vision, companies have been combining insights extracted from images with sales data to refine marketing strategies. We’re going to discuss the particulars of deriving data from head pose estimation with Euler angles in MXNET/GLUON. 

Head pose estimation from an image is currently derived from two main methods: with and without facial keypoints, which include eyes, ears, and nose. The accuracy of the keypoints approach depends upon the correct representation of a 3D generic body model. Such a model is usually difficult to achieve. The no-keypoints approach, however, works around the depth complexity of the keypoint approach and directly learns from 2D images with multi-loss. For instance, Ruiz et al., 2018 and Shao et al., 2019 developed no-keypoints models in Pytorch and Tensorflow, respectively, and their models outperform the traditional face landmark algorithms on several widely used data sets.

We adopted the no-keypoints approach and reimplemented the algorithm of Ruiz et al., 2018 in MXNET/GLUON (hereafter gazenet). Compared to other deep learning platforms, MXNET (with GLUON API) provides the same simplicity and flexibility as Pytorch, but also allows data scientists to hybridize the deep learning networks to leverage performance optimizations of the symbolic graph. Moreover, MXNET/GLUON does not need to specify the input size of networks, instead, it directly specifies the activation functions in the fully connected and the convolutional layers, and it can create a namescope to attach a unique name to each layer. Finally, its scalability and stability attract many retail companies to the MXNET/GLUON platform for their product deployment.

This gazenet algorithm takes in 3-channel (RGB) images and outputs three-unit vectors of a person’s gazing direction, that is, yaw, roll, and pitch (as illustrated below). The bounding box of that person’s face is provided by a face detector we modified and trained based on the paper of Najibi et al., 2017. Given the bounding box of a face, gazenet can detect that person’s gazing directions even when that person is looking sideways and when the video or images are in relatively low resolution, making facial landmarks hard to detect.

Similar to Rui et al., 2018 and Shao et al., 2019, gazenet employs a pre-trained ResNet50 (He et al., 2015) architecture followed by a fully connected layer. A softmax function is then used to derive the class scores. Multi-loss functions are used to classify and regress each angle. Its architecture is illustrated below. Gazenet achieves a comparable performance as Rui et al., 2018 on the public data set of AFLW200 with approximately 6.5 degrees average errors for yaw, roll, pitch, and mean squared error. Its open-sourced MXNET/GLUON implementation is here: https://github.com/Cjiangbpcs/gazenet_mxJiang/blob/master/README.md. We also adopted gazenet in our video analytics product, Reflect.

The estimated Euler angles can be further aggregated to provide insights on promising new products, driving marketing programs. This new data, together with traditional sales data, can provide retail stores valuable information to design experiments and drive meaningful business impact. 

Let's build your future.

Share with your network

You may also enjoy