A deep learning approach to detection and classification of face masks in surveillance dataset

Frederik Westerhout
13 min readJun 18, 2021
Source: https://www.aclu.org/blog/privacy-technology/surveillance-technologies/army-robot-surveillance-guards-coming

In this blog, we propose a neural network architecture that is able to differentiate whether people are wearing face masks in a real-time setting. This can be potentially used to monitor individuals who are not wearing face masks in a crowd, which can be useful for many different applications, like research on contagiousness of the virus and for supervision purposes. An approach is shown how to practically train and test a neural network model with a non-standard self-augmented dataset in the topic of image recognition in computer vision. The steps taken considering the choice of the network architecture, the processing and ordering of the data and the implementation and results of the training process are shown.

Since the arrival of COVID-19 the demand for face mask recognition has increased rapidly in areas like video surveillance and security camera footage, as the precaution of wearing face masks might help in decreasing the transfer of the virus from one person to another [1]. Deep learning techniques have already shown promising results on classifying and detecting face masks from persons and faces in large datasets [2][3][4]. To gain further hands-on understanding in the area of computer vision and deep learning, we chose to use and examine a real-time video dataset in a university environment that is not yet been used on a large scale [5].

Figure 1 Sample from dataset showing bounding boxes with classes MW (green) and NM (red)

The footage of this dataset consists of a video, a sequence of images, from a security camera perspective from multiple subjects walking by. The duration of the video is only a couple minutes and contains 4357 image frames. Each frame is annotated with the coordinates of a bounding-box and a state being either mask-wearing (MW) or no-mask (NM). In addition to this, persons from which the face cannot be see are also classified as no-mask wearing, for example when someone is walking away from the camera perspective.

Therefore, this dataset entails all information we need for examining both the object detection part as well as the classification part, which are both interesting to be looked at. This blog post has the following structure:

  • Choice and explanation of network architecture
  • Dataset and data-processing
  • Training network and results
  • Discussion and conclusion

The preliminary goal of our research was to train a self-composed neural network pipeline with the chosen dataset. The data is transformed and augmented so it fits the requirements to be fed into the networks. Regarding the neural network itself, we split the detection task and the classification task in two seperate neural networks, to be able to fine-tune each one individually.

Proposed pipeline
Figure 2 Proposed pipeline

To specialise the scope of this blog, the decision was made to focus on the classification task, visible in the above figure as “Facemask classifier”. Different existing backbones of neural nets are implemented for classification and ranked on their performance accordingly. In this way, the goal is to show the reader the practical steps to follow when using a non-standard dataset and to implement it in different neural nets. We hope to give more valuable insights in the process we followed, and transfer our lessons learned.

Network architecture

Looking at our chosen dataset of persons walking by with or without a face mask, we chose to divide the detection and classification task that the neural network has to perfom in two parts. By making this division, the intention was to design a pipeline existing of two neural networks, which are each optimized to do the detection and classification respectively. First there needs to be a localization of every person in the image, this is done by annotating every detected person in the image with a bounding box. Secondly, the classification task on every detected person has to be carried out, to determine whether a person is mask-wearing (MW) or has no-mask (NM). Our initial plan was to combine the detection and the classification part and implement them both to gain insight in both segments. However, our main goal was to build a network that was able to train on the self-processed data of the chosen dataset. We learned from our previous project that it is better to start simple. So to make sure that we had enough time to work on the data itself and produce results, the decision was made to focus on the classification part of this two-stage approach.

For the classification task the following networks were used and eventually compared:

  • Simple convolutional network with maxpooling (Simple Conv):
Figure 3 Model summary of Simple Conv
  • Simple convolutional network with batch-normalisation and ReLU (Conv+bn):
Figure 4 Model summary of Conv+bn
  • LeNet-5 as defined in [6]:
Figure 5 Model summary of LeNet-5
  • ResNet-50 as defined in [7]:
Figure 6 Concise model summary of ResNet-50 ⁷

We used cross entropy loss as loss function and a stochastic gradient descent with momentum=0.9 and a to be defined learning rate as optimizer.

Data processing

As said before, we tend to design a pipeline in which the first step includes a object detection which predicts the boundingbox of a person and the second step consists of implementing a classifier which predicts if the person in the boundingbox is wearing a face mask or not.

So for the classifier we need to create a data-set which contains cropped images of inside the boudingbox annotation. Per frame, the annotation — delivered as a xml element tree — has the form as depicted below and has variable length depending on the amount of detected persons in the concerning frame.

Figure 7 Annotations in dataset

To process this data we first created a lists containing the frames we wanted to include in the set. And a python dictionary which contains all the annotations. We then made a loop which iterates over the files and retrieves coordinates of the bounding box locations of each detected person and depending of its classification NM or MW it is assigned to its designated dataset.

Figure 8 Data processing

The cropped_in_no_mask and cropped_im_mask sets now each contains images of persons with mask and no mask respectively as depicted in the two different rows below. One can image that as people from which the mask is not visible, the cropped_im_mask dataset is much larger. This will be equalized by deleting a certain amount of images from this dataset. These sets are split in a training set and a test set with an 80% and 20% ratio on chronological frame order. This order was necessary to prevent consecutive frames -which are highly similar - to be both in the train - and test set.

Figure 9 Samples from dataset, above MW under NM

As the results as depicted below (pink: train-accuracy, blue: test-accuracy) of the first training cycle with a simple convolution network came out we concluded that the network was overfitting. One reason for this overfitting could can be found that the dataset consist of many sequential frames from a video which is why the consecutive frames results in similar cropped images. Due to this, the network is prone to overfit. Another reason can be found at the nature of the picture. We want to classify if a person is wearing a mask but instead of cropping the image around the facial region -where you would expect the face mask- we feed images of the whole person which made it possible for the network to focus on other patterns than the face mask itself. This is why we implemented some data augmentation.

Figure 10 Train accuracy (pink) and test accuracy (blue)

For the data augmentation we first tackled the occurrence of many similar consecutive frames by extracting not every frame to the dataset but rather every i-th frame with i in range of [20,40,60]. The results are shown in Figure 11 and 12. This resulted in less overfitting for i=20, 40 and enables the networks to achieve an accuracy between 70–80 percent. However, the network remained untrainable for the dataset which has frames extracted at every 60th frame. We choose for the remaining of the project to work with the dataset which is extracted at every 40th frame, we call this dataset 40th_dataset from now on.

Figure 11: Accuracy of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=64)
Figure 12: Loss of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=64)

As the 40th dataset only contains a total of about 400 instances, we implemented data augmentation in the form of flipping every instance on the vertical axis. As the dataset now exists of the double amount of instances and more variety, we would have expected to see less overfitting. Unfortunately, the data augmentation did not result in an accuracy boost as can be seen in Figure 13 and 14. We continue to use the plain dataset in the remaining of the project.

Figure 13: Accuracy of of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=64)
Figure 14: Loss of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=64)

Next we are going to compare different values of batch_size being: [4, 8, 32, 64]. Below are the results shown in Figure 15 and 15, from this we concluded that batch_size 8, 32 or 64 will suffice for our network. We continue to use a batch_size of 8 in the remaining of the project.

Figure 15: Accuracy of Conv+bn (lr=0.01, momentum=0.9, epochs=40)
Figure 16:Loss of Conv+bn (lr=0.01, momentum=0.9, epochs=40)

Next we are going to find the right learning rate, or lr in short, for our network. We compared lr of [0.01, 0.005, 0.001], the results are shown below in Figure 17 and 18. It can be concluded from these results that the learning rate does not effect the learning curve much in our case. As lr=0.001 results in a more smooth learning curve, we stick to this value for the remaining of the project.

Figure 17: Accuracy of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=8)
Figure 18: Loss of Conv+bn (lr=0.01, momentum=0.9, epochs=40, batch_size=8)

We now compare this network with the found hyperparameters of batch_size=8, learning rate = 0.001, plain dataset extracted at every 40th frame against the “Simple Convolution” network, LeNet-5 and ResNet-50. The results are shown in Figure 19 and 20. We conclude that non of the networks converge towards an explicit value. In addition, the more complex networks ResNet-50 and “Conv+bn” will results in a higher accurcy than the simpler network “Simple Conv” and LeNet-5. This may be due to the complexity of the dataset, as the region of interest, namely the facemask, is quite small in comparison with the remaining image and thus large parts of the image contains useless information.

Figure 19: Accuracy of different network (lr=0.001, momentum=0.9, epochs=80, batch_size=8)
Figure 20: Loss of different network (lr=0.001, momentum=0.9, epochs=80, batch_size=8)

From this we conclude that ResNet-50 eventually is the better choice although non of the networks are converging towards one value.

Discussion

When looking at the results that came out of the different networks we tested, we were positively suprised that the accuracy improved and the loss decreased during training, as this was our first succesful implementation of a classifier network on a non-standard dataset. This means that the models we used were somewhat suited for optimizing testing and training results, which is a good takeaway. However, there are also still some results that we did not expect, which leaves room for improvement of the setup for building such networks when linking them to a certain dataset.

A first good thing to state here is that one of our main goals was to produce a working classifiying network that showed its ability to train succesfully and show it on the test set. In pursuing this goal we started with one network, “Conv+bn”, and optimized the hyperparameters for this network, so that it performed optimally. When we compared this network to the other networks afterwards, this means that the differences that arise afterwards are not equally weighted, as the “Conv+bn” net is optimized. Those hyperparameters afterwards are used for the miscellaneous networks, which gives an unfair advantage to the first network we used, by definition.

Secondly, when looking at the graphs of the training process, there exists a big gap between the training and the testing curve, which normally implies that the model is overfitting on the dataset. We tried to reduce this gap by sampling frames from the video in the dataset to decrease the effect of training on very similar images. This data processing step worked somewhat, but there still is quite a large gap between the training and testing curves, so the model still seems to overfit. A possible explanation for this, is that the model is not training on the right features. In the setup right now it is easy for the model to mistake important features like the facemask detection, as the bounding boxes are placed around a person in stead of around a face. Also, the model is still training on a lot of very similar images, because the video contains only a couple minutes of footage. This could mean that the dataset is not large enough to train the classifier model effectively. Lastly, as pedestrians from whom the facemask is not detectable due to occlusion are classified as not wearing a facemask, it could very well be possible that those features are learned as a decisive feature for the non-mask classification. This should be prevented as this does not say anything about the real facemask wearing state.

Therefore we have the following recommendations for potential future work. First, for the next research in detecting and recognizing face masks, we believe it is better to focus on face detection instead of person detection. This probably works better in saying whether a person is wearing a face mask as it is more likely to train the network on the right features, like the occlusion of face parts due to mask, than just a small part when looking at a person, for example for it is only a small amount of pixels relatively that is different. In the current configuration the network is more likely to train on other features when focussing on the whole person, like shades and brightness levels, factors you do not want to train on, which are irrelevant for the result. Secondly, another interesting feature of the current implemenation is to look at whether the use of image sequence opposing to seperate unrelated images, for example with static images of persons wearing or not-wearing face masks, is relevant to what architecture you choose. We expect that it has a substantial effect, because when training on image sequences a lot of images are very similar, so it is uncertain whether the distribution is well-proportioned and whether the network is training on the right features. Furthermore, we would recommend to make a third class namely “non-detectable” should be assigned to every data instance from which it should not be detectable if a facemask is worn due to occlusion. Lastly, we will encourage in future work to make use of a validation set to optimize on, rather than optimize towards the test set as this can result in non-generelizable found hyperparameters on unseen data.

Conclusion

When looking back on this project we conclude that we have succesfully implemented a classifier network model within a proposed pipeline connected to a self-augmented non-popular dataset⁵. From the networks that we have compared, Resnet-50 was the one that performed best among the chosen models, but there is still room for improvement.

We have learned a great deal about data processing, implementation of a classifier, detection techniques and the practicalities of the training process and optimizing this. We hope to have provided the reader some guidance and clarity when doing a deep learning computer vision project yourself and may our endeavours make your life easier when doing a similar project.

References

  • [1] Hussain, G. J., Priya, R., Rajarajeswari, S., Prasanth, P., & Niyazuddeen, N. (2021, May). The Face Mask Detection Technology for Image Analysis in the Covid-19 Surveillance System. In Journal of Physics: Conference Series (Vol. 1916, №1, p. 012084). IOP Publishing
  • [2] Militante, S. V., & Dionisio, N. V. (2020, August). Real-Time Facemask Recognition with Alarm System using Deep Learning. In 2020 11th IEEE Control and System Graduate Research Colloquium (ICSGRC) (pp. 106–110). IEEE
  • [3] Jiang, X., Xiang, F., Lv, M., Wang, W., Zhang, Z., & Yu, Y. (2021, February). YOLOv3_slim for face mask recognition. In Journal of Physics: Conference Series (Vol. 1771, №1, p. 012002). IOP Publishing
  • [4] Loey, M., Manogaran, G., Taha, M. H. N., & Khalifa, N. E. M. (2021). A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement, 167, 108288
  • [5] Nawaz, Faisal; Khan, Wasiq; Yasen, Salwa; Hussain, Abir (2020), “Face Mask Detection Video Dataset”, Mendeley Data, V1, doi: 10.17632/v3kry8gb59.1
  • [6] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.
  • [7] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

--

--