Вы находитесь на странице: 1из 4

2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT)

Machine Learning in Video Surveillance for Fall


Detection
Lesya Anishchenko
Bauman Moscow State Technical University
Moscow, Russia
anishchenko@rslab.ru

Abstract—The present paper considers the usage of deep season, it is essential to provide first aid as soon as possible
learning and transfers learning techniques in fall detection by in order to prevent the negative consequences associated with
means of surveillance camera data processing. As a dataset, the hypothermia. Therefore, the up-to-date task is to design
an open dataset gathered by the Laboratory of Electronics and
Imaging of the National Center for Scientific Research in Chalon- technical means that will detect an emergency situation in an
sur-Saone was used. The architecture of the CNN AlexNet, which automated mode and inform the operator if help is needed.
was used as a starting point for the classifier, was adapted to In the present work, the problem of developing such a
solve fall detection problem. The proposed method was tested on system for recognizing human movements through the records
a dataset of 30 records containing a single fall episode each.
We achieved Cohen’s kappa of 0.93 and 0.60 for the fall – of CCTV cameras and detecting falls has been considered. The
non-fall classification for the known and unknown for classifier fall of a person can be caused by many factors: loss of balance
surrounding conditions, respectively. due to lack of cerebral blood supply, muscle weakness, etc. In
Index Terms—deep learning, machine learning, intelligent any case, a situation when a person has fallen and cannot get
video analysis, fall detection up without assistance is dangerous and requires an immediate
response.
I. I NTRODUCTION
The problem of detecting falls using video analysis has been
At present, surveillance cameras are widely used. They are studied in a large number of works [8], [9], which were based
utilized to identify suspicious persons, as well as persons on analyzing the shape and position of the person in the frame,
in a state of alcohol and drug intoxication, which may be gradients in the vertical and horizontal directions, and changes
dangerous to a life and health of others. The developers of the in images in the time domain.
closed-circuit television system (CCTV systems) offer com- In the majority of papers on CCTV cameras records analysis
plexes consisting of IP cameras for the automated detection of the effectiveness of the fall event classifiers is artificially
suspicious persons. These systems can be based on biometric overestimated because of the limitations of the datasets used
identification (for example, the NeoFace system, smart glasses for testing and training procedure. These limitations may be
R7) [1], [2], as well as on the recognition of emotions or described as follows:
facial expressions (for example, DeepFace ) [3]–[6]. The main
disadvantage of this type of systems is that information about • Dataset is usually formed by the data recorded in un-
the possible trespasser may be absent in the database used for changed conditions (most often in the laboratory condi-
biometric identification. tions, rather than realistic ones), and for uniform illumi-
It should be noted that in the majority of cases, data from nation of the entire analyzed scene.
CCTV cameras are used only for the creating archives of • One and the same person acts as test subjects.
video records, and the possibilities of intelligent video analysis • Movement artifacts of the ”fall” type are similar in pre-
of data from remote objects are practically not used. Such ceding actions of the subject and her/his relative position
situation is determined in particular by technical difficulties toward the camera at the time of the fall.
associated with the use of intelligent video surveillance sys- • In almost all cases, falls are performed on a cushioning
tems in practice. For example, classification algorithms are mat, which most often has a significant color contrast
usually extremely sensitive to lighting conditions [7]. with respect to the clothing of the subject.
As a specific problem of intelligent video analysis, we can All these factors result in a significant overestimation of the
define the problem of detecting an abnormal situation when classifiers presented in papers based on the analysis of datasets
a person is in danger and there are no passers around, who with the above-listed disadvantages.
can help (for example, a person has fallen and can not get
The purpose of this paper was to evaluate the applicability
up or call for help). If such situation takes place in the cold
of deep learning and transfer learning techniques in automated
The research was supported by the grant of Russian Foundation for Basic detection of falls by analysis of surveillance cameras data
Research (17-20-03034). gathered in realistic conditions.

978-1-5386-4946-6/18/$31.00 ©2018 IEEE 99


2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT)

Fig. 1. CNN architecture.

II. M ETHODS images (fall and non-fall classes), we need to make changes
to the AlexNet CNN architecture. Namely, replace the 23d
As it is known, deep learning technique is based on the
layer with 2 ”fully connected layers”, since there are two
use of convolution neural networks (CNN) [10], which are
classes, and replace the 25th layer in accordance with the class
biologically-inspired variants of multilayer perceptrons. The
names for ”fall” and ”non-fall”. After making the appropriate
main drawback of this technique is that its training requires
changes, the network architecture has the appearance shown
a huge amount of data (more than 100,000 examples) for the
in Fig. 1. Layers that are different from the original AlexNet
successful implementation in practice. As a result, the training
are underlined. After CNN is designed supervised training
process can last for weeks and months, which is not always
procedure should be done by using a dataset with images for
acceptable.
both classes (fall and non-fall).
Transfer learning [11] is the technique which allows over-
coming disadvantages of deep learning. It may be described III. E XPERIMENTAL DATA
as following: the researcher selects for solving a new problem In present work, an open database of video recordings
CNN already trained for another similar problem. That allows provided by the Laboratory of Electronics and Imaging of the
transferring knowledge obtained as a result of solving one task National Center for Scientific Research in Chalon-sur-Saone
(for example, recognition of images of various animal species) was used [14]. In Fig. 2 and 3 examples of frames extracted
to solve another (for example, recognition of interior objects from the video records both corresponding to fall and non-fall
on the image). This approach allows reducing significantly the episodes are presented. The merits of this database include the
time required for CNN training on a new dataset comparing following factors:
to training a neural network from scratch (when initializing • Video signals are recorded for various environmental
the neural network connection with random weights). conditions.
In present paper transfer learning procedure was done utiliz- • The illumination of the experimental scene is irregular.
ing MATLAB2017b. We used pre-trained CNN AlexNet [12] The dataset contains records for which contrast of the
which is available by installing Neural Network ToolboxTM human over the background objects is rather low due to
Model for AlexNet Network support package. This CNN was the limits of the camera dynamic range and presence of
trained by its authors on 1.2 millions of images to classify regions with high brightness (for example, window area
1000 different classes and significantly outperformed in 2012 in Fig. 3).
all earlier CNN versions by utilizing more filters per layer that • 4 different subjects (3 men and 1 woman) participated in
previously proposed LeNet-5 [13], which was a pioneering the experiments.
CNN designed by Yann LeCunn in 1998. • Falls were performed at different viewing angles, both
In order to use this network to recognize only two classes of from the standing position and from the sitting position.

978-1-5386-4946-6/18/$31.00 ©2018 IEEE 100


2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT)

Fig. 4. Classification accuracy.

Fig. 2. Frame classified as fall episode.

Fig. 5. Classification error.

64. The classifier performance during the training process was


evaluated on validation dataset.
Fig. 3. Frame classified as non fall episode. As a result of CNN training, we had a graph of classification
accuracy increasing with each iteration of the CNN training
(Fig. 4), while the classification error for a batch with each
• Volunteers were falling both on a specially prepared
iteration became smaller, as it is shown in Fig. 5.
cushioning base, and directly on the floor.
After completing the training process, the efficiency of the
30 records from the dataset [14] were used in this paper neural network was evaluated on the test sample. The results
to train and to test classifier. All records contain at least one of the classification in a form of a confusion matrix are shown
fall episode. For each record operator visually detected fall in Fig. 6.
episodes, and provide the number of the frame corresponding Since the number of images corresponding to the fall class
to the start and the end of the fall episode. The duration are more than ten times smaller than those corresponding to
of fall episodes was estimated as 22±9 frames (mean±SD). the not falling class (88 and 1345, respectively) the present
Taking into account the sampling frequency of 25 frames/s, dataset is an unbalances one. Therefore, the accuracy of the
the duration of fall artifacts was 0,7±0,3 s. classification (in our case it is 99.23%) overestimates the
IV. R ESULTS AND DISCUSSION efficiency of the trained classifier. In such case, it is also
Each record from the dataset [14] was divided into separate useful to use the sensitivity, specificity, precision or positive
frames with known class (fall or non-fall), which were used predictive value and negative predictive value, as well as
as a training data for the classifier. After that, each frame was measures of inter-rater agreement (Cohen’s kappa [15]), as
re-sampled to make it compatible with CNN AlexNet, namely additional estimates of the classifier’s performance. For the
277 by 277 pixels. Thereupon the dataset was randomly trained neural network these parameters are given below:
divided into training and testing sets in the ratio 80:20%. • Cohen’s kappa 0,93

As transfer learning technique was used to train the CNN, • Accuracy = 0.99

the parameters of the input layer were remained unchanged. • Sensitivity = 0.93

That allows using values of connections weights available after • Specificity = 0.99

CNN pre-training. Moreover, initial learning rate parameter • Positive predictive value = 0.94

was set to be equal 0.001 to make changes of the initial values • Negative predictive value = 0.99

of weights small. As it is known, an epoch is a full training We have also used trained classifier for new surrounding
cycle for the whole training dataset. The maximum number of conditions (another 30 records subset of the dataset [12]),
training epochs parameter was set to be 20, and the batch size to understand if the CNN will need re-training in case the

978-1-5386-4946-6/18/$31.00 ©2018 IEEE 101


2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT)

Also, it should be noted that the classification was carried


out for each frame regardless of the antecedent and consequent
frames classes. In other words, relative position of the frames
in time was not taken into account. Usage of this information
and the additional heuristics should improve the accuracy of
the final classification, removing, for example, false positives
errors (for example, if the classifier detects a potential episode
of a fall that lasts less than a certain a priori predetermined
threshold).
V. C ONCLUSION
In this paper, we presented the feasibility study of deep
learning and transfer learning techniques usage in fall detec-
tion by means of surveillance camera data processing. The
proposed method allows detecting fall events on a video
records. We achieved Cohen’s kappa of 0.93 and 0.60 for the
fall – non-fall classification for the known and unknown for
classifier surrounding conditions, respectively.
Fig. 6. Confusion matrix (known surrounding conditions).
The future activity will consider enriching the dataset.
Moreover, we are planning to improve the classifier per-
formance by using additional heuristics considering relative
positions of the frames in time and knowledge about typical
duration of fall episodes.
The results might be used while creating new noncontact
fall detectors both for outdoors and indoor applications.
R EFERENCES
[1] www.necam.com/docs/?id=6c812b4d-2a12-40ed-9fea-fae81550c7aa
[2] www.osterhoutgroup.com/pub/static/ version1515417478/ frontend/ In-
fortis/ ultimo/ en US/ pdf/ R-7-TechSheet.pdf
[3] Yaniv TaigmanMing YangMarc’Aurelio RanzatoLior Wolf, DeepFace:
closing the gap to human-level performance in face verification, Con-
ference on Computer Vision and Pattern Recognition (CVPR), June 24,
2014.
[4] www.robots.ox.ac.uk/ vgg/publications/2015/Parkhi15/parkhi15.pdf.
[5] Mohammadian, A., Aghaeinia, H., Towhidkhah, F. Video-based facial
expression recognition by removing the style variations in Image Pro-
cessing, IET, 2015, vol. 9, no. 7, pp. 596–603.
[6] Iosifidis A., Tefas A., Pitas, I. Class-specific reference discriminant
analysis with application in human behavior analysis, IEEE Transactions
on Human-Machine Systems, 2015, vol. 45, no. 3, pp. 315–326.
Fig. 7. Confusion matrix (unknown surrounding conditions). [7] Rice, D., Evaluating camera performance in challenging lighting
situations, 2014. www.sdmmag.com/articles/90525-evaluating-camera-
performance-in-challenging-lighting-situations
environmental conditions changed. When testing a trained [8] Rougier, C., Meunier, J., St-Arnaud, A., Rousseau, J. Fall detection from
human shape and motion history using video surveillance, Proc. 21st Int.
classifier on a new dataset the classification results are sig- Conf. AINAW, 2007, vol. 2, pp. 875–880.
nificantly lower. Although the accuracy of the classification is [9] Lee, T., Mihailidis, A. An intelligent emergency response system:
still high (0.98), Cohen’s kappa is 0.6. The confusion matrix Preliminary development and testing of automated fall detection, J.
Telemed. Telecare, 2005, vol. 11, no. 4, pp. 194–198.
is shown in Fig. 7, all estimates of classifier performance for [10] https://en.wikipedia.org/wiki/Convolutional neural network
unknown conditions are given below: [11] https://en.wikipedia.org/wiki/Transfer learning
[12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”ImageNet
• Cohen’s kappa 0,60
classification with deep convolutional neural networks.” Advances in
• Accuracy = 0.99 neural information processing systems. 2012.
• Sensitivity = 0.61 [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, november
• Specificity = 0.99
1998.
• Positive predictive value = 0.61 [14] Charfi, I., Miteran, J., Dubois, J., Atri, M., Tourki, R., Optimised spatio-
• Negative predictive value = 0.99 temporal descriptors for real-time fall detection: comparison of SVM
and Adaboost based classification, Journal of Electronic Imaging (JEI),
Such differences in the effectiveness of the classification Vol.22. Issue.4, pp.17, October 2013.
may be due to the over-training of the classifier, or to the fact [15] Cohen, J., A coefficient of agreement for nominal scales, Educational
and Psychological Measurement, 1960, 20 (1), pp. 37–46.
that the training sample was not representative, and enriching
of the dataset will help to solve this issue.

978-1-5386-4946-6/18/$31.00 ©2018 IEEE 102

Вам также может понравиться