Вы находитесь на странице: 1из 15

sensors

Article
Violence Detection Using Spatiotemporal Features
with 3D Convolutional Neural Network
Fath U Min Ullah 1 , Amin Ullah 1 , Khan Muhammad 2 , Ijaz Ul Haq 1 and Sung Wook Baik 1, *
1 Intelligent Media Laboratory, Digital Contents Research Institute, Sejong University, Seoul 143-747, Korea;
fath3797@gmail.com (F.U.M.U.); qamin3797@gmail.com (A.U.); hijaz3797@gmail.com (I.U.H.)
2 Department of Software, Sejong University, Seoul 143-747, Korea; Khan.muhammad@ieee.org
* Correspondence: sbaik@sejong.ac.kr

Received: 1 April 2019; Accepted: 24 May 2019; Published: 30 May 2019 

Abstract: The worldwide utilization of surveillance cameras in smart cities has enabled researchers
to analyze a gigantic volume of data to ensure automatic monitoring. An enhanced security system
in smart cities, schools, hospitals, and other surveillance domains is mandatory for the detection
of violent or abnormal activities to avoid any casualties which could cause social, economic, and
ecological damages. Automatic detection of violence for quick actions is very significant and can
efficiently assist the concerned departments. In this paper, we propose a triple-staged end-to-end
deep learning violence detection framework. First, persons are detected in the surveillance video
stream using a light-weight convolutional neural network (CNN) model to reduce and overcome the
voluminous processing of useless frames. Second, a sequence of 16 frames with detected persons is
passed to 3D CNN, where the spatiotemporal features of these sequences are extracted and fed to the
Softmax classifier. Furthermore, we optimized the 3D CNN model using an open visual inference
and neural networks optimization toolkit developed by Intel, which converts the trained model into
intermediate representation and adjusts it for optimal execution at the end platform for the final
prediction of violent activity. After detection of a violent activity, an alert is transmitted to the nearest
police station or security department to take prompt preventive actions. We found that our proposed
method outperforms the existing state-of-the-art methods for different benchmark datasets.

Keywords: abnormal activity; deep learning; 3D convolutional neural network; violence detection;
surveillance cameras

1. Introduction
In the past decade, with the growth and advancements in the field of computer vision, an
enormous amount of modern techniques has emerged and gained much attention among researchers
due to their vast surveillance applications [1–5]. For instance, in 2017, about 954,261 CCTV cameras
were installed in public in South Korea, which was an increase of 12.9% compared to the previous
year [6]. The purpose of these cameras is to ensure security in public places. For this purpose, we focus
on the detection of violence using these cameras. Violence is an abnormal behavior and an activity
that involves some physical force to damage something, to kill or hurt a human or an animal; these
actions can be identified through a smart surveillance system which could be used to prevent these
events before further fatal accidents. One of the main functions of surveillance systems deployed
on a large scale in different areas, such as schools, streets, parks, and medical centers, is to facilitate
the authorities by alerting them to the violent activity. However, the response of human operators
monitoring the surveillance footage is very slow, causing loss of human life and property; thus, there is
a demand for an automated violence detection system [7]. Hence, this field of study is growing steadily

Sensors 2019, 19, 2472; doi:10.3390/s19112472 www.mdpi.com/journal/sensors


Sensors 2019, 19, 2472 2 of 15

and gaining interest in the computer vision society. Many techniques based on deep features [8–10]
and handcrafted features have emerged.

1.1. Handcrafted Features-Based Approaches


In these approaches, certain methods are developed by the researchers. For instance,
Datta et al. [11] used the trajectory of motion information and limb orientation of a person in the scene
to detect violence. Similarly, Nguyen et al. [12] suggested the use of the hierarchical hidden Markov
model (HHMM) to recognize violent activities. Their main contribution involves the utilization of
a shared structure of HHMM for violence detection. Some of the researchers integrated audio and
video modalities for the detection of violent activities. For instance, Mahadevan et al. [13] developed
a system to recognize violent scenes via detecting blood and flames combined with the degree of
motion and sound. A research work proposed by Hassner et al. [14] considered the flow vector
magnitude represented by violent flow descriptors (ViF). Using a support vector machine (SVM),
these ViF descriptors were then classified into violent and non-violent in crowd scenes. Furthermore,
Huang et al. [15] presented a method for violent crowd behavior analysis by considering only the
statistical properties of the optical flow field in video data. These properties were then classified into
normal or abnormal activity classes using SVM. To detect and localize the violence in a surveillance
video stream, Zhang et al. [16] presented a Gaussian model of optical flow for violent region extraction
and used an orientation histogram of optical flow to distinguish the violent from non-violent class via
linear SVM. Similar to this method, Gao et al. [17] proposed an oriented violent flow descriptor (OViF),
which depicts both motion magnitude and orientation information.

1.2. Deep Learning-Based Approaches


Violence detection in video data is a challenging task due to the presence of complex patterns in
the form of sequential information. For this purpose, numerous methods are developed, for instance,
Chen et al. [18] used spatiotemporal interest points, including Harris corner detector, space–time
interest points (STIP) [19], and motion scale-invariant feature transform (Mo SIFT) [7,20], for violence
detection. Similarly, Lloyd et al. [21] developed new descriptors called grey level co-occurrence texture
measures (GLCM), where changes in crowded texture are encoded by temporal summaries to detect
violent and abnormal crowds. In addition, this, Fu et al. [22] developed a model to detect a fight scene;
its function is to search a series of features based on motion analysis using three attributes, including
motion acceleration, motion magnitude, and the motion region. These features are collectively called
motion signal which is obtained by the summation of motion region. Similarly, Sudhakaran et al. [23]
proposed a method where they used long short-term memory (LSTM) and the adjacent frame difference
as an input into the model by encoding the changes that occur in the videos. Mahmoodi et al. [24]
used a histogram of optical flow magnitude and orientation (HOMO) for violence detection. Recently,
a violent activity recognition framework was presented by Fenil et al. [25] for a soccer game. They
extracted histogram of oriented gradient (HoG) features from each frame. These features were used to
train bidirectional long short-term memory (BD-LSTM) and ensure its usage for both forward and
backward information access. This generated output contains information about violent scenes.
The approaches mentioned above tried to tackle many challenges in violence detection, including
camera views, complex crowd patterns, and intensity variations. For instance, they failed to capture
the discriminative and effective features by their extraction when variation occurs in the human body
for violence detection. These variations occur due to viewpoint, significant mutual occlusion, and
scale [26]. Next, [14] when considering ViF only, this method encounters a problem: If the flow vector
for one pixel in two consecutive frames has the same magnitude and different direction, then the ViF’s
effect is restricted because ViF detects no difference between these two flow vectors. Furthermore,
earlier methods used flames, explosions, and blood for violence detection; these are limited because
of low detection rates and can produce false alarms. Moreover, the HHMM based method [12] and
HOMO [24] failed for complex crowd behavior recognition.
Sensors 2019, 19, 2472 3 of 15

Recently, convolutional neural networks (CNNs) evolved to have higher accuracy and better
results for various computer vision techniques, such as behavior recognition and security [10,27,28],
object tracking and activity recognition [29,30], video summarization [31], and disaster management [8].
Inspired by the performance of CNNs in the mentioned domains, we tackle the problems mentioned
above by proposing 3D CNN-based violence detection in surveillance. The key contributions of the
proposed method are summarized in the following bullet points:

• Violence detection from video data is a challenging problem because of complex sequential
visual patterns’ identification. The mainstream techniques use traditional low-level features for
this task, which are inefficient at recognizing such complex patterns as well as being hard to
implement in real-time surveillance. Considering the limitations of the existing techniques, we
present a deep-learning-based 3D CNN model to learn complex sequential patterns to predict
violence accurately.
• Most violence detection algorithms suffer from the problem of processing a massive number
of unimportant frames, which results in occupying more memory and is very time-consuming.
Considering this major limitation, we first detected the persons in the video stream using a
pre-trained MobileNet CNN model. Only the sequence of 16 frames containing persons was
passed to the 3D CNN model for final prediction, which helped achieve efficient processing.
• The current mainstream methods do not learn effective patterns due to lack of data in violence
detection benchmark datasets and an often low accuracy rate. Inspired by the concept of transfer
learning, the 3D-CNN was fine-tuned using publicly available benchmark datasets for violence
detection in both indoor and outdoor surveillance. It experimentally dominates conventional
hand-engineered features extraction algorithms by improving the accuracy rate.
• After obtaining the trained deep learning model, it was optimized using an OPENVINO toolkit to
speed up and improve its performance at the model deployment stage. Using this strategy, the
trained model was converted into an intermediate representation (IR) based on trained weights
and topology.

The rest of the manuscript is organized as follows: Section 2 covers the proposed method, and
the experimental evaluation is discussed in Section 3. A conclusion and future work are provided in
Section 4.

2. Proposed Method
In this section, we discuss our proposed method in detail where a violent activity ĂI is detected
using an end-to-end deep learning framework. First, the camera captures the video stream VI , which
is directly passed to a trained MobileNet CNN model to detect the people. When a person in the video
stream is detected, the sequence Š of 16 frames is passed to the 3D CNN model for spatiotemporal
features extraction. These features are fed to the Softmax classifier CS to analyze the activity features
at the end and give predictions. An alert is sent to the nearest security department when violence is
detected so that they can take immediate action accordingly. The proposed method is further discussed
in detail in the sub-sections, where each step is given in Figure 1. The e input and output parameters
are described in Table 1 with symbols.
Recently, convolutional neural networks (CNNs) evolved to have higher accuracy and better
results for various computer vision techniques, such as behavior recognition and security [10,27,28],
object tracking and activity recognition [29,30], video summarization [31], and disaster management
[8]. Inspired
Sensors 2019, 19,by
2472the performance of CNNs in the mentioned domains, we tackle the problems 4 of 15
mentionedSensors above
2019, 19, xby
FOR proposing
PEER REVIEW 3D CNN-based violence detection in surveillance. The
4 of 15key
contributions of the proposed method are summarized in the following bullet points:
 Violence detection from video data is a challenging problem because of complex sequential
visual patterns’ identification. The mainstream techniques use traditional low-level features for
this task, which are inefficient at recognizing such complex patterns as well as being hard to
implement in real-time surveillance. Considering the limitations of the existing techniques, we
present a deep-learning-based 3D CNN model to learn complex sequential patterns to predict
violence accurately.
 Most violence detection algorithms suffer from the problem of processing a massive number of
unimportant frames, which results in occupying more memory and is very time-consuming.
Considering this major limitation, we first detected the persons in the video stream using a pre-
trained MobileNet CNN model. Only the sequence of 16 frames containing persons was passed
to the 3D CNN model for final prediction, which helped achieve efficient processing.
 The current mainstream methods do not learn effective patterns due to lack of data in violence
detection benchmark datasets and an often low accuracy rate. Inspired by the concept of transfer
learning, the 3D-CNN was fine-tuned using publicly available benchmark datasets for violence
detection in both indoor and outdoor surveillance. It experimentally dominates conventional
hand-engineered features extraction algorithms by improving the accuracy rate.
 After obtaining the trained deep learning model, it was optimized using an OPENVINO toolkit
to speed up and improve its performance at the model deployment stage. Using this strategy,
the trained model was converted into an intermediate representation (IR) based on trained
weights
Figure and
Figure
1. 1.topology.
The The framework
framework of the
of the proposedviolent
proposed violent detection
detection method.
method.InIn
thethe
first phase,
first a video
phase, stream
a video stream
from a surveillance camera is acquired in which persons are detected. The second phase extracts deep
from
The a surveillance
rest camera isisacquired
of thebymanuscript organized in which persons Section
are detected. The second phase extracts deepand
features feeding a selected sequence as follows:
of frames to a 3D CNN2 model
covers the proposed
which method,
detects the violent
features by feeding a selected sequence of frames to a 3D CNN model which detects the violent activity.
the experimental
activity.evaluation is discussed
Lastly, if a violent activity isin Section
detected, 3. we
then A conclusion and future
report this information work
to the arestation
nearest provided in
Lastly, if a violent activity is detected, then we report this information to the nearest station to take
Section 4. to take immediate action before any injury or disaster occurs.
immediate action before any injury or disaster occurs.
2.1. Pre-Processing
Table
Table1.1.Description
Description ofofthetheinput
input andandoutput
output parameters
parameters used
used inin
the
theproposed
proposedmethod.method.
Person detection is an essential step in our proposed method to ensure efficient processing
Symbols
Symbolsdetection Description
Description Symbols Description
before the violence step. In this section,Symbolswe detect the persons Description
in the video stream for
Ă
efficient processing.
I Violent
Instead Violent
activity
of processing
Ď Dataset
ĂI activitythe whole video Ď stream, we process Dataset only those sequences
that contain personsVVII Violent
by avoiding Violent video
video
unimportant CSCThe
frames. S
Softmax
video stream classifier
Softmax classifier
is fed into the MobileNet-
SSD CNN model ƑNN[32] forNumberNumber
person ofofframes
frames
detection. We used thisFcFcCNNFully connected
Fully
architecture connected
because layer
itlayer
helps the system
Š
to restrict for latency Sequence
and Sequence ofofframes
size. MobileNet framespossesses Ň Ňdepthwise Number Number
separable of clips
of clips
convolutions to detect
objects insteadĹ ĹTrof regular TrainingTraininglist
convolutions. list If depthwise
ĹTeĹTe and pointwise Testing
Testing list list
convolutions are counted
separately, there are 28 layers, where every layer is followed by nonlinearity batch norm and ReLU
2.2.1. exceptMethod
Proposed the final fully connected layer. The first convolutional layer contains a stride of two with a
Pre-Processing
filter shape of 3 × 3 × 3 × 32 and has an input size of 224 × 224 × 3; its next depthwise convolution has
InPerson
thisstride,
one section, thewe
detection filterdiscuss
is shape our
an essential
is 3 ×proposed
3 step
× 32, inand method
our inputinsize
theproposed detail
methodwhere
is 112 a× violent
to ensure
× 112 32. The activity
efficient
MobileNet ĂIisismainly
processingdetected
before
using an end-to-end
the violence
used fordetection deepstep.
classification learning
In this
while framework.
its section,
SSD we
versionFirst, the camera
detect
is used the captures
persons
to locate thein thethevideo
multibox video stream
stream
detector, and VItheir
for, efficient
which
isprocessing.
directly passed
combination to
Instead a of
performstrained
objectMobileNet
processing detection.
the whole CNN
For thismodel
purpose,
video to the
stream, detectwethe
SSD people.
isprocess
added atonlyWhen
the end
thoseofathe
person
network,
sequences in the
that
video
containwhich
stream performs
persons is detected,feedforward
by avoiding convolution
theunimportant
sequence Š frames. and produces
of 16 frames
The videoa fixed-size
is passed
stream to group
is fed of
theinto bounding
3D the CNN boxes, to
model for
MobileNet-SSD
CNNensure
modelthe
spatiotemporal [32]presence
features
for person and detection
extraction.
detection. of We
These object instances
features
used areCNN
this in those
fed theboxes
toarchitecture via because
Softmax extracting
classifierittheCSfeatures
to the
helps maptheto
analyze
system
and applying the convolution filters. The boundary box is composed of a predicted class with a
activity
restrictfeatures
for latency at the
andend and
size. give predictions.
MobileNet possesses An alert is sent
depthwise to the nearest
separable convolutionssecurity to department
detect objects
probability for each class. The class with the highest probability indicates the object, while zero
when violence
instead is detected
of regular so that they
convolutions. can take immediate
If depthwise and pointwise actionconvolutions
accordingly. The proposedseparately,
are counted method
represents no object indication. A demonstration of person detection in some samples of the hockey
isthere
furtherarediscussed
28 layers, in detail
where in
every
fight dataset is shown in Figure 2.
the sub-sections,
layer is followedwhereby each step
nonlinearity is given
batch in Figure
norm and1. The
ReLU e input
except and
the
output parameters
final fully are layer.
connected described in Table
The first 1 with symbols.
convolutional layer contains a stride of two with a filter shape of
3 × 3 × 3 × 32 and has an input size of 224 × 224 × 3; its next depthwise convolution has one stride,
the filter shape is 3 × 3 × 32, and the input size is 112 × 112 × 32. The MobileNet is mainly used
for classification while its SSD version is used to locate the multibox detector, and their combination
performs object detection. For this purpose, the SSD is added at the end of the network, which performs
feedforward convolution and produces a fixed-size group of bounding boxes, to ensure the presence
and detection of object instances in those boxes via extracting the features map and applying the
Sensors 2019, 19, 2472 5 of 15

convolution filters. The boundary box is composed of a predicted class with a probability for each class.
The class with the highest probability indicates the object, while zero represents no object indication.
A demonstration of person detection in some samples of the hockey fight dataset is shown in Figure
Sensors 2019, 19, x FOR PEER REVIEW
2.
5 of 15

Sensors 2019, 19, x FOR PEER REVIEW 5 of 15

Figure 2.
Figure 2. Persons
Persons detected
detected in
in the
the frames
frames of
of both
both classes
classes from
from video
video clips
clips of
of the
the hockey
hockey fight
fight dataset
dataset
using MobileNet-SSD.
using MobileNet-SSD.

2.2. Learning with 3D CNN


Figure 2. Persons detected in the frames of both classes from video clips of the hockey fight dataset
A 3D 3D CNNCNN is is well-suited
well-suited to to extract
extract spatiotemporal
spatiotemporal features and can preserve the the temporal
temporal
using MobileNet-SSD.
information better betterowing
owingtotoitsits3D 3D convolution and pooling operation. In addition,
convolution and pooling operation. In addition, in 2D CNNs, in 2D CNNs,
there
there is spatial
is spatial
2.2. information
information
Learning with 3Donly,
CNN only, awhile
while 3D CNN a 3Dcan
CNN can capture
capture all temporalall temporal
information information
regardingregarding
the input
the input sequence.
sequence. Some of the Some of themethods
existing existing usemethods use 2D ConvNets
2D ConvNets to extracttothe extract
spatialthecorrelation
spatial correlation
in video
A 3D CNN is well-suited to extract spatiotemporal features and can preserve the temporal
in video data,
data,information
which possesswhich possess temporal correlation. For instance, in [33,34], the 2D CNN processes
bettertemporal
owing to its correlation.
3D convolutionFor and
instance,
poolinginoperation.
[33,34], the 2D CNN
In addition, processes
in 2D CNNs, theremultiple
multiple
frames, frames, and all the temporal feature information is collapsed. The 3D convolution
is spatial information only, while a 3D CNN can capture all temporal information regarding the input by
and all the temporal feature information is collapsed. The 3D convolution operates
operates
convolving
sequence. a 3D
Somemask
mask on the
the cube
onexisting
of the cube designed
designed
methods via assembling
viaConvNets
use 2D assembling attached
attached
to extract the frames. The obtained
frames.correlation
spatial feature
in video
mapsdata,
fromwhichthe convolution layer are linked to multiple attached frames in the prior
possess temporal correlation. For instance, in [33,34], the 2D CNN processes multiple layer, capturing
the motion
frames,information.
motion Hence, the
and all the temporal
information. feature on positionisx,y,z
valueinformation at the qth
collapsed.
x,y,z Thefeature map in the
3D convolution pth layer
operates by with
bias tconvolving
is a
illustrated 3D
pq is illustrated by
pq mask
by on the cube designed via assembling attached frames. The obtained feature
maps from the convolution layer are linked𝐴𝑝 −1to𝐵𝑝multiple
−1 𝐶𝑝 −1 attached frames in the prior layer, capturing
the motion information.
𝑥𝑦𝑧 Hence, the value onAposition
p x,y,z
−1 Bp −1 C at the
p −1𝑎𝑏𝑐 qth feature map in the pth layer with
(𝑥+𝑎)(𝑦+𝑏)(𝑧+𝑐)
𝑁𝑝𝑞 =xyz𝑡𝑎𝑛ℎ( 𝑡𝑝𝑞 + ∑ X ∑X ∑X ∑X 𝑤𝑝𝑞𝑘abc𝑁(𝑝−1)𝑘
(x+a)( y+b)(z+)c) (1)
bias tpq is illustratedNby = tanh(tpq + w N ) (1)
pq 𝑘 𝑎=0 𝑏=0
𝐴𝑝 −1 𝐵 𝑐=0 pqk (p−1)k
𝑝 −1 𝐶𝑝 −1
k a=0 b=0 c=0
𝑥𝑦𝑧 𝑎𝑏𝑐 (𝑥+𝑎)(𝑦+𝑏)(𝑧+𝑐)
𝑁𝑝𝑞 = 𝑡𝑎𝑛ℎ( 𝑡𝑝𝑞 + ∑ ∑ ∑ ∑ 𝑤𝑝𝑞𝑘 𝑁(𝑝−1)𝑘
where Cp is the 3D mask size with the temporal dimension and 𝑤 𝑎𝑏𝑐 )is 𝑝𝑞𝑘 the (a, b, c)th value (1) of the
where Cp is the 3D mask size with the𝑘temporal 𝑎=0 𝑏=0 dimension
𝑐=0 and wabc is the (a, b, c)th value of the
mask attached to the kth feature map in the prior layer. Only one type of feature is extracted by 3Dpqk
mask attached
convolutional
where to
Cp ismask
thethe 3D kth
from feature
maskthesize
framemapcube
with in the
the prior
since
temporal layer.
thedimension
weights Only of one
and the type
𝑎𝑏𝑐
𝑤𝑝𝑞𝑘 is of
kernel feature
theare is extracted
(a, replicated
b, c)th valueinofthe by
the 3D
entire
convolutional
cube.mask mask
attached
In Figure from
to the
3, the the frame
kth feature
feature mapsmap cube
of the since
in the the
3D prior weights
CNNlayer. of
Only from
obtained the
one type kernel
twooflayers are replicated
featureconv3a
is extracted in the entire
by 3D are
and conv5a
cube. In Figure 3, the feature maps of the 3D CNN obtained from
provided. The input sequence is taken from the violence category in the movies’ dataset. Aentire
convolutional mask from the frame cube since the weights of the two
kernel layers
are conv3a
replicated and
in the conv5a are
principle
provided.
cube. The
In input
Figure 3, sequence
the feature is taken
maps of from
the 3Dthe violence
CNN obtainedcategory
from in
two the
layers
for CNN is to increase the amount of feature maps in late layers by creating several kinds of features movies’
conv3a dataset.
and A
conv5a principle
are
for provided.
CNN The inputthe sequence is taken from the violence
latecategory in the movies’ dataset.kinds
A principle
from the is to increase
same feature maps. amount
The inputof feature
data maps
to thisinnetwork layers
is aby creating
sequence ofseveral
frames. of features
Before starting
from for
theCNN
same is to increase
feature the
maps. amount
The of
input feature
data maps
to thisin late
networklayersis by
a creating
sequence several
of kinds
frames. of features
Before starting
the training process, the volume mean of training and testing data is calculated. The architecture of
the from theprocess,
same feature maps. The input data to this
andnetwork is a sequence of frames.
TheBefore startingof the
the training
network is fine-tuned
the training
the volume
process, the to volume
mean
obtain mean
of training
these sequences
of training
testing
andas
data
inputs.
testing
is calculated.
dataThe final prediction
is calculated.
architecture
at the Softmax
The architecture of
network
layerthe is fine-tunedbelonging
is calculated to obtain these
to thesequences oras inputs. The final prediction at the Softmax layer is
network is as fine-tuned to obtain violent
these non-violent
sequences as inputs. class.
The final prediction at the Softmax
calculated as belonging to the violent or non-violent class.
layer is calculated as belonging to the violent or non-violent class.

Figure
Figure 3.
FigureThe
3. The input
3. The
input sequence
input is is
sequence
sequence is taken from
fromviolence
takenfrom
taken in
violence in
violence movies
inmovies dataset.
moviesdataset. Feature
Feature
dataset. map
Feature map of
of the
map the
the conv3a
of conv3a and and
conv3a and
conv5a is
conv5aformed.
is As
formed. the
As process
the processof the
of theconvolution
convolution proceeds,
proceeds, deeper
deeper features
features areare
conv5a is formed. As the process of the convolution proceeds, deeper features are extracted. extracted.
extracted.

2.3. Data Preparation and Usage


2.3. Data Preparation and Usage
This section specifies the preparation of data and their usage for learning violence activity
This section specifies the preparation of data and their usage for learning violence activity
patterns. First, violence dataset Ď was used, containing Ň number of short video clips with different
patterns. First, Each
durations. violence
videodataset
dataset Ď was used,
contains containing
two categories: Ňviolent
i.e., number of and
class short video clips
non-violent with
class. different
Before
Sensors 2019, 19, 2472 6 of 15

2.3. Data Preparation and Usage


This section specifies the preparation of data and their usage for learning violence activity patterns.
First,
Sensorsviolence
2019, 19, xdataset
FOR PEER Ď was
REVIEW used, containing Ň number of short video clips with different durations. 6 of 15
Each video dataset contains two categories: i.e., violent class and non-violent class. Before the learning
process, the whole
the learning process, dataset Ď was
the whole divided
dataset intodivided
Ď was a sequence
into aofsequence
16 frames of 16Š with
frames an Š8-frame
with anoverlay
8-frame
between
overlay the two successive
between clips. Subsequently,
the two successive having obtained
clips. Subsequently, havingthe frames, the
obtained we split frames, the we
whole
splitdata
the
into training
whole and training
data into testing sets.
and For thissets.
testing purpose, we purpose,
For this used 75%we and 25%75%
used of data
and for 25% training
of dataand testing,
for training
respectively.
and testing, Once the training
respectively. Onceand thetesting
training data
andwere obtained,
testing we generated
data were obtained,awe filegenerated
list containing
a filethe
list
paths of training list Ĺ
containing the pathsTrof training= {S , S ,
1 17 list S , . . . ,
33 ĹTr = {S S } and testing list Ĺ
N 1, S17, S33, …, SN} and = {S ,
Te testingS ,
1 17 listS , . . . , S }. The subscript
33 ĹTe = {SN1, S17, S33, …, SN}.
ofThe
S issubscript
the starting of Sframe
is thenumber
startinginframe
the sequence
numberwherein the each path is
sequence giveneach
where in the path list,ispointing
given intowards
the list,
the extracted
pointing frames
towards theinextracted
the directories.
frames in the directories.

2.4.
2.4.C3D
C3DNetwork
NetworkArchitecture
Architecture
Inspired
Inspired by bythe
theperformance
performance of of 3D
3DCNNCNN in in[35–38],
[35–38],we wealso
alsofine-tuned
fine-tunedthe the3D3DCNNCNNmodel model
proposed
proposed in [36]. A starting version of the C3D model [36] was developed in 2014 with aversion
in [36]. A starting version of the C3D model [36] was developed in 2014 with a versionof of
Caffe
Caffe[39].
[39].This
Thisnetwork
networkconsisted
consistedof ofeight
eightconvolutions:
convolutions:five fivepooling
poolingand
andtwo
twofully
fullyconnected
connectedlayerslayers
with
withaaSoftmax
Softmaxoutput
output layer.
layer.Each
Each convolutional
convolutional layer hashas
layer 3 ×33 ××33 ×kernels withwith
3 kernels one one
stride, and and
stride, all the
all
pooling layers are max pooling with a 2 × 2 × 2 kernel size except for the first
the pooling layers are max pooling with a 2 × 2 × 2 kernel size except for the first pooling layer where pooling layer where
kernel sizeisis11××22×× 22 with
kernelsize with two
two strides,
strides, preserving
preservingthe thetime-based
time-basedinformation.
information.The Thenumber
numberof offilters
filters
inineach convolution is 64, 128, 256, for first, second, and third layers, respectively.
each convolution is 64, 128, 256, for first, second, and third layers, respectively. The kernels The kernels for each
for
convolution have a have
each convolution defined temporal
a defined depth, with
temporal depth,sizewith
D. Thesizekernel
D. Thesize and padding
kernel size andused to apply
padding used the
to
convolution were kept as 3 and 1, respectively. Two fully connected layers (fc6
apply the convolution were kept as 3 and 1, respectively. Two fully connected layers (fc6 and fc7) and fc7) contained 4096
neurons
containedand4096
the Softmax
neuronslayerand containing
the Softmax N layer
number of outputs
containing N depended
number ofonoutputs
the classes of the dataset.
depended on the
In our case,
classes of the the outputInisour
dataset. only two
case, thebecause
output we have
is only twoonly two classes:
because we havei.e.,
only violent and non-violent
two classes: i.e., violent
scenes. The overall
and non-violent detailed
scenes. Thearchitecture is illustrated
overall detailed in Figure
architecture 4.
is illustrated in Figure 4.

Figure4.4.The
Figure Thearchitecture
architectureofofC3D
C3Dcontaining
containingeight
eightconvolutional,
convolutional, five
fivemax
maxpooling,
pooling,and
andtwo
twofully
fully
connected
connectedlayers,
layers,followed
followedby bySoftMax
SoftMaxlayer
layer(Output).
(Output). The
Thethree
threecolor
colorcones
conesrepresent
representthe
thedifferent
different
size
sizeofoffilters
filtersfor
foreach
eachlayer.
layer.

This
Thisarchitecture
architectureof ofaa3D3Dconvolutional
convolutionalnetwork
networkobtained
obtainedthetheshort
shortsequence
sequenceof of16
16frames
framesas asanan
input of size 128 × 171, but we used random crops of size 3 × 16 × 112 × 112 from
input of size 128 × 171, but we used random crops of size 3 × 16 × 112 × 112 from the original input the original input
sequence
sequenceatatthethetime
timeof of
training
trainingto avoid the the
to avoid overfitting problem
overfitting and to
problem andachieve effective
to achieve learning.
effective After
learning.
this, thethis,
After sequence of frames
the sequence of is followed
frames by 3D convolution
is followed and pooling
by 3D convolution andoperations. When training
pooling operations. When
istraining
performed, the network acts as a generic feature extractor. In fact, diverse
is performed, the network acts as a generic feature extractor. In fact, diverse features arefeatures
learned areat
each layer of hierarchy in the network. The bottom’s activation layers contain
learned at each layer of hierarchy in the network. The bottom’s activation layers contain smaller smaller receptive fields
making it sensitive
receptive towards
fields making it patterns,
sensitive such as corners,
towards edges,
patterns, suchand
as shapes,
corners,while
edges,theand
topshapes,
activation layers
while the
contain larger receptive fields learning high-level and global features to collect complex
top activation layers contain larger receptive fields learning high-level and global features to collect invariances.
Finally,
complex theinvariances.
output label is predicted
Finally, as violent
the output label or non-violent
is predicted asat the end.
violent or non-violent at the end.
2.5. Model Optimization
2.5. Model Optimization
Model optimization is the process used to generate an optimal and fine-tuned design model based
Model optimization is the process used to generate an optimal and fine-tuned design model
on some prioritized constraints while keeping the model strength, efficiency, and reliability maximized.
based on some prioritized constraints while keeping the model strength, efficiency, and reliability
maximized. Optimizing the model enables CNN network inference at the end and speeds up the
process by using pre-optimized kernels and functions. Inspired by these strategies, we used an open
source toolkit known as OPENVINO provided by the Intel Corporation. This toolkit extends the work
process across the hardware by maximizing its performance. It works on Intel hardware and takes
pre-trained models, such as Caffe, ONNX, MXNet, and TensorFlow, as inputs and converts these into
Sensors 2019, 19, 2472 7 of 15

Optimizing the model enables CNN network inference at the end and speeds up the process by using
pre-optimized kernels and functions. Inspired by these strategies, we used an open source toolkit
known as OPENVINO provided by the Intel Corporation. This toolkit extends the work process across
the hardware by maximizing its performance. It works on Intel hardware and takes pre-trained models,
such as Caffe, ONNX, MXNet, and TensorFlow, as inputs and converts these into an IR using a model
Sensors 2019, 19, x FOR PEER REVIEW 7 of 15
optimizer. The model optimizer is used to enable a transition between the training and deployment
floor to adjust
training the model floor
and deployment for optimal execution
to adjust the model onfor
theoptimal
end platform.
executionFigure
on the5 end
shows the flow
platform. and
Figure
process
5 showsofthe theflow
model optimization,
and taking
process of the modelthe trained model as
optimization, inputthe
taking andtrained
producing an as
model intermediate
input and
model. At the end platform, this output is deployed for further analysis.
producing an intermediate model. At the end platform, this output is deployed for further analysis.

Figure 5.
Figure 5. The flow of converting the trained model into intermediate representation
representation (IR)
(IR) format
format using
using
model optimizer,
model optimizer,where
wherethe
theIR
IRformat
formatofofthe
themodel
modelisisfurther
furtherused
usedatatthe
theend
endplatform.
platform.

3.
3. Results
Results
We
Weconducted
conducted various experiments
various experimentsto evaluate the performance
to evaluate of the proposed
the performance of the method
proposed concerning
method
three publicly available datasets for violence detection, such as violent crowd [14], hockey
concerning three publicly available datasets for violence detection, such as violent crowd [14], hockeyfight [7], and
violence in movies [7]. To perform the experiments, we used different parameters and
fight [7], and violence in movies [7]. To perform the experiments, we used different parameters andlearning rates to
achieve
learningthe greatest
rates accuracy.
to achieve Detailed
the greatest descriptions
accuracy. of the
Detailed datasets are
descriptions ofgiven in Table
the datasets are2.given
Furthermore,
in Table
we compared our method with different handcrafted and deep-learning-based state-of-the-art
2. Furthermore, we compared our method with different handcrafted and deep-learning-based state- methods
to evaluatemethods
of-the-art its accuracy and performance
to evaluate overand
its accuracy three datasets. Toover
performance perform
threethe experiments,
datasets. the Caffe
To perform the
toolbox was used to extract deep features on GeForce-Titan-X GPU. The operating
experiments, the Caffe toolbox was used to extract deep features on GeForce-Titan-X GPU. The system was Ubuntu
16.04
operating CoreTMwas
usingsystem i5-6600 with16.04
Ubuntu 64GB RAM.
using CoreTM i5-6600 with 64GB RAM.

Table2.2. Detailed
Table Detailed description
description and
and statistics
statistics of
of the
the used
useddatasets.
datasets.
Violent
ViolentScenes
Scenes Non-Violent
Non-ViolentScenes
Scenes
Datasets Samples Resolution
Datasets Samples Resolution No. of
No. of
Clips Frame Rate No. of Clips Frame
Frame Rate
Frame Rate No. of Clips
Violent Clips Rate
246 320 × 240 123 25 123 25
Crowd
Violent[14]
246 320 × 240 123 25 123 25
Crowd [14]
Violence in
200 360 × 250 100 25 100 29.97
Movies [7]
Violence in
200 360 × 250 100 25 100 29.97
Movies
Hockey[7]
1000 360 × 288 500 25 500 25
Fight [7]
Hockey
1000 360 × 288 500 25 500 25
Fight [7]
3.1. Datasets
3.1. Datasets
This section describes the datasets used in the experiments. Each dataset has a different number
This section
of samples. describes
A detailed the datasets
explanation usedasinfollows:
is given the experiments. Each dataset has a different number
of samples. A detailed explanation is given as follows:
3.1.1. Violent Crowd
3.1.1.The
Violent Crowd
violent crowd dataset was presented by Hassner et al. [14]. This dataset contains 246 videos
takenThe
from YouTube,
violent crowdpresenting different
dataset was typesby
presented ofHassner
scenes and scenarios.
et al. [14]. ThisAt first, the
dataset dataset
contains contains
246 videos
five
takensets of video
from clips.
YouTube, In each set,
presenting there types
different are two of categories: i.e., violent
scenes and scenarios. Atand
first,non-violent. For the
the dataset contains
experiments, we merged
five sets of video clips. Inthese
eachfive
set,sets to form
there are twotwocategories:
categoriesi.e.,
where 123 video
violent clips are related
and non-violent. to
For the
violent events, and 123 videos are related to non-violent clips. Each video clip
experiments, we merged these five sets to form two categories where 123 video clips are related to has a resolution of
320 × 240
violent pixelsand
events, with123
lengths
videosvarying fromto
are related 50non-violent
to 150 frames. Some
clips. Eachsample
videoframes
clip has from this dataset
a resolution are
of 320
given in Figure 6.
× 240 pixels with lengths varying from 50 to 150 frames. Some sample frames from this dataset are
given in Figure 6.
Sensors 2019, 19, 2472 8 of 15
Sensors 2019, 19, x FOR PEER REVIEW 8 of 15

Figure
Figure6.6.Sample
Samplevideo
videoframes
framesrandomly
randomly selected from: (a)
selected from: (a) violent
violentcrowd
crowd[14],
[14],(b)
(b)violence
violenceinin movies,
movies, [7]
[7]
andand
(c)(c) hockey
hockey fight
fight [7].[7].

3.1.2.Violence
3.1.2. ViolenceininMovies
Movies
Thisdataset
This datasetwaswasintroduced
introducedby byNievas
Nievasetetal.
al.[7]
[7]for
forfight
fightdetection,
detection,and
andititconsists
consistsofof200200videos
videos
clips,ininwhich
clips, whichperson-on-person
person-on-personfight fightvideos
videoshave
havebeen beentaken
takenfrom
fromaction
actionmovies
movieswhilewhilenon-fight
non-fight
videoshave
videos havebeen
beenextracted
extractedfrom
frompublicly
publiclyavailable
availableaction
actionrecognition
recognitiondatasets.
datasets.ThisThisdataset
datasetcovers
coversaa
varietyofofscenes,
variety scenes,with
withananaverage
averageresolution 360××250
resolutionofof360 250pixels
pixelsand
andeach
eachclip
clipisislimited
limitedtoto5050frames.
frames.
InInthis
thisdataset,
dataset,aafirst
firstperson
personininthe
thesequence
sequencehas
haslow loworornonocamera
cameramotion.
motion.Some
Somesamplesampleframes
framesfrom from
this dataset are given in Figure
this dataset are given in Figure 6. 6.

3.1.3.Hockey
3.1.3. HockeyFight
Fight
Thisdataset
This datasetwas
wasintroduced
introducedby byNievas
Nievasetetal.al.[7]
[7]and
andcontains
contains1000
1000short
shortvideo
videoclips
clipstaken
takenfrom
from
theNational
the NationalHockey
HockeyLeague
League(NHL).
(NHL).InInthisthisdataset,
dataset,500 500video
videoclips
clipsare
arelabeled
labeledasasfight,
fight,andand500500are
are
labeled as non-fight. Each clip consists of 50 frames with a resolution of 360 × 288 pixels.
labeled as non-fight. Each clip consists of 50 frames with a resolution of 360 × 288 pixels. In the fight In the fight
class,all
class, allthe
theclips
clipsare
arerelated
relatedtotofights
fightsininthe
thehockey
hockeygrounds,
grounds,and andthethenon-fight
non-fightclass
classisisalso
alsorelated
relatedtoto
thesame
the sameenvironment
environmentcontaining
containingnon-fight
non-fightclip
clipsosoasastotoreliably
reliablydetect
detectviolent
violentscenes
scenesininsports
sportsvideos.
videos.
Some sample frames from this dataset are given
Some sample frames from this dataset are given in Figure 6. in Figure 6.

3.2. Discussion
3.2. Discussion
Table 3 explains the experiments performed on the violent crowd dataset, where the highest
Table 3 explains the experiments performed on the violent crowd dataset, where the highest
achieved accuracy was 98%, with 1.89 × 10−9−9 loss at the maximum iteration of 5000 with a base learning
achieved accuracy was 98%, with 1.89 × 10 loss at the maximum iteration of 5000 with a base learning
rate of 0.001. The loss value is given in scientific notation, which is equivalent to 1.89 × 10−9−9 . We kept
rate of 0.001. The loss value is given in scientific notation, which is equivalent to 1.89 × 10 . We kept
the learning rate normal because the learning rate has two terminologies for its usage. First, the
the learning rate normal because the learning rate has two terminologies for its usage. First, the
learning rate should not be very large because it oscillates when searching for the minimal point
learning rate should not be very large because it oscillates when searching for the minimal point and
and can cause drastic updates leading to divergent behaviors. Second, the learning rate should not
can cause drastic updates leading to divergent behaviors. Second, the learning rate should not be
be very small because it slows down the convergence towards the minimal point and requires too
very small because it slows down the convergence towards the minimal point and requires too many
many updates before reaching the minimum point. At first, the learning rate is large, and the random
updates before reaching the minimum point. At first, the learning rate is large, and the random
weights at that position are far from the optimal point; then, it slowly and gradually decreases as
weights at that position are far from the optimal point; then, it slowly and gradually decreases as
further iterations proceed.
further iterations proceed.
Sensors 2019, 19, 2472 9 of 15

Table 3. Classification accuracies of the proposed method on the violent crowd dataset [14].

Learning Rate (Batch Size = 20) Iterations Loss Accuracy


1000 1.30
0.01 3000 8.28 × 10−1 55%
5000 7.07 × 10−1
1000 1.52 × 10−5
0.001 3000 1.79 × 10−8 98%
5000 1.89 × 10−9
Testing the obtained model on violence in movies dataset [7] 65%
Testing the obtained model on hockey fight dataset [7] 47%

Table 4 explains the experiments performed on the violence in movies dataset [7], where the
highest achieved accuracy was 99.9% with 1.67 × 10−7 loss at a maximum iteration of 5000 with
the base learning rate of 0.001. After conducting experiments on the violence in movies dataset,
we made various observations. For instance, detecting the fights in the movies dataset footage
was easier than detecting it in the crowd dataset because when we tested the obtained model on
the violent crowd dataset, we achieved 54% accuracy, which is low because fights in the violent
crowd dataset are very varied in appearance or cinematography. In addition, the clips included a
large number of people; however, in the violence in movies dataset, a majority of the videos clips
contained person-to-person violence. Notwithstanding this, the hockey fight dataset was relatively
very consistent. The same model was tested using the hockey fight dataset [7], in which the obtained
accuracy was 63%, which is better than the accuracy obtained for the violent crowd dataset. We also
tested the model obtained from the violent crowd on the other two datasets, i.e., violence in movies
and hockey fight dataset, which gave an accuracy of 65% and 47%, respectively. The obtained accuracy
on these two datasets is lower due to pattern footage because the hockey fight and violence in movies
datasets contained person-to-person fights and the violent crowd dataset contained multiple numbers
of persons. The graphical representation for the experiments performed in Table 4 is given in Figure 7.

Table 4. Classification accuracies of the proposed method on violence in movies dataset [7].

Learning Rate (Batch Size = 20) Iterations Loss Accuracy


1000 0
0.001 3000 0 99.9%
5000 1.67 × 10−7
1000 1.21 × 10−2
1 × 10−5 3000 1.99 × 10−3 99.9%
5000 5.4 × 10−4
Testing the obtained model on violent crowd dataset [14] 54%
Testing the obtained model on hockey fight dataset [7] 63%

Table 5 explains the experiment’s performance in relation to the hockey fight dataset [7], where
the highest achieved accuracy was 96% with a 5.77 × 10−4 loss at the maximum iteration of 5000 and
the base learning rate of 0.001. Furthermore, we evaluated the accuracy of the fine-tuned model of the
hockey fight dataset [7] on the violent crowd dataset [14] and violence in movies, giving 52% and 49%
accuracy, respectively.
Sensors 2019, 19, 2472 10 of 15
Sensors 2019, 19, x FOR PEER REVIEW 10 of 15

Table
Table 5. 5.Classification
Classification accuracies
accuracies ofof proposed
proposed method
method onon hockey
hockey fight
fight dataset
dataset [7].
[7].
Learning Rates (Batch
Learning Size =Size
Rates (Batch 20)= 20) Iterations
Iterations Loss Loss Accuracy Accuracy
1000 1.59 × 10−4
1000 1.59 × 10−4
0.001 0.001 3000
3000 0 0 96% 96%
5000
5000 2.31 × 102.31 × 10
−7 −7

1000
1000 9.1 × 109.1
−2 × 10−2

0.00010.0001 3000
3000 2.27 × 10 −6 × 10−6 96%
2.27 96%
5000 5.77 × 10 −4
5000 5.77 × 10−4
Testing
Testingthe
theobtained
obtained model onviolent
model on violentcrowd
crowd dataset
dataset [14][14] 52% 52%
Testing the obtained model on violence in movies dataset
Testing the obtained model on violence in movies dataset [7] [7] 49% 49%

In addition, we observed that changing the learning rate has an effect on loss and with iterations.
In addition, we observed that changing the learning rate has an effect on loss and with iterations.
In Figure 7a, the graph shows the change in loss with the variation in the number of iterations with a
In Figure 7a, the graph shows the change in loss with the variation in the number of iterations with
base learning rate of 0.001 for the hockey fight dataset. At the iteration of 500, the loss obtained is 1.97
a base−2learning rate of 0.001 for the hockey fight dataset. At the iteration of 500, the loss obtained is
× 10 , which decreases as the number of iterations proceeds; at the maximum iteration of 5000, the
1.97 × 10−2 , which decreases as the number of iterations proceeds; at the maximum iteration of 5000,
obtained loss is 2.32 × 10−7 while keeping the same experiment, we only changed the learning rate to
the obtained loss is 2.32 × 10−7 while keeping the same experiment, we−2,only changed the learning rate
0.0001, so the obtained loss at the initial iteration of 500 is 7.39 × 10 and at the maximum iteration
to 0.0001, so the obtained loss at the initial iteration of 500 is 7.39 × 10−2, and at the maximum iteration
of 5000 the obtained loss is 5.77 × 10 −4.
of 5000 the obtained loss is 5.77 × 10−4 .

Figure 7. (a) Variation of loss with different iterations on hockey fight [7] dataset with a learning rate of
Figure 7. (a) Variation of loss with different iterations on hockey fight [7] dataset with a learning rate
0.001; at the horizontal position, the initial iteration from zero grows towards the final iteration, which
of 0.001; at the horizontal position, the initial iteration from zero grows towards the final iteration,
is 5000, while in the vertical a loss is given. The loss is decreasing as the iterations proceed; likewise in
which is 5000, while in the vertical a loss is given. The loss is decreasing as the iterations proceed;
(b) the variation in loss with different iterations on violence in movies dataset [7] when the learning
likewise in (b) the variation in loss with different iterations on violence in movies dataset [7] when
rate is 0.00001, it shows the loss is decreasing as the iterations proceed. (c) shows a variation of loss
the learning rate is 0.00001, it shows the loss is decreasing as the iterations proceed. (c) shows a
with different iterations on the violent crowd dataset [14], with a learning rate of 0.001, it shows that at
variation of loss with different iterations on the violent crowd dataset [14], with a learning rate of
the 500th iteration the loss is very high, but with further iterations, it decreases.
0.001, it shows that at the 500th iteration the loss is very high, but with further iterations, it decreases.
The loss to iteration comparison for violent crowd is given in Figure 7c, where the loss decreases
The loss to iteration comparison for violent crowd is given in Figure 7c, where the loss decreases
from the start and becomes less than zero after 1000 iterations. The loss for the violence in movies
from the start and becomes less than zero after 1000 iterations. The loss for the violence in movies
dataset in the initial stages is high; then, it decreases as iterations proceed. In this way, the loss obtained
dataset in the initial stages is high; then, it decreases as iterations proceed. In this way, the loss
at the 5000th iteration becomes 5.4 × 10−4 . The decrease in loss for the violence in movies dataset
obtained at the 5000th iteration becomes 5.4 × 10 −4. The decrease in loss for the violence in movies
is graphically presented in Figure 7b, where the vertical axis represents the loss, and the horizontal
dataset is graphically presented in Figure 7b, where the vertical axis represents the loss, and the
axis represents the training iterations. We also evaluated the performance of the proposed method
horizontal axis represents the training iterations. We also evaluated the performance of the proposed
by examining precision, recall, and the comparison among the datasets by providing the values of
method by examining precision, recall, and the comparison among the datasets by providing the
area under the curve (AUC) in Table 6, which show the effectiveness of the proposed method on each
values of area under the curve (AUC) in Table 6, which show the effectiveness of the proposed
dataset. In addition, the obtained confusion matrix is given in Table 7. The precision and recall values
method on each dataset. In addition, the obtained confusion matrix is given in Table 7. The precision
for each dataset ranges between Xmin , Ymin and Xmax , Ymax , respectively. Here the X represents the
and recall values for each dataset ranges between Xmin, Ymin and Xmax, Ymax, respectively. Here the X
precision, and Y represents recall for each dataset. The precision obtained for hockey fight, violence in
represents the precision, and Y represents recall for each dataset. The precision obtained for hockey
movies, and violent crowd dataset is 0.9597, 1.0, and 0.9815, respectively, while the recall is 0.9667, 1.0,
fight, violence in movies, and violent crowd dataset is 0.9597, 1.0, and 0.9815, respectively, while the
and 0.9876, respectively. We also calculate the time complexity of the proposed method, considering
recall is 0.9667, 1.0, and 0.9876, respectively. We also calculate the time complexity of the proposed
the testing phase during this experiment. For each 16 frame sequence, the average calculated time is
method, considering the testing phase during this experiment. For each 16 frame sequence, the
average calculated time is 1.85 s, while, for a one-minute clip with 25 FPS it takes about 2 min and 54
Sensors 2019, 19, 2472 11 of 15
Sensors 2019, 19, x FOR PEER REVIEW 11 of 15

1.85
s to s, while, for
complete theatesting
one-minute
phaseclip with 25
through all FPS it takes about
the sequences. We2further
min and 54 s to complete
evaluated the testing
the effectiveness of
phase through all the sequences. We further evaluated the effectiveness of the proposed
the proposed method by plotting the receiver operating characteristic (ROC) curve across the true method by
plotting
positive the
rate
Sensors receiver
and
2019, operating
19, false
x FOR characteristic
positive
PEER rate. This is (ROC)
REVIEW brieflycurve acrossin
illustrated the true positive
Figure 8, whererate
theand
AUC false positive
values
11 of 15 are
rate. This isfor
compared briefly
each illustrated
dataset. in Figure 8, where the AUC values are compared for each dataset.
s to complete the testing phase through all the sequences. We further evaluated the effectiveness of
the proposed method by plotting the receiver operating characteristic (ROC) curve across the true
positive rate and false positive rate. This is briefly illustrated in Figure 8, where the AUC values are
compared for each dataset.

Figure 8. The ROC curve and comparison amongst the datasets based on AUC value, i.e., (a) Hockey
Figure 8. The ROC curve and comparison amongst the datasets based on AUC value, i.e., (a) Hockey
fight dataset; (b) Violent crowd dataset; and (c) Violence in movies dataset.
fight dataset; (b) Violent crowd dataset; and (c) Violence in movies dataset.
FigureTable
8. The6.
ROCPrecision andcomparison
curve and recall withamongst
AUC values are compared
the datasets for each
based on AUC dataset.
value, i.e., (a) Hockey
Table 6.
fight dataset; (b)Precision and recall
Violent crowd with
dataset; and AUC valuesinare
(c) Violence compared
movies dataset.for each dataset.
Values
Values
Datasets Table 6. Precision and recall with
Precision for each dataset.
Recall AUC
Datasets TP TN FP AUCFN
values are compared
Precision Recall AUC
TP TN FP FN
Hockey Fight [7] 262 230 Values
11 9 0.95970696 0.966789668 0.970
Hockey Fight [7]
Datasets 262 230 11 9 0.95970696
Precision 0.966789668
Recall AUC0.970
Violence in Movies [7] 50 TP57 TN0 FP 0 FN 1.0 1.0 0.997
Violence
ViolentinCrowd
Movies[14][7] 50 128 57 3 0 2 0 0.981595092 1.0 1.0 0.997
Hockey Fight [7] 160 262 230 11 9 0.95970696 0.987654321
0.966789668 0.98
0.970
Violent Crowd [14]
Violence in Movies [7]
16050
12857 30 20 0.981595092
1.0
0.987654321
1.0 0.997
0.98
Violent Crowd [14] 1607. Confusion
Table 128 3matrix for
2 each dataset.
0.981595092 0.987654321 0.98
Table 7. Confusion matrix for each dataset.
Hockey Fight
Table Violence
7. Confusion matrix in Movies
for each dataset. Violent Crowd
Classes\Datasets Hockey Fight Violence in movies Violent crowd
Classes\Datasets Violent Nonviolent Violent Nonviolent Violent Nonviolent
ViolentHockey Fight
Nonviolent Violence Nonviolent
Violent in movies Violent crowd
Violent Nonviolent
Classes\Datasets
Violent
Violent 262
262 11
Violent Nonviolent
11 50
Violent
50 00
Nonviolent 160 Nonviolent
Violent
160 33
Non-violent
Violent 9 262 23011 050 57
0 160 2 3 128
Non-violent 9 230 0 57 2 128
Non-violent 9 230 0 57 2 128
We
We also
also compared
compared the the accuracies
accuracies forfor the
the benchmark
benchmark datasets
datasets in
in Figure
Figure 9,
9, where
where the
the highest
highest
We also compared the accuracies for the benchmark datasets in Figure 9, where the highest
achieved
achieved accuracy is 99.9% obtained
obtained in the
the movies
movies dataset,
dataset, 98% accuracy
accuracy is obtained
obtained in the
the
achieved accuracy is 99.9% obtained in the movies dataset, 98% accuracy is obtained in the violent
violent
violent
crowd dataset,
crowdcrowd
dataset, and 96%
and and
dataset, is
96%96% obtained
is obtained in the
in in
is obtained the hockey
thehockey fight
hockeyfight dataset.
fight dataset.
dataset.

FigureFigure 9. Comparative
9. Comparative analysis
analysis of of
thethe proposedmethod
proposed method on
on various
variousdatasets
datasetsbased on on
based accuracy.
accuracy.
Figure 9. Comparative analysis of the proposed method on various datasets based on accuracy.
Sensors 2019, 19, 2472 12 of 15

3.3. Comparative Analysis


In this section, we compare the results of each dataset with existing state-of-the-art methods.
The comparative analysis with all the state-of-the-art methods is shown in Table 8. In the first row, we
present the results of method [17], which used oriented violent flows (OViF) for motion magnitude
and AdaBoost as feature extraction, and SVM for classification. Using these parameters, they obtained
an accuracy of 88% and 87.50% for the violent crowd and hockey fight datasets, respectively. Recently,
another method [40] used Hough forests with 2D CNN to detect violence and obtained 99% accuracy
on the violent movies dataset and 94.6% on the hockey fight dataset. Apart from this, there was
another method [7] to detect violence in videos; this method used a spatiotemporal descriptor called
space–time interest point (STIP), bag-of-words (BoW), and SVM to classify the output classes. They
used only the violence in movies dataset and obtained 89.5% accuracy. Furthermore, we compared the
results with another method [41], which used motion blobs and random forests for detection of the
fast fight. They also used only the violence in movies dataset and obtained 96.9% accuracy. Moreover,
in [42], two descriptors were used to detect and localize the abnormal behaviors; they used a simplified
histogram of oriented tracklets (sHOT) combined with a dense optical flow to recognize abnormal
behavior at the final result and obtained an accuracy of 82.2% for the violent crowd dataset. In [14],
the authors used ViF and then classified the final prediction using SVM, where they used five-fold
cross-validation for testing and obtained 82.90% accuracy for the hockey fight dataset and 81.3% for the
violent crowd dataset. In method [43], the authors used the sliding window approach and improved
the Fisher vector method to detect violence. They obtained accuracies of 99.5%, 96.4%, and 93.7% for
violence in movies, violent crowd, and hockey fight datasets, respectively. Finally, in the last row, we
present our approach, which obtained 99.9%, 98%, and 96% accuracies for violence in movies, violent
crowd, and hockey fight datasets, respectively.

Table 8. Comparative analysis of the proposed method with state-of-the-art methods based on
overall accuracy.

Datasets Accuracies (%)


Methods
Violence in Movies [7] Violent Crowd [14] Hockey Fight [7]
ViF, OViF, AdaBoost and SVM [17] - 88 87.50
Hough Forests and 2D CNN [40] 99 - 94.6
STIP, BoW, and SVM [7] 89.5 - -
Motion Blobs and Random Forests [41] 96.9 - -
ViF [14] - 81.3 82.90
sHOT [42] - 82.2 -
Improved Fisher Vectors [43] 99.5 96.4 93.7
Proposed Method 99.9 98 96

4. Conclusions and Future Work


In this paper, a three-staged end-to-end framework is proposed for violence detection in a
surveillance video stream. In the first stage, persons are detected using an efficient CNN model
to remove unwanted frames, which results in reducing the overall processing time. Next, frames
sequences with persons are fed into a 3D CNN model trained on three benchmark datasets, where
the spatiotemporal features are extracted and forwarded to the Softmax classifier for final predictions.
Finally, an OPENVINO toolkit is used to optimize the model to speed up and increase its performance
at the end platform. Experimental results over various benchmark datasets confirm that our method is
the best fit for violence detection in surveillance and achieved better accuracy than several employed
techniques. In the future, we intend to ensure our system is implemented over resource-constrained
devices. Furthermore, we plan to propose edge intelligence for violence recognition work in the IoT
using smart devices for quick responses.

Author Contributions: Conceptualization, F.U.M.U., A.U. and S.W.B.; Methodology, F.U.M.U., A.U. and S.W.B.;
Software, F.U.M.U.; Validation, F.U.M.U., K.M. and A.U.; Formal analysis, F.U.M.U. and K.M.; Investigation,
Sensors 2019, 19, 2472 13 of 15

F.U.M.U., A.U. and I.U.H.; Writing—review and editing, F.U.M.U., A.U., K.M. and I.U.H.; Visualization, F.U.M.U.
and I.U.H.; Supervision, S.W.B.; Project administration, S.W.B.
Funding: This work was supported by Institute of Information & communications Technology Planning &
Evaluation (IITP) grant funded by the Korea government (MSIT) (2019-0-00136, Development of AI-Convergence
Technologies for Smart City Industry Productivity Innovation).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Batchuluun, G.; Kim, Y.; Kim, J.; Hong, H.; Park, K. Robust behavior recognition in intelligent surveillance
environments. Sensors 2016, 16, 1010. [CrossRef] [PubMed]
2. Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; Albuquerque, V. Activity Recognition using Temporal
Optical Flow Convolutional Features and Multi-Layer LSTM. IEEE Trans. Ind. Electron. 2018. [CrossRef]
3. Muhammad, K.; Ahmad, J.; Lv, Z.; Bellavista, P.; Yang, P.; Baik, S.W. Efficient Deep CNN-Based Fire Detection
and Localization in Video Surveillance Applications. IEEE Trans. Syst. Man Cybern. Syst. 2018. [CrossRef]
4. Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action recognition using optimized deep autoencoder
and CNN for surveillance data streams of non-stationary environments. Future Gener. Comput. Syst. 2019.
[CrossRef]
5. Muhammad, K.; Khan, S.; Elhoseny, M.; Ahmed, S.H.; Baik, S.W. Efficient Fire Detection for Uncertain
Surveillance Environment. IEEE Trans. Ind. Inform. 2019. [CrossRef]
6. Greenfield, M. Change in the Number of Closed-Circuit Television (CCTV) Cameras in Public Places in South
Korea. Available online: https://www.statista.com/statistics/651509/south-korea-cctv-cameras/ (accessed on
25 April 2018).
7. Nievas, E.B.; Suarez, O.D.; García, G.B.; Sukthankar, R. Violence detection in video using computer vision
techniques. In Proceedings of the International Conference on Computer Analysis of Images and Patterns,
Seville, Spain, 29–31 August 2011; pp. 332–339.
8. Khan, S.; Muhammad, K.; Mumtaz, S.; Baik, S.W.; de Albuquerque, V.H.C. Energy-Efficient Deep CNN for
Smoke Detection in Foggy IoT Environment. IEEE Internet Things J. 2019. [CrossRef]
9. Sajjad, M.; Khan, S.; Muhammad, K.; Wu, W.; Ullah, A.; Baik, S.W. Multi-grade brain tumor classification
using deep CNN with extensive data augmentation. J. Comput. Sci. 2019, 30, 174–182. [CrossRef]
10. Sajjad, M.; Khan, S.; Hussain, T.; Muhammad, K.; Sangaiah, A.K.; Castiglione, A.; Esposito, C.; Baik, S.W.
CNN-based anti-spoofing two-tier multi-factor authentication system. Pattern Recognit. Lett. 2018. [CrossRef]
11. Datta, A.; Shah, M.; Lobo, N.D.V. Person-on-person violence detection in video data. In Proceedings of the
16th International Conference on Pattern Recognition, Quebec, QC, Canada, 11–15 August 2002; pp. 433–438.
12. Nguyen, N.T.; Phung, D.Q.; Venkatesh, S.; Bui, H. Learning and detecting activities from movement
trajectories using the hierarchical hidden Markov model. In Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 955–960.
13. Mahadevan, V.; Li, W.; Bhalodia, V.; Vasconcelos, N. Anomaly detection in crowded scenes. In Proceedings
of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA,
13–18 June 2010; pp. 1975–1981.
14. Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior.
In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012; pp. 1–6.
15. Huang, J.-F.; Chen, S.-L. Detection of violent crowd behavior based on statistical characteristics of the optical
flow. In Proceedings of the 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD), Xiamen, China, 19–21 August 2014; pp. 565–569.
16. Zhang, T.; Yang, Z.; Jia, W.; Yang, B.; Yang, J.; He, X. A new method for violence detection in surveillance
scenes. Multimed. Tools Appl. 2016, 75, 7327–7349. [CrossRef]
17. Gao, Y.; Liu, H.; Sun, X.; Wang, C.; Liu, Y. Violence detection using oriented violent flows. Image Vis. Comput.
2016, 48, 37–41. [CrossRef]
Sensors 2019, 19, 2472 14 of 15

18. Chen, D.; Wactlar, H.; Chen, M.-Y.; Gao, C.; Bharucha, A.; Hauptmann, A. Recognition of aggressive
human behavior using binary local motion descriptors. In Proceedings of the 2008 30th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, BC, Canada,
20–25 August 2008; pp. 5238–5241.
19. De Souza, F.D.; Chavez, G.C.; do Valle Jr, E.A.; Araújo, A.d.A. Violence detection in video using spatio-temporal
features. In Proceedings of the 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI),
Gramado, Brazil, 30 August–3 September 2010; pp. 224–230.
20. Xu, L.; Gong, C.; Yang, J.; Wu, Q.; Yao, L. Violent video detection based on MoSIFT feature and sparse coding.
In Proceedings of the ICASSP, Florence, Italy, 4–9 May 2014; pp. 3538–3542.
21. Lloyd, K.; Rosin, P.L.; Marshall, D.; Moore, S.C. Detecting violent and abnormal crowd activity using
temporal analysis of grey level co-occurrence matrix (GLCM)-based texture measures. Mach. Vis. Appl. 2017,
28, 361–371. [CrossRef]
22. Fu, E.Y.; Leong, H.V.; Ngai, G.; Chan, S.C. Automatic fight detection in surveillance videos. Int. J. Pervasive
Comput. Commun. 2017, 13, 130–156. [CrossRef]
23. Sudhakaran, S.; Lanz, O. Learning to detect violent videos using convolutional long short-term memory.
In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
24. Mahmoodi, J.; Salajeghe, A. A classification method based on optical flow for violence detection.
Expert Syst. Appl. 2019, 127, 121–127. [CrossRef]
25. Fenil, E.; Manogaran, G.; Vivekananda, G.; Thanjaivadivel, M.; Jeeva, S.; Ahilan, A. Real time Violence
Detection Framework for Football Stadium comprising of Big Data Analysis and Deep Learning through
Bidirectional LSTM. Comput. Netw. 2019, 151, 191–200.
26. Zhou, P.; Ding, Q.; Luo, H.; Hou, X. Violence detection in surveillance video using low-level features.
PLoS ONE 2018, 13, e0203668. [CrossRef]
27. Batchuluun, G.; Kim, J.H.; Hong, H.G.; Kang, J.K.; Park, K.R. Fuzzy system based human behavior recognition
by combining behavior prediction and recognition. Expert Syst. Appl. 2017, 81, 108–133. [CrossRef]
28. Sajjad, M.; Nasir, M.; Ullah, F.U.M.; Muhammad, K.; Sangaiah, A.K.; Baik, S.W. Raspberry Pi assisted facial
expression recognition framework for smart security in law-enforcement services. Inf. Sci. 2018, 479, 416–431.
[CrossRef]
29. Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using
deep Bi-directional LSTM with CNN features. IEEE Access 2018, 6, 1155–1166. [CrossRef]
30. Lee, S.; Kim, E. Multiple Object Tracking via Feature Pyramid Siamese Networks. IEEE Access 2019, 7,
8181–8194. [CrossRef]
31. Haq, I.U.; Muhammad, K.; Ullah, A.; Baik, S.W. DeepStar: Detecting Starring Characters in Movies.
IEEE Access 2019, 7, 9265–9272. [CrossRef]
32. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.
Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017,
arXiv:1704.04861.
33. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos.
In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
8–13 December 2014; pp. 568–576.
34. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification
with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732.
35. Shou, Z.; Wang, D.; Chang, S.-F. Temporal action localization in untrimmed videos via multi-stage cnns.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 1049–1058.
36. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d
convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 7–13 December 2015; pp. 4489–4497.
37. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Deep end2end voxel2voxel prediction. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 17–24.
Sensors 2019, 19, 2472 15 of 15

38. Muhammad, K.; Hussain, T.; Baik, S.W. Efficient CNN based summarization of surveillance videos for
resource-constrained devices. Pattern Recognit. Lett. 2018. [CrossRef]
39. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International
Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678.
40. Serrano, I.; Deniz, O.; Espinosa-Aranda, J.L.; Bueno, G. Fight Recognition in video using Hough Forests and
2D Convolutional Neural Network. IEEE Trans. Image Process. 2018, 27, 4787–4797. [CrossRef]
41. Gracia, I.S.; Suarez, O.D.; Garcia, G.B.; Kim, T.-K. Fast fight detection. PLoS ONE 2015, 10, e0120448.
42. Rabiee, H.; Mousavi, H.; Nabi, M.; Ravanbakhsh, M. Detection and localization of crowd behavior using a
novel tracklet-based model. Int. J. Mach. Learn. Cybern. 2018, 9, 1999–2010. [CrossRef]
43. Bilinski, P.; Bremond, F. Human violence recognition and detection in surveillance videos. In Proceedings
of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS),
Colorado Springs, CO, USA, 23–26 August 2016; pp. 30–36.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Вам также может понравиться