Automatic threat detection using dual and multi-view X-ray images

Automatic Threat Detection in Single, Dual and
Multi View X-Ray Images

Abhinav Tuli Rohit Bohra Tanmay Moghe
Department of Computer Science Department of Computer Science Department of Computer Science
Birla Institute of Technology Birla Institute of Technology Birla Institute of Technology
and Science, Pilani and Science, Pilani and Science, Pilani
Pilani, Rajasthan Pilani, Rajasthan Pilani, Rajasthan
f20170048@pilani.bits-pilani.ac.in f20170225@pilani.bits-pilani.ac.in f20170184@pilani.bits-pilani.ac.in
Nitin Chaturvedi Dhiraj

Department of Electrical Engineering Cognitive Computing Group
Birla Institute of Technology CSIR-Central Electronics Engineering
and Science, Pilani Research Institute,
Pilani, Rajasthan Pilani, Rajasthan
nitin80@pilani.bits-pilani.ac.in dhiraj@ceeri.res.in
Abstract—Accurate X-ray screening systems are of paramount a very demanding job and requires a lot of concentration
importance in the present day. Most existing systems predict only by the security officers as only a miniscule percentage of
on the basis of a single image, which could lead to false positives bags would contain weapons or explosives. Detecting these
and false negatives due to limited information present. In this
paper, we explore and modify the current work that has been weapons in closely packed bags is in general a very difficult
going on in this field. We implemented several approaches using task because of closely packed items and the view being
single, dual and multiple X-Ray views to make a reliable and blocked by other objects [2]. Because of the huge difficulty
practical model with varying levels of success. These approaches of these identification tasks, some threat objects are bound
include long-established methods such as Bag of Visual Words, to be missed by the elaborate security systems on account
3D Object Recognition, Adaptive Implicit Shape Model and Deep
Neural Networks. We have made all these approaches more of human error. The detection rates of humans as well as
relevant for existing scenarios by taking in dual inputs to make computers are severely affected because of the complexity -
more informed predictions. We got varying levels of success in all types of objects are present in the bags of passengers. In
these methods ranging from 73% using Bag of Visual Words to addition to this, worldwide air travel and leisure tourism is
87% using Deep CNN. In this paper, we will show that when constantly increasing. This is further reducing the time spent
dual view of an object is considered, we gain an improvement of
5% to 15% (considering various approaches) compared to using by the security officers per person which further reduces the
a single view approach. chance of detecting a threat object. It can be very cost effective
if security checks per customer is reduced even by a small
Index Terms—Threat Detection, X-Ray, Bag of Visual Words, amount of time [3].
Deep Learning, Transfer Learning, Adaptive Implicit Shape
Model
I. I NTRODUCTION
In recent years, especially after the New York plane crashes
in 2001, X-Ray screening systems are being used worldwide
to protect and secure many places like airports [1], railway
stations, malls, stadiums, government buildings, etc. These are
the places where safety is of paramount importance because
of a lot of crowd movement, as shown in Figure 1.The main
aim of these screening systems is to detect dangerous objects
such as knives, handguns and explosives in baggage items
being carried by the public during their travel.
All over the world, this detection is done manually by

security officers present at the physical location. This is Fig. 1. Huge Queues at Airport Security
1
In spite of this, human security officers receive minimal [10], VGG19 [11], ResNet50 [12], etc. and reported their
technological support in identifying threat objects. Other than performance. In section V, we study the Adaptive Implicit
airport baggage security, there are a lot of areas where X-Ray Shape Method with two images fed simultaneously instead
image inspection has been automated and has proven useful of one. Finally in section VI, we offer conclusions on our
to the task. One obvious example is in the medical field - experiments, observations and discuss the future scope of the
detection of diseases, even Covid-19, through the chest X-Ray proposed techniques.
of a patient [37]. Other non-obvious applications are in food
quality inspection [38] and inspection of welds and cargo
[39]. Considering all this, an automated system to detect
the threat objects in passenger baggage and cargo would
greatly streamline the process of airport security and reduce
the workload of security officers. A lot of research has been
going on towards automatic detection of threat objects. Most
of these methods improve the quality of the image through
novel pseudo-colour algorithms and image segmentation and
use methods like Bag of Visual Words [4], Support Vector
Machines [5], 3D re-projections, etc. Moderate recall and
precision have been achieved using the SURF [6] and SIFT
[7] descriptors through the Bag of Visual Words method. In
all of these diagnoses, threat detection in a single view image
of threat objects is the key feature.
In all of the above approaches, the chances of

misinterpretation are high as there is just a single view.
However, nowadays dual view X-Ray machines (as shown
in Figure 2) are becoming more common. Having multiple
views will always enhance the capability to detect a threat
object compared to having a single view. As has been the
trend currently, most of the airports, railway stations and other
high-threat zones with security infrastructure are moving
towards machines which capture X-Ray images from more
Fig. 2. A Common Dual-View X-Ray Machine
than one view. This has proved to be a boon for the security
officers in terms of detecting threat objects. Incorporating
multiple views in automated threat detection systems is the
scope of this paper. In this paper, we have analysed how using II. 3D O BJECT R ECOGNITION
multiple views to detect threat objects using the methods
of 3D Space Carving [8], Bag of Visual Words, Dual Input This method was originally proposed by Riffo in his paper
CNN and Adaptive Implicit Shape Model [9] would impact on handgun threat detection [13]. This method proposes
the process of threat detection and whether it is feasible to the creation of 3D models of the threat object in the
implement considering the current infrastructural bottlenecks. training stage and the extraction of descriptors from these.
In the testing phase, 3d models of the bags of interest are
The remaining content of the paper is organized as follows created and their descriptors are extracted. Subsequently
- in section II, we have analyzed the method of 3D space the two descriptors are compared to find matches. The 3D
carving [8]. A concept 3D model is generated with the help of reconstruction in both the training and testing phases is done
X-Ray images from multiple views and features are extracted by using a voxel cutting technique called ”Space Carving” [8].
from this model. 3D models are also generated from the
images from which threats are to be detected and features The publicly available GDXray database [14] contains
from these models are compared to features from the already more than 8,000 images for baggage inspection. For both
developed training models of threat objects. The feasibility training and testing, we have used X-ray images selected
of this method is also addressed. Following this, in section from the GDXray database.
III, we have analyzed the Bag of Visual Words method using
multiple features extractors and compared the performance In the subsequent sections we have described the method
when a single view image is given as input and when two in detail, starting from acquisition of training images. Refer
views are given. In section IV, we have designed various Figure 3 for a complete understanding of training and testing
dual-input Convolutional neural network architectures on top procedures of this method.
of well known image classification networks like InceptionV3
2
a satisfactory 3D model was created in the training stage.
B. Training
Fig. 3. 3D Object Recognition Flowchart
A. Training Images Acquisition

The threat object to be detected is placed inside a sphere
of expanded polystyrene (due to its low X-ray absorption
coefficient). The referred system [13] allows users to acquire
images of the threat object in many poses by varying the Fig. 5. 3D model of gun generated from 2 gun images (Using Space Carving)
rotation angles associated with each axis, X, Y and Z of the
sphere as seen in Figure 4. Training is done by using this 3D model(Figure 5) from
the images acquired using Space Carving.
This system uses Single Spectrum X-rays and thus doesn’t
take into account the material of the object, it will detect Space Carving works by carving out a ‘virtual sculpture’
objects that resemble a gun(such as harmless toy guns) as a from a cube shaped voxel array. The 3D object to be
threat. Thus increasing false alarms as compared to existing constructed is enclosed within this cube. Space Carving
systems. Moreover, in order to use the images obtained from proceeds to iteratively carve/remove portions of the cube that
this proposed method, the projection matrix associated with do not match the silhouettes of the 2D images. At the end,
each image is required, which wasn’t available in public the sculpture of the 3D model is left and the background has
domain. been carved out.
Subsequently, we extract keypoints from the reconstructed

object. These points are the locations where invariant features
about the 3D model can be described. 3D local descriptors
around these keypoints are then extracted using OpenCV
library. These are local characteristics that describe the
object’s characteristics and they must have some invariance,
such as rotation or scaling.
We separately experimented with four 3D descriptors,

namely 3DSC [15], USC [17], FPFH [18] and SHOT [16].
C. Testing Images Acquisition
For testing purposes, 90 X-Ray images from 0 to 178
degrees were taken by rotating the vertical axis. Figure 6 This
was done for 19 distinct bags from GDXray that contained
0, 1 or 2 handguns. From these images, the three darkest
contours were selected and the background was ignored.
Doing this we were able to segment all objects of interest
Fig. 4. Image Acquisition and ignore the insignificant objects.
We have instead used two orthogonal images taken

from freely available online stock images for training and D. Testing
constructing the 3d Model.The image of this 3d model is From the acquisition phase, we had obtained 90 images
shown in Figure 5 These were taken from the same distance present for each bag with a two degree gap along with their
and same focal length. As the threat object (gun) is a smooth respective projection matrices, complete space carving was
object with few imperfections, using just 2 orthogonal images, done using projection matrices. As multiple objects are present
3
is less than the proposed accuracy in the original paper [13])
of 97%. We were unable to replicate the results shown in the
original paper due to non availability of projection matrix data
for the gun model. Despite this, the method demonstrated the
advantages of using multiple views over a single view and
gave us insights that we used in subsequent experiments.
Here accuracy was defined as the numbers of guns detected

out of all guns present in the nineteen bags. Some bags
contained more than one guns giving a total number of 18
guns. As this system could potentially be deployed in real time
systems, speed is another important factor to be considered.
Using SHOT as the descriptor was the fastest, we were able
to bring down the time per bag from 33 seconds in 3DSC
to 18 seconds although the accuracy obtained went down to
66.67%.
Fig. 6. Testing Image Acquistion 3DSC [15] SHOT [16] USC [17] FPFH [18]
72.22% 66.67% 55.56% 55.56%
in the bags, each significant object was clustered separately to
obtain multiple 3D models from each bag. A generated 3d A table showing the results from 3D Object Recognition
model is shown in Figure 7. For each object obtained, key using various 3D feature descriptors.
points and descriptors were extracted, similar to Training.
III. BAG OF V ISUAL W ORDS
The X-Ray Images of Bags usually involve high levels of
clutter and identifying contrabands in a short span of time is
very difficult. A reliable automated system to detect threats
will be a huge breakthrough. A viewpoint of a bag is not
enough because of the occlusion complexity.
Before we move on to our BOVW approach, we try simple

matching of SIFT descriptors between different types of guns
in GDXRAY and a target bag, which may or may not contain
guns. We used a FLANN(Fast Library for Approximate
Nearest Neighbors) [19] based Matcher to compare gun and
a bag,which may or may not contain a gun. Refer to Figure
8 to see a sample of descriptor matching using this FLAAN.
This library contains a vast collection of algorithms optimized
for fast nearest neighbor search in large datasets and for high
dimensional features. We can see in Figure 7, the gun feature
gets matched to other objects in the image as well. We did
Fig. 7. Separate 3D objects obtained by clustering not experiment too much with this approach since it is evident
in the work of Chan et al. [20] that this approach is not
The 3D descriptors of the obtained test models were com- suitable to X-Ray Images. They demonstrate that SIFT-based
pared with the descriptors of the training model using Eu- matching on X-Ray Images is not very feasible as it gives a
clidean distance [25] as the difference metric. The descriptors high False Positive Rate.
matched if Euclidean distance was smaller than a specified
threshold. The Euclidean distance formula is given below. We use the Bag of visual words (BOVW) approach [4]
v
u n to tackle this image classification problem.This novel tech-
uX 2 inque is a modification to the Bag of Words(BOW) [21]
d (p, q) = t (qi − pi )
Approach,used in NLP and information retrieval. In BOW
i=1
approach, we count the number of words that appear in a
E. Results document and then we use the frequency of each word to
Across the 19 bags, we achieved an overall accuracy of know the important keywords in the document. Then we
72.22% for gun detection using 3DSC Descriptor [15] which create a frequency histogram from it. Figure 9 shows the
4
side and bottom images are shown in Figures 10 and 11.
Fig. 8. Descriptor Matching using FLAAN
SIFT descriptors extracted on a sample bag and the histogram

generated from it. We treat each document as a bag of words
[21].However in the bag of visual words, instead of words,
we use image features as the words to create the frequency Fig. 10. Side-View Image of Gun from GDXRAY Dataset
histogram. Image features are unique patterns that we can find
in an image. They are a variety of feature types to extract from
a image. Keypoints are the “stand out” points in an image.
Even if the image is rotated, shrinked, or expanded, its key-
points will always be the same. Also, the descriptor can be
considered as the description of the key-point.
Fig. 11. Bottom-View Image of Gun from GDXRAY Dataset
We then extracted keypoints and descriptors for all

the images in the dataset. We separately experimented with
4 descriptors for the same, namely SIFT, SURF, ORB, BRISK.
Fig. 9. SIFT Descriptors extracted from a Bag in GDXray and Histogram Since the number of the key-point, descriptor pairs obtained
generated
for most images were more than 100, we applied k-means to
Following a similar work of Toby Breckon [22], we use obtain the visual codebook for the main keypoints (top 30)
dual-view single-energy X-Ray Images. Both the front and in the image. Once we are done with this across all images,
side view of the image is considered. We use the GDXray we take an average of the 30 descriptors extracted from all
Dataset [14] which consists of Images of Bags containing images to essentially get a histogram of the average features
hazardous objects. We treat this problem as a standard of a gun.
binary classification problem. Our aim is to differentiate bags
containing guns from bags without guns. We find the Euclidean distance [?] between the average
histogram and histogram of each training image. We the
We train our classifier using four different detectors:- manually set a threshold value such that the distance of the
• Scale Invariant Feature Transform (SIFT) [7]
majority of training images lies below this.
• Binary Robust invariant scalable key-points (BRISK) [23]
• Speeded-Up Robust Features (SURF) [6] 2) Testing: The Test set consisted of twenty images
• Oriented FAST and rotated BRIEF (ORB) [24] of different bags(taken from GDXray). Ten of these bags
contained 1 or more guns while the rest had no guns. These
A. Single Input Approach bags contains multiple objects and the guns were partially
1) Training: The training set consisted of 700 cropped obstructed from view in many images. Figure 12 and Figure
gun images from GDXRay. These included 307 front views, 13 shows the two types of bags used for this method.
302 side views and 91 top and bottom views. Examples of
5
We apply a sliding window method on each bag individually 3) Results: The accuracy obtained using different descrip-
to check if a portion of the bag has a gun or not. We take tors is tabulated below.
a window size of 256x256. Similar to training, the top 30 Measure SIFT SURF ORB BRISK
Keypoints and descriptors were then extracted from each
window. Precision 54.54% 53.84% 41.6% 42.86%
Recall 60% 70% 50% 60%
We then calculated the distance of that window’s
descriptors from the average histogram obtained in training. F1 Score 0.57 0.61 0.45 0.50
We experimented with both Euclidean [25] and Manhattan
distance, but got better results using Euclidean distance as Here we see that SURF gives the best overall result.
the metric [26]. If the distance is less than the predetermined B. Dual Input Approach
threshold, the model predicts the presence of a gun.
A single image of a bag is not sufficient to accurately
make a prediction due to possible obstructions. We thus tried
Once this is done for all windows of a particular image,
the same method with a dual input approach.
if all the distances were more than the threshold, the model
predicts that no gun is present. Otherwise the model detects a
1) Training: The same dataset mentioned in the Single
gun.
Input Approach was used however the 91 images of the top
and bottom view were removed. This was done because we
were unable to get a accurate histogram for this because of
limited number of top and bottom view images in our dataset.
Similar to the Single Input Approach, we again created

the visual codebook by calculating the average histogram,
however this time we created 2 codebooks, one for the front
view and one for the side view. This was done by segregating
the dataset and calculating histograms separately.
2) Testing: The Test set consisted of 2 orthogonal images

from 20 Bags. 10 of these bags contained 1 or more guns
while the rest had no guns. These bags contained multiple
objects and the guns were partially obstructed from view in
many images.
Similar to Single Input Approach, we again used the sliding

Fig. 12. Example 1 of Bag from GDXRAY Dataset - Does not contain a gun window method for each bag. However this time we took 2
images from each bag into consideration before making the
prediction about the presence of a gun. Each window from
both images was compared to both histograms for front and
side view matching with the gun.
3) Results: The accuracy obtained using different descrip-

tors is tabulated below.
Measure SIFT SURF ORB BRISK
Precision 66.67% 62.5% 50% 50%
Recall 80% 50% 70% 60%
F1 Score 0.73 0.55 0.58 0.54
Here we see that SIFT gives the best overall result.
C. Limitations of the Approach

Figuring out the threshold to be used for gun detection is a
Fig. 13. Example 2 of Bag from GDXRAY Dataset - Contains two guns manual process that requires a lot of trial and error. Moreover
the same threshold might not give good results for images
6
from a slightly different dataset.
This method also gives a lot of false positives even for the
dual input model. Even for SIFT approach, we got 4 false
positives in 20 bags.This is because it simply detects a gun if
even a single view detects a gun and does not verify it with
the other view. Doing it this way does help in increasing the
true positives however.
Fig. 15. Inception Network Architecture

IV. D UAL I NPUT CNN
A. Overview
With tremendous progress in the field of Deep Learning shortcut connection - this connection skips one or more
since the creation of the VGG network [19], we had to test if connection. This doesn’t degrade the performance of the
a deep neural network could give us a better result than the network as simply identity mappings are being stacked upon
methods discussed above. The VGG architecture is shown in the current network.
Figure 14.
VGG was the next step after the huge success of AlexNet.
VGG used small receptive fields (3x3 compared to 11x11). 1x1
convolutional layers were also incorporated in the network.
These properties may seem trivial, but were very important
changes in the field of using Convolutional Neural Networks
for Image Recognition and Object Detection.
Fig. 16. Residual Unit
Fig. 14. VGG Architecture
The Inception network [35], was an important development

for the advancement of the field of Computer Vision. This
network introduced the idea of making the network more
wide than deep to reduce computation expense as well as Fig. 17. ResNet Architecture
prevent overfitting. This is done using Inception Modules
which allow more efficient computation and deeper networks We have previously tested our models only on guns. Here
using stacked 1×1 convolutions for dimensionality reduction. we used Gun, Shuriken and Knife as our target labels. We
The solution takes multiple kernel filter sizes within the developed three dual-input architectures. These architectures
CNN, and arranges them so that they operate on the same were built on top of InceptionV3, VGG19 and ResNet50 - all
level rather than one after the other. Figure 15 shows the of these are well known models in the field of deep learning
Inception Net Architecture. and image classification.
ResNet was the first neural network architecture(as shown

in Figure 17) which successfully tackled the problem of B. Dataset
vanishing gradients. This was done with the help of residual From the GDX-Ray dataset, we have X-Ray images of 19
units(Figure 16). A residual unit consists of an identity bags, each of the bags having different threat objects - gun,
7
blade, shuriken and knife. For the scope of this problem, we activation function for both of these layers is ‘relu’. Finally
are detecting whether the bag contains a gun or not, whether the output layer had three neurons - each neuron giving
it contains a blade or not and whether it contains a shuriken the prediction whether the bag contains a gun, a blade or
or not. This problem is very specific, but it can be generalized a shuriken. This Dual Input CNN Model is shown in Figure 19
to all types of threat objects on the acquisition of suitable
datasets of dual-view X-Ray images. Each of these 19 bags To illustrate this, if the output vector for a given test data
have 180 images, each of the same bag but from a different set input is [1 0 1], this means that the bag contains a gun,
angle. The angles range from 0 to 360 with an interval of 2 does not contain a blade and contains a shuriken. Even if
degrees. If we consider the images shot at 0 degrees and the one of the predictions is zero, this means that a threat is
one shot at 90 degrees, they are orthogonal to each other and detected. The activation function for the output layer was
can be captured by a dual-input X-Ray machine (one example ‘sigmoid’. The loss function for the model was ‘binary-
is shown in Figure 18. Same goes for images shot at 0 degrees crossentropy’ [28] and the optimizer used was ‘Adam’ [29].
and 270 degrees. Thus, we have generated 180 pairs from the The loss function was chosen as such because the problem
given 180 images. Thus we have a total of 19*180 i.e. a total is now a multi-class multi-label classification problem. Same
of 3420 pairs. For training the network, we have used 174 architecture and layer-parameters were used for the other two
pairs from 16 bags i.e. a total of 2784 pairs. For testing the models (VGG19 and ResNet50). The models were run for a
network, we have used the 6 remaining pairs from these 16 total of 50 epochs.
bags and 30 pairs from the three remaining bags which the
network has not seen, giving a total size of 186 bags.
E. Implementation Details
The networks were trained on a Google Colab environment
with 25GB of RAM and the default GPU’s. The ResNet50
model had a total of 179,362,563 parameters, out of which
132,187,139 were trainable and the rest - 47,175,424,
non-trainable. On the other hand, the VGG19 model had
a total of 69,475,459 parameters, with 40,048,768 being
non-trainable and just 29,426,691 trainable parameters.
Finally, the InceptionV3 model had 117,072,451 parameters,
73,466,883 of them being trainable and 43,605,568 of them
being non-trainable.
Fig. 18. A pair of orthogonal images
The dataset has images from 19 bags, 14 of them have The input images were originally were of dimensions
guns, 10 of them have blades and 5 of them have shurikens 2208x2688 but were scaled down to 224x270 because of
present in them. local computational bottlenecks while training. Also, the
images originally were in gray-scale mode and had only
channel of input - the resulting dimension being 224x270x1.
C. Baseline Model This channel was stacked thrice on top of each other so
Before experimenting with the dual input architecture, as to make the input dimension 224x270x3 - which is a
we implemented a vanilla Inception-v3 [10] architecture requirement for all these common transfer learning models.
(with only a single input) as our baseline model. We fed
our network 2784 images for training. Then we applied our The models were built and trained using function from the
model on 186 images for testing (number same as dual-input, TensorFlow library. Some of the code in the library has to be
however single images taken instead of pairs). This CNN edited in order for the model to run.
architecture gave us a decent accuracy of around 65%. This
F. Results
metric is explained in detail in the next section.
Because of the limited dataset, accuracy in this case
is a slightly flawed but still the best way to detect the
D. Dual Input Model performance of the models. Since all of the data points in
Consider the Inception model - each of the two orthogonal the training set have threat objects, the job of the model is to
input images was fed to one Inception-v3 module with its accurately predict the presence of predefined threat objects
final connected layer omitted and all the weights fixed as on the basis of which the models were trained, these objects
those used in the ‘imagenet’ competition. These weights being gun, blade and shuriken. The result is correct only
were frozen i.e. all the layers were set to non-trainable. The when the presence of each threat object is detected correctly.
outputs of the modules were concatenated and two dense Considering this, the models have a validation accuracy
layers were added on top of that. The first layer had 512 between 75-90 percent in identifying the correct threats,
neurons and the second dense layer had 128 neurons. The which is promising since the test set also includes images of
8
B. Visual Vocabulary - Codebook
The visual vocabulary (codebook) is used to index votes for
object position. A visual word in the codebook can be thought
of as a ‘part’ of the image. The part is defined by its center
of mass in the codebook. For codebook generation, a training
database of 200 images each of a handgun, a blade and a
shuriken is used. The SIFT approach [7] is used to extract
key-points and their neighbouring visual descriptors from all
training images obtained of the target object. A keypoint can
be thought of as a distinguishable point in an image i.e. it
represents a unique feature of the target object.
Fig. 19. Dual Input Architecture
bags which the network has never seen before. Out of the
three, the best accuracy is seen on the model built on top of
ResNet50 which is no surprise is it has the most parameters
out of the three. Since the number and types of threat objects
is not very large (different types of guns, explosives, etc.),
this method can be extrapolated to detect all types of threat
objects in dual-view and if possible, multi-view X-Ray images.
The exact results (the measure being accuracy) are formu-

lated in the table below. Fig. 20. Star Topology
Baseline InceptionV3 ResNet50 VGG19

65.09% 78.07% 87.72% 77.19% C. Difference Between AISM and ISM
The basic distinction between the first methodology (ISM)
and the proposed method (AISM) [9] is that in ISM, the
V. A DAPTIVE I MPLICIT S HAPE M ODEL detection with the highest similarity score is considered as a
valid detection. This is why multiple threat objects cannot be
A. Overview detected in a single image. On the other hand, AISM merges
all the occurrences in the image which have similarity scores
This approach is inspired from a work of Domingo Merry exceeding a certain predefined threshold.
in this paper [30]. The AISM (Adaptive Implicit Shape
Model) is inspired from the famous Implicit Shape Model
(ISM) approach [31]. Initially, the ISM method was developed D. Testing and Detection
for category recognition of various objects. The implicit shape For each test image, probable centers of target objects are
model works on two basic ideas - learning an appearance extracted from the image. These points are called interest
codebook and learning a star-topology [36] structural model points. Now each interest point is matched to the closest
i.e. the features are considered as independent given the keyword in the codebook and a probability is calculated.
object center. The star-topology model is shown in Figure This probability corresponds to whether the interest point
20. The model for a given object category consists of a corresponds to the center of mass of the target object. Now
codebook of local appearances that are representative for the each matched interest point generated polls for specific
given object category. The codebook can be thought of as a instances of the given object category at different points
class-specific alphabet for the given category. The model also on the image according to the learnt spatial probability
contains a spatial probabilistic distribution. This distribution distribution. After this step, many subwindows are collected
specifies where (the displacement from the central position) and overlapping subwindows among these are merged. Each
each codebook entry (alphabet) may be found on the object (a sub window over a pre-decided threshold probability is now
word made from the alphabet). The probabilistic Generalized selected. If no sub-window in the target image meets this
Hough Transform [32] algorithm is used to detect images. conditional requirement, then no potential target object is
detected. Such detection is carried out for each threat object
9
(gun, blade and shuriken) separately.
The following three images (Figure 21, Figure 22, Figure

23) show in detail the working of the method. These images
show the sub-part in which we detect for the threat of a
blade. Detection for a gun and shuriken is performed after
this. 3 stages are shown - extraction of keypoints, merging
of overlapping subwindows and final detection. One correct
detection is there along with one false positive, this is shown
in Figure 23.
Fig. 23. Detection of Blade Stage III - final detection - one true detection
and one false positive
All the training images for the codebook generation and the
starter AISM Matlab code are available at Domingo Mery’s
university page.
E. Codebook Generation
We used 200 gun images, 100 blade images and 100
shuriken images to generate the codebook. These were
Fig. 21. Detection of Blade Stage I - extraction of keypoints obtained in the same way as section II-B using single
spectrum X-Rays.
Next, we created a visual codebook using the SIFT

descriptors similar to section III-A.1. However, here instead
of K-means [33] clustering, we use an agglomerative
clustering strategy [34].
F. Results
We created a custom test set of 180 pairs of orthogonal
images from the GDXray dataset. In our test set, each one of
the target images has one or more of the three threats - gun,
blade and shuriken. We are measuring the score in which the
correct threat is detected at the correct position in the image.
We test the target images for each of the three threat objects
individually.
1) Single View: Only one image was used from the pair
of images from each data point in our custom dataset. In the
180 single view images, through AISM, threat objects were
correctly classified in 145 images. However, the number of
Fig. 22. Detection of Blade Stage II - merging of overlapping subwindows false detections were high, 43 of them were detected.
10
2) Dual View: In the 180 dual view pairs from the custom number of false negatives which is done through the AISM
dataset, AISM correctly detected the threats in 165 images. dual-view approach.
Along with this, the amount of false positives also increased.
60 false detections were recorded. ACKNOWLEDGMENT
We would also like to thank Professor Vladimir Riffo
A False Detection - A target image can have more than one
and Professor Domingo Mery for their novel work and also
threat object detection. A false detection is counted when the
providing us with valuable data for our project.All our work
place where the threat is detected in the image has actually
was done on the GDXray dataset, taken from Domingo Mery’s
no threat object. Figure 20 shows one true detection and one
Website [40]. In addition to that, both authors were generous
false positive.
to provide us with additional information on the dataset. We
VI. C ONCLUSION have adopted AISM implementation from Dr.Riffo’s work.
In the desire to improve Threat Detection using X-Ray
R EFERENCES
images, we explored 4 different approaches.
[1] E. Parliament, “Aviation security with a special focus on security
First, we experimented with the novel work of Dr.Vladimir scanners,” in Proc. Eur. Parliament Resolut. (INI), Oct. 2012, pp. 1–10.
[2] A. Bolfing, T. Halbherr, and A. Schwaninger, “How image based factors
Riffo, which gave good results but it wasn’t a practical and human factors contribute to threat detection performance in X-
approach as 180 images were needed from each bag to make ray aviation security screening,” in HCI and Usability for Education
accurate detection. Due to the unavailability of projection and Work (LNCS 5298), A. Holzinger, Ed. Heidelberg, Germany:
Springer,2008, pp. 419–438.
matrices of gun images, we created our own implementation [3] G. Blalock, V. Kadiyali, and D. H. Simon, “The impact of post-9/11
of 3D space carving. Using this, we generated gun models airport security measures on the demand for air travel,” J. Law Econ.,vol.
using only two gun images. On considering our limitations, 50, no. 4, pp. 731–755, Nov. 2007.
[4] C.-F. Tsai, “Bag-of-words representation in image annotation: A re-
this approach shows promise. However, the need to have view,”ISRN Artif. Intell., vol. 2012, 2012, Art. no. 376804.
90 images of the target bag to make detection makes this [5] Chapelle, Olivier, Patrick Haffner, and Vladimir N. Vapnik. ”Support
infeasible for real life scenarios as most modern X-Ray vector machines for histogram-based image classification.” IEEE trans-
scanners have atmost 2 views. actions on Neural Networks 10.5 (1999): 1055-1064.
[6] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. ”Surf: Speeded up
robust features.” European conference on computer vision. Springer,
Secondly, using the Bag of Visual Words Technique, we Berlin, Heidelberg, 2006.
made a classifier to detect guns in a bag. We experimented [7] Lowe, David G. ”Distinctive image features from scale-invariant key-
points.” International journal of computer vision 60.2 (2004): 91-110.
with both single view and dual view approaches and got [8] Kutulakos, Kiriakos N., and Steven M. Seitz. ”A theory of shape by
much better precision and recall for the dual view approach. space carving.” International journal of computer vision 38.3 (2000):
However, even the dual input approach resulted in false 199-218.
[9] V. Riffo and D. Mery, “Automated detection of threat objects using
positives as both views were being considered separately and adapted implicit shape model,” IEEE Trans. Syst., Man, Cybern.,
no combined inferences were taken. To take the combined Syst.,vol. 46, no. 4, pp. 472–482, Apr. 2016.
inference of both views, we experimented with this idea [10] Szegedy, Christian, et al. ”Rethinking the inception architecture for
computer vision.” Proceedings of the IEEE conference on computer
while implementing our CNN models. vision and pattern recognition. 2016.
[11] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolu-
We experimented with a variants of Inception-V3, VGG-19 tional networks for large-scale image recognition.” arXiv preprint
arXiv:1409.1556 (2014).
and ResNet50 models with transfer learning. Instead of the [12] He, Kaiming, et al. ”Deep residual learning for image recognition.”
standard CNN implementation, we used a Dual Input CNN Proceedings of the IEEE conference on computer vision and pattern
model with custom top layers to consider both Front and recognition. 2016.
[13] Riffo, Vladimir, Ivan Godoy, and Domingo Mery. ”Handgun Detection
Side Views of an Object. For this process, we segregated in Single-Spectrum Multiple X-ray Views Based on 3D Object Recog-
orthogonal pairs from the original dataset. Out of these there, nition.” Journal of Nondestructive Evaluation 38.3 (2019): 66.
the model built on top of ResNet50 performed the best. This [14] Mery, D., Riffo, V., Zscherpel, U., Mondragón,G., Lillo, I., Zuccar,
I., Lobel, H., Carrasco, M.: GDXray: The database of X-ray images
approach will definitely work better if a a more uniform for non-destructive testing. Journal of Nondestructive Evaluation 34(4)
dataset of dual-view images is provided for training. (2015) 1–12
[15] Frome, A., Huber, D., Kolluri, R., Bülow, T., & Malik, J. (2004, May).
Finally, we discussed the Adaptive Implicit Shape Model Recognizing objects in range data using regional point descriptors. In
European conference on computer vision (pp. 224-237). Springer, Berlin,
approach(AISM), which performs better in terms of detecting Heidelberg.
threats when we use dual-view instead of single-view. Accu- [16] Tombari, F., Salti, S., & Di Stefano, L. (2010, September). Unique signa-
rate threat detection increased from 80% to 91% on moving tures of histograms for local surface description. In European conference
on computer vision (pp. 356-369). Springer, Berlin, Heidelberg.
from single view to the dual view approach. Even on using [17] Tombari, F., Salti, S., & Di Stefano, L. (2010, October). Unique shape
the dual-input mode also resulted in more number of false context for 3D data description. In Proceedings of the ACM workshop
positives being generated. In the case of threat detection, on 3D object retrieval (pp. 57-62).
[18] Rusu, R. B., Blodow, N., & Beetz, M. (2009, May). Fast point feature
false positives do not matter a lot as the percentage of bags histograms (FPFH) for 3D registration. In 2009 IEEE international
consisting of threats is miniscule. The aim is to reduce the conference on robotics and automation (pp. 3212-3217). IEEE.
11
[19] Muja, Marius, and David G. Lowe. ”Fast approximate nearest neighbors
with automatic algorithm configuration.” VISAPP (1) 2.331-340 (2009):
2.
[20] J.W. Chan, A. Omar, J.P.O. Evans, D. Downes, X. Wang, and Y. Liu,
“Feasibility of SIFT to synthesise KDEX imagery for aviation luggage
security screening,” in International Conference on Crime Detection and
Prevention, 2009, pp. 1–6.
[21] Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. ”Understanding bag-of-
words model: a statistical framework.” International Journal of Machine
Learning and Cybernetics 1.1-4 (2010): 43-52.
[22] Turcsany, Diana, Andre Mouton, and Toby P. Breckon. ”Improving
feature-based object recognition for X-ray baggage security screening
using primed visual words.” 2013 IEEE International Conference on
Industrial Technology (ICIT). IEEE, 2013.
[23] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. ”BRISK:
Binary robust invariant scalable keypoints.” 2011 International confer-
ence on computer vision. Ieee, 2011.
[24] Rublee, Ethan, et al. ”ORB: An efficient alternative to SIFT or SURF.”
2011 International conference on computer vision. Ieee, 2011.
[25] Danielsson, Per-Erik. ”Euclidean distance mapping.” Computer Graphics
and image processing 14.3 (1980): 227-248.
[26] Kong, Weihao, Wu-Jun Li, and Minyi Guo. ”Manhattan hashing for
large-scale image retrieval.” Proceedings of the 35th international ACM
SIGIR conference on Research and development in information retrieval.
2012
[27] Jang, Eric, Shixiang Gu, and Ben Poole. ”Categorical reparameterization
with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).
[28] Zhang, Zhilu, and Mert Sabuncu. ”Generalized cross entropy loss for
training deep neural networks with noisy labels.” Advances in neural
information processing systems. 2018.
[29] Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).
[30] Mery, Domingo, et al. ”Modern computer vision techniques for x-ray
testing in baggage inspection.” IEEE Transactions on Systems, Man, and
Cybernetics: Systems 47.4 (2016): 682-692.
[31] Leibe, Bastian, Ales Leonardis, and Bernt Schiele. ”Combined object
categorization and segmentation with an implicit shape model.” Work-
shop on statistical learning in computer vision, ECCV. Vol. 2. No. 5.
2004.
[32] Ballard, Dana H. ”Generalizing the Hough transform to detect arbitrary
shapes.” Readings in computer vision. Morgan Kaufmann, 1987. 714-
725.
[33] Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. ”The global k-
means clustering algorithm.” Pattern recognition 36.2 (2003): 451-461.
[34] Beeferman, Doug, and Adam Berger. ”Agglomerative clustering of a
search engine query log.” Proceedings of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining. 2000.
[35] Szegedy, Christian, et al. ”Going deeper with convolutions.” Proceedings
of the IEEE conference on computer vision and pattern recognition.
2015.
[36] Habbab, I. S. A. M., Mohsen Kavehrad, and C. Sundberg. ”Protocols
for very high-speed optical fiber local area networks using a passive star
topology.” Journal of Lightwave Technology 5.12 (1987): 1782-1794.
[37] Rajpurkar, Pranav, et al. ”Chexnet: Radiologist-level pneumonia
detection on chest x-rays with deep learning.” arXiv preprint
arXiv:1711.05225 (2017).
[38] Brosnan, Tadhg, and Da-Wen Sun. ”Improving quality inspection of food
products by computer vision––a review.” Journal of food engineering
61.1 (2004): 3-16.
[39] Brosnan, Tadhg, and Da-Wen Sun. ”Improving quality inspection of food
products by computer vision––a review.” Journal of food engineering
61.1 (2004): 3-16.
[40] https://domingomery.ing.puc.cl/
12

Automatic threat detection using dual and multi-view X-ray images

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Automatic threat detection using dual and multi-view X-ray images

Загружено:

Авторское право:

Доступные форматы

Automatic Threat Detection in Single, Dual and

Multi View X-Ray Images

Nitin Chaturvedi Dhiraj

All over the world, this detection is done manually by

In all of the above approaches, the chances of

Fig. 3. 3D Object Recognition Flowchart

A. Training Images Acquisition

Subsequently, we extract keypoints from the reconstructed

We separately experimented with four 3D descriptors,

We have instead used two orthogonal images taken

Here accuracy was defined as the numbers of guns detected

Before we move on to our BOVW approach, we try simple

Fig. 8. Descriptor Matching using FLAAN

SIFT descriptors extracted on a sample bag and the histogram

Fig. 11. Bottom-View Image of Gun from GDXRAY Dataset

We then extracted keypoints and descriptors for all

Similar to the Single Input Approach, we again created

2) Testing: The Test set consisted of 2 orthogonal images

Similar to Single Input Approach, we again used the sliding

3) Results: The accuracy obtained using different descrip-

Here we see that SIFT gives the best overall result.

C. Limitations of the Approach

Fig. 15. Inception Network Architecture

Fig. 16. Residual Unit

Fig. 14. VGG Architecture

The Inception network [35], was an important development

ResNet was the first neural network architecture(as shown

Fig. 19. Dual Input Architecture

The exact results (the measure being accuracy) are formu-

Baseline InceptionV3 ResNet50 VGG19

The following three images (Figure 21, Figure 22, Figure

Next, we created a visual codebook using the SIFT

Вам также может понравиться