Академический Документы
Профессиональный Документы
Культура Документы
Abstract—Accurate X-ray screening systems are of paramount a very demanding job and requires a lot of concentration
importance in the present day. Most existing systems predict only by the security officers as only a miniscule percentage of
on the basis of a single image, which could lead to false positives bags would contain weapons or explosives. Detecting these
and false negatives due to limited information present. In this
paper, we explore and modify the current work that has been weapons in closely packed bags is in general a very difficult
going on in this field. We implemented several approaches using task because of closely packed items and the view being
single, dual and multiple X-Ray views to make a reliable and blocked by other objects [2]. Because of the huge difficulty
practical model with varying levels of success. These approaches of these identification tasks, some threat objects are bound
include long-established methods such as Bag of Visual Words, to be missed by the elaborate security systems on account
3D Object Recognition, Adaptive Implicit Shape Model and Deep
Neural Networks. We have made all these approaches more of human error. The detection rates of humans as well as
relevant for existing scenarios by taking in dual inputs to make computers are severely affected because of the complexity -
more informed predictions. We got varying levels of success in all types of objects are present in the bags of passengers. In
these methods ranging from 73% using Bag of Visual Words to addition to this, worldwide air travel and leisure tourism is
87% using Deep CNN. In this paper, we will show that when constantly increasing. This is further reducing the time spent
dual view of an object is considered, we gain an improvement of
5% to 15% (considering various approaches) compared to using by the security officers per person which further reduces the
a single view approach. chance of detecting a threat object. It can be very cost effective
if security checks per customer is reduced even by a small
Index Terms—Threat Detection, X-Ray, Bag of Visual Words, amount of time [3].
Deep Learning, Transfer Learning, Adaptive Implicit Shape
Model
I. I NTRODUCTION
In recent years, especially after the New York plane crashes
in 2001, X-Ray screening systems are being used worldwide
to protect and secure many places like airports [1], railway
stations, malls, stadiums, government buildings, etc. These are
the places where safety is of paramount importance because
of a lot of crowd movement, as shown in Figure 1.The main
aim of these screening systems is to detect dangerous objects
such as knives, handguns and explosives in baggage items
being carried by the public during their travel.
1
In spite of this, human security officers receive minimal [10], VGG19 [11], ResNet50 [12], etc. and reported their
technological support in identifying threat objects. Other than performance. In section V, we study the Adaptive Implicit
airport baggage security, there are a lot of areas where X-Ray Shape Method with two images fed simultaneously instead
image inspection has been automated and has proven useful of one. Finally in section VI, we offer conclusions on our
to the task. One obvious example is in the medical field - experiments, observations and discuss the future scope of the
detection of diseases, even Covid-19, through the chest X-Ray proposed techniques.
of a patient [37]. Other non-obvious applications are in food
quality inspection [38] and inspection of welds and cargo
[39]. Considering all this, an automated system to detect
the threat objects in passenger baggage and cargo would
greatly streamline the process of airport security and reduce
the workload of security officers. A lot of research has been
going on towards automatic detection of threat objects. Most
of these methods improve the quality of the image through
novel pseudo-colour algorithms and image segmentation and
use methods like Bag of Visual Words [4], Support Vector
Machines [5], 3D re-projections, etc. Moderate recall and
precision have been achieved using the SURF [6] and SIFT
[7] descriptors through the Bag of Visual Words method. In
all of these diagnoses, threat detection in a single view image
of threat objects is the key feature.
2
a satisfactory 3D model was created in the training stage.
B. Training
3
is less than the proposed accuracy in the original paper [13])
of 97%. We were unable to replicate the results shown in the
original paper due to non availability of projection matrix data
for the gun model. Despite this, the method demonstrated the
advantages of using multiple views over a single view and
gave us insights that we used in subsequent experiments.
4
side and bottom images are shown in Figures 10 and 11.
Fig. 9. SIFT Descriptors extracted from a Bag in GDXray and Histogram Since the number of the key-point, descriptor pairs obtained
generated
for most images were more than 100, we applied k-means to
Following a similar work of Toby Breckon [22], we use obtain the visual codebook for the main keypoints (top 30)
dual-view single-energy X-Ray Images. Both the front and in the image. Once we are done with this across all images,
side view of the image is considered. We use the GDXray we take an average of the 30 descriptors extracted from all
Dataset [14] which consists of Images of Bags containing images to essentially get a histogram of the average features
hazardous objects. We treat this problem as a standard of a gun.
binary classification problem. Our aim is to differentiate bags
containing guns from bags without guns. We find the Euclidean distance [?] between the average
histogram and histogram of each training image. We the
We train our classifier using four different detectors:- manually set a threshold value such that the distance of the
• Scale Invariant Feature Transform (SIFT) [7]
majority of training images lies below this.
• Binary Robust invariant scalable key-points (BRISK) [23]
• Speeded-Up Robust Features (SURF) [6] 2) Testing: The Test set consisted of twenty images
• Oriented FAST and rotated BRIEF (ORB) [24] of different bags(taken from GDXray). Ten of these bags
contained 1 or more guns while the rest had no guns. These
A. Single Input Approach bags contains multiple objects and the guns were partially
1) Training: The training set consisted of 700 cropped obstructed from view in many images. Figure 12 and Figure
gun images from GDXRay. These included 307 front views, 13 shows the two types of bags used for this method.
302 side views and 91 top and bottom views. Examples of
5
We apply a sliding window method on each bag individually 3) Results: The accuracy obtained using different descrip-
to check if a portion of the bag has a gun or not. We take tors is tabulated below.
a window size of 256x256. Similar to training, the top 30 Measure SIFT SURF ORB BRISK
Keypoints and descriptors were then extracted from each
window. Precision 54.54% 53.84% 41.6% 42.86%
Recall 60% 70% 50% 60%
We then calculated the distance of that window’s
descriptors from the average histogram obtained in training. F1 Score 0.57 0.61 0.45 0.50
We experimented with both Euclidean [25] and Manhattan
distance, but got better results using Euclidean distance as Here we see that SURF gives the best overall result.
the metric [26]. If the distance is less than the predetermined B. Dual Input Approach
threshold, the model predicts the presence of a gun.
A single image of a bag is not sufficient to accurately
make a prediction due to possible obstructions. We thus tried
Once this is done for all windows of a particular image,
the same method with a dual input approach.
if all the distances were more than the threshold, the model
predicts that no gun is present. Otherwise the model detects a
1) Training: The same dataset mentioned in the Single
gun.
Input Approach was used however the 91 images of the top
and bottom view were removed. This was done because we
were unable to get a accurate histogram for this because of
limited number of top and bottom view images in our dataset.
6
from a slightly different dataset.
This method also gives a lot of false positives even for the
dual input model. Even for SIFT approach, we got 4 false
positives in 20 bags.This is because it simply detects a gun if
even a single view detects a gun and does not verify it with
the other view. Doing it this way does help in increasing the
true positives however.
VGG was the next step after the huge success of AlexNet.
VGG used small receptive fields (3x3 compared to 11x11). 1x1
convolutional layers were also incorporated in the network.
These properties may seem trivial, but were very important
changes in the field of using Convolutional Neural Networks
for Image Recognition and Object Detection.
7
blade, shuriken and knife. For the scope of this problem, we activation function for both of these layers is ‘relu’. Finally
are detecting whether the bag contains a gun or not, whether the output layer had three neurons - each neuron giving
it contains a blade or not and whether it contains a shuriken the prediction whether the bag contains a gun, a blade or
or not. This problem is very specific, but it can be generalized a shuriken. This Dual Input CNN Model is shown in Figure 19
to all types of threat objects on the acquisition of suitable
datasets of dual-view X-Ray images. Each of these 19 bags To illustrate this, if the output vector for a given test data
have 180 images, each of the same bag but from a different set input is [1 0 1], this means that the bag contains a gun,
angle. The angles range from 0 to 360 with an interval of 2 does not contain a blade and contains a shuriken. Even if
degrees. If we consider the images shot at 0 degrees and the one of the predictions is zero, this means that a threat is
one shot at 90 degrees, they are orthogonal to each other and detected. The activation function for the output layer was
can be captured by a dual-input X-Ray machine (one example ‘sigmoid’. The loss function for the model was ‘binary-
is shown in Figure 18. Same goes for images shot at 0 degrees crossentropy’ [28] and the optimizer used was ‘Adam’ [29].
and 270 degrees. Thus, we have generated 180 pairs from the The loss function was chosen as such because the problem
given 180 images. Thus we have a total of 19*180 i.e. a total is now a multi-class multi-label classification problem. Same
of 3420 pairs. For training the network, we have used 174 architecture and layer-parameters were used for the other two
pairs from 16 bags i.e. a total of 2784 pairs. For testing the models (VGG19 and ResNet50). The models were run for a
network, we have used the 6 remaining pairs from these 16 total of 50 epochs.
bags and 30 pairs from the three remaining bags which the
network has not seen, giving a total size of 186 bags.
E. Implementation Details
The networks were trained on a Google Colab environment
with 25GB of RAM and the default GPU’s. The ResNet50
model had a total of 179,362,563 parameters, out of which
132,187,139 were trainable and the rest - 47,175,424,
non-trainable. On the other hand, the VGG19 model had
a total of 69,475,459 parameters, with 40,048,768 being
non-trainable and just 29,426,691 trainable parameters.
Finally, the InceptionV3 model had 117,072,451 parameters,
73,466,883 of them being trainable and 43,605,568 of them
being non-trainable.
Fig. 18. A pair of orthogonal images
The dataset has images from 19 bags, 14 of them have The input images were originally were of dimensions
guns, 10 of them have blades and 5 of them have shurikens 2208x2688 but were scaled down to 224x270 because of
present in them. local computational bottlenecks while training. Also, the
images originally were in gray-scale mode and had only
channel of input - the resulting dimension being 224x270x1.
C. Baseline Model This channel was stacked thrice on top of each other so
Before experimenting with the dual input architecture, as to make the input dimension 224x270x3 - which is a
we implemented a vanilla Inception-v3 [10] architecture requirement for all these common transfer learning models.
(with only a single input) as our baseline model. We fed
our network 2784 images for training. Then we applied our The models were built and trained using function from the
model on 186 images for testing (number same as dual-input, TensorFlow library. Some of the code in the library has to be
however single images taken instead of pairs). This CNN edited in order for the model to run.
architecture gave us a decent accuracy of around 65%. This
F. Results
metric is explained in detail in the next section.
Because of the limited dataset, accuracy in this case
is a slightly flawed but still the best way to detect the
D. Dual Input Model performance of the models. Since all of the data points in
Consider the Inception model - each of the two orthogonal the training set have threat objects, the job of the model is to
input images was fed to one Inception-v3 module with its accurately predict the presence of predefined threat objects
final connected layer omitted and all the weights fixed as on the basis of which the models were trained, these objects
those used in the ‘imagenet’ competition. These weights being gun, blade and shuriken. The result is correct only
were frozen i.e. all the layers were set to non-trainable. The when the presence of each threat object is detected correctly.
outputs of the modules were concatenated and two dense Considering this, the models have a validation accuracy
layers were added on top of that. The first layer had 512 between 75-90 percent in identifying the correct threats,
neurons and the second dense layer had 128 neurons. The which is promising since the test set also includes images of
8
B. Visual Vocabulary - Codebook
The visual vocabulary (codebook) is used to index votes for
object position. A visual word in the codebook can be thought
of as a ‘part’ of the image. The part is defined by its center
of mass in the codebook. For codebook generation, a training
database of 200 images each of a handgun, a blade and a
shuriken is used. The SIFT approach [7] is used to extract
key-points and their neighbouring visual descriptors from all
training images obtained of the target object. A keypoint can
be thought of as a distinguishable point in an image i.e. it
represents a unique feature of the target object.
bags which the network has never seen before. Out of the
three, the best accuracy is seen on the model built on top of
ResNet50 which is no surprise is it has the most parameters
out of the three. Since the number and types of threat objects
is not very large (different types of guns, explosives, etc.),
this method can be extrapolated to detect all types of threat
objects in dual-view and if possible, multi-view X-Ray images.
9
(gun, blade and shuriken) separately.
Fig. 23. Detection of Blade Stage III - final detection - one true detection
and one false positive
All the training images for the codebook generation and the
starter AISM Matlab code are available at Domingo Mery’s
university page.
E. Codebook Generation
We used 200 gun images, 100 blade images and 100
shuriken images to generate the codebook. These were
Fig. 21. Detection of Blade Stage I - extraction of keypoints obtained in the same way as section II-B using single
spectrum X-Rays.
F. Results
We created a custom test set of 180 pairs of orthogonal
images from the GDXray dataset. In our test set, each one of
the target images has one or more of the three threats - gun,
blade and shuriken. We are measuring the score in which the
correct threat is detected at the correct position in the image.
We test the target images for each of the three threat objects
individually.
1) Single View: Only one image was used from the pair
of images from each data point in our custom dataset. In the
180 single view images, through AISM, threat objects were
correctly classified in 145 images. However, the number of
Fig. 22. Detection of Blade Stage II - merging of overlapping subwindows false detections were high, 43 of them were detected.
10
2) Dual View: In the 180 dual view pairs from the custom number of false negatives which is done through the AISM
dataset, AISM correctly detected the threats in 165 images. dual-view approach.
Along with this, the amount of false positives also increased.
60 false detections were recorded. ACKNOWLEDGMENT
We would also like to thank Professor Vladimir Riffo
A False Detection - A target image can have more than one
and Professor Domingo Mery for their novel work and also
threat object detection. A false detection is counted when the
providing us with valuable data for our project.All our work
place where the threat is detected in the image has actually
was done on the GDXray dataset, taken from Domingo Mery’s
no threat object. Figure 20 shows one true detection and one
Website [40]. In addition to that, both authors were generous
false positive.
to provide us with additional information on the dataset. We
VI. C ONCLUSION have adopted AISM implementation from Dr.Riffo’s work.
In the desire to improve Threat Detection using X-Ray
R EFERENCES
images, we explored 4 different approaches.
[1] E. Parliament, “Aviation security with a special focus on security
First, we experimented with the novel work of Dr.Vladimir scanners,” in Proc. Eur. Parliament Resolut. (INI), Oct. 2012, pp. 1–10.
[2] A. Bolfing, T. Halbherr, and A. Schwaninger, “How image based factors
Riffo, which gave good results but it wasn’t a practical and human factors contribute to threat detection performance in X-
approach as 180 images were needed from each bag to make ray aviation security screening,” in HCI and Usability for Education
accurate detection. Due to the unavailability of projection and Work (LNCS 5298), A. Holzinger, Ed. Heidelberg, Germany:
Springer,2008, pp. 419–438.
matrices of gun images, we created our own implementation [3] G. Blalock, V. Kadiyali, and D. H. Simon, “The impact of post-9/11
of 3D space carving. Using this, we generated gun models airport security measures on the demand for air travel,” J. Law Econ.,vol.
using only two gun images. On considering our limitations, 50, no. 4, pp. 731–755, Nov. 2007.
[4] C.-F. Tsai, “Bag-of-words representation in image annotation: A re-
this approach shows promise. However, the need to have view,”ISRN Artif. Intell., vol. 2012, 2012, Art. no. 376804.
90 images of the target bag to make detection makes this [5] Chapelle, Olivier, Patrick Haffner, and Vladimir N. Vapnik. ”Support
infeasible for real life scenarios as most modern X-Ray vector machines for histogram-based image classification.” IEEE trans-
scanners have atmost 2 views. actions on Neural Networks 10.5 (1999): 1055-1064.
[6] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. ”Surf: Speeded up
robust features.” European conference on computer vision. Springer,
Secondly, using the Bag of Visual Words Technique, we Berlin, Heidelberg, 2006.
made a classifier to detect guns in a bag. We experimented [7] Lowe, David G. ”Distinctive image features from scale-invariant key-
points.” International journal of computer vision 60.2 (2004): 91-110.
with both single view and dual view approaches and got [8] Kutulakos, Kiriakos N., and Steven M. Seitz. ”A theory of shape by
much better precision and recall for the dual view approach. space carving.” International journal of computer vision 38.3 (2000):
However, even the dual input approach resulted in false 199-218.
[9] V. Riffo and D. Mery, “Automated detection of threat objects using
positives as both views were being considered separately and adapted implicit shape model,” IEEE Trans. Syst., Man, Cybern.,
no combined inferences were taken. To take the combined Syst.,vol. 46, no. 4, pp. 472–482, Apr. 2016.
inference of both views, we experimented with this idea [10] Szegedy, Christian, et al. ”Rethinking the inception architecture for
computer vision.” Proceedings of the IEEE conference on computer
while implementing our CNN models. vision and pattern recognition. 2016.
[11] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolu-
We experimented with a variants of Inception-V3, VGG-19 tional networks for large-scale image recognition.” arXiv preprint
arXiv:1409.1556 (2014).
and ResNet50 models with transfer learning. Instead of the [12] He, Kaiming, et al. ”Deep residual learning for image recognition.”
standard CNN implementation, we used a Dual Input CNN Proceedings of the IEEE conference on computer vision and pattern
model with custom top layers to consider both Front and recognition. 2016.
[13] Riffo, Vladimir, Ivan Godoy, and Domingo Mery. ”Handgun Detection
Side Views of an Object. For this process, we segregated in Single-Spectrum Multiple X-ray Views Based on 3D Object Recog-
orthogonal pairs from the original dataset. Out of these there, nition.” Journal of Nondestructive Evaluation 38.3 (2019): 66.
the model built on top of ResNet50 performed the best. This [14] Mery, D., Riffo, V., Zscherpel, U., Mondragón,G., Lillo, I., Zuccar,
I., Lobel, H., Carrasco, M.: GDXray: The database of X-ray images
approach will definitely work better if a a more uniform for non-destructive testing. Journal of Nondestructive Evaluation 34(4)
dataset of dual-view images is provided for training. (2015) 1–12
[15] Frome, A., Huber, D., Kolluri, R., Bülow, T., & Malik, J. (2004, May).
Finally, we discussed the Adaptive Implicit Shape Model Recognizing objects in range data using regional point descriptors. In
European conference on computer vision (pp. 224-237). Springer, Berlin,
approach(AISM), which performs better in terms of detecting Heidelberg.
threats when we use dual-view instead of single-view. Accu- [16] Tombari, F., Salti, S., & Di Stefano, L. (2010, September). Unique signa-
rate threat detection increased from 80% to 91% on moving tures of histograms for local surface description. In European conference
on computer vision (pp. 356-369). Springer, Berlin, Heidelberg.
from single view to the dual view approach. Even on using [17] Tombari, F., Salti, S., & Di Stefano, L. (2010, October). Unique shape
the dual-input mode also resulted in more number of false context for 3D data description. In Proceedings of the ACM workshop
positives being generated. In the case of threat detection, on 3D object retrieval (pp. 57-62).
[18] Rusu, R. B., Blodow, N., & Beetz, M. (2009, May). Fast point feature
false positives do not matter a lot as the percentage of bags histograms (FPFH) for 3D registration. In 2009 IEEE international
consisting of threats is miniscule. The aim is to reduce the conference on robotics and automation (pp. 3212-3217). IEEE.
11
[19] Muja, Marius, and David G. Lowe. ”Fast approximate nearest neighbors
with automatic algorithm configuration.” VISAPP (1) 2.331-340 (2009):
2.
[20] J.W. Chan, A. Omar, J.P.O. Evans, D. Downes, X. Wang, and Y. Liu,
“Feasibility of SIFT to synthesise KDEX imagery for aviation luggage
security screening,” in International Conference on Crime Detection and
Prevention, 2009, pp. 1–6.
[21] Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. ”Understanding bag-of-
words model: a statistical framework.” International Journal of Machine
Learning and Cybernetics 1.1-4 (2010): 43-52.
[22] Turcsany, Diana, Andre Mouton, and Toby P. Breckon. ”Improving
feature-based object recognition for X-ray baggage security screening
using primed visual words.” 2013 IEEE International Conference on
Industrial Technology (ICIT). IEEE, 2013.
[23] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. ”BRISK:
Binary robust invariant scalable keypoints.” 2011 International confer-
ence on computer vision. Ieee, 2011.
[24] Rublee, Ethan, et al. ”ORB: An efficient alternative to SIFT or SURF.”
2011 International conference on computer vision. Ieee, 2011.
[25] Danielsson, Per-Erik. ”Euclidean distance mapping.” Computer Graphics
and image processing 14.3 (1980): 227-248.
[26] Kong, Weihao, Wu-Jun Li, and Minyi Guo. ”Manhattan hashing for
large-scale image retrieval.” Proceedings of the 35th international ACM
SIGIR conference on Research and development in information retrieval.
2012
[27] Jang, Eric, Shixiang Gu, and Ben Poole. ”Categorical reparameterization
with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).
[28] Zhang, Zhilu, and Mert Sabuncu. ”Generalized cross entropy loss for
training deep neural networks with noisy labels.” Advances in neural
information processing systems. 2018.
[29] Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).
[30] Mery, Domingo, et al. ”Modern computer vision techniques for x-ray
testing in baggage inspection.” IEEE Transactions on Systems, Man, and
Cybernetics: Systems 47.4 (2016): 682-692.
[31] Leibe, Bastian, Ales Leonardis, and Bernt Schiele. ”Combined object
categorization and segmentation with an implicit shape model.” Work-
shop on statistical learning in computer vision, ECCV. Vol. 2. No. 5.
2004.
[32] Ballard, Dana H. ”Generalizing the Hough transform to detect arbitrary
shapes.” Readings in computer vision. Morgan Kaufmann, 1987. 714-
725.
[33] Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. ”The global k-
means clustering algorithm.” Pattern recognition 36.2 (2003): 451-461.
[34] Beeferman, Doug, and Adam Berger. ”Agglomerative clustering of a
search engine query log.” Proceedings of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining. 2000.
[35] Szegedy, Christian, et al. ”Going deeper with convolutions.” Proceedings
of the IEEE conference on computer vision and pattern recognition.
2015.
[36] Habbab, I. S. A. M., Mohsen Kavehrad, and C. Sundberg. ”Protocols
for very high-speed optical fiber local area networks using a passive star
topology.” Journal of Lightwave Technology 5.12 (1987): 1782-1794.
[37] Rajpurkar, Pranav, et al. ”Chexnet: Radiologist-level pneumonia
detection on chest x-rays with deep learning.” arXiv preprint
arXiv:1711.05225 (2017).
[38] Brosnan, Tadhg, and Da-Wen Sun. ”Improving quality inspection of food
products by computer vision––a review.” Journal of food engineering
61.1 (2004): 3-16.
[39] Brosnan, Tadhg, and Da-Wen Sun. ”Improving quality inspection of food
products by computer vision––a review.” Journal of food engineering
61.1 (2004): 3-16.
[40] https://domingomery.ing.puc.cl/
12