Вы находитесь на странице: 1из 7

Monitoring the railroad using Neural Networks

CS870 project report. August 2017


Vyacheslav Derevyanko
David R. Cheriton School of Computer Science
University of Waterloo
vderevya@uwaterloo.ca

ABSTRACT 2. EXPERIMENTS: DETECTING A TRAIN


In this project various Neural Network architectures are sur- Initial experiments were based around the task of find-
veyed performing a single task: locating a small object on a ing trains location on the track given that only one train
complex scene image. Experiments are performed on a large is running on it. As with any supervised learning system,
number of NN architectures, gradually increasing in com- a neural network needs to be trained on training data be-
plexity: starting with training a Logistic Regression classi- fore it can be used for image classification tasks on real data
fier, simple 3-layered NNs, moving on to Convolutional NNs, stream. Ive initially collected around 1500 images of a sin-
and concluding with state-of-the-art CNNs used for object gle train running on different parts of the track. The task
detection and localization. Source code for NN models used of finding trains location was organized as a classification
is available online1 . task for a neural network - the track was divided into small
Experiments prove superiority of modern CNNs for image segments, and NN was used to classify which segment the
classification/object detection tasks over other non-CNN ap- locomotive is currently in. Comparing my dataset to image
proaches. datasets available on the web, the number of images avail-
able as training data might sound quite low: even a simpler
dataset of handwritten digits MNIST2 has 50000 images in
Keywords the training set for 10 output categories. I still decided to
start experimenting with a small training dataset, thinking
Convolutional Neural Networks, Machine Learning, image that it would provide good feedback on what can be con-
classification, supervised learning, object detection. sidered a good size for a dataset. Images collected were
with 960x480 resolution, which may sound large, as a neu-
ral network input layer would need to consist of 460,800
1. INTRODUCTION neurons. Ive halved image resolution to 480x240, and con-
Following our class discussions on how CNNs can be ap- verted them to grayscale to lower the number of neurons
plied for computer vision tasks [6, 13], Ive decided it would in the input layer to 115,200. Nonetheless, the actual train
be interesting to base my project around application of neu- on such image is small - 40x10 pixels, and at that zoom
ral networks for image recognition, and evaluation of how level distinctive features of a train are quite blurry, and so
various neural network architectures, parameters and tech- it wasnt completely clear how a neural network would be
niques can affect the quality of image recognition and clas- able to identify and distinguish trains on input images.
sification. As the basis for this experimentation, Ive come
up with the following setup: a toy model of a railroad would 2.1 Implementing a linear classifier
be pictured from above, and image recognition tasks would To be able to train a neural network on collected images,
evolve around recognizing the location of a train on the these images needed to be manually classified and labelled
track. This simulates a real-world scenario of satellite im- according to predetermined track segments. The more fine-
agery of on-ground activity. Additional experiments would grained a division into track segments would be, the more
be based around relative locations of multiple trains, such precisely would each segment represent a location of the
as whether its safe for a train to proceed in a particular di- train. Considering that it wasnt clear how many images
rection with given velocity, given the location of other trains were to be labelled, and what would be a good number of
on the track. categories, Ive realized that manual labelling wouldnt be
As NN classification heavily depends on the quality and such a great idea. Ive implemented a custom images clas-
quantity of training data, all of the experiments require a sifier that is based on luminosity3 . After converting images
large collection of training images. Images for experiments to grayscale, a train can be recognized on the image as a
were gathered at the CS452 class lab, which has a large dark rectangle of size 40x10. To evaluate whether a train
model of a railroad track and control systems that allow to is present at particular location, a 40x10 bounding box is
run trains on the track. The lab also has a camera mounted positioned at that location along the track, and a thresh-
on the ceiling, allowing to take pictures of the track from old is applied on a summed value of luminosity of all pixels
above. within that 40x10 rectangle. Ive estimated the luminosity
2
http://yann.lecun.com/exdb/mnist/
1 3
https://github.com/slavik112211/cs870 neural networks trains location linear classifier.py
Figure 2: Sigmoid and its derivative at x=6.5

Figure 1: Railroad track divided into 10 segments descent, which is good for debugging and learning purposes.
To simplify the classification case, 88 categories were joined
to be lower than 35000 when a train is within the bounding together to form 10 larger categories (track path chunks),
box, and above 43000 when its not. A list of pixels within a as shown on Fig. 1. This network had the following con-
bounding box is retrieved by positioning that 40x10 bound- figuration: 3 layers - 115,200 neurons input (480x240), 30
ing box at the origin point (0,0) and multiplying pixels coor- neurons hidden layer, and 10 neurons output; sigmoid ac-
dinate vector by transformation matrix for each pixel within tivation function, and sigmoid output layer (not softmax);
the bounding box to get pixels new coordinates: mean squared error as the cost function. Networks perfor-
 0    mance was validated on the test dataset after each training
x cos sin x epoch, i.e. each time after training the network over 100%
= (1)
y0 sin cos y of training data (1197 images).
This way, each train position on the track can be described The network didnt show any capacity for learning, with
with only three values: the location of the center of the accuracy not getting higher than 10-20%. Considering that
bounding box (x,y), and its rotation angle . this network implementation works fine with MNIST dataset,
Ive defined 88 potential train locations, as thats the Ive tried to identify the reason for bad performance. Ive
amount of categories required to express the trains location observed that network outputs 1 for multiple categories per
with a precision of about half of a train size. Sometimes each input image. This behavior is undesirable, as networks
an image would be classified into multiple categories, and a output should signify a single category that the image be-
category with a smaller luminosity value would be picked as longs to, not multiple. The network has no way to dis-
a ground truth (i.e. a bigger part of a train shows in one cat- tinguish between multiple categories that receive a score of
egory over the other). The output from linear classifier was 1. This issue slows down the learning process: error cost
inspected visually to ensure that threshold luminosity values isnt attributed properly, and it cant correctly estimate how
are correctly classifying images and cutting out images that wrong the model is in each particular case. It would be pos-
dont belong in such categories. sible to prevent this effect by using the softmax function
as the output layer, which is a logistic function that allows
2.2 Training simple neural networks to represent NN output as a probability distribution, where
Using the output from linear classifier as ground truth each output category receives a probability score with a total
labels, Ive come up with 1436 labelled images depicting a across all categories summing to 1:
train at 88 different locations on track that in total com- ezj
pletely cover the area of the track. After reserving 1/6 (z)j = PK , for j = 1, ..., K (2)
k=1 ezk
of this dataset as testing data (239 images), the training
dataset got further reduced to a total of 1197 labelled im- where K is the number of output categories, and z is a
ages. Given that images are divided across categories in K-dimensional vector of arbitrary real values, that is being
a uniform manner, each category was represented by over fed into the output layer.
10 training images. These images were provided as input Other than that, I believe Ive hit a classic case of learn-
to neural networks as 2-dimensional arrays, with each pixel ing slowdown, when a sigmoid activation function is used
represented by the luminosity score for that pixel - a real together with quadratic cost function. Sigmoid function (3)
value ranging from 0 to 1, where 0 represents black, and 1
1
represents white respectively. (x) = ; 0 (x) = (x) (1 (x)) (3)
1 + ex
2.2.1 Custom 3-layer NN
gets very flat in the extremes (when its input is either much
Before using any deep learning libraries I decided to train larger than 0, or much smaller than 0). This warrants that
a custom implementation of a simple 3-layer neural network4 its derivative tends to get very small, when sigmoid input is
taken from [9], implemented using Pythons Numpy linear not close to 0 (Fig. 2). Quadratic cost function, defined as
algebra package5 . Before attempting to train the network
on my dataset, Ive ensured that it works correctly recog- (y a)2
nizing digits from MNIST dataset. This network explic- C= , (4)
2
itly programs the code for backpropagation and gradient
where a is actual neurons output, and y is a desired neu-
4
simple neural net.py rons output, tends to cause learning slowdown, as sigmoid
5
http://www.numpy.org/ derivative 0 (x) is present in C/w and C/b gradients
when quadratic cost is used:
C C
= a 0 (x); = a 0 (x) (5)
w b
Given that backpropagation is based on consequent applica-
tion of the chain rule (multiplication of derivatives), small
0 (x) makes resulting weights/biases updates negligible.
Cross-entropy cost function, thats defined as
1X
C= [y ln a + (1 y) ln(1 a)] , (6)
n x Figure 3: Accuracy and cost function charts for
2-layer NN using TensorFlow (88 categories, 600
summing over training inputs x (total of n), doesnt exhibit epochs)
similar problems, as its C/w and C/b gradients do not
depend on 0 (x), and arent affected by it being close to 0: use the same method to gradually adjust model parameters:
Stochastic Gradient Descent.
C 1X C 1X In the first experiment Ive used the following configura-
= xj ((z) y); = ((z) y) (7)
wj n x b n x tion: 2 layers, 10 output categories, softmax on output layer,
cross-entropy cost function, 1197 training images, 239 test-
After changing the cost function to cross-entropy, and sim- ing images. Again, Ive first verified that the model is able
plifying the model to two layers by removing the hidden to train using MNIST dataset. Since the training dataset is
layer, the network was successfully trained to classify rail- that small, Ive setup the training to proceed for 300 epochs
road images, achieving 95% accuracy after training for about (approx. 5 minutes), where 1 epoch signifies that all 100%
200 epochs. A comparable 2-layer network using quadratic of the training data has been fed into the network. The ac-
cost function wasnt successful: the classification accuracy curacy was verified using the test dataset each 50 epochs.
didnt increase at all from epoch to epoch, whereas the Accuracy goes over 50% after first 100 epochs, and rises to
cost function output would decrease, but incredibly slowly. over 95% after 300 epochs.
Given that this same network would provide over 90% accu- The model was successfully able to classify images into
racy classifying MNIST dataset after just a single training 10 categories, i.e. its able to correctly identify on which
epoch, I concluded that quadratic cost can still be used on path chunk (color-coded on Fig. 1) the train is located.
simpler datasets, but its way less effective than using cross- Nonetheless, the original problem was identifying the precise
entropy as the cost function. location of the train, and thus its more interesting to see the
Instead of continuing to improve this custom network im- accuracy of NN model when train locations are classified into
plementation I decided to reimplement this simple 3-layer the original 88 categories, instead of 10 larger categories that
network based on TensorFlow, as that would allow to sim- these 88 were combined into. As custom luminosity classifier
ply reuse existing implementations of softmax, cross-entropy assigned image labels for full 88 categories, changing the
and other functions from vast TensorFlow collection. setup was simple. With this setup the model doesnt learn
as rapidly, but after training for 400 epochs the testing data
2.2.2 2-layer NN using TensorFlow & Theano was categorized with acceptable 80% accuracy (Fig. 3)
Ive started with a very simple implementation of the neu- To cross-validate this successful result Ive re-implemented
ral network using TensorFlow: no hidden layers, all input this 2-layer Neural Network8 using Theano9 , a deep learn-
neurons directly linked to the output layer. This implemen- ing framework by UMontreal lab MILA. This network had
tation is provided as part of TensorFlow tutorials6 , where its the following parameters: 2 layers, input - 480x240 neurons,
used to classify MNIST digits, and I only needed to adapt output - 10 categories, softmax activation, cross-entropy
it a little to be able to feed my own training data into that as a cost function. Input consisted of the same collection
model7 . of 1197 training images and 239 testing images, network
The fact that this model doesnt use hidden layers makes it weights were updated after each batch of 10 input images,
identical to another very commonly used classification model and training continued for over 300 epochs. At first, train-
- Multinomial Logistic Regression. Neural networks are dif- ing procedure didnt provide any meaningful results: net-
ferent in that a sigmoid activation function is used at each works accuracy was verified each 10 epochs, and training
layer of neurons, and backpropagation algorithm [11] is used data classification didnt seem to improve. Moreover, cross-
to acquire cost function gradients for every parameter of the entropy cost didnt decrease as the training progressed, its
network (weights and biases). But when therere no hidden value was unstable and jumped back and forth. This clearly
layers the model gets radically simpler: backpropagation al- indicated that weights updates werent applied correctly.
gorithm isnt required anymore, as cost function gradients Adjusting input batch size (10), gradient descent step size
can be acquired by applying differential calculus. An activa- (0.15), and other hyperparameters didnt provide any mea-
tion function is never applied to the input layer of NN (only surable improvement. One major difference between Ten-
to consequent layers), and when therere no hidden layers sorFlow and Theano implementations was the fact that Ten-
input is directly fed into the output layer. Output layer of sorFlow was randomly reshuffling the order of input images
the neural network can use softmax as the activation func- after each training epoch. Ive noticed that my training
tion, same as with Logistic Regression. Finally, both models data was highly ordered, with input images belonging to
6 8
https://www.tensorflow.org/get started/mnist/beginners logistic sgd theano.py
7 9
trains tensorflow classifier.py http://deeplearning.net/tutorial/logreg.html
one category being put together. It turned out that this ases) to learn and makes the model much more complex, and
was the reason why networks training procedure couldnt thus harder to train. At the same time, it allows to learn
show progress. It was enough to reshuffle input images just more subtle non-linear dependencies between input data cat-
once (at the beginning of training) to get Theanos network egories. A 1-hidden layer network architecture doesnt have
to properly update weights, and reach comparable levels of too many hyperparameters to adjust - input layer size is
accuracy to TensorFlows implementation: 95% after 300 predetermined by input images resolution, output layer size
epochs. Another experiment with input data separated into is determined by the number of categories that we want to
more fine-grained 88 categories has confirmed the previous classify the images into. Thus, this architecture is only flex-
result (achieved with TensorFlow implementation), and the ible in terms of the number of neurons in the hidden layer.
network has reached over 85% accuracy after training for Nonetheless, it still proved impossible to achieve any mean-
about 500 epochs. ingful classification accuracy. The following configuration
was used: dataset - 11967/2393 training/test images, hid-
2.2.3 Enlarging the dataset using Gaussian noise den layer size - 1000 neurons (couldnt do much more due to
Although 2-layered networks did relatively well in clas- 2GB GPU RAM limit), 88 output categories, tanh as activa-
sification task, first attempts at training 3-layered models tion function in hidden layer, and negative log-likelihood as a
didnt show any signs of incremental accuracy improvement. cost function. The larger dataset didnt seem to make much
I assumed that one of the main reasons for bad training re- difference, with classification accuracy lingering at about
sults could have been the lack of input training data, a phe- 5%, no matter for how long the network was trained (ran
nomena known as underfitting. NN with 1-hidden layer has for 400 epochs). This was confirmed with two implementa-
a tremendous amount of hidden parameters (weights and tions - using TensorFlow and Theano. The real reason of
biases), that were trying to learn. Back of the envelope course is that the hidden layer of 1000 neurons is simply
calculations provide the following numbers. Input layer size too small to hold any meaningful features of input image
is 480*240=115,200. Each input neuron is linked to each of size over 100,000 pixels. Its generally recommended for
neuron in the hidden layer (size 30), which is 3,456,000 con- a hidden layer to be at least 1/3 of the size of the input
nections only between input and hidden layers. And to learn layer, but in case when NN is used to classify images this
the correct values for this massive amount of parameters Ive gives an enormous and prohibitive number of parameters
only provided a total of 1197 labelled images. in NN: 100, 000 33, 000, or over 3 billion weights, whereas
To increase the size of the dataset, Ive decided to apply most modern CNNs dont exceed even 50 million. Thus I
a Gaussian random noise filter on the images. For that, concluded that 1 hidden layer nets arent applicable to the
each pixels luminosity value was slightly modified by using image classification/object detection task (except very small
a random value drawn from a normal Gaussian distribution. images, such as in MNIST: 28x28), and moved on to exper-
Considering that each pixels initial value is in the range imenting with CNNs.
0-255, random variable was generated with mean of 0 and
standard deviation of 8. Noise filter was applied 10 times for
each original image, thus generating 10x times more training
3. DETECTING OBJECTS USING CNN
data (14360 images). The labels were assigned to images ac-
cording to the labels for non-noisy original images. Results
3.1 Classification using LeNet-5
were confirmed visually: noise distorted the images, but not Convolutional Neural Networks handle the aforementioned
much - images objects were still easily recognizable. problem of exploding number of network weights by creat-
Both TensorFlow-based and Theano-based NN implemen- ing a hierarchy of hidden layers. Usage of CNNs for image
tations have displayed classification accuracy improvements. classification tasks was popularized in 1998, when LeNet-
Networks were configured as follows: 2 layers - input/out- 5 CNN[7] was created for classification of images of hand-
put, 14360 input images (11967/2393 training/test), 88 out- written digits. Later in 2012 a breakthrough result in image
put categories. Both networks have reached accuracy of 90% classification was achieved by UToronto group of researchers
after 40 epochs, and over 95% after 100 epochs. Surely, the [6], who created AlexNet - a deep neural net with 5 convo-
training took a bit longer, as 11967 images had to be pro- lutional and 3 fully-connected layers, and won a large-scale
cessed each epoch, not 1197. Both frameworks were con- image classification competition ILSVRC by a wide margin
figured to do processing on a GPU10 , which significantly comparing to non-CNN approaches11 .
improved training performance - approx. 10 times in most In comparison to fully-connected NNs, CNNs arent sim-
experiments. ply classifying image pixels, but are rather able to recognize
features such as lines, geometric shapes, color blobs on the
2.2.4 Training a NN with 1 hidden layer input image, and such concepts increase in complexity with
After increasing the volume of training data tenfold I each next layer of feature kernels, as they combine the fea-
again tried to train 3-layered networks. It was proven that tures learned in previous convolutional layers. Such feature
neural nets with even a single hidden layer are able to ap- kernels function like image filters that are applied to an in-
proximate any function at all - the universality theorem [1]. put image in a sliding window manner. On the first convo-
According to this theorem, we can be certain that there ex- lutional layer size of the sliding window is relatively small
ists a 1-hidden layer NN, that would be able to classify any (say 5x5 pixels), but the size of filters effectively doubles
dataset with 100% accuracy, including my custom dataset. with each pooling layer that merges 2x2 groups of filters lo-
Whereas 2-layered models are quite straightforward - the cated nearby to form a filter that covers 10x10 pixels of the
input is directly mapped to the output, introducing 1 hid- original image. Thus stacking combinations of convolution-
den layer increments the amount of parameters (weights/bi- al/pooling layers together one after another allows to create
10 11
NVidia GeForce GTX 750 Ti, 2GB http://www.image-net.org/challenges/LSVRC/2012/results.html
larger and larger filters with each consequent layer. An im- 3.2 Object localization
portant aspect of CNNs is that the weights learned for each One of the most fundamental approaches to localizing
filter at one layer are the same across all sliding window po- smaller objects in a larger image using CNNs is based on
sitions. This greatly reduces the amount of NN parameters Linear Regression. It works as follows. A pretrained CNN
that need to be learned in the process of training. such as AlexNet has 5 convlayers which output feature de-
Given that LeNet-5 is a simpler CNN than AlexNet, I tectors, and 3 fully-connected layers which classify input
decided to try applying it to my dataset first. Both Ten- images into categories using feature detectors generated by
sorFlow and Theano provide reference implementations of convlayers. To localize an object in an image, such CNN
LeNet-5 as part of their tutorials12,13 , and thus only the data has to also output coordinates of a bounding box (in ad-
adapter needed to be implemented to import my dataset into dition to object class). That is achieved by substituting
both models, and adjust sizes of convolutional and fully- the 3 fully-connected layers classifying images (classifica-
connected filters according to the size of input images. tion head) with 3 fully-connected layers that output 4 co-
The network had the following configuration. Input layer: ordinates of a bounding box (regression head). The CNN
115,200 neurons; 1st convlayer: 20 kernels (feature filters) of is then re-trained with this regression head, where training
size 5x5 pixels, 2x2 pooling, tanh activation; 2nd convlayer: data contains 4 ground-truth coordinates for a bounding box
40 kernels of size 5x5 (over 1st layer kernels, effectively for each image. A more advanced version of this architec-
10x10 over original image), 2x2 pooling, tanh activation; 1st ture is introduced by OverFeat[12], and its using a sliding
fully-connected layer: 500 neurons, tanh activation; output window approach to improve localization precision. As can
layer: 88 categories, negative log-likelihood cost function; be seen, this Linear Regression approach only works when
SGD learning rate: 0.1; batch size: 10 images. Networks the number of bounding boxes in all images is fixed (usu-
were trained with varying dataset sizes (clean dataset and ally 1), as the number of output nodes for regression head
10x-sized dataset with Gaussian noise images), and varying is determined by the amount of coordinates that are being
number of output categories: 88 (more specific train loca- predicted, and its hard to generalize this approach to object
tion) and 10 (larger chunks of track specifying train location detection scenario when the image might contain a varying
area). number of objects in a scene. Since I wanted to be able
None of the tries to train LeNet-5 on this dataset have suc- to detect multiple trains in a scene, I skipped this model
ceeded, and both TensorFlow and Theano implementations and continued searching for a model that better suits this
were showing classification accuracy of about 3%, even after scenario.
training for over 100 epochs. In retrospective, I realized that
trying to classify my dataset using LeNet-5 or even AlexNet 3.3 Object detection using R-CNN
would never succeed. Both of these CNNs are designed to
A rather computationally expensive CNN architecture R-
classify images which contain a single object, and that object
CNN[3] (proposed in 2014) is able to detect multiple objects
spans the whole image. In comparison, images in my dataset
of different classes in a single image. This architecture uses a
represent a table with a model of a railroad, and a train in
class-agnostic region proposal algorithm Selective Search[14]
each image is simply a small dark rectangle somewhere on
as a first step of the object detection pipeline. The algorithm
the tracks. All of the images in my dataset are practically
isnt based on any machine learning techniques, it analyzes
the same, with only difference of the location of that dark
image features such as color and texture looking for mono-
rectangle. Thus, LeNet-5/AlexNet architectures are serving
tonic blobs to detect potential object candidates, and gen-
a bit of a different purpose. A second reason why classifying
erates a large number of bounding box proposals (image
using LeNet-5 wouldnt work is as follows: original applica-
regions) - up to 2000 per image. After that, each of such
tion of LeNet-5 was for classifying small 28x28 images from
regions is treated as a separate input image for a CNN ob-
MNIST dataset, with little variation across images within 1
ject localization pipeline similar to the one described above.
category. AlexNet handles larger and more complex images
Even though this architecture works well, its quite obvi-
of size 224x224 from ImageNet dataset. This provides intu-
ous that this method is tremendously slow - 47 seconds per
ition that LeNet-5 architecture is overly simplistic to handle
image. Fast R-CNN[2], an improved architecture released
480x240 images from my dataset, and AlexNet would be
by R-CNN authors a year later, swaps the order of pipeline
more appropriate in this case. For example, one important
operations - potential regions pre-calculated using Selective
aspect is the size of feature kernels in the 1st convlayer -
Search are applied directly on the feature kernels produced
with larger images it makes sense to increase their size as
by CNN convolutions. This allows to save the time spent on
well (AlexNet uses 11x11 kernels for the 1st layer).
running each proposed region through convolutional layers,
To make use of AlexNet CNN with my dataset Id need to
and improves object detection speed almost 25x times at test
reformulate my train localization problem as an image clas-
time, with most of the time still spent on region proposal
sification problem: a sliding window of 50x50 pixels would
algorithm. In their next architecture (Faster R-CNN[10]),
scan the input image, and run each 50x50 tile through AlexNet
authors were able to get rid of the region proposal algorithm
to identify whether any of the tiles classify as a train or
altogether. They introduced the Region Proposal Network
not. It turned out that object detection is a separate well-
(RPN), a class-agnostic object detector that is trained on
researched topic, and there exist appropriate CNN models
feature kernels produced by CNN convlayers and generates
that tackle this task.
potential regions. Thus, the whole pipeline becomes a mono-
lithic CNN architecture. This latest architecture allows to
achieve almost real-time object detection: 0.2 seconds per
image. Faster R-CNN paired together with a recently re-
12
https://www.tensorflow.org/get started/mnist/pros leased deep CNN ResNet (101 layers) provides an impres-
13
http://deeplearning.net/tutorial/lenet.html sive precision in object detection - 59.0 mAP[4] (mean Av-
Figure 4: SSD (green) vs Faster R-CNN (violet) Figure 5: Train detection, SSD/Inception2 CNN
detection precision. Both models achieve over 95% assign an object class, SSD directly uses these regions and
mAP after training for 6000 batches (5 epochs) evaluates a score for each region for each object class. This
simplifies the model, and allows for faster training/evalua-
erage Precision metric, 0-100 scale) when trained on COCO
tion time without sacrificing detection quality much.
dataset (dataset used for object detection evaluation, aver-
SSD model is also provided as part of TensorFlow Object
age of 7 objects per image). This result was state of the art
Detection API, where its paired together with GoogLeNet
in the beginning of 2016.
Inception2 CNN[13]. This model requires data in the same
Ive decided to try experimenting with this architecture.
format as Faster R-CNN, and thus no additional setup was
Google has recently released TensorFlow Object Detection
required. Authors of Object Detection API have written a
API14 , which includes implementation of the Faster R-CNN/
thorough SSD vs Faster R-CNN comparison paper[5], where
ResNet101 object detector. This model is pretrained on
they mention that Inception2 is a way more compact net-
COCO dataset, so theres no need to spend weeks to train
work comparing to ResNet101 (10 million vs 42 million pa-
the network. Nonetheless, since I wanted the network to
rameters respectively), and requires up to 3 times less GPU
classify my own dataset, the network had to be retrained
FLOPs, and way less GPU RAM. This has been confirmed
a bit with my custom dataset15 using a technique called
in my tests, where SSD was easier to work with, and was
transfer learning: output classification is augmented by in-
generally faster. Object detection accuracy comparison to
troducing new object classes from an additional dataset.
FasterRCNN/ResNet101 can be seen on Fig. 4. In gen-
To use the model, Ive reformulated the problem from im-
eral, precision improvement with SSD seems to go a little
age classification (88 categories denoting train location) into
slower comparing to Faster R-CNN (in terms of training
train object detection problem: each image is put through
epochs), but both converge to over 95% precision in less
an object detector CNN to see whether it has any trains, and
than 5 epochs. SSD seems faster in wall-clock time though
outputs their specific location. To import my dataset into
due to being a simpler model.
this model, Ive converted images from grayscale to RGB,
Train detection results can be seen on Fig. 5. An inter-
since modern CNNs work with 3 color channels. The model
esting observation is that the trains located off-track (in the
allows to annotate each training image with a list of objects
right track circle) arent detected at all. This is the desired
on it (and their bounding boxes). Since my images only con-
outcome, and the reason for that is that these trains were
tain one train (not multiple), Ive provided a single bounding
never provided as ground-truth values on network training,
box annotation for each image, and introduced 1 additional
and thus it didnt learn to recognize them. As I provided
object class - train. The ground-truth bounding box was
50x50 pixels rectangles as ground-truth boxes that represent
calculated as a 50x50 box around the center of a category
a train, instead of a dark stripe of 40x10 pixels, the net-
that the image belongs to (out of previously defined 88 im-
work didnt learn to recognize the real-world trains of size
age categories - for ex. center point for category a05: [271,
40x10 pixels, but rather these 50x50 image tiles.
44]). Training progress can be seen on Fig. 4. In general,
the model achieves over 95% accuracy in about 5 epochs,
which seems really good comparing to 2-layer NN tried pre- 4. CONCLUSION
viously. Nonetheless, training this CNN is only possible on This project allowed to survey various Neural Network
a high-end GPU. models, and apply them to the task of object detection. In
my experiments, Ive gradually increased the complexity of
3.4 Object detection using SSD NNs - from most simple 2-layer ones to present day state-of-
Single-shot MultiBox Detector CNN[8], introduced by Google the-art CNNs used for object detection. The project allowed
at the end of 2015 uses a slightly simpler approach than me to get a hands-on experience with training NNs, and get
Faster R-CNN. Region proposals work in a similar fashion a good understanding of their capabilities and limitations.
to Region Proposal Network introduced in Faster R-CNN, Experiments confirmed that simple NN architectures arent
i.e. a fixed number of potential regions (anchor boxes) are applicable to the case of object detection, and that recent
considered. But rather than feeding these potential regions progress in this field is significant.
further through feature extractor as in Faster R-CNN to I havent experimented with trying to detect multiple train
objects on track, but SSD/R-CNN object detection models
14
github.com/tensorflow/models/tree/master/object detection can easily be used for detection of multiple objects in a scene,
15
create trains tf record.py and thus I consider this goal accomplished as well.
5. REFERENCES
[1] G. Cybenko. Approximation by superpositions of a
sigmoidal function. Mathematics of Control, Signals,
and Systems (MCSS), 2(4):303314, Dec. 1989.
[2] R. Girshick. Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages
14401448, 2015.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection
and semantic segmentation. In Proceedings of the
IEEE conference on computer vision and pattern
recognition, pages 580587, 2014.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern
recognition, pages 770778, 2016.
[5] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song,
S. Guadarrama, et al. Speed/accuracy trade-offs for
modern convolutional object detectors. arXiv preprint
arXiv:1611.10012, 2016.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information
processing systems, pages 10971105, 2012.
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
Gradient-based learning applied to document
recognition. Proceedings of the IEEE,
86(11):22782324, 1998.
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,
C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox
detector. In European conference on computer vision,
pages 2137. Springer, 2016.
[9] M. A. Nielsen. Neural Networks and Deep Learning.
Determination Press, 2015.
[10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:
Towards real-time object detection with region
proposal networks. In Advances in neural information
processing systems, pages 9199, 2015.
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
Learning representations by back-propagating errors.
Nature, 323(6088):533538, 1986.
[12] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,
R. Fergus, and Y. LeCun. Overfeat: Integrated
recognition, localization and detection using
convolutional networks. arXiv preprint
arXiv:1312.6229, 2013.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In
Computer Vision and Pattern Recognition (CVPR),
2015.
[14] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and
A. W. Smeulders. Selective search for object
recognition. International journal of computer vision,
104(2):154171, 2013.

Вам также может понравиться