Вы находитесь на странице: 1из 5

Fast Sliding Window Classification with Convolutional

Neural Networks

Henry G. R. Gouk Anthony M. Blake


Department of Computer Science Department of Computer Science
University of Waikato University of Waikato
Private Bag 3105, Hamilton 3240, New Zealand Private Bag 3105, Hamilton 3240, New Zealand
hgouk@waikato.ac.nz ablake@waikato.ac.nz

ABSTRACT It is known that computing convolutions by transform-


Convolutional Neural Networks (CNNs) have repeatedly been ing the image and kernel into the frequency domain and
shown to be the state of the art method for natural signal performing point-wise complex multiplication is an effective
classification – image classification in particular. Unfortu- way to gain a speedup with many computer vision and image
nately, due to the high model complexity CNNs often cannot processing algorithms [9]. However, it has only been applied
be used for object detection tasks with real-time constraints, in the context of CNNs recently where the main motivation
where many predictions have to be made on sub-windows of was to accelerate the training process [6]. The focus of our
a large input image. We demonstrate how two recent ad- work has been on optimisation of the entire deployed slid-
vances in CNN efficiency can be combined, with modifica- ing window CNN classification system, as opposed to the
tions, to provide a substantial speedup for sliding window CNN training process. The forward propagation procedure
classification. An in depth analysis of the various factors developed as part of the frequency domain method scales
that can impact performance is presented. better to larger image sizes than the traditional space do-
main method, so it forms the basis of our sliding window
inference procedure.
Categories and Subject Descriptors In [8] it was observed that the feature maps computed by
I.5.1 [Pattern Recognition]: Neural Nets CNNs for adjacent sub-windows during sliding window clas-
sification contain large overlapping sections, allowing for a
significant number of redundant computations to be avoided.
Keywords The method originally presented describes the computations
Convolutional Neural Networks, Object Detection, Sliding in the spatial domain and the performance is not quantified.
Window, FFT We use the frequency domain forward propagation algorithm
from [6] and perform the same extending described by [8]
and apply architecture aware optimisations to greatly ac-
1. INTRODUCTION celerate the sliding window classification process to a point
Convolutional Neural Networks (CNNs) are powerful mod- where it can be used in real-time with modest sized net-
els and have become very popular for image classification works. Neither of these methods impacts the final accuracy
tasks over the last few years, where they have been shown of the resulting classification system. That is to say, they
to produce state of the art results on several benchmark produce results that are equivalent to the algorithms con-
datasets [10, 5]. The primary drawback of using CNNs is ventionally used for performing training and inference on
their substantial complexity, which leads to slow training CNNs. Thus, our work was focused on the speedup that
and inference procedures, making some applications unfea- can be attained.
sible due to real-time constraints. In this paper we aim to We also show how the memory requirements of the al-
address this problem of slow inference in the context of slid- gorithm presented in [6] can be reduced. The relationship
ing window classification by extending a scalable inference between the reduced memory usage and the performance is
procedure to evaluate all the sub-windows in a large input demonstrated.
image simultaneously and applying computer architecture The paper is structured as follows; in Section 2 we give an
aware optimisations. We provide a comparison between sev- overview of the methods presented in [6] and [8], followed by
eral modern methods for applying CNNs in a sliding window an explanation of how the methods can be combined. Sec-
fashion. tion 3 discusses what modifications must be made to prevent
several potential pitfalls, and also provides an exploration of
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
the tradeoff between memory usage and run time. Sections 6
for profit or commercial advantage and that copies bear this notice and the full citation and 7 provide an evaluation of how efficient the different al-
on the first page. Copyrights for components of this work owned by others than the gorithms are in a range of scenarios. Section 8 provides some
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or final comments with suggestions for future work in this area.
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
IVCNZ ’14 November 19 - 21 2014, Hamilton, New Zealand 2. BACKGROUND
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3184-5/14/11$15.00
Exploiting the convolution theorem to accelerate CNNs
http://dx.doi.org/10.1145/2683405.2683429. is not as trivial as it is to use it to speed up other image

114
processing and computer vision algorithms. If one were to convolution.
simply replace all occurrences of spatial convolution with Additionally, the fast sliding window algorithm transforms
frequency domain convolution then in the vast majority of fully connected layers into convolutional layers. If one were
cases one would see a dramatic slowdown because the ker- to indiscriminately modify all convolutional layers to under-
nels used in CNNs are usually quite small – 5 × 5 pixels is take computations in the frequency domain huge memory
a very common size. To get around this [6] leverages the and computational overheads will be incurred since the ker-
linearity property of the Fourier transform to greatly reduce nel consists of only a single value. To gain a good speedup
the number of inverse transforms required to compute each from combining these two methods a slightly more sophisti-
feature map by performing the summation in the frequency cated approach will need to be employed.
domain. In addition to this, it was observed that one can
simply precompute the frequency domain copy of the ker- 3. OPTIMISING SLIDING CNNS
nels and store them, further cutting down the number of
Excessive memory use has the consequence that the over-
transforms required.
all memory available to the rest of the system is reduced. If
Although these optimisations are not particularly difficult
the CNN is part of a larger system then this could be catas-
to see once the task is approached with frequency domain
trophic, as it could end up leaving the system unable to even
methods in mind, it has taken over twenty years since the
operate on a computer with limited memory. We propose a
introduction of CNNs for them to be applied. It should be
very simple way to avoid the first problem: coarsely divide
noted that [6] thoroughly explores the factors that deter-
the large input image up into several smaller input images
mine the run-time performance of these frequency domain
and apply the frequency domain sliding window inference to
methods versus the traditional spatial domain methods, but
each of these images. Note that these smaller sub-images
performance of sliding window classification was not within
must have an overlap to compensate for the cropping effect
the scope of their investigation.
introduced by the convolutional layers in the CNN, so if one
Recently [8] introduced OverFeat, a CNN based image
were to divide the input image into four sub-images then
recogniser and feature extractor. The authors also included
each of those sub-images would contain slightly more than
a small section addressing the performance of sliding window
one quarter of the input image. This overlap reintroduces
classification where they identified that the feature maps in
some of the redundant computations that first provided the
convolutional layers produced by propagating several adja-
inspiration for the fast space domain forward propagation
cent sub-windows through the same CNN contain a large
algorithm, and as the number of sub-images grows the per-
amount of overlapping data. To exploit this fact they ap-
formance will tend towards the standard method of extract-
plied the convolutional layers to the entire input image rather
ing and performing inference on each sub-window individu-
than processing each sub-window individually, preventing
ally. As an example of how this changes the run-time perfor-
the repeated calculation of the same values. The pooling
mance, the technique was applied to an image of 1024×1024
layers are left to continue pooling in the same pattern, but
pixels using the shallow model described in Section 6. Fig-
since the pools are extending over the much larger feature
ure 1 shows how this method impacts the speed compared to
maps one is forced into using a particular stride, determined
the simple sliding window frequency domain approach that
by the product of the sizes of each pooling layer in the net-
applies a model to the entire input image at once – when
work. Finally, the fully connected layers can be transformed
the number of sub-images is one, this approach is equivalent
into convolutional layers with 1 × 1 kernels and used in the
to the standard frequency domain sliding window inference
same manner as the other convolutional layers.
algorithm. The speeds have very little correlation with the
The derivation of the fast sliding window inference algo-
number of divisions. This is because the FFT needs to be
rithm presented by [8] can easily be modified to use the fre-
computed on a signal of a size that is not a power of two. In
quency domain methods presented by [6]. In the fast sliding
most modern libraries this causes a significant degradation
window method each kernel in a convolutional layer is ap-
in performance [1]. Due to the sacrifice in speed this opti-
plied to the entirety of each input feature map, as opposed to
misation is intended to be used in situations where memory
each sub-window, so by simply using the frequency domain
consumption needs to be minimised, such as in embedded
convolution procedure with the CNN specific optimisations
environments or mobile devices.
from [6], the algorithms can be combined. However, one will
The second problem – of incurring needless overheads for
quickly encounter a major problem when attempting to per-
fully connected layers – can also be mitigated. Once again,
form inference on a large image. The convolution theorem
the solution is rather simple; only convert convolutional lay-
specifically states that point-wise multiplication in the fre-
ers with sufficiently large kernels to use the frequency do-
quency domain is equivalent to circular convolution in the
main forward propagation method. We use the very simple
space domain. The way to get around this limitation and
heuristic of performing frequency domain based convolutions
still perform linear convolution is to zero-pad the image and
for any layer where the kernels have more than one element.
the kernel enough that the border effects caused by circular
Thus, the computations for the fully connected layers are
convolution are not present [2]. The problem with this, in
still performed in the spatial domain. However, there is still
the context of sliding window CNN classifiers, is that the
something that can be improved upon. If one truly treats
input images one would like to apply the sliding window
the fully connected layers as convolutional layers with 1 × 1
classifier to can be several megapixels. CNNs usually have a
kernels then a huge number of redundant branching instruc-
very large number of kernels, so padding out all these kernels
tions will be executed. To avoid this we make a variant of
to be at least as big as a large input image causes the mem-
the convolutional layer where the inner loops inside the con-
ory requirements to skyrocket – a problem the fast spatial
volution methods have been unrolled to take into account
domain sliding window classification algorithm does not suf-
that the kernels have only a single element. Figure 2 gives
fer from since the padding is not required for space domain
an idea of how much speedup the removal of these superflu-

115
2.5 1.5

Relative Speedup
Time (seconds)

1
1.5

1
0.5

0.5

0 0
1 4 16 64 256 32 64 128 256 512 1024
Number of Sub-images Input Image Size

Figure 1: This figure illustrates how dividing up the Figure 2: Relative speedup of the specialised sliding
input image in order to reduce the model size im- fully connected layer over using a convolutional layer
pacts the speed. The network used here is the shal- for different square input image sizes when using the
low network described in Section 6. shallow network described in Section 6.

ous branch instructions provides. any application that would be suitable to use these capabil-
ities due to the limitations outlined earlier. A GPU version
of the library is also in development, and we plan to release
4. BACKPROPAGATION this at a later date.
It is worth noting that the fast sliding window algorithm
is also capable of speeding up backpropagation by applying 6. PERFORMANCE EVALUATION
the same extension idea to the correlation and convolution
From the outset, the goal of this paper was to quantify the
operations used for calculating the derivatives. However the
relative performance of several recently introduced methods
accuracy is likely to be very poor unless two criteria are satis-
for accelerating CNNs – particularly in the context of slid-
fied. Firstly, it must make sense to use very large minibatch
ing window classification. This section provides an in depth
sizes, as each sub-window is essentially a new instance. Sec-
evaluation of how each of these algorithms performs under a
ondly, the class distribution of the sub-windows in a single
representative variety of typical scenarios. We aim to show
input image must be varied enough to prevent catastrophic
how the performance of each algorithm is impacted by kernel
forgetting (a phenomenon explained very well in [3]) for ev-
size, input image size, network depth, and network breadth.
ery new input image. Unfortunately, in the object detection
All experiments are run on a machine with an Intel i7 4770
setting there are no guarantees about how frequently an ob-
processor and 16GB of RAM. The performance measure-
ject class will occur so it is unlikely the fast sliding window
ments reported for each parameter configuration is the av-
backpropagation algorithm will be any use in these cases.
erage over 10 runs. Five different algorithms are evaluated:
The task of pixel-level scene labeling might see a benefit
from this backpropagation method, but that is outside the • Standard forward propagation in the space domain
scope of this work. (FP-S);
• Standard forward propagation in the frequency domain
5. IMPLEMENTATION DETAILS (FP-F);
The algorithms were implemented in C++ and extensive
use was made of Intel’s AVX (Advanced Vector eXtensions) • Sliding window forward propagation in the space do-
SIMD (Single Instruction, Multiple Data) intrinsics to pro- main (SWFP-S);
vide a significant performance boost. FFTW [4], a popu- • Sliding window forward propagation in the frequency
lar fast Fourier transform library, was used for computing domain (SWFP-F);
the Fourier transforms required in the convolutional layers.
This library was chosen because of the relatively good per- • Sliding window forward propagation in the frequency
formance and because it also utilises AVX instructions. domain with the optimisations to the fully connected
The implementation is packaged as a library that can be layers described in Section 3 (OSWFP-F).
included in other projects and is available for download.1
In addition to the fast sliding window forward propagation Note that optimisations introduced for OSWFP-F does
algorithm, we implemented the sliding window backpropa- not include the technique for reducing memory usage by
gation methods as well. We have, however, not yet found coarsely dividing up the input image.
The first experiment we conducted provides empirical ev-
1
http://github.com/henrygouk/nnet/ idence for how fast the different inference procedures are

116
102 FP-S
FP-S 102
FP-F FP-F
101 SWFP-S SWFP-S
101
SWFP-F SWFP-F

Time (seconds)
OSWFP-F
Time (seconds)

100 OSWFP-F 100

10−1 10−1

10−2 10−2

10−3
10−3
10−4
32 64 128 256 512 1024 32 64 128 256 512 1024

Input Image Size Input Image Size

Figure 3: Speed of the five algorithms for different Figure 4: Speed of the five algorithms for different
square input image resolutions using a typical shal- square input image resolutions using a typical deep
low network architecture. network architecture.

on a shallow network. We measure the average time taken


kernel sizes, the feature maps are cropped to the largest size
to run the CNN on randomly generated synthetic images
that is a divisible by the pool size without any remainder.
at several different resolutions. Figure 3 shows these mea-
surements for the five different inference techniques. The
network architecture used in experiment contains four lay-
ers; a convolutional layer with 9 × 9 kernels producing 64
feature maps, a max pooling layer with a 3 × 3 pool size, 7. DISCUSSION
a fully connected layer with 64 units, and a binary output As can be seen from the experiments presented in Sec-
layer. Rectified Linear Units [7] are used for all the hidden tion 6, the frequency domain sliding window inference method
layers and softmax is used for the output layer, as this has that takes advantage of computer architecture aware optimi-
become the typical combination of activation functions in sations is the fastest method in most cases. The exception
deep neural networks applied to classification tasks. being where the image size is identical to the receptive field
Next, we investigate how the different methods cope with size of the original CNN classifier. The main controlling fac-
a significantly larger number of kernels by introducing addi- tor in the performance of the frequency domain approach
tional convolutional and pooling layers. The network used is the speed of the FFT implementation. The erratic devia-
for this experiment is as follows; a convolutional layer with tion in performance demonstrated in Figure 1, caused by the
5 × 5 kernels producing 32 feature maps, a pooling layer FFT execution time, can also be seen when observing how
with a pool size of 2 × 2, another convolutional layer with the speeds varies with different kernel sizes. In Figures 5
5 × 5 kernels that also produces 32 feature maps, another and 6 the speedup of the frequency domain algorithms ex-
2×2 pooling layer, a fully connected layer with 32 units, and hibits the same, seemingly random, variations in speed with
a binary output layer. Once again, we run several different changes to the kernel size. This means great care must be
sized images through the network and use all five algorithms taken when using the method for reducing the memory usage
to perform forward propagations. The results are presented of the frequency domain algorithm described in Section 3.
in Figure 4. The difference between OSWFP-F and SWFP-F is most
The primary benefit of the frequency domain method of noticeable when a large fraction of the running time is spent
convolution in the context of CNNs is that the computation on computing the fully connected layer activations, as that
time becomes completely dependent on the feature map size is what the performance optimisations targeted. This dif-
and the kernel size plays no role, aside from determining the ference is most obvious in the shallow network, particularly
dimensions of feature maps used as inputs for subsequent when the image size is quite large, as Figure 5. In the deep
layers. To see how the frequency domain algorithms com- network the vast majority of computations are devoted to
pare to the space domain algorithms we experimented with the convolutional layers, and because of this there is very
the same two networks previously used to show how the per- little difference between OSWFP-F and SWFP-F.
formance changes for each of the five techniques when the The fast space domain sliding window classification method
kernel size is varied, as opposed to the input image size. presented in [8] (labeled SWFP-S in the figures) does in-
For these experiments we use a fixed input image size of deed provide a massive speedup of over an order of mag-
1024 × 1024 pixels. Figure 5 shows the results for the shal- nitude compared to the naive approach of extracting each
low network and Figure 6 shows the results for the deep sub-window and running the inference procedure on them
network. When the feature maps output by a convolutional individually. This speedup is particularly noticeable when
layer are not divisible by the pool sizes due to the varying the input image becomes quite large.

117
8. CONCLUSIONS
We have presented an algorithm that improves upon the
state of the art performance of sliding window inference with
convolutional neural networks by exploiting a combination
of frequency domain methods and knowledge of computer
architecture. The performance of this method and a recently
introduced spatial domain method were quantified in a range
of scenarios representative of the typical CNN architectures
Time (seconds)

used. Our analysis also shows that the quality of the FFT
101 implementation used impacts the performance in a major
way. In the future we plan to investigate how the issues with
the fast sliding window backpropagation can be overcome,
and also demonstrate that complex models can be applied
to tasks with real-time constraints.

9. REFERENCES
100
[1] Anthony Martin Blake. Computing the fast Fourier
transform on SIMD microprocessors. PhD thesis,
3 5 7 9 University of Waikato, 2012.
[2] CSS Burrus and Thomas W Parks. DFT/FFT and
Kernel Size
Convolution Algorithms: Theory and Implementation.
John Wiley & Sons, Inc., 1991.
Figure 5: Speed of the five algorithms for different [3] Robert M French. Catastrophic forgetting in
square kernel sizes using a typical shallow network connectionist networks. Trends in cognitive sciences,
architecture. The legend has been omitted due to 3(4):128–135, 1999.
space constraints, however the colour coding is the [4] Matteo Frigo and Steven G. Johnson. The design and
same as in Figure 3. implementation of FFTW3. Proceedings of the IEEE,
93(2):216–231, 2005. Special issue on “Program
Generation, Optimization, and Platform Adaptation”.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in Neural Information
Processing Systems 25, pages 1106–1114, 2012.
[6] Michael Mathieu, Mikael Henaff, and Yann LeCun.
Fast Training of Convolutional Networks through
FFTs. arXiv preprint arXiv:1312.5851, 2013.
[7] Vinod Nair and Geoffrey E Hinton. Rectified linear
102 units improve restricted boltzmann machines. In
Proceedings of the 27th International Conference on
Time (seconds)

Machine Learning (ICML-10), pages 807–814, 2010.


[8] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël
Mathieu, Rob Fergus, and Yann LeCun. Overfeat:
101 Integrated recognition, localization and detection
using convolutional networks. arXiv preprint
arXiv:1312.6229, 2013.
[9] Richard Szeliski. Computer vision: algorithms and
applications. Springer, 2010.
100 [10] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun,
and Rob Fergus. Regularization of neural networks
3 5 7 9 using dropconnect. In Proceedings of the 30th
International Conference on Machine Learning
Kernel Size (ICML-13), pages 1058–1066, 2013.

Figure 6: Speed of the five algorithms for differ-


ent square kernel sizes using a typical deep network
architecture. The legend has been omitted due to
space constraints, however the colour coding is the
same as in Figure 3.

118

Вам также может понравиться