3 views

Uploaded by sharkie

Squeezenet a deep neural network algorithm

- 1-s2.0-S0266353806002788-main
- On the Study of the Ethernet
- Tutorial 4
- Turban Online Tutorial 2
- neural network
- Neural Network Lec1
- Deep Learning Technique for Music Generation - A Survey
- Artificial Intelligence in UAVs, Airplanes & Autopilots
- 10.1109@ijcnn.1993.713922
- Cavaco-090519
- 021 M03 Ivanova
- Thesis Ieee
- scimakelatex.29021.George+B.+Silvestri.Arnaldo+D.+Brescia
- 1112.1473
- Sc Industrial Applications
- Singh 2015
- 10. Data mining and data warehouse.pdf
- Neural Networks
- Intelligent approach for efficient operation of electrical distribution automation systems
- Vehicle Detection and Road Scene Segmentation using Deep Learning

You are on page 1of 9

Forrest N. Iandola1 , Song Han2 , Matthew W. Moskewicz1 , Khalid Ashraf1 , William J. Dally2 , Kurt Keutzer1

1

DeepScale∗ & UC Berkeley 2 Stanford University

{forresti, moskewcz, kashraf, keutzer}@eecs.berkeley.edu {songhan, dally}@stanford.edu

arXiv:1602.07360v3 [cs.CV] 6 Apr 2016

models require less communication, making frequent

Recent research on deep convolutional neural networks updates more feasible.

(CNNs) has focused primarily on improving accuracy. For

a given accuracy level, it is typically possible to identify • Feasible FPGA and embedded deployment. FPGAs

multiple CNN architectures that achieve that accuracy level. often have less than 10MB1 of on-chip memory and

With equivalent accuracy, smaller CNN architectures offer at no off-chip memory or storage. For inference, a suf-

least three advantages: (1) Smaller CNNs require less com- ficiently small model could be stored directly on the

munication across servers during distributed training. (2) FPGA instead of being bottle necked by memory band-

Smaller CNNs require less bandwidth to export a new model width [25], while video frames stream through the

from the cloud to an autonomous car. (3) Smaller CNNs FPGA in real time.

are more feasible to deploy on FPGAs and other hardware

with limited memory. To provide all of these advantages,

we propose a small CNN architecture called SqueezeNet. As you can see, there are several advantages of smaller

SqueezeNet achieves AlexNet-level accuracy on ImageNet CNN architectures. With this in mind, we focus directly on

with 50x fewer parameters. Additionally, with model com- the problem of identifying a CNN architecture with fewer pa-

pression techniques we are able to compress SqueezeNet to rameters but equivalent accuracy compared to a well-known

less than 0.5MB (510× smaller than AlexNet). model. We have discovered such an architecture, which we

The SqueezeNet architecture is available for download call SqueezeNet.

here: https://github.com/DeepScale/SqueezeNet The rest of the paper is organized as follows. In Section 2

we review the related work. Then, in Sections 3 and 4 we

describe and evaluate the SqueezeNet architecture. After

1. Introduction and Motivation that, we turn our attention to understanding how CNN ar-

Much of the recent research on deep convolutional neu-

chitectural design choices impact model size and accuracy.

ral networks (CNNs) has focused on increasing accuracy on

We gain this understanding by exploring the design space of

computer vision datasets. For a given accuracy level, there

SqueezeNet-like architectures. In Section 5, we do design

typically exist multiple CNN architectures that achieve that

space exploration on the CNN microarchitecture, which we

accuracy level. Given equivalent accuracy, a CNN architec-

define as the organization and dimensionality of individual

ture with fewer parameters has several advantages:

layers and modules. In Section 6, we do design space ex-

• More efficient distributed training. Communication ploration on the CNN macroarchitecture, which we define

among servers is the limiting factor to the scalability of as high-level organization of layers in a CNN. In Section 7,

distributed CNN training. For distributed data-parallel we do design space exploration on model compression tech-

training, communication overhead is directly propor- niques for compressing CNNs such as SqueezeNet. Finally,

tional to the number of parameters in the model [15]. we conclude in Section 8. In short, Sections 3 and 4 are use-

In short, small models train faster due to requiring less ful for CNN researchers as well as practitioners who simply

communication. want to apply SqueezeNet to a new application. The remain-

ing sections are aimed at advanced researchers who intend to

• Less overhead when exporting new models to clients.

design their own CNN architectures.

For autonomous driving, companies such as Tesla pe-

riodically copy new models from their servers to cus-

tomers’ cars. With AlexNet, this would require 240MB 1 For example, the Xilinx Vertex-7 FPGA has a maximum of 8.5 MBytes

∗ http://deepscale.ai (i.e. 68 Mbits) of on-chip memory and does not provide off-chip memory.

1

2. Related Work Perhaps the mostly widely studied CNN macroarchitec-

2.1. Model Compression ture topic in the recent literature is the impact of depth (i.e.

The overarching goal of our work is to identify a model number of layers) in networks. Simoyan and Zisserman pro-

that has very few parameters while preserving accuracy. To posed the VGG [26] family of CNNs with 12 to 19 layers

address this problem, a sensible approach is to take an exist- and reported that deeper networks produce higher accuracy

ing CNN model and compress it in a lossy fashion. In fact, a on the ImageNet-1k dataset [4]. K. He et al. proposed deeper

research community has emerged around the topic of model CNNs with up to 30 layers that deliver even higher ImageNet

compression, and several approaches have been reported. A accuracy [14].

fairly straightforward approach by Denton et al. is to apply The choice of connections across multiple layers or mod-

singular value decomposition (SVD) to a pretrained CNN ules is an emerging area of CNN macroarchitectural re-

model [5]. Han et al. developed Network Pruning, which be- search. Residual Networks (ResNet) [13] and Highway Net-

gins with a pretrained model, then replaces parameters that works [29] each propose the use of connections that skip over

are below a certain threshold with zeros to form a sparse multiple layers, for example additively connecting the acti-

matrix, and finally performs a few iterations of training on vations from layer 3 to the activations from layer 6. We refer

the sparse CNN [11]. Recently, Han et al. extended their to these connections as bypass connections. The authors of

work by combining Network Pruning with quantization (to ResNet provide an A/B comparison of a 34-layer CNN with

8 bits or less) and huffman encoding to create an approach and without bypass connections; adding bypass connections

called Deep Compression [10], and further designed a hard- delivers a 2 percentage-point improvement on Top-5 Ima-

ware accelerator called EIE [9] that operates directly on the geNet accuracy.

compressed model, achieving substantial speedups and en-

ergy savings. 2.4. Neural Network Design Space Exploration

Neural networks (including deep and convolutional NNs)

2.2. CNN Microarchitecture have a large design space, with numerous options for mi-

Convolutions have been used in artificial neural networks

croarchitectures, macroarchitectures, solvers, and other hy-

for at least 25 years; LeCun et al. helped to popularize CNNs

perparameters. It seems natural that the community would

for digit recognition applications in the late 1980s [21]. In

want to gain intuition about how these factors impact a NN’s

neural networks, convolution filters are typically 3D, with

accuracy (i.e. the shape of the design space). Much of the

height, width, and channels as the key dimensions. When

work on design space exploration (DSE) of NNs has fo-

applied to images, CNN filters typically have 3 channels in

cused on developing automated approaches for finding NN

their first layer (i.e. RGB), and in each subsequent layer

architectures that deliver higher accuracy. These automated

Li the filters have the same number of channels as Li−1

DSE approaches include bayesian optimization [27], simu-

has filters. The early work by LeCun et al. [21] uses

lated annealing [23], randomized search [2], and genetic al-

5x5xChannels2 filters, and the recent VGG [26] architec-

gorithms [30]. To their credit, each of these papers provides

tures extensively use 3x3 filters. Models such as Network-

a case in which the proposed DSE approach produces a NN

in-Network [22] and the GoogLeNet family of architec-

architecture that achieves higher accuracy compared to a rep-

tures [32, 18, 33, 31] use 1x1 filters in some layers.

resentative baseline. However, these papers make no attempt

With the trend of designing very deep CNNs, it becomes

to provide intuition about the shape of the NN design space.

cumbersome to manually select filter dimensions for each

Later in this paper, we eschew automated approaches – in-

layer. To address this, various higher level building blocks,

stead, we refactor CNNs in such a way that we can do princi-

or modules, comprised of multiple convolution layers with

pled A/B comparisons to investigate how CNN architectural

a specific fixed organization have been proposed. For ex-

decisions influence model size and accuracy.

ample, the GoogLeNet papers propose Inception modules,

In the following sections, we first propose and evaluate

which are comprised of a number of different dimensional-

the SqueezeNet architecture with and without model com-

ities of filters, usually including 1x1 and 3x3, plus some-

pression. Then, we explore the impact of design choices in

times 5x5 [32] and sometimes 1x3 and 3x1 [33]. Many such

microarchitecture, macroarchitecture, and model compres-

modules are then combined, perhaps with additional ad-hoc

sion for SqueezeNet-like CNN architectures.

layers, to form a complete network. We use the term CNN

microarchitecture to refer to the particular organization and

dimensions of the individual modules. 3. SqueezeNet: preserving accuracy with

few parameters

2.3. CNN Macroarchitecture In this section, we begin by outlining our design strategies

While the CNN microarchitecture refers to individual lay- for CNN architectures with few parameters. Then, we intro-

ers and modules, we define the CNN macroarchitecture as duce the Fire module, our new building block out of which to

the big-picture organization of multiple modules into an end- build CNN architectures. Finally, we use our design strate-

to-end CNN architecture. gies to construct SqueezeNet, which is comprised mainly of

2 From now on, we will simply abbreviate HxWxChannels to HxW. Fire modules.

2

3.1. Architectural Design Strategies eze8

Our overarching objective in this paper is to identify CNN sq u e

1x18convolu.on8ﬁlters8

architectures which have few parameters while maintaining

competitive accuracy. To achieve this, we employ three main

strategies when designing CNN architectures: ReLU8

n d8

Strategy 1. Replace 3x3 filters with 1x1 filters. Given a expa 1x18and83x38convolu.on8ﬁlters8

budget of a certain number of convolution filters, we will

choose to make the majority of these filters 1x1, since a 1x1

filter has 9X fewer parameters than a 3x3 filter.

Strategy 2. Decrease the number of input channels to 3x3 ReLU8

Figure 1. Microarchitectural view: Organization of convolution fil-

tirely of 3x3 filters. The total quantity of parameters in this ters in the Fire module. In this example, s1x1 = 3, e1x1 = 4, and

layer is (number of input channels) * (number of filters) * e3x3 = 4. We illustrate the convolution filters but not the activa-

(3*3). So, to maintain a small total number of parameters tions.

in a CNN, it is important not only to decrease the number

of 3x3 filters (see Strategy 1 above), but also to decrease the

number of input channels to the 3x3 filters. We decrease the The liberal use of 1x1 filters in Fire modules is an applica-

number of input channels to 3x3 filters using squeeze layers, tion of Strategy 1 from Section 3.1. We expose three tunable

which we describe in the next section. dimensions (hyperparameters) in a Fire module: s1x1 , e1x1 ,

and e3x3 . In a Fire module, s1x1 is the number of filters in

Strategy 3. Downsample late in the network so that con-

the squeeze layer (all 1x1), e1x1 is the number of 1x1 filters

volution layers have large activation maps. In a convo-

in the expand layer, and e3x3 is the number of 3x3 filters in

lutional network, each convolution layer produces an output

the expand layer. When we use Fire modules we set s1x1

activation map with a spatial resolution that is at least 1x1

to be less than (e1x1 + e3x3 ), so the squeeze layer helps to

and often much larger than 1x1. The height and width of

limit the number of input channels to the 3x3 filters, as per

these activation maps are controlled by: (1) the size of the

Strategy 2 from Section 3.1.

input data (e.g. 256x256 images) and (2) the choice of lay-

ers in which to downsample in the CNN architecture. Most 3.3. The SqueezeNet architecture

commonly, downsampling is engineered into CNN architec- We now describe the SqueezeNet CNN architecture. We

tures by setting the (stride > 1) in some of the convolution illustrate in Figure 2 that SqueezeNet begins with a stan-

or pooling layers. If early3 layers in the network have large dalone convolution layer (conv1), followed by 8 Fire mod-

strides, then most layers will have small activation maps. ules (fire2-9), ending with a final conv layer (conv10). We

Conversely, if most layers in the network have a stride of gradually increase the number of filters per fire module from

1, and the strides greater than 1 are concentrated toward the the beginning to the end of the network. SqueezeNet per-

end4 of the network, then many layers in the network will forms max-pooling with a stride of 2 after conv1, fire4,

have large activation maps. Our intuition is that large activa- fire8, and conv10; these relatively late placements of pool-

tion maps (due to delayed downsampling) can lead to higher ing are per Strategy 3 from Section 3.1. We present the full

classification accuracy, with all else held equal. Indeed, K. SqueezeNet architecture in Table 1.

He and H. Sun applied delayed downsampling to four differ-

ent CNN architectures, and in each case delayed downsam-

3.3.1 Other SqueezeNet details

pling led to higher classification accuracy [12].

• So that the output activations from 1x1 and 3x3 filters

Strategies 1 and 2 are about judiciously decreasing the

have the same height and width, we add a 1-pixel border

quantity of parameters in a CNN while attempting to pre-

of zero-padding in the input data to 3x3 filters of expand

serve accuracy. Strategy 3 is about maximizing accuracy on

modules.

a limited budget of parameters. Next, we describe the Fire

module, which is our building block for CNN architectures • ReLU [24] is applied to activations from squeeze and

that enables us to successfully employ Strategies 1, 2, and 3. expand layers.

• Dropout [28] with a ratio of 50% is applied after the

3.2. The Fire Module

We define the Fire module as follows. A Fire module is fire9 module.

comprised of: a squeeze convolution layer (which has only • Note the lack of fully-connected layers in SqueezeNet;

1x1 filters), feeding into an expand layer that has a mix of this design choice was inspired by the NiN [22] archi-

1x1 and 3x3 convolution filters; we illustrate this in Figure 1. tecture.

3 In our terminology, an “early” layer is close to the input data. • When training SqueezeNet, we use a polynomial learn-

4 In our terminology, the “end” of the network is the classifier. ing rate much like the one described in [15]. For

3

conv1 conv1 conv1

96 96 96

maxpool/2 maxpool/2 maxpool/2

96

128 128 128

fire3 fire3 fire3

128 128 128

fire4 fire4 fire4 conv1x1

256 256 256

maxpool/2 maxpool/2 maxpool/2

256 256 256

fire6 fire6 fire6 conv1x1

384 384 384

fire7 fire7 fire7

384 384 384

fire8 fire8 fire8 conv1x1

512 512 512

maxpool/2 maxpool/2 maxpool/2

512 512 512

conv10 conv10 conv10

1000 1000 1000

global avgpool global avgpool global avgpool

"labrador

softmax

retriever

softmax softmax

dog"

Figure 2. Macroarchitectural view of our SqueezeNet architecture. Left: SqueezeNet (Section 3.3); Middle: SqueezeNet with simple bypass

(Section 6); Right: SqueezeNet with complex bypass (Section 6).

details on the training protocol (e.g. batch size, Table 1. SqueezeNet architectural dimensions. (The formatting of

this table was inspired by the Inception2 paper [18].)

learning rate, parameter initialization), please refer filter size /

s1x1 e1x1 e3x3 #parameter #parameter

to our Caffe [19] configuration files located here: layer

output size

stride

depth (#1x1 s1x1 e1x1 e3x3 # bits before after

name/type (if not a fire (#1x1 (#3x3

sparsity sparsity sparsity pruning pruning

layer) squeeze) expand) expand)

https://github.com/DeepScale/SqueezeNet. input image 224x224x3 - -

conv1 111x111x96 7x7/2 (x96) 1 100% (7x7) 6bit 14,208 14,208

• The Caffe framework does not natively support a convo- maxpool1 55x55x96 3x3/2 0

fire2 55x55x128 2 16 64 64 100% 100% 33% 6bit 11,920 5,746

lution layer that contains multiple filter resolutions (e.g. fire3 55x55x128 2 16 64 64 100% 100% 33% 6bit 12,432 6,258

1x1 and 3x3). To get around this, we implement our ex- fire4 55x55x256 2 32 128 128 100% 100% 33% 6bit 45,344 20,646

maxpool4 27x27x256 3x3/2 0

pand layer with two separate convolution layers: a layer fire5 27x27x256 2 32 128 128 100% 100% 33% 6bit 49,440 24,742

with 1x1 filters, and a layer with 3x3 filters. Then, we fire6 27x27x384 2 48 192 192 100% 50% 33% 6bit 104,880 44,700

fire7 27x27x384 2 48 192 192 50% 100% 33% 6bit 111,024 46,236

concatenate the outputs of these layers together in the fire8 27x27x512 2 64 256 256 100% 50% 33% 6bit 188,992 77,581

maxpool8 13x12x512 3x3/2 0

channel dimension. This is numerically equivalent to fire9 13x13x512 2 64 256 256 50% 100% 30% 6bit 197,184 77,581

implementing one layer that contains both 1x1 and 3x3 conv10 13x13x1000 1x1/1 (x1000) 1 20% (3x3) 6bit 513,000 103,400

avgpool10 1x1x1000 13x13/1 0

filters. 1,248,424 421,098

(total) (total)

4. Evaluation of SqueezeNet

We now turn our attention to evaluating SqueezeNet. In

is able to compress a pretrained AlexNet model by a factor

each of the CNN model compression papers reviewed in Sec-

of 5X, while diminishing top-1 accuracy to 56.0% [5]. Net-

tion 2.1, the goal was to compress an AlexNet [20] model

work Pruning achieves a 9X reduction in model size while

that was trained to classify images using the ImageNet [4]

maintaining the baseline of 57.2% top-1 and 80.3% top-5

(ILSVRC 2012) dataset. Therefore, we use AlexNet5 and

accuracy on ImageNet [11]. Deep Compression achieves

the associated model compression results as a basis for com-

a 35x reduction in model size while still maintaining the

parison when evaluating SqueezeNet.

baseline accuracy level [10]. Now, with SqueezeNet, we

In Table 2, we review SqueezeNet in the context of re-

achieve a 50X reduction in model size compared to AlexNet,

cent model compression results. The SVD-based approach

while meeting or exceeding the top-1 and top-5 accuracy of

5 Our baseline is bvlc alexnet from the Caffe codebase [19]. AlexNet. We summarize all of the aforementioned results in

4

Table 2. Comparing SqueezeNet to model compression approaches. By model size, we mean the number of bytes required to store all of the

parameters in the trained model.

CNN Compression Data Original → Reduction in Top-1 Top-5

architecture Approach Type Compressed Model Size Model Size vs. ImageNet ImageNet

AlexNet Accuracy Accuracy

AlexNet None (baseline) 32 bit 240MB 1x 57.2% 80.3%

AlexNet SVD [5] 32 bit 240MB → 48MB 5x 56.0% 79.4%

AlexNet Network 32 bit 240MB → 27MB 9x 57.2% 80.3%

Pruning [11]

AlexNet Deep Compres- 5-8 bit 240MB → 6.9MB 35x 57.2% 80.3%

sion [10]

SqueezeNet None 32 bit 4.8MB 50x 57.5% 80.3%

(ours)

SqueezeNet Deep 8 bit 4.8MB → 0.66MB 363x 57.5% 80.3%

(ours) Compression

SqueezeNet Deep 6 bit 4.8MB → 0.47MB 510x 57.5% 80.3%

(ours) Compression

Table 2. space. We divide this exploration into three main topics: mi-

It appears that we have surpassed the state-of-the-art re- croarchitectural exploration (per-module layer dimensions

sults from the model compression community: even when and configurations), macroarchitectural exploration (high-

using uncompressed 32-bit values to represent the model, level end-to-end organization of modules and other layers),

SqueezeNet has a 1.4× smaller model size than the best ef- and exploration of model compression configurations.

forts from the model compression community while main-

In this section, we design and execute experiments with

taining or exceeding the baseline accuracy. Until now, an

the goal of providing intuition about the shape of the mi-

open question has been: are small models amenable to com-

croarchitectural design space with respect to the design

pression, or do small models “need” all of the representa-

strategies that we proposed in Section 3.1. Note that our goal

tional power afforded by dense floating-point values? To

here is not to maximize accuracy in every experiment, but

find out, we applied Deep Compression [10] to SqueezeNet,

rather to understand the impact of CNN architectural choices

using 33% sparsity6 and 8-bit quantization. This yields a

on model size and accuracy.

0.66 MB model (363× smaller than 32-bit AlexNet) with

equivalent accuracy to AlexNet. Further, applying Deep

Compression with 6-bit quantization and 33% sparsity on

SqueezeNet, we produce a 0.47MB model (510× smaller 5.1. CNN Microarchitecture metaparameters

than 32-bit AlexNet) with equivalent accuracy. It appears

In SqueezeNet, each Fire module has three dimensional

that our small model is indeed amenable to compression.

hyperparameters which we defined in Section 3.2: s1x1 ,

In addition, these results demonstrate that Deep Com-

e1x1 , and e3x3 . SqueezeNet has 8 Fire modules with a to-

pression [10] not only works well on CNN architectures

tal of 24 dimensional hyperparameters. To do broad sweeps

with many parameters (e.g. AlexNet and VGG), but it is

of the design space of SqueezeNet-like architectures, we de-

also able to compress the already compact, fully convolu-

fine the following set of higher level metaparameters which

tional SqueezeNet architecture. Deep Compression com-

control the dimensions of all Fire modules in a CNN. We

pressed SqueezeNet by 10× while preserving the baseline

define basee as the number of expand filters in the first

accuracy. In summary: by combining CNN architectural

Fire module in a CNN. After every f req Fire modules, we

innovation (SqueezeNet) with state-of-the-art compression

increase the number of expand filters by incre . In other

techniques (Deep Compression), we achieved a 510× reduc-

tion in model size with no decrease in accuracy compared to

words, for Fire module j i, the

k number of expand filters is

i

the baseline. ei = basee + (incre ∗ f req ). In the expand layer of a Fire

module, some filters are 1x1 and some are 3x3; we define

5. CNN Microarchitecture Design Space Explo- ei = ei,1x1 + ei,3x3 with pct3x3 (in the range [0, 1], shared

over all Fire modules) as the percentage of expand filters that

ration

are 3x3. In other words, ei,3x3 = ei ∗ pct3x3 , and ei,1x1 =

So far, we have proposed architectural design strate-

ei ∗ (1 − pct3x3 ). Finally, we define the number of filters

gies for small models, followed these principles to cre-

in the squeeze layer of a Fire module using a metaparam-

ate SqueezeNet, and discovered that SqueezeNet is 50x

eter called the squeeze ratio (SR) (again, in the range [0, 1],

smaller than AlexNet with equivalent accuracy. However,

shared by all Fire modules): si,1x1 = SR∗ei (or equivalently

SqueezeNet and other models reside in a broad and largely

si,1x1 = SR ∗ (ei,1x1 + ei,3x3 )). SqueezeNet (Table 1) is an

unexplored design space of CNN architectures. Now, in Sec-

example architecture that we generated with the aforemen-

tions 5, 6, and 7, we explore several aspects of the design

tioned set of metaparameters. Specifically, SqueezeNet has

6 Note that, due to the storage overhead of storing sparse matrix indices, the following metaparameters: basee = 128, incre = 128,

33% sparsity leads to somewhat less than a 3× decrease in model size. pct3x3 = 0.5, f req = 2, and SR = 0.125.

5

5.2. Squeeze Ratio

In Section 3.1, we proposed decreasing the number of SqueezeNet# 85.3%# 86.0%#

parameters by using squeeze layers to decrease the num- 80.3%# accuracy# accuracy#

accuracy#

ber of input channels seen by 3x3 filters. We defined the 13#MB#of# 19#MB#of#

4.8#MB#of# weights# weights#

squeeze ratio (SR) as the ratio between the number of filters weights#

in squeeze layers and the number of filters in expand layers.

We now design an experiment to investigate the effect of the

squeeze ratio on model size and accuracy.

In these experiments, we use SqueezeNet (Figure 2) as

a starting point. As in SqueezeNet, these experiments use

the following metaparameters: basee = 128, incre = 128,

pct3x3 = 0.5, and f req = 2. We train multiple models,

where each model has a different squeeze ratio (SR)7 in the

range [0.125, 1.0]. In Figure 3(a), we show the results of (a) Exploring the impact of the squeeze ratio (SR) on model

this experiment, where each point on the graph is an inde- size and accuracy.

pendent model that was trained from scratch. SqueezeNet is

the SR=0.125 point in this figure. From this figure, we learn

that increasing SR beyond 0.125 can further increase Ima- 85.3%# 85.3%#

76.3%# accuracy# accuracy#

geNet top-5 accuracy from 80.3% (i.e. AlexNet-level) with accuracy#

13#MB#of# 21#MB#of#

a 4.8MB model to 86.0% with a 19MB model. Accuracy 5.7#MB#of# weights# weights#

plateaus at 86.0% with SR=0.75 (a 19MB model), and set- weights#

ting SR=1.0 further increases model size without improving

accuracy.

5.3. Trading off 1x1 and 3x3 filters

In Section 3.1, we proposed decreasing the number of pa-

rameters in a CNN by replacing some 3x3 filters with 1x1

filters. An open question is, how important is spatial reso-

lution in CNNs? The VGG [26] architectures have 3x3 spa-

tial resolution in most layers’ filters; GoogLeNet [32] and (b) Exploring the impact of the ratio of 3x3 filters in expand lay-

ers (pct3x3 ) on model size and accuracy.

Network-in-Network (NiN) [22] have 1x1 filters in some lay-

ers. In GoogLeNet and NiN, the authors simply propose a Figure 3. Microarchitectural design space exploration.

specific quantity of 1x1 and 3x3 filters without further anal-

ysis.8 Here, we attempt to shed light on how the proportion 6. CNN Macroarchitecture Design Space Ex-

of 1x1 and 3x3 filters affects model size and accuracy.

ploration

We use the following metaparameters in this experiment:

So far we have explored the design space at the microar-

basee = incre = 128, f req = 2, SR = 0.500, and we vary

chitecture level, i.e. the contents of individual modules of the

pct3x3 from 1% to 99%. In other words, each Fire module’s

CNN. Now, we explore design decisions at the macroarchi-

expand layer has a predefined number of filters partitioned

tecture level concerning the high-level connections among

between 1x1 and 3x3, and here we turn the knob on these fil-

Fire modules. Inspired by ResNet [13], we explored three

ters from “mostly 1x1” to “mostly 3x3”. As in the previous

different architectures:

experiment, these models have 8 Fire modules, following the

same organization of layers as in Figure 2. We show the re- • Vanilla SqueezeNet (as per the prior sections).

sults of this experiment in Figure 3(b). Note that the 13MB

models in Figure 3(a) and Figure 3(b) are the same architec- • SqueezeNet with simple bypass connections between

ture: SR = 0.500 and pct3x3 = 50%. We see in Figure 3(b) some Fire modules.

that the top-5 accuracy plateaus at 85.6% using 50% 3x3 fil- • SqueezeNet with complex bypass connections between

ters, and further increasing the percentage of 3x3 filters leads the remaining Fire modules.

to a larger model size but provides no improvement in accu-

racy on ImageNet. We illustrate these three variants of SqueezeNet in Fig-

ure 2.

Our simple bypass architecture adds bypass connections

7 Note that, for a given model, all Fire layers share the same squeeze

around Fire modules 3, 5, 7, and 9, requiring these modules

ratio.

to learn a residual function between input and output. As in

8 To be clear, each filter is 1x1xChannels or 3x3xChannels, which we ResNet, to implement a bypass connection around Fire3, we

abbreviate to 1x1 and 3x3. set the input to Fire4 equal to (output of Fire2 + output of

6

Table 3. SqueezeNet accuracy and model size using different Top1-accuracy Top5-accuracy

82.0%

macroarchitecture configurations

71.7%

Architecture Top-1 Top-5 Model

Accuracy Accuracy Size 61.3%

fire2_s1x1

fire3_s1x1

fire4_s1x1

fire5_s1x1

fire6_s1x1

fire7_s1x1

fire8_s1x1

fire9_s1x1

conv1

fire2_e1x1

fire2_e3x3

fire3_e1x1

fire3_e3x3

fire4_e1x1

fire4_e3x3

fire5_e1x1

fire5_e3x3

fire6_e1x1

fire6_e3x3

fire7_e1x1

fire7_e3x3

fire8_e1x1

fire8_e3x3

fire9_e1x1

fire9_e3x3

conv10

SqueezeNet + Simple 60.4% 82.5% 4.8MB

Bypass

SqueezeNet + Complex 58.8% 82.0% 7.7MB

Bypass Figure 4. Sensitivity analysis: accuracy of the SqueezeNet after

pruning each layer by 50%. There are two dents for fire7 e1x1

and fire9 e1x1. Those layers are really sensitive to pruning. They

Fire3), where the + operator is elementwise addition. This should be sized larger; Those 3x3 kernels are flat meaning less sen-

changes the regularization applied to the parameters of these sitive, so they can be pruned more.

Fire modules, and, as per ResNet, can improve the final ac-

curacy and/or ability to train the full model.

One limitation is that, in the straightforward case, the perparameters include the choice of bit width and degree of

number of input channels and number of output channels has pruning (i.e. sparsity level) of each layer of the CNN. To

to be the same; as a result, only half of the Fire modules can address this, we propose sensitivity analysis, which is our

have simple bypass connections, as shown in the middle dia- systematic strategy for broadly exploring the design space of

gram of Fig 2. When the “same number of channels” require- per-layer parameter pruning settings.

ment can’t be met, we use a complex bypass connection, as

illustrated on the right of Figure 2. While a simple bypass 7.1. Sensitivity Analysis: Where to Prune or Add

is “just a wire,” we define a complex bypass as a bypass that parameters

includes a 1x1 convolution layer with the number of filters Broadly, we define Sensitivity Analysis as any technique

set equal to the number of output channels that are needed. that provides insight into how important a layer’s parame-

Note that complex bypass connections add extra parameters ters are to the network’s accuracy. Our sensitivity analysis

to the model, while simple bypass connections do not. approach is quite straightforward. In SqueezeNet, there are

In addition to changing the regularization, it is intuitive 8 Fire modules and 2 standalone convolution layers. As we

to us that adding bypass connections would help to alleviate described in Section 3.3.1, a Fire module has 3 filter banks:

the representational bottleneck introduced by squeeze lay- squeeze1x1 , expand1x1 , and expand3x3 . Thus, at the per-

ers. In SqueezeNet, the squeeze ratio (SR) is 0.125, meaning layer level (rather than the per-module level), there are a to-

that every squeeze layer has 8x fewer output channels than tal of 26 layers. To perform sensitivity analysis, we perform

the accompanying expand layer. Due to this severe dimen- 26 separate experiments: in each experiment, we select one

sionality reduction, a limited amount of information can pass layer in SqueezeNet and zero out (i.e. prune) the 50% of

through squeeze layers. However, by adding bypass connec- the parameters with the smallest numerical values. In Fig-

tions to SqueezeNet, we open up avenues for information to ure 4, each point is one of these 26 experiments. The exper-

flow around the squeeze layers. iments in Figure 4 had no further training after parameters

We trained SqueezeNet with the three macroarchitec- were pruned – observe that in many layers, deleting 50% of

tures in Figure 2 and compared the accuracy and model the parameters leads to no accuracy drop compared to a non-

size in Table 3. We fixed the microarchitecture to match pruned baseline of 80.3% top-5 accuracy.

SqueezeNet as described in Table 1 throughout the macroar- Sensitivity analysis applied to model compression.

chitecture exploration. Complex and simple bypass connec- Broadly, we observe in Figure 4 that the 1x1 filters tend to

tions both yielded an accuracy improvement over the vanilla be more sensitive to pruning than the 3x3 filters. With this

SqueezeNet architecture. Interestingly, the simple bypass in mind, we configure Deep Compression to prune 3x3 filters

enabled a higher accuracy accuracy improvement than com- quite aggressively (see “sparsity #bits” in Table 1). We prune

plex bypass. Adding the simple bypass connections yielded 1x1 filters less aggressively and in some cases not at all. The

an increase of 2.9 percentage-points in top-1 accuracy and insights from sensitivity analysis enabled us to preserve the

2.2 percentage-points in top-5 accuracy without increasing baseline accuracy while pruning by a factor of 3× and fur-

model size. ther decreasing model size by using narrower bit widths. In

Figure 5, we show the per-layer distribution of parameters

7. Model Compression Design Space Explo- before and after applying Deep Compression.

ration Sensitivity analysis applied to increasing accuracy. If

Deep Compression [10] made SqueezeNet 10× smaller sensitivity analysis can expose the parameters that are most

without losing accuracy. Earlier in the paper we made harmless to prune, we wondered – can sensitivity analy-

this look simple, but in reality there are a number of de- sis also suggest layers in which adding parameters would

sign choices and hyperparameters that need to be consid- be particularly helpful for accuracy? Our intuition is that,

ered when applying Deep Compression to a CNN. These hy- adding parameters to the most sensitive in layers in Fig-

7

# Weights remaining # Weights pruned away Table 4. Improving accuracy by using sensitivity analysis to deter-

5E+05 mine where to add parameters in a CNN.

Architecture Top-1 Accuracy Top-5 Accuracy Model Size

4E+05 SqueezeNet 57.5% 80.3% 4.8MB

SqueezeNet++ 59.5% 81.5% 7.1MB

3E+05

Table 5. Improving accuracy with dense→sparse→dense (DSD)

training.

1E+05

Architecture Top-1 Top-5 Model

Accuracy Accuracy Size

0E+00 SqueezeNet 57.5% 80.3% 4.8MB

fire2_s1x1

fire3_s1x1

fire4_s1x1

fire5_s1x1

fire6_s1x1

fire7_s1x1

fire8_s1x1

fire9_s1x1

conv10

conv1

fire2_e1x1

fire2_e3x3

fire3_e1x1

fire3_e3x3

fire4_e1x1

fire4_e3x3

fire5_e1x1

fire5_e3x3

fire6_e1x1

fire6_e3x3

fire7_e1x1

fire7_e3x3

fire8_e1x1

fire8_e3x3

fire9_e1x1

fire9_e3x3

SqueezeNet with 61.8% 83.5% 4.8MB

DSD

Figure 5. Parameters pruned away in each layer. 3x3 kernels has work.

more redundancy and could be pruned more than 1x1 kernel. Over-

all SqueezeNet can be removed 3x parameters without hurting ac- 8. Conclusions

curacy. We have presented SqueezeNet, a CNN architecture

that has 50× fewer parameters than AlexNet and main-

ure 5 is likely to improve model accuracy.9 By looking tains AlexNet-level accuracy on ImageNet. We also com-

at Figure 4, we found f ire7 e1x1 and f ire9 e1x1 each pressed SqueezeNet to less than 0.5MB, or 510× smaller

have a “dent” in the accuracy plot, and are therefore par- than AlexNet without compression. In this paper, we focused

ticularly sensitive. With this in mind, we increased the num- on ImageNet as a target dataset. However, it has become

ber of channels in f ire7 e1x1 and f ire9 e1x1 a factor of common practice to apply ImageNet-trained CNN represen-

3, and we call this model “SqueezeNet++”. We trained the tations to a variety of applications such as fine-grained object

SqueezeNet++ model from scratch, and we report accuracy recognition [34, 6], logo identification in images [17], and

results in Table 4. Compared to vanilla SqueezeNet, we find generating sentences about images [7]. ImageNet-trained

that SqueezeNet++ delivers a top-5 accuracy increase of 1.2 CNNs have also been applied to a number of applications

percentage points and a top-1 accuracy increase of over 2 pertaining to autonomous driving, including pedestrian and

percentage points. vehicle detection in images [16, 8] and videos [3], as well as

segmenting the shape of the road [1]. We think SqueezeNet

7.2. Improving Accuracy by Densifying Sparse will be a good candidate CNN architecture for a variety of

Models applications, especially those in which small model size is

We discovered an interesting byproduct of model com-

of importance.

pression: re-densifying and retraining from a sparse model

SqueezeNet is one of several new CNNs that we have dis-

can improve the accuracy. That is, compared to a dense CNN

covered while broadly exploring the design space of CNN ar-

baseline, dense→sparse→dense (DSD) training yielded

chitectures. We hope that SqueezeNet will inspire the reader

higher accuracy.

to consider and explore the broad range of possibilities in the

We now explain our DSD training strategy. On top of

design space of CNN architectures.

the sparse SqueezeNet (pruned 3x), we let the killed weights

recover, initializing them from zero. We let the survived References

weights keeping their value. We retrained the whole network

using learning rate of 1e − 4. After 20 epochs of training, we [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg-

observed that the top-1 ImageNet accuracy improved by 4.3 Net: A deep convolutional encoder-decoder architec-

percentage-points; we show this in Table 5. ture for image segmentation. arxiv:1511.00561, 2015.

Sparsity is a powerful form of regularization. Our intu- 8

ition is that, once the network arrives at a local minimum [2] J. Bergstra and Y. Bengio. An optimization method-

given the sparsity constraint, relaxing the constraint gives ology for neural network weights and architectures.

the network more freedom to escape the saddle point and ar- JMLR, 2012. 2

rive at a higher-accuracy local minimum. So far, we trained [3] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma,

in just three stages of density (dense→sparse→dense), but S. Fidler, and R. Urtasun. 3d object proposals for accu-

regularizing models by intermittently pruning parameters10 rate object class detection. In NIPS, 2015. 8

throughout training would be an interesting area of future

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and

9 One might naively think that adding parameters anywhere in a CNN L. Fei-Fei. ImageNet: A large-scale hierarchical im-

would improve accuracy. However, as we observed in Figure 3(b), there are age database. In CVPR, 2009. 2, 4

certainly cases where adding parameters does not improve accuracy.

10 This is somewhat like Dropout, which prunes activations instead of pa- [5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and

rameters. R. Fergus. Exploiting linear structure within convolu-

8

tional networks for efficient evaluation. In NIPS, 2014. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-

2, 4, 5 geNet Classification with Deep Convolutional Neural

[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, Networks. In NIPS, 2012. 4

E. Tzeng, and T. Darrell. Decaf: A deep convolu- [21] Y. LeCun, B.Boser, J. Denker, D. Henderson,

tional activation feature for generic visual recognition. R. Howard, W. Hubbard, and L. Jackel. Backpropaga-

arXiv:1310.1531, 2013. 8 tion applied to handwritten zip code recognition. Neu-

ral Computation, 1989. 2

[7] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng,

P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. [22] M. Lin, Q. Chen, and S. Yan. Network in network.

Zitnick, and G. Zweig. From captions to visual con- arXiv:1312.4400, 2013. 2, 3, 6

cepts and back. In CVPR, 2015. 8 [23] T. Ludermir, A. Yamazaki, and C. Zanchettin. An opti-

mization methodology for neural network weights and

[8] R. B. Girshick, F. N. Iandola, T. Darrell, and J. Ma-

architectures. IEEE Trans. Neural Networks, 2006. 2

lik. Deformable part models are convolutional neural

networks. In CVPR, 2015. 8 [24] V. Nair and G. E. Hinton. Rectified linear units improve

restricted boltzmann machines. In ICML, 2010. 3

[9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.

Horowitz, and W. J. Dally. Eie: Efficient inference en- [25] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu,

gine on compressed deep neural network. International T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going

Symposium on Computer Architecture (ISCA), 2016. 2 deeper with embedded fpga platform for convolutional

neural network. In ACM International Symposium on

[10] S. Han, H. Mao, and W. Dally. Deep compression: FPGA, 2016. 1

Compressing DNNs with pruning, trained quantization

[26] K. Simonyan and A. Zisserman. Very deep convo-

and huffman coding. arxiv:1510.00149v3, 2015. 2, 4,

lutional networks for large-scale image recognition.

5, 7

arXiv:1409.1556, 2014. 2, 6

[11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both [27] J. Snoek, H. Larochelle, and R. Adams. Practical

weights and connections for efficient neural networks. bayesian optimization of machine learning algorithms.

In NIPS, 2015. 2, 4, 5 In NIPS, 2012. 2

[12] K. He and J. Sun. Convolutional neural networks at [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,

constrained time cost. In CVPR, 2015. 3 and R. Salakhutdinov. Dropout: a simple way to pre-

[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid- vent neural networks from overfitting. JMLR, 2014. 3

ual learning for image recognition. arXiv:1512.03385, [29] R. K. Srivastava, K. Greff, and J. Schmidhuber. High-

2015. 2, 6 way networks. In ICML Deep Learning Workshop,

[14] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep 2015. 2

into rectifiers: Surpassing human-level performance on [30] K. Stanley and R. Miikkulainen. Evolving neural net-

imagenet classification. In ICCV, 2015. 2 works through augmenting topologies. Neurocomput-

ing, 2002. 2

[15] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and

K. Keutzer. FireCaffe: near-linear acceleration of [31] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,

deep neural network training on compute clusters. inception-resnet and the impact of residual connections

arxiv:1511.00175, 2015. 1, 3 on learning. arXiv:1602.07261, 2016. 2

[16] F. N. Iandola, M. W. Moskewicz, S. Karayev, R. B. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,

Girshick, T. Darrell, and K. Keutzer. DenseNet: D. Anguelov, D. Erhan, V. Vanhoucke, and

Implementing efficient convnet descriptor pyramids. A. Rabinovich. Going deeper with convolutions.

arXiv:1404.1869, 2014. 8 arXiv:1409.4842, 2014. 2, 6

[33] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and

[17] F. N. Iandola, A. Shen, P. Gao, and K. Keutzer. DeepL-

Z. Wojna. Rethinking the inception architecture for

ogo: Hitting logo recognition with the deep neural net-

computer vision. arXiv:1512.00567, 2015. 2

work hammer. arXiv:1510.02131, 2015. 8

[34] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. De-

[18] S. Ioffe and C. Szegedy. Batch normalization: Accel- formable part descriptors for fine-grained recognition

erating deep network training by reducing internal co- and attribute prediction. In ICCV, 2013. 8

variate shift. JMLR, 2015. 2, 4

[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,

R. Girshick, S. Guadarrama, and T. Darrell. Caffe:

Convolutional architecture for fast feature embedding.

In ACM Multimedia, 2014. 4

- 1-s2.0-S0266353806002788-mainUploaded byfficici
- On the Study of the EthernetUploaded bycarrerare
- Tutorial 4Uploaded byFaten Clubistia
- Turban Online Tutorial 2Uploaded byMarta Kharja
- neural networkUploaded byAB Anshu Bhardwaj
- Neural Network Lec1Uploaded byاحمد مظفر
- Deep Learning Technique for Music Generation - A SurveyUploaded bySatoshi Suzuki
- Artificial Intelligence in UAVs, Airplanes & AutopilotsUploaded byValupa
- 10.1109@ijcnn.1993.713922Uploaded byBryan Peñaloza
- Cavaco-090519Uploaded byMochammad Muzakhi
- 021 M03 IvanovaUploaded bymarusya_smokova
- Thesis IeeeUploaded byneethurj9
- scimakelatex.29021.George+B.+Silvestri.Arnaldo+D.+BresciaUploaded bymassimoriserbo
- 1112.1473Uploaded byOpeyemi Dada
- Sc Industrial ApplicationsUploaded byNathan Malone
- Singh 2015Uploaded byKMBA
- 10. Data mining and data warehouse.pdfUploaded byMaryus Steaua
- Neural NetworksUploaded byabhinav pandey
- Intelligent approach for efficient operation of electrical distribution automation systemsUploaded byapi-3697505
- Vehicle Detection and Road Scene Segmentation using Deep LearningUploaded byabeerarshad
- PartB.pdfUploaded byMADDY
- Deep learning for monaural speech separationUploaded byTroy
- Deep Learning VideosUploaded byAdityaGarg
- IJISA-V7-N12-8Uploaded byhfarrukhn
- 002Uploaded byTester
- NNFL Exp 1Uploaded byOmkar Nagvekar
- SPSS Neural Network 16.0Uploaded byTanu Priya
- The Convergence of Budding Reckoning Soft Computing TechnologiesUploaded byEditor IJRITCC
- DownloadUploaded byjimakosjp
- Automatic Signature Verification the State of the Arte 1993Uploaded bycasesilva

- tesorbox1506.04878v3Uploaded bysharkie
- adam.pdfUploaded byWander Scheidegger
- AnalysisOfDeepLearning1605.07678v2Uploaded bysharkie
- DP.pdfUploaded bysharkie
- Distributed DL1602.06709v1Uploaded bysharkie
- SocherBauerManningNg_ACL2013.pdfUploaded bySr. RZ
- 1502.01852v1Uploaded bysmisat
- NervanaFastConv1509.09308v2.pdfUploaded bysharkie
- An overview of gradient descent optimization algorithms.pdfUploaded bysharkie
- Reducing Branch Divergence in GPU ProgramsUploaded bysharkie
- Deep Tracking Seeing Beyond Seeing Using Recurrent Neural Networks 2016AAAI_ondruska.pdfUploaded bysharkie
- Accurate Scale Estimation for Robust Visual TrackingUploaded bysharkie
- Masi Pose-Aware Face Recognition CVPR 2016 PaperUploaded bysharkie
- Generalized Haar Filter Based Deep Networks for Real-Time Object Detecti...Uploaded bysharkie
- NervanaFastConv1509.09308v2.pdfUploaded bysharkie

- Stabality TheoryUploaded byArun Singh
- Carpentry LessonUploaded bymaria nia feruelo
- GRP IN AS 2885.pdfUploaded byshafeeqm3086
- Adaptive ReuseUploaded byayeez28
- RN171_UserGuideUploaded bylucasmqs
- Customizing Outlook 2000Uploaded byanand98765
- acnp-class-6-140725151300-phpapp01Uploaded byben911t
- Instalacion job sheduller sosUploaded byosgosa
- UAP Corporate Profile 2010Uploaded byBrian Griffin
- The Damage, Cause of Takoma Bridge FailureUploaded byeliasjamhour
- EPIC - Project VisionUploaded byEuropean Network of Living Labs
- 2426_cottodestecatalogokerlite2013Uploaded byZlatko Krsic
- ATJB QuizUploaded byminh
- Frangible RoofUploaded byamevaluaciones
- Cnp-pla200 User GuideUploaded byTamás Nagy
- MC68HC705KJ1 Data sheetUploaded byMicrotech Solutions
- 15_AttackDetectionUploaded byBerrezeg Mahieddine
- Architecture June 2015Uploaded byArtdata
- EVOLVING CORE NETWORKS Motorola Softswitch SolutionUploaded byShashank S. Prasad
- Accessibility for the DisabledUploaded byIsmail Ibrahim
- Analysis of the Attack Surface of Microsoft Office From User Perspective FinalUploaded byV Xange
- Page-CUploaded bybogner22
- ccna3-lec1-9.pdfUploaded byDe2791Vi
- Oracle IDM InstallationUploaded byGourav Chouhan
- Design of Diagonal Strap BracingUploaded bysandeepsharmafj
- ZEC Thermoplastic HosesUploaded byCamilo Andres Noy
- jQueryNotes.pdfUploaded byLuis Poma Sarzuri
- Tds - Stucco Antica - English - Issued.18.11.2002Uploaded byB
- Brocade Hitachi Data Systems QrgUploaded byJhovic Zuñiga
- 3 CataloguesUploaded byNaveenKrishnaS