Академический Документы
Профессиональный Документы
Культура Документы
Abstract
1
across translated receptive fields. This weight-tying serves both rotationflip combinations. More recently they generalized this
as a constraint on the translational structure of image statistics, theory to all group-structured transformations in [4], but they
and as an effective technique to reduce the number of learnable only demonstrated applications on finite groupsan extension
parameterssee Figure 1. In essence, translational equivariance to continuous transformations would require a treatment on
has been baked into the architecture of existing CNN models. anti-aliasing and bandlimiting. [24] use a larger number of
We do the same for rotation and refer to it as hard-baking. rotations for texture classification and [26] also use many
The current widely accepted practice to cope with rotation is rotated handcrafted filter copies, opting not to learn the filters.
to train with aggressive data augmentation [16]. This certainly To achieve equivariance to a greater number of rotations, these
improves generalization, but is not exact, fails to capture methods would need an infinite amount of computation. H-Nets
local equivariances, and does not ensure equivariance at every achieve equivariance to all rotations, but with finite computation.
layer within a network. How to maintain the richness of [6] feed in multiple rotated copies of the CNN input and
local rotation information, is what we present in this paper. fuse the output predictions. [17] do the same for a broader
Another disadvantage of data augmentation is that it leads to the class of global image transformations, and propose a novel
so-called black-box problem, where there is a lack of feature per-pixel pooling technique for output fusion. As discussed,
map interpretability. Indeed, close inspection of first-layer these techniques lead to global equivariances only and do not
weights in a CNN reveals that many of them are rotated, produce interpretable feature maps. [5] go one step further and
scaled, and translated copies of one another [34]. Why waste copy each feature map at four 90-rotations. They propose 4
computation learning all of these redundant weights? different equivariance preserving feature map transformations.
In this paper, we present Harmonic Networks, or H-Nets. Their CNN is similar to [3] in terms of what is being computed,
They design patch-wise 360-rotational equivariance into deep but rotating feature maps instead of filters. A downside of this
image representations, by constraining the filters to the family of is that all inputs and feature maps have to be square; whereas,
circular harmonics. The circular harmonics are steerable filters we can use any sized input.
[7], which means that we can represent all rotated versions of Learning generalized transformations Others have tried
a filter, using just a finite, linear combination of steering bases. to learn the transformations directly from the data. While this is
This overcomes the issue of learning multiple filter copies in an appealing idea, as we have said, for certain transformations it
CNNs, guarantees rotational equivariance, and produces feature makes more sense to hard-bake these in for interpretability and
maps that transform predictably under input rotation. reliability. [25] construct a higher-order Boltzmann machine,
which learns tuples of transformed linear filters in inputoutput
2. Related Work pairs. Although powerful, they have only shown this to work on
shallow architectures. [9] introduced capsules, units of neurons
Multiple existing approaches seek to encode rotational designed to mimic the action of cortical columns. Capsules are
equivariance into CNNs. Many of these follow a broad designed to be invariant to complicated transformations of the
approach of introducing filter or feature map copies at different input. Their outputs are merged at the deepest layer, and so are
rotations. None has dominated as standard practice. only invariant to global transformation. [22] present a method to
Steerable filters At the root of H-Nets lies the property regress equivariant feature detectors using an objective, which
of filter steerability [7]. Filters exhibiting steerability can be penalizes representations, which lie far from the equivariant
constructed at any rotation as a finite, linear combination of base manifold. Again, this only encourages global equivariance;
filters. This removes the need to learn multiple filters at different although, this work could be adapted to encourage equivariance
rotations, and has the bonus of constant memory requirements. at every layer of a deep pipeline.
As such, H-Nets could be thought of as using an infinite bank
of rotated filter copies. A work, which combines steerable 3. Problem analysis
filters with learning is [23]. They build shallow features from
steerable filters, which are fed into a kernel SVM for object Many computer vision systems strive to be view indepen-
detection and rigid pose regression. H-Nets use the same filters dent, such as object recognition, which is invariant to affine
with an added rotation offset term, so that filters in different transformations, or boundary detection, which is equivariant
layers can have orientation-selectivity relative to one another. to non-rigid deformations. H-Nets hard-bake 360-rotation
Hard-baked transformations in CNNs While H-Nets equivariance into their feature representation, by constraining
hard-bake patch-wise 360-rotation into the feature represen- the convolutional filters of a CNN to be from the family of
tation, numerous related works have encoded equivariance to circular harmonics. Below, we outline the formal definition of
discrete rotations. The following works can be grouped into equivariance (Section 3.1), how the circular harmonics exhibit
those, which encode global equivariance versus patch-wise rotational equivariance (Section 3.2) and some properties of
equivariance, and those which rotate filters versus feature maps. the circular harmonics, which we must heed for successful
[3] introduce equivariance to 90-rotations and dihedral integration into the CNN framework (Section 3.2).
flips in CNNs by copying the transformed filters at different Continuous domain feature maps In deep learning we use
K
K
Figure 2. Real and imaginary parts of the complex Gaussian filter
2 2
Wm (r,0 ;er ,0)=er eim , for some rotation orders. As a simple
2
example, we have set R(r)=er and =0, but in general we learn
these quantities. Cross-correlation, of a feature map of rotation order
n with one of these filters of rotation order m, results in a feature map
of rotation order m+n. Note the negative rotation order filters have
flipped imaginary parts compared to the positive orders.
35 128 32
P 4CNN rotation pooling [3] 3.21 25k
20
20
P 4CNN [3] 2.28 25k
20 16 64 1 16 1 H-Net (Ours) 1.69 33k
20 16 64 16
0.98
0.96 CNN
0.94
CNN + data aug.
0.92
0.90 H Net (ours)
0.88
2000 4000 6000 8000 10000 12000
Training set size Figure 12. View best in color. Orientated feature maps for the H-DSN.
Figure 10. Data ablation study. On the rotated MNIST dataset, we The color wheel shows orientation coding. Note that between layers
experiment with test accuracy for varying sizes of the training set. We boundary orientations are colored differently because each feature
normalize the maximum test accuracy of each method to 1, for direct has a different . This visualization demonstrates the consistency
comparison of the falloff with training size. Clearly H-Nets are more of orientation within a feature map and across multiple layers. The
data-efficient than regular CNNs, which need more data to discover images are taken from layers 2, 4, 6, 8, and 10 in a clockwise order
rotational equivariance unaided. from largest to smallest.
References [17] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. TI-
POOLING: transformation-invariant pooling for feature learning
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour de- in convolutional neural networks. CoRR, abs/1604.06318, 2016.
tection and hierarchical image segmentation. IEEE Transactions 2
on Pattern Analysis and Machine Intelligence, 33(5):898916, [18] H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Ben-
May 2011. 6, 7 gio. An empirical evaluation of deep architectures on problems
[2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi-scale with many factors of variation. In Machine Learning, Proceedings
bifurcated deep network for top-down contour detection. In IEEE of the Twenty-Fourth International Conference (ICML 2007), Cor-
Conference on Computer Vision and Pattern Recognition, CVPR vallis, Oregon, USA, June 20-24, 2007, pages 473480, 2007. 6
2015, Boston, MA, USA, June 7-12, 2015, pages 43804389, [19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard,
2015. 7 W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition
[3] T. S. Cohen and M. Welling. Group equivariant convolutional with a back-propagation network. In Advances in Neural
networks. arXiv preprint arXiv:1602.07576, 2016. 2, 6 Information Processing Systems 2, [NIPS Conference, Denver,
[4] T. S. Cohen and M. Welling. Steerable cnns. CoRR, Colorado, USA, November 27-30, 1989], pages 396404, 1989. 1
abs/1612.08498, 2016. 2 [20] C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeply-
[5] S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting cyclic supervised nets. In Proceedings of the Eighteenth International
symmetry in convolutional neural networks. arXiv preprint Conference on Artificial Intelligence and Statistics, AISTATS
arXiv:1602.02660, 2016. 2, 4 2015, San Diego, California, USA, May 9-12, 2015, 2015. 6, 7
[6] B. Fasel and D. Gatica-Perez. Rotation-invariant neoperceptron. [21] K. Lenc and A. Vedaldi. Understanding image representations by
In 18th International Conference on Pattern Recognition (ICPR measuring their equivariance and equivalence. In IEEE Confer-
2006), 20-24 August 2006, Hong Kong, China, pages 336339, ence on Computer Vision and Pattern Recognition, CVPR 2015,
2006. 2 Boston, MA, USA, June 7-12, 2015, pages 991999, 2015. 1, 3
[7] W. T. Freeman and E. H. Adelson. The design and use of [22] K. Lenc and A. Vedaldi. Learning covariant feature detectors.
steerable filters. IEEE Transactions on Pattern analysis and CoRR, abs/1605.01224, 2016. 2
machine intelligence, 13(9):891906, 1991. 2, 4, 5 [23] K. Liu, Q. Wang, W. Driever, and O. Ronneberger. 2d/3d
[8] Y. Ganin and V. S. Lempitsky. N4 -fields: Neural network nearest rotation-invariant detection using equivariant filters and kernel
neighbor fields for image transforms. CoRR, abs/1406.6558, weighted mapping. In 2012 IEEE Conference on Computer
2014. 7 Vision and Pattern Recognition, Providence, RI, USA, June 16-21,
[9] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming 2012, pages 917924, 2012. 2, 4
auto-encoders. In Artificial Neural Networks and Machine [24] D. Marcos, M. Volpi, and D. Tuia. Learning rotation invariant
Learning - ICANN 2011 - 21st International Conference on convolutional filters for texture classification. arXiv preprint
Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, arXiv:1604.06720, 2016. 2
Proceedings, Part I, pages 4451, 2011. 2 [25] R. Memisevic and G. E. Hinton. Learning to represent spatial
[10] J. Hwang and T. Liu. Pixel-wise deep learning for contour transformations with factored higher-order boltzmann machines.
detection. CoRR, abs/1504.01989, 2015. 7 Neural Computation, 22(6):14731492, 2010. 2
[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [26] E. Oyallon and S. Mallat. Deep roto-translation scattering for
deep network training by reducing internal covariate shift. In object classification. In IEEE Conference on Computer Vision
Proceedings of the 32nd International Conference on Machine and Pattern Recognition, CVPR 2015, Boston, MA, USA, June
Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 7-12, 2015, pages 28652873, 2015. 2
448456, 2015. 5 [27] U. Schmidt and S. Roth. Learning rotation-aware features: From
[12] J. Jacobsen, J. C. van Gemert, Z. Lou, and A. W. M. Smeulders. invariant priors to equivariant descriptors. In 2012 IEEE Confer-
Structured receptive fields in cnns. CoRR, abs/1605.02971, 2016. ence on Computer Vision and Pattern Recognition, Providence,
5 RI, USA, June 16-21, 2012, pages 20502057, 2012. 6
[13] D. P. Kingma and J. Ba. Adam: A method for stochastic [28] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deepcontour:
optimization. CoRR, abs/1412.6980, 2014. 6 A deep convolutional feature learned by positive-sharing loss for
[14] J. J. Kivinen, C. K. I. Williams, and N. Heess. Visual boundary contour detection. In IEEE Conference on Computer Vision and
prediction: A deep neural prediction network and quality dissec- Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12,
tion. In Proceedings of the Seventeenth International Conference 2015, pages 39823991, 2015. 7
on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, [29] K. Simonyan and A. Zisserman. Very deep convolutional net-
Iceland, April 22-25, 2014, pages 512521, 2014. 7 works for large-scale image recognition. CoRR, abs/1409.1556,
[15] I. Kokkinos. Pushing the boundaries of boundary detection using 2014. 7
deep learning. arXiv preprint arXiv:1511.07386, 2015. 7 [30] S. Soatto. Actionable information in vision. In IEEE 12th Interna-
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet tional Conference on Computer Vision, ICCV 2009, Kyoto, Japan,
classification with deep convolutional neural networks. In September 27 - October 4, 2009, pages 21382145, 2009. 1
Advances in Neural Information Processing Systems 25: 26th [31] K. Sohn and H. Lee. Learning invariant representations with
Annual Conference on Neural Information Processing Systems local transformations. In Proceedings of the 29th International
2012. Proceedings of a meeting held December 3-6, 2012, Lake Conference on Machine Learning, ICML 2012, Edinburgh,
Tahoe, Nevada, United States., pages 11061114, 2012. 2, 8 Scotland, UK, June 26 - July 1, 2012, 2012. 6
[32] M. Tygert, J. Bruna, S. Chintala, Y. LeCun, S. Piantino, and
A. Szlam. A mathematical motivation for complex-valued convo-
lutional networks. Neural Computation, 28(5):815825, 2016. 3
[33] S. Xie and Z. Tu. Holistically-nested edge detection. In 2015
IEEE International Conference on Computer Vision, ICCV 2015,
Santiago, Chile, December 7-13, 2015, pages 13951403, 2015.
6, 7
[34] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In Computer Vision - ECCV 2014 -
13th European Conference, Zurich, Switzerland, September 6-12,
2014, Proceedings, Part I, pages 818833, 2014. 2
Supplementary Material
Abstract
We include some proofs and derivations of the rotational equivariance properties of the circular harmonics, along with a
demonstration of how we calculate the number of parameters for various network architectures.
A. Equivariance properties
In Section 3.2 we mentioned that cross-correlation with the circular harmonics is a 360-rotation equivariant feature transform.
Here we provide the proof, and some of the properties mentioned in Arithmetic and Equivariance Condition.
A.1. Equivariance of the Circular Harmonics
We are interested in proving that there exists a filter Wm, such that cross-correlation of F with Wm yields a rotationally equivariant
feature map. The proof requires us to introduce two different kinds of transformation: rotation R and translation T . To simplify
the math, we use vector notation, so the spatial domain of the filter/image is R2. We write the filter as Wm(x) and image as F(x)
for x R2. We define the transformation operators R and Tt, such that R F = F(R x) and TtF = F(xt), where R is a 2D
rotation matrix for a counter-clockwise rotation. We introduce rotational cross-correlation ?. This is defined as
Z Z
[Wm ?F]= Wm(rRx)F(rRx)drd, (11)
R
where we have used the decomposition x=rx, with r =kxk2 0 and x=x/r. The rotational cross-correlation is performed about
the origin of the image. If we rotate the image, then we have
Z Z
[Wm ?R F]= Wm(rRx)F(rRR x)drd (12)
ZZR
= Wm(rRx)F(rR x)drd (13)
ZZR
= Wm(rR0 + x)F(rR0 x)drd0. (14)
R
And so rotational cross-correlation is rotationally equivariant about the origin of rotation. In the next part, we build up to a result
needed for proving the chained cross-correlation result.
Cross-correlation about t To perform the rotational cross-correlation about another point t, we first have to translate the image
such that t is the new origin, so Ft(x)=F(xt), then perform the rotational cross-correlation, so
Cross-correlation about t with rotated F about origin Say we wish to perform the rotational cross-correlation about a point
t, when the image has been rotated about the origin. Denoting F =R F, then the response is
Thus we see that cross-correlation of the rotated signal F with the circular harmonic filter Wm =R(r)ei(m+) is equal to the
response at zero rotation [W?F], multiplied by a complex phase shift eim . In the notation of the paper, we denote this multiplication
by eim as m
[]=eim . Thus cross-correlation with Wm yields a rotationally equivariant feature mapping.
A.2. Properties
A.2.1 Chained cross-correlation
We claimed in Arithmetic and Equivariance Condition, that the rotation order of a feature map resulting from chained
cross-correlations is equal to the sum of the the rotation orders of the filters in the chain. We prove this for a chain of two filters,
and the rest follows by induction. Consider taking a -rotated image F about the origin, then cross-correlating it with a filter Wm
as every point in the image plane tR2, followed by cross-correlation with Wn as a point sR2. We already know that the response
to the rotation is [Wm ?TtR F] = eim [Wm ?TR tF], for all rotations of the input and all points t in the response plane, so we
can write the chained convolution as
Thus we see that the chained cross-correlation results in a summation of the rotation orders of the individual filters Wm and Wn.
Setting s=0, such that we evaluate the cross-correlation at the center of rotation, we regain an equation similar to 18.
since g only acts on magnitudes. Since for fixed F the output is a function of m and only, the point-wise magnitude-acting
nonlinearity preserves rotational equivariance.
which for fixed F1,F2,1,2 is a function of m and only and so also rotationally equivariant with order m.
B. Number of parameters
Here we list a break down of how we computed the number of parameters for the various network architectures in the experiments
section. The networks architectures used are in Figure 13. Red boxes are cross-correlations, blue boxes are pooling (average for
H-Nets, max for regular CNNs), green boxes are 11-cross-correlations.
1 1
m=0 m=1
256 1 64 1
256 64
256 1 64 1
m=0 256 64
10 m=1
35 128 1 32 1
20 35 128 32
20
20 16 64 1 16 1
20 16 64 16
20 8 32 1 8 1
20 8 32 8