You are on page 1of 39

Deep learning and weak supervision

for image classification

Matthieu Cord
Joint work with Thibaut Durand, Nicolas Thome

Sorbonne Universits - Universit Pierre et Marie Curie (UPMC)


Laboratoire dInformatique de Paris 6 (LIP6) - MLIA Team
UMR CNRS

June 09, 2016

1/35
Outline

Context: Visual classification


1. MANTRA: Latent variable
model to boost classification
performances
2. WELDON: extension to
Deep CNN

2/35
Motivations

Working on datasets with complex scenes (large and cluttered


background), not centered objects, variable size, ...

VOC07/12 MIT67 15 Scene COCO VOC12 Action

Select relevant regions better prediction


ImageNet: centered objects
I Efficient transfert: needs bounding boxes [Oquab, CVPR14]

Full annotations expensive training with weak supervision

3/35
Motivations
How to learn without bounding boxes?
Multiple-Instance Learning/Latent variables for missing
information [Felzenszwalb, PAMI10]
Latent SVM and extensions => MANTRA
How to learn deep without bounding boxes?
Learning invariance with input image transformations
I Spatial Transformer Networks [Jaderberg, NIPS15]

Attention models: to select relevant regions


I Stacked Attention Networks for Image Question Answering

[Yang, CVPR16]
Parts model
I Automatic discovery and optimization of parts for image

classification [Parizi, ICLR15]


Deep MIL
I Is object localization for free? [Oquab, CVPR15]
I Deep extension of MANTRA: WELDON

4/35
Notations

Variable Notation Space Train Test Example


Input x X observed observed image
Output y Y observed unobserved label
Latent h H unobserved unobserved region
Model missing information with latent variables h
Most popular approach in Computer Vision: Latent SVM
[Felzenszwalb, PAMI10] [Yu, ICML09]
5/35
Latent Structural SVM [Yu, ICML09]

Prediction function:

= arg max hw, (x, y, h)i


y, h)
( (1)
(y,h)YH

I (x, y, h): joint feature map


I Joint inference in the (Y H) space
Training: a set of N labeled trained pairs (xi , yi )
I Objective function: upper bound of (y , y
i i )
N
1 C X
kwk2 + max [(yi , y) + hw, (xi , y, h)i] max hw, (xi , yi , h)i
2 N (y,h)YH hH
i=1 | {z }
(yi ,
yi )

I Difference of Convex Functions, solved with CCCP


I LAI: max [(yi , y) + hw, (xi , y, h)i]
(y,h)YH
I Challenge exacerbated in the latent case, (Y H) space

6/35
MANTRA: Minimum Maximum Latent Structural SVM

Classifying only with the


max scoring latent value
not always relevant

MANTRA model:

Pair of latent variables (h+
i,y , hi,y )
I max scoring latent value: h+ = arg max hw, (x , y, h)i
i,y i
hH
I min scoring latent value: h
i,y = arg min hw, (xi , y, h)i
hH
New scoring function:

Dw (xi , y) = hw, (xi , y, h+
i,y )i + hw, (xi , y, hi,y )i (2)

Prediction function find the output with maximum score


y = arg max Dw (xi , y) (3)
yY
7/35
MANTRA: Minimum Maximum Latent Structural SVM

Classifying only with the


max scoring latent value
not always relevant

MANTRA model:

Pair of latent variables (h+
i,y , hi,y )
I max scoring latent value: h+ = arg max hw, (x , y, h)i
i,y i
hH
I min scoring latent value: h
i,y = arg min hw, (xi , y, h)i
hH
New scoring function:

Dw (xi , y) = hw, (xi , y, h+
i,y )i + hw, (xi , y, hi,y )i (2)

Prediction function find the output with maximum score


y = arg max Dw (xi , y) (3)
yY
7/35
MANTRA: Model & Training Rationale
Intuition of the max+min prediction function
x image, h image region, y image class
hw, (x, y, h)i: region h score for class y
Dw (x, y) = hw, (x, y, h+
y )i + hw, (x, y, hy )i
I h+
y : presence of class y large for yi
I h
y : localized evidence of the absence of class y
I Not too low for yi latent space regularization
I Low for y 6= yi tracking negative evidence [Parizi, ICLR15]

street image x Dw (x, street) = 2 Dw (x, highway) = 0.7 Dw (x, coast) = 1.5
8/35
MANTRA: Model Training

Learning formulation
Loss function: `w (xi , yi ) = max [(yi , y) + Dw (xi , y)] Dw (xi , yi )
yY

I (Margin rescaling) upper bound of (yi , y), constraints:

y 6= yi , Dw (xi , yi ) (yi , y) + Dw (xi , y)


| {z } | {z } | {z }
score for ground truth output margin score for other output

Non-convex optimization problem


N
1 C X
min kwk2 + `w (xi , yi ) (4)
w 2 N
i=1

Solver: non convex one slack cutting plane [Do, JMLR12]


I Fast convergence

I Direct optimization 6= CCCP for LSSVM


I Still needs to solve LAI: max [(y , y) + D (x , y)]
y i w i

9/35
MANTRA: Optimization

MANTRA Instantiation: define (x, y, h), (x, y, h), (yi , y)


Instantiations: binary & multi-class classification, AP ranking
Binary Multi-class AP Ranking
x bag bag set of bags
(set of regions) (set of regions) (of regions)
y 1 {1, . . . , K } ranking matrix
h instance (region) region regions
[1(y=1) (x, h), . . . , joint latent ranking
(x, y, h) y (x, h)
1(y=K ) (x, h)] feature map
(yi , y) 0/1 loss 0/1 loss AP loss
LAI exhaustive exhaustive exact and efficient

Solve Inference maxy Dw (xi , y) & LAI maxy [(yi , y) + Dw (xi , y)]
I Exhaustive for binary/multi-class classification
I Exact and efficient solutions for ranking

10/35
WELDON
Weakly supErvised Learning of Deep cOnvolutional Nets
MANTRA extension for training deep CNNs
Learning (x, y, h): end-to-end learning of deep CNNs with
structured prediction and latent variables
I Incorporating multiple positive & negative evidence
I Training deep CNNs with structured loss

11/35
Standard deep CNN architecture: VGG16

Simonyan et al. Very deep convolutional networks for large-scale image recognition.
ICLR 2015
12/35
MANTRA adaptation for deep CNN
Problem
Fixed-size image as input

13/35
MANTRA adaptation for deep CNN
Problem
Fixed-size image as input

Adapt architecture to weakly supervised learning


1. Fully connected layers convolution layers
I sliding window approach

13/35
MANTRA adaptation for deep CNN
Problem
Fixed-size image as input

Adapt architecture to weakly supervised learning


1. Fully connected layers convolution layers
I sliding window approach

13/35
MANTRA adaptation for deep CNN
Problem
Fixed-size image as input

Adapt architecture to weakly supervised learning


1. Fully connected layers convolution layers
I sliding window approach
2. Spatial aggregation
I Perform object localization prediction

13/35
WELDON: deep architecture

C : number of classes
14/35
Aggregation function

[Oquab, 2015]
Region aggregation = max
Select the highest-scoring window

original image motorbike feature map max prediction


Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervised
learning with convolutional neural networks. CVPR 2015 15/35
WELDON: region aggregation
Aggregation strategy:
max+min pooling (MANTRA prediction function)
k-instances
I Single region to multiple high scoring regions:
k k
1X 1X
max i-th max min i-th min
k k
i=1 i=1
I More robust region selection [Vasconcelos CVPR15]

max max + min 3 max +3 min


16/35
WELDON: architecture

17/35
WELDON: learning
Objective function for multi-class task and k = 1:
N
1 X
min R(w) + `(fw (xi ), yigt )
w N
i=1
 
w w 0
fw (xi ) =arg max max Lconv (xi , y , h) + min 0
Lconv (xi , y , h )
y h h

How to learn deep architecture ?


Stochastic gradient descent training.
Back-propagation of the selecting windows error.
18/35
WELDON: learning

Class is present
Increase score of selecting windows.

Figure: Car map

19/35
WELDON: learning

Class is absent
Decrease score of selecting windows.

Figure: Boat map

20/35
Experiments

VGG16 pre-trained on ImageNet


Torch7 implementation

Datasets
Object recognition: Pascal VOC 2007, Pascal VOC 2012
Scene recognition: MIT67, 15 Scene
Visual recognition, where context plays an important role:
COCO, Pascal VOC 2012 Action

VOC07/12 MIT67 15 Scene COCO VOC12 Action


21/35
Experiments

Dataset Train Test Classes Classification


VOC07 5.000 5.000 20 multi-label
VOC12 5.700 5.800 20 multi-label
15 Scene 1.500 2.985 15 multi-class
MIT67 5.360 1.340 67 multi-class
VOC12 Action 2.000 2.000 10 multi-label
COCO 80.000 40.000 80 multi-label

22/35
Experiments

Multi-scale: 8 scales (combination with Object Bank strategy)

23/35
Object recognition

VOC 2007 VOC 2012


VGG16 (online code) [1] 84.5 82.8
SPP net [2] 82.4
Deep WSL MIL [3] 81.8
WELDON 90.2 88.5
Table: mAP results on object recognition datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015


[2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014
[3] Oquab et al. Is object localization for free? CVPR 2015
24/35
Scene recognition

15 Scene MIT67
VGG16 (online code) [1] 91.2 69.9
MOP CNN [2] 68.9
Negative parts [3] 77.1
WELDON 94.3 78.0
Table: Multi-class accuracy results on scene categorization datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015


[2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional Activation
Features. ECCV 2014
[3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015
25/35
Context datasets

VOC 2012 action COCO


VGG16 (online code) [1] 67.1 59.7
Deep WSL MIL [2] 62.8
Our WSL deep CNN 75.0 68.8
Table: mAP results on context datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015


[2] Oquab et al. Is object localization for free? CVPR 2015

26/35
Visual results

Aeroplane model (1.8) Bus model (-0.4)

27/35
Visual results

Motorbike model (1.1) Sofa model (-0.8)

28/35
Visual results

Sofa model (1.2) Horse model (-0.6)

29/35
Visual results (failing examples)

Buffet Restaurant kitchen

30/35
Visual results (failing examples)

Kindergarden Classroom

31/35
Analysis
Impact of the different improvements
a) max b) +k=3 c) +min d) +AP VOC07 VOC12 action
X 83.6 53.5
X X 86.3 62.6
X X 87.5 68.4
X X X 88.4 71.7
X X X 87.8 69.8
X X X X 88.9 72.6

WSL detection results on VOC 2012 Action

max (a)) [Oquab, 2015] WELDON


IoU 25.6 30.4

32/35
Analysis
Impact of the number or regions k

k=1 k=3

33/35
Connections to others Latent Variables Models
Hidden CRF (HCRF) [Quattoni, PAMI07]
N
1 C X X X
kwk2 + log exphw, (xi , y, h)i log exphw, (xi , yi , h)i
2 N
i=1 (y,h)YH hH

Latent Structural SVM (LSSVM) [Yu, ICML09]


N
1 2 C X
kwk + max {(yi ,y)+hw,(xi ,y,h)i} maxhw,(xi ,yi ,h)i
2 N (y,h)YH hH
i=1

Marginal Structural SVM (MSSVM) [Ping, ICML14]


N
( )
1 2 C
X X X
kwk + max (yi ,y)+log exphw,(xi ,y,h)i log exphw,(xi ,yi,h)i
2 N y
i=1 hH hH

WELDON
N
1 CX X X
kwk2 + max (yi ,y)+ hw,(xi ,y,h)i hw,(xi , yi , h)i
2 N y
i=1 hH hH
34/35
Thibaut Durand Nicolas Thome Matthieu Cord

MLIA Team (Patrick Gallinari)


Sorbonne Universits - UPMC Paris 6 - LIP6

MANTRA project page


http://webia.lip6.fr/~durandt/project/mantra.html

Thibaut Durand, Nicolas Thome, and Matthieu Cord.


MANTRA: Minimum Maximum LSSVM for Image Classification and Ranking.
In IEEE International Conference on Computer Vision (ICCV), 2015.
Thibaut Durand, Nicolas Thome, and Matthieu Cord.
WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
35/35