You are on page 1of 9

Stacked

Autoencoders for Semi-Supervised Learning


Ivan Papusha, Jiquan Ngiam, Andrew Ng

Unsupervised Learning
Train from unlabeled examples (cheap and plenEful) Use available labels to push bases in a direcEon amenable to the classicaEon task from the start Why autoencoder and not, say, RBM?
Non-stochasEc objecEve EsEmate Hessian using limited memory BFGS for fast convergence!

Stacked Sparse Autoencoder


'0' '1' '2' '7' '8'

T WL

softmax(W0 hL + b0 ) W0 hL = sigm(WL hL1 + bL ) WL


'9'

T W2 T W1

h2 = sigm(W2 h1 + b2 ) W2 h1 = sigm(W1 v + b1 ) W1 v

Stacked Sparse Autoencoder


Setup: (x(1), y(1)), (x(1), y(1)), , (x(m), y(m)) = training examples with or without labels p = target acEvaEon level (e.g., 0.05 for 5% acEve unit) Minimize objec1ve over parameters (all weights and biases)
m L1

+4

L l=1

i=1 units m

i=1 l=1 m

hl+1 hl 2 2

(i)

(i)

Layerwise reconstrucEon error KL divergence between actual acEvaEons (q) and desired acEvaEons (p)

p log(q) + (1 p) log(1 q)

i=1

log p(y (i) |x(i) ; y (i) )

Our contribuEon: NegaEve log-likelihood of parameters only for exis,ng labels Entrywise sum of squares Frobenius weight regularizaEon for beber generalizaEon accuracy

tr(WlT Wl )

0, 1T = 1, R4

Metaparameter controls tradeo between objecEves

MNIST dataset
60,000 training examples 10,000 test examples 28x28 images of the digits 0-9 Architecture: 784-500-500- 784 grayscale input units 500 hidden units at each hidden level

Global Finetune (2D embeddings)

2nd principal component

1st principal component 1st principal component

Pretrained pushed bases

2nd principal component

Bases netuned with global backprop

Test digit:
Randomly throw away 90% of the labels

Global netune

Unsupervised Learning on images MNIST digits Learned bases


Supervised Learning on known labels to push bases toward classicaEon task

ClassicaEon result:

MNIST classicaEon error on test set (0.1% dierences signicant)


Network depth 1 2 3 4 Reference models SVM Gaussian kernel Autoencoder, greedy layerwise+netune, 784-500-500 100% labels 1.61% 1.38% 1.36% 1.66% 10% labels 3.37% 3.19% 3.36% 5.56% 1% labels 7.89% 6.59% 6.87% -- MNIST error on 100% labels 1.4% 1.93%

RBM pretrain+netune, 784-1000-500-250-30 (Hinton 1.6% & Salakhutdinov, 2006) Autoencoder, Hessian free, 784-1000-500-250-30 (Martens, 2010) Conv. Net + elasEc distorEons (Ranzato et al, 2006) 2.46% 0.39%

Selected References
Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, July 2006. Martens, J. Deep learning via Hessian-free opEmizaEon. ICML 2010. Ranzato M., Poultney, C., Chopra, S., and LeCun, Y. Ecient Learning of Sparse RepresentaEons with an Energy-Based Model NIPS 2006. Schmidt, M. minFunc (L-BFGS implementaEon) hbp://www.cs.ubc.ca/~schmidtm/Sovware/ minFunc.html