Sowl Talk Ibm

OWL to the rescue of LASSO
IISc IBM day 2018

Joint Work
R. Sankaran and Francis Bach
AISTATS ’17
Chiranjib Bhattacharyya
Professor, Department of Computer Science and Automation

Indian Institute of Science, Bangalore
March 7, 2018
Identifying groups of strongly correlated variables
through Smoothed Ordered Weighted L1 -norms
LASSO: The method of choice for feature selection
Ordered Weighted L1(OWL) norm and Submodular penalties
ΩS : Smoothed OWL (SOWL)

LASSO: The method of choice for feature
selection
Linear Regression in High dimension
Question?
Let x ∈ Rd , y ∈ R and
y = w ∗> x + ∼ N(0, σ 2 )
From D = {(xi , yi )|i ∈ [n], xi ∈ Rd , yi ∈ R} can we find w ∗
x>
 
1
 x> 2

X =  , Y ∈ Rn
 
..
 . 
x>
n
X ∈ Rn×d , y ∈ Rn
Least Squares Linear Regression

X ∈ Rn×d , y ∈ Rd :
1
wLS = argminw ∈Rd kXw − Y k22
2
Assumptions
Pn
I Labels centered : j=1 yj = 0.
I Features normalized : xi ∈ Rd , kxi k2 = 1, xi> 1d = 0.
Least Squares Linear Regression

−1 >
wLS = X > X X Y and E (wLS ) = w ∗
Variance of Predictive error: n1 E (kX (wLS − w ∗ )k2 ) = σ 2 dn
rank(X ) = d
I wLS is unique
I Poor predictive
performance,
d is close to n
rank(X ) = d
rank(X ) < d
I wLS is unique I d >n

I Poor predictive I wLS is not unique.
performance,
d is close to n
Regularized Linear Regression
Regularized Linear Regression X ∈ Rn×d , y ∈ Rd , Ω : Rd → R:
1
min kXw − y k22 s.t. Ω(w ) ≤ t
w ∈Rd 2
Regularizer: Ω(w )
I non-negative
I Convex function, typically a norm.
I Possibly non-differentiable.
Lasso regression[Tibshirani, 1994]
Lasso: Ω(w ) = kw k1
Ridge: Ω(w ) = kw k22
I Encourages sparse
I Does not promote solutions.
sparsity
I Solve convex
I Closed form solution optimization problem
Regularized Linear Regression
Regularized Linear Regression X ∈ Rn×d , y ∈ Rd , Ω : Rd → R:
1
min kXw − y k22 + λΩ(w )
w ∈Rd 2
I Equivalent to the constraint version

I Unconstrained
Lasso: Properties at a glance
Computational
I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle

and Pock, 2011, Beck and Teboulle, 2009].
I Convergence rate : O(1/T 2 ) in T iterations.
I Assumption: availability of proximal operator of Ω (Easy for
`1 ).
Lasso: Properties at a glance
Computational
I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle

and Pock, 2011, Beck and Teboulle, 2009].
I Convergence rate : O(1/T 2 ) in T iterations.
I Assumption: availability of proximal operator of Ω (Easy for
`1 ).
Statistical properties[Wainwright, 2009]
I Support recovery (Will it recover the true support ?).

I Sample complexity (How many samples needed ?).
I Prediction error (What is the expected error in prediction ?).
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Setup
y = Xw ∗ + , i ∼ N (0, σ 2 ), X ∈ Rn×d
Support of w ∗ be S = {j|wj∗ 6= 0 j ∈ [d]}
Lasso
1
min kXw − y k22 + λkw k1
w ∈Rd 2
Conditions1 .
−1
> >

X c XS X XS ≤ 1 − γ, Incoherence with γ ∈ (0, 1],
S S
∞

1 >
Λmin X XS ≥ Cmin
n S
r
2 2σ 2 log d
λ > λ0 ,
γ n
1 P
Define kMk∞ = maxi j |Mij |
Conditions1 .
−1
> >

X c XS X XS ≤ 1 − γ, Incoherence with γ ∈ (0, 1],
S S
∞

1 >
Λmin X XS ≥ Cmin
n S
r
2 2σ 2 log d
λ > λ0 ,
γ n
W.h.p, the following holds:
−1 p
wS∗ k∞
>
kŵS − ≤ λ XS XS /n

+ 4σ/ Cmin
∞
1 P
Define kMk∞ = maxi j |Mij |
Lasso Model Recovery: Special cases
Case: XS>c XS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: XS> XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
Lasso Model Recovery: Special cases
Case: XS>c XS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: XS> XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
When does Lasso work well?

I Lasso prefers low correlation between support and non-support
columns.
I Low correlation of columns within support lead to better
recovery.
Lasso Model Recovery: Implications
Setting: Strongly correlated columns in X .
I Correlation between feature i and feature j
ρij ≈ xi> xj
I Large correlation between XS and XS c ⇒ γ is small.

I Large correlation within XS ⇒ Cmin is small.
ρij ≈ xi> xj

I The r.h.s. of the bound is large. (loose bound).
I Hence w.h.p., lasso fails in model recovery.

In other words:
I Lasso solutions differ with the solver used.
I Solution is not unique typically.
I The prediction error may not be as worse though [Hebiri and

Lederer, 2013].
ρij ≈ xi> xj

I The r.h.s. of the bound is large. (loose bound).
I Hence w.h.p., lasso fails in model recovery.

In other words:
I Lasso solutions differ with the solver used.
I Solution is not unique typically.
I The prediction error may not be as worse though [Hebiri and

Lederer, 2013].
Requirements.
I Need consistent estimates independent of the solver.
I Preferably select all the correlated variables as a group.

Illustration: Lasso under correlation[Zeng and Figueiredo,
2015]
Setting: strongly correlated features.
I {1, . . . , d} = G1 ∪ · · · ∪ Gk , Gm ∩ Gl = ∅, ∀l 6= m
I ρij ≡ |xi> xj | very high (≈ 1) for pairs i, j ∈ Gm .
Illustration: Lasso under correlation[Zeng and Figueiredo,
2015]
Setting: strongly correlated features.
I {1, . . . , d} = G1 ∪ · · · ∪ Gk , Gm ∩ Gl = ∅, ∀l 6= m
I ρij ≡ |xi> xj | very high (≈ 1) for pairs i, j ∈ Gm .
6
Toy Example 4
I d = 40, k = 4. 2
0
I G1 = [1 : 10], G2 = [11 : 20], 0 20 40
G3 = [21 : 30], G4 = [31 : 40].

Figure: Original signal
6
Lasso: Ω(w ) = kw k1 . 4
I Sparse recovery. 2
0
I Arbitrarily selects the variables 0 20 40
within a group. Figure: Recovered signal.

Possible solutions: 2-stage procedures
Cluster Group Lasso [Buhlmann et al., 2013]

I Identify strongly correlated groups G = {G1 , . . . , Gk }
I Canonical Correlation.
Pk
Group selection. Ω(w ) = j=1 αj kwGj k2 .
I
I Select all or no variables from each group.

Possible solutions: 2-stage procedures
Cluster Group Lasso [Buhlmann et al., 2013]

I Identify strongly correlated groups G = {G1 , . . . , Gk }
I Canonical Correlation.
Pk
Group selection. Ω(w ) = j=1 αj kwGj k2 .
I
I Select all or no variables from each group.

Goal: Learn G and w simultaneously ?
Ordered Weighted `1 (OWL) norms
OSCAR2 [Bondell and Reich, 2008]
d
X
ΩO (w ) = ci |w |(i)
i=1
ci = c0 + (d − i)µ,
c0 , µ, cd > 0.
2
Notation: |w |(i) : i th largest in |w |.
d
X
ΩO (w ) = ci |w |(i)
i=1
ci = c0 + (d − i)µ,
c0 , µ, cd > 0.
OWL [Figueiredo and Nowak, 2016]:

I c1 ≥ · · · ≥ cd ≥ 0.
2
6
d
X 4
ΩO (w ) = ci |w |(i)
2
i=1
0
ci = c0 + (d − i)µ, 0 20 40
c0 , µ, cd > 0.
Figure: Recovered: OWL
OWL [Figueiredo and Nowak, 2016]:

I c1 ≥ · · · ≥ cd ≥ 0.
2
Oscar: Sparsity Illustrated
Figure: Examples of solutions:[Bondell and Reich, 2008]
I Solutions encouraged towards vertices.

I Encourages blockwise constant solutions.
I See [Bondell and Reich, 2008].
OWL-Properties
Grouping covariates [Bondell and Reich, 2008, Theorem 1],

[Figueiredo and Nowak, 2016, Theorem 1]
|wi | = |wj |, if λ ≥ λ0ij .

OWL-Properties
Grouping covariates [Bondell and Reich, 2008, Theorem 1],

[Figueiredo and Nowak, 2016, Theorem 1]
|wi | = |wj |, if λ ≥ λ0ij .
q
I λ0ij ∝ 1 − ρ2ij .
I Strongly correlated pairs grouped early in the regularization
path.
I Groups: Gj = {i ||wi | = αj }.
OWL-Issues
6 6
4 4
2 2
0 0
0 20 40 0 20 40
Figure: True model Figure: Recovered: OWL
I Bias for piecewise constant ŵ

I Easily understood through the norm balls.
I Requires more samples to consistent estimation.
OWL-Issues
6 6
4 4
2 2
0 0
0 20 40 0 20 40
Figure: True model Figure: Recovered: OWL
I Bias for piecewise constant ŵ

I Easily understood through the norm balls.
I Requires more samples to consistent estimation.
I Lack of interpretations for choosing c
Ordered Weighted L1(OWL) norm and
Submodular penalties
Preliminaries: Penalties on the Support
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Goal
supp(w ) = {i|wi 6= 0}
Idea
Penalty on support [Obozinski and Bach, 2012]:
pen(w ) = F (supp(w )) + kw kpp , p ∈ [1, ∞].
Goal
supp(w ) = {i|wi 6= 0}
Idea
Relaxation(pen(w)): ΩFp (w )
Tightest positively homogenous, convex lower bound.
Goal
supp(w ) = {i|wi 6= 0}
Idea
Example: F (supp(w )) = |supp(w )|

Goal
supp(w ) = {i|wi 6= 0}
Idea
Example: F (supp(w )) = |supp(w )| ⇒ ΩFp (w ) = kw k1 . Familiar!

Message: The cardinality function always relaxes to the `1 norm.
Nondecreasing Submodular Penalties on Cardinality
Assumptions. Denote F ∈ F, if F : A ⊆ {1, . . . , d} → R is:

1. Submodular [Bach, 2011].
I ∀A ⊆ B, F (A ∪ {k}) − F (A) ≥ F (B ∪ {k}) − F (B).
I Lovász extension: f : Rd → R. (Convex extension
of F to Rd ).
2. Cardinality based.
I F (A) = g (|A|) (Invariant to permutations).
3. Non Decreasing.
I g (0) = 0, g (x) ≥ g (x − 1).
Implication: F ∈ F ⇒ F completely specified through g .
Example: Let V = {1, .P. . , d}, define F (A) = |A||V \ A|.
Then f (w ) = i<j |wi − wj |.
ΩF∞ and Lovász extension
Result: Case p = ∞ [Bach, 2010]:
ΩF∞ (w ) = f (|w |)
I The `∞ relaxation coincides with the Lovász extension in the

positive orthant.
I To work with ΩF∞ , may use existing results of submodular
function minimization.
I ΩFp not known in closed form for p < ∞.
Equivalence of OWL and Lovász extensions: Statement
Proposition [Sankaran et al., 2017]: F ∈ F, ΩF∞ (w ) ⇔ ΩO (w )

1. Given F (A) = f (|A|), ΩF∞ (w ) = ΩO (w ), with
ci = f (i) − f (i − 1).
2. Given c1 ≥ . . . cd ≥ 0, ΩO (w ) = ΩF∞ (w ) with
f (i) = c1 + · · · + ci .
Interpretations
I Gives alternate interpretations for OWL.

ci = f (i) − f (i − 1).
f (i) = c1 + · · · + ci .
Interpretations
I ΩF
∞ has undesired extreme points [Bach, 2011].
I Explains piecewise constant solutions of OWL.

ci = f (i) − f (i − 1).
f (i) = c1 + · · · + ci .
Interpretations
I ΩF
∞ has undesired extreme points [Bach, 2011].
I Explains piecewise constant solutions of OWL.
I Motivates ΩFp (w ) for p < ∞.
ΩS : Smoothed OWL (SOWL)
SOWL: Definition
Smoothed OWL
ΩS (w ) := ΩF2 (w ).
SOWL: Definition
Smoothed OWL
ΩS (w ) := ΩF2 (w ).
Variational form for ΩF2 [Obozinski and Bach, 2012].
d
!
1 X w2 i
ΩS (w ) = min + f (η) .
η∈Rd+ 2 ηi
i=1
SOWL: Definition
Smoothed OWL
ΩS (w ) := ΩF2 (w ).
Variational form for ΩF2 [Obozinski and Bach, 2012].
d
!
1 X w2 i
ΩS (w ) = min + f (η) .
η∈Rd+ 2 ηi
i=1
Pd
Use OWL equivalance: f (|η|) = ΩF∞ (η) = i=1 ci |η|(i) ,
d
wi2

1X
ΩS (w ) = min + ci η(i) . (SOWL)
η∈Rd+ 2 ηi
i=1
| {z }
Ψ(w ,η)
OWL vs SOWL
Case: c = [1, 0, . . . , 0]> .
Case: c = 1d . | {z }
d−1
I ΩS (w ) = kw k1 .
I ΩS (w ) = kw k2 .
I ΩO (w ) = kw k1 .
I ΩO (w ) = kw k∞ .
OWL vs SOWL
Case: c = [1, 0, . . . , 0]> .
Case: c = 1d . | {z }
d−1
I ΩS (w ) = kw k1 .
I ΩS (w ) = kw k2 .
I ΩO (w ) = kw k1 .
I ΩO (w ) = kw k∞ .
Norm Balls
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
: OWL : SOWL
Figure: Norm balls for OWL, SOWL, for different values of c

Group Lasso and ΩS
SOWL objective (eliminating η):
 
k
X sX
ΩS (w ) = kwGj k ci  .
j=1 i∈Gj
Denote ηw : denote the optimal η, given w .

Group Lasso and ΩS
SOWL objective (eliminating η):
 
k
X sX
ΩS (w ) = kwGj k ci  .
j=1 i∈Gj
Denote ηw : denote the optimal η, given w .

Key differences:
I Groups defined through ηw = [δ1 , . . . , δ1 , . . . , δk , . . . , δk ].
| {z } | {z }
G1 Gk
I Influenced by the choice of c.
Open Questions:
1. Does ΩS promotes grouping of correlated variables as ΩO ?

I Are there any benefits over ΩO ?
Open Questions:

2. Is using ΩS computationally feasible ?
Open Questions:

2. Is using ΩS computationally feasible ?
3. Theoretical properties of ΩS vs Group Lasso?
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
d
wi2

1 λX
min kXw − y k22 + + ci η(i) .
w ∈Rd ,η∈Rd+ 2n 2 ηi
i=1
| {z }
Γ(λ) (w ,η)
d
wi2

1 λX
i=1
| {z }
Γ(λ) (w ,η)
Theorem: [Sankaran et al., 2017]

Define the following:
I ŵ (λ) , η̂ (λ) = argminw ,η Γ(λ) (w , η).

I ρij = xi> xj .
I c̃ = mini ci − ci+1 .
d
wi2

1 λX
i=1
| {z }
Γ(λ) (w ,η)
Theorem: [Sankaran et al., 2017]

Define the following:
I ŵ (λ) , η̂ (λ) = argminw ,η Γ(λ) (w , η).

I ρij = xi> xj .
I c̃ = mini ci − ci+1 .
ky k2 1 (λ)
There exists 0 ≤ λ0 ≤ √ (4
c̃
− 4ρ2ij ) 4 , such that ∀λ > λ0 , η̂i =
(λ)
η̂j
Grouping property ΩS : Interpretation
1. Variables ηi , ηj grouped if ρij ≈ 1 (Even for small λ).

I Similar to Figueiredo and Nowak [2016, Theorem 1], which is
for absolute values of w .

2. SOWL differentiates grouping variable η and model variable
w.

2. SOWL differentiates grouping variable η and model variable
w.
3. Allows model variance within group.
(λ) (λ)
I ŵi 6= ŵj as long as c has distinct values.
Illustration: Group Discovery using SOWL
Aim: Illustrate group discovery of SOWL.
I Consider z ∈ Rd , Compute proxλΩS (z).
I Study the regularization path.
1.5
0.5
0
0 10 20 30
Figure: Original signal

2 2
2
1 1
1.5
1 0 0
0 50 100 150 0 50 100 150
0.5
Lasso w OWL w
0
0 10 20 30 4
2
Figure: Original signal 1 2
0 0
0 50 100 150 0 100 200
I Early group discovery. SOWL w SOWL η
I Model variation.
Figure: x-axis : λ, y-axis: ŵ .
Proximal Methods: A brief overview
Proximal operator
1
proxΩ (z) = argminw kw − zk22 + Ω(w )
2
I Easy to evaluate for many simple norms.

I proxλ`1 (z) = sign(z) (|z| − λ)+ .
I Generalization of Projected Gradient Descent
FISTA [Beck and Teboulle, 2009]
Initialization
I t (1) = 1, w̃ (1) = x (1) = 0.
Steps : k > 1
w (k) = proxΩ w̃ (k−1) − L1 ∇f (w̃ (k−1) ) .

I
q
(k)
2
I t = 1+ 1+4 t (k−1) /2.
(k−1)
w̃ (k) = w (k) + t t (k)−1 w (k) − w (k−1) .

I
Guarantee
I Convergence rate O(1/T 2 ).
I No additional assumptions than IST.
I Known to be optimal for this class of minimization problems.
Computing proxΩS
Problem:
1
proxλΩ (z) = argminw kw − zk22 + λΩ(w ).
2
(λ)
w = proxλΩS (z), ηw = argminη Ψ(w (λ) , η).
(λ)
Computing proxΩS
Problem:
1
2
(λ)
(λ)
(λ)
Key idea: Ordering of ηw remains same for all λ.
3 3
η2 η2
1 1
0 0
-5 0 5 10 0 10 20
log(λ) λ
Computing proxΩS
Problem:
1
2
(λ)
(λ)
(λ)
Key idea: Ordering of ηw remains same for all λ.
3 3
η2 η2
1 1
0 0
-5 0 5 10 0 10 20
log(λ) λ
(λ)
I ηw = (ηz − λ)+ .
I Same complexity as computing the norm ΩS (O(d log d)).
I True for all cardinality based F .
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).

i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
G1 Gk
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Random Design

i
| {z } | {z }
3 G1 Gk
Assumptions: √
I ΣJ ,J is invertible, λ → 0, and λ n → ∞.
3
Random Design

i
| {z } | {z }
3 G1 Gk
Assumptions: √
Irrepresentability conditions
1. δk∗ = 0 if |J c | =
6 ∅.

−1
ΣJ c ,J (ΣJ ,J ) DwJ
∗

2. β
2
<1.
3
Random Design

i
| {z } | {z }
3 G1 Gk
Assumptions: √
Irrepresentability conditions Result[Sankaran

et al., 2017]
1. δk∗ = 0 if |J c | =
6 ∅.
1. ŵ →p w ∗ .

−1
∗

2. β
2
<1. 2. P(Jˆ = J ) → 1.
3
Random Design

i
| {z } | {z }
3 G1 Gk
Assumptions: √
Irrepresentability conditions Result[Sankaran

et al., 2017]
1. δk∗ = 0 if |J c | =
6 ∅.
1. ŵ →p w ∗ .

−1
∗

2. β
2
<1. 2. P(Jˆ = J ) → 1.
I Similar to Group Lasso [Bach, 2008, Theorem 2].

I Learns the weights, without explicit groups information.
3
Quantitative Simulation: Predictive Accuracy
Aim: Learn ŵ using LS-SOWL, evaluate prediction error.
4
The experiments followed the setup of Bondell and Reich [2008]

Generate samples:
I x ∼ N (0, Σ), ∼ N (0, σ 2 ),
I y = x > w ∗ + .
4

Generate samples:
I x ∼ N (0, Σ), ∼ N (0, σ 2 ),
I y = x > w ∗ + .
Metric: E [kx > (w ∗ − ŵ )k2 ] = (w ∗ − ŵ )> Σ(w ∗ − ŵ ).
4

Generate samples:
I x ∼ N (0, Σ), ∼ N (0, σ 2 ),
I y = x > w ∗ + .
Data:4 w ∗ = [0> > > > >
10 , 210 , 010 , 210 ] .
I n = 100, σ = 15 and Σi,j = 0.5 if i 6= j and 1 if i = j.
4

Generate samples:
I x ∼ N (0, Σ), ∼ N (0, σ 2 ),
I y = x > w ∗ + .
Data:4 w ∗ = [0> > > > >
10 , 210 , 010 , 210 ] .
I n = 100, σ = 15 and Σi,j = 0.5 if i 6= j and 1 if i = j.
Models with group variance:
I Measure E [kx > (w̃ ∗ − ŵ )k2 ].
I w̃ ∗ = w ∗ + ˜,
I ˜ ∼ U[−τ, τ ],
I τ = 0, 0.2, 0.4.
4
Predictive accuracy results
Algorithm Med. MSE MSE (10th Perc). MSE (90th Perc)

LASSO 46.1 / 45.2 / 45.5 32.8 / 32.7 / 33.2 60.0 / 61.5 / 61.4
OWL 27.6 / 27.0 / 26.4 19.8 / 19.2 / 19.2 42.7 / 40.4 / 39.2
El. Net 30.8 / 30.7 / 30.6 21.9 / 22.6 / 23.0 42.4 / 43.0 / 41.4
ΩS 23.9 / 23.3 / 23.4 16.9 / 16.8 / 16.8 35.2 / 35.4 / 33.2
Table: Each column has numbers for τ = 0, 0.2, 0.4.

Summary
1. Proposed a new family of norms ΩS .

2. Properties:
I Equivalent to OWL in group identification.
I Efficient computational tools
I Equivalences to Group Lasso.
3. Illustrations on performance through simulations.
Questions ?
Thank you !!!
References I
F. Bach. Consistency of the group lasso and multiple kernel

learning. Journal of Machine Learning Research, 9:1179–1225,
2008.
F. Bach. Structured sparsity-inducing norms through submodular
functions. In Neural Information Processing Systems, 2010.
F. Bach. Learning with submodular functions: A convex
optimization perspective. Foundations and Trends in Machine
Learning, 6-2-3:145–373, 2011.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding
algorithm for linear inverse problems. SIAM journal on imaging
sciences, 2(1):183–202, March 2009.
H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage,
variable selection, and supervised clustering of predictors with
oscar. Biometrics, 64:115–123, 2008.
References II
P. Buhlmann, P. Rütimann, Sara van de Geer, and Cun-Hui Zhang.
Correlated variables in regression : Clustering and sparse
estimation. Journal of Statistical Planning and Inference, 43:
1835–1858, 2013.
A. Chambolle and T. Pock. A first-order primal-dual algorithm for
convex problems with applications to imaging. Journal of
mathematical imaging and vision, 40(1):120–145, May 2011.
M. A. T. Figueiredo and R. D. Nowak. Ordered weighted `1
regularized regression with strongly correlated covariates:
Theoretical aspects. In International Conference on Artificial
Intelligence and Statistics (AISTATS), 2016.
Mohamed Hebiri and Johannes Lederer. How correlations influence
lasso prediction. IEEE Transactions on Information Theory, 59
(3):1846–1854, 2013.
G. Obozinski and F. Bach. Convex relaxation for combinatorial
penalties. Technical Report 00694765, HAL, 2012.
References III
R. Sankaran, F. Bach, and C. Bhattacharya. Identifying groups of

strongly correlated variables through smoothed ordered weighted
l 1-norms. In Artificial Intelligence and Statistics, pages
1123–1131, 2017.
R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society, Series B, 58:267–288,
1994.
M.J. Wainwright. Sharp thresholds for high-dimensional and noisy
sparsity recovery using `1 -constrained quadratic programming
(lasso). IEEE Transactions On Information Theory, 55(5):
2183–2202, 2009.
X. Zeng and M. Figueiredo. The ordered weighted `1 norm:
Atomic formulation, projections, and algorithms. ArXiv
preprint:1409.4271v5, 2015.

Sowl Talk Ibm

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Sowl Talk Ibm

Загружено:

Авторское право:

Доступные форматы

OWL to the rescue of LASSO

IISc IBM day 2018

Professor, Department of Computer Science and Automation

Ordered Weighted L1(OWL) norm and Submodular penalties

ΩS : Smoothed OWL (SOWL)

From D = {(xi , yi )|i ∈ [n], xi ∈ Rd , yi ∈ R} can we find w ∗

Least Squares Linear Regression

Least Squares Linear Regression

I wLS is unique I d >n

Regularized Linear Regression X ∈ Rn×d , y ∈ Rd , Ω : Rd → R:

Regularized Linear Regression X ∈ Rn×d , y ∈ Rd , Ω : Rd → R:

I Equivalent to the constraint version

I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle

I Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle

Statistical properties[Wainwright, 2009]

I Support recovery (Will it recover the true support ?).

W.h.p, the following holds:

When does Lasso work well?

I Large correlation between XS and XS c ⇒ γ is small.

I Large correlation between XS and XS c ⇒ γ is small.

I The r.h.s. of the bound is large. (loose bound).

I Hence w.h.p., lasso fails in model recovery.

I Solution is not unique typically.

I The prediction error may not be as worse though [Hebiri and

I Large correlation between XS and XS c ⇒ γ is small.

I The r.h.s. of the bound is large. (loose bound).

I Hence w.h.p., lasso fails in model recovery.

I Solution is not unique typically.

I The prediction error may not be as worse though [Hebiri and

I Preferably select all the correlated variables as a group.

G3 = [21 : 30], G4 = [31 : 40].

within a group. Figure: Recovered signal.

Cluster Group Lasso [Buhlmann et al., 2013]

I Select all or no variables from each group.

Cluster Group Lasso [Buhlmann et al., 2013]

I Select all or no variables from each group.

OSCAR2 [Bondell and Reich, 2008]

OSCAR2 [Bondell and Reich, 2008]

OWL [Figueiredo and Nowak, 2016]:

OSCAR2 [Bondell and Reich, 2008]

OWL [Figueiredo and Nowak, 2016]:

Figure: Examples of solutions:[Bondell and Reich, 2008]

I Solutions encouraged towards vertices.

Grouping covariates [Bondell and Reich, 2008, Theorem 1],

|wi | = |wj |, if λ ≥ λ0ij .

Grouping covariates [Bondell and Reich, 2008, Theorem 1],

|wi | = |wj |, if λ ≥ λ0ij .

Figure: True model Figure: Recovered: OWL

I Bias for piecewise constant ŵ

Figure: True model Figure: Recovered: OWL

I Bias for piecewise constant ŵ

Example: F (supp(w )) = |supp(w )|

Example: F (supp(w )) = |supp(w )| ⇒ ΩFp (w ) = kw k1 . Familiar!

Assumptions. Denote F ∈ F, if F : A ⊆ {1, . . . , d} → R is:

Result: Case p = ∞ [Bach, 2010]:

I The `∞ relaxation coincides with the Lovász extension in the

Proposition [Sankaran et al., 2017]: F ∈ F, ΩF∞ (w ) ⇔ ΩO (w )

Proposition [Sankaran et al., 2017]: F ∈ F, ΩF∞ (w ) ⇔ ΩO (w )

Proposition [Sankaran et al., 2017]: F ∈ F, ΩF∞ (w ) ⇔ ΩO (w )

Variational form for ΩF2 [Obozinski and Bach, 2012].

Variational form for ΩF2 [Obozinski and Bach, 2012].

Figure: Norm balls for OWL, SOWL, for different values of c

SOWL objective (eliminating η):