Академический Документы
Профессиональный Документы
Культура Документы
Chiranjib Bhattacharyya
March 7, 2018
Identifying groups of strongly correlated variables
through Smoothed Ordered Weighted L1 -norms
LASSO: The method of choice for feature selection
Question?
Let x ∈ Rd , y ∈ R and
y = w ∗> x + ∼ N(0, σ 2 )
x>
1
x> 2
X = , Y ∈ Rn
..
.
x>
n
X ∈ Rn×d , y ∈ Rn
Linear Regression in High dimension
1
wLS = argminw ∈Rd kXw − Y k22
2
Assumptions
Pn
I Labels centered : j=1 yj = 0.
I Features normalized : xi ∈ Rd , kxi k2 = 1, xi> 1d = 0.
Linear Regression in High dimension
rank(X ) = d
I wLS is unique
I Poor predictive
performance,
d is close to n
Linear Regression in High dimension
rank(X ) = d
rank(X ) < d
1
min kXw − y k22 s.t. Ω(w ) ≤ t
w ∈Rd 2
Regularizer: Ω(w )
I non-negative
I Convex function, typically a norm.
I Possibly non-differentiable.
Lasso regression[Tibshirani, 1994]
Lasso: Ω(w ) = kw k1
Ridge: Ω(w ) = kw k22
I Encourages sparse
I Does not promote solutions.
sparsity
I Solve convex
I Closed form solution optimization problem
Regularized Linear Regression
1
min kXw − y k22 + λΩ(w )
w ∈Rd 2
Computational
Computational
Setup
y = Xw ∗ + , i ∼ N (0, σ 2 ), X ∈ Rn×d
Support of w ∗ be S = {j|wj∗ 6= 0 j ∈ [d]}
Lasso
1
min kXw − y k22 + λkw k1
w ∈Rd 2
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Conditions1 .
−1
> >
X c XS X XS
≤ 1 − γ, Incoherence with γ ∈ (0, 1],
S S
∞
1 >
Λmin X XS ≥ Cmin
n S
r
2 2σ 2 log d
λ > λ0 ,
γ n
1 P
Define kMk∞ = maxi j |Mij |
Lasso Model Recovery[Wainwright, 2009, Theorem 1]
Conditions1 .
−1
> >
X c XS X XS
≤ 1 − γ, Incoherence with γ ∈ (0, 1],
S S
∞
1 >
Λmin X XS ≥ Cmin
n S
r
2 2σ 2 log d
λ > λ0 ,
γ n
−1
p
wS∗ k∞
>
kŵS − ≤ λ
XS XS /n
+ 4σ/ Cmin
∞
1 P
Define kMk∞ = maxi j |Mij |
Lasso Model Recovery: Special cases
Case: XS>c XS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: XS> XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
Lasso Model Recovery: Special cases
Case: XS>c XS = 0.
I The incoherence condition trivially holds, and γ = 1.
I The threshold λ0 is lesser ⇒ The recovery error is lesser.
Case: XS> XS = I .
I Cmin = 1/n, the largest possible for a given n.
I Larger Cmin ⇒ lesser recovery error.
ρij ≈ xi> xj
ρij ≈ xi> xj
ρij ≈ xi> xj
I d = 40, k = 4. 2
0
I G1 = [1 : 10], G2 = [11 : 20], 0 20 40
6
Lasso: Ω(w ) = kw k1 . 4
I Sparse recovery. 2
0
I Arbitrarily selects the variables 0 20 40
d
X
ΩO (w ) = ci |w |(i)
i=1
ci = c0 + (d − i)µ,
c0 , µ, cd > 0.
2
Notation: |w |(i) : i th largest in |w |.
Ordered Weighted `1 (OWL) norms
d
X
ΩO (w ) = ci |w |(i)
i=1
ci = c0 + (d − i)µ,
c0 , µ, cd > 0.
2
Notation: |w |(i) : i th largest in |w |.
Ordered Weighted `1 (OWL) norms
6
d
X 4
ΩO (w ) = ci |w |(i)
2
i=1
0
ci = c0 + (d − i)µ, 0 20 40
c0 , µ, cd > 0.
Figure: Recovered: OWL
2
Notation: |w |(i) : i th largest in |w |.
Oscar: Sparsity Illustrated
q
I λ0ij ∝ 1 − ρ2ij .
I Strongly correlated pairs grouped early in the regularization
path.
I Groups: Gj = {i ||wi | = αj }.
OWL-Issues
6 6
4 4
2 2
0 0
0 20 40 0 20 40
6 6
4 4
2 2
0 0
0 20 40 0 20 40
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Preliminaries: Penalties on the Support
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Idea
Penalty on support [Obozinski and Bach, 2012]:
pen(w ) = F (supp(w )) + kw kpp , p ∈ [1, ∞].
Preliminaries: Penalties on the Support
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Idea
Penalty on support [Obozinski and Bach, 2012]:
pen(w ) = F (supp(w )) + kw kpp , p ∈ [1, ∞].
Relaxation(pen(w)): ΩFp (w )
Tightest positively homogenous, convex lower bound.
Preliminaries: Penalties on the Support
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Idea
Penalty on support [Obozinski and Bach, 2012]:
pen(w ) = F (supp(w )) + kw kpp , p ∈ [1, ∞].
Relaxation(pen(w)): ΩFp (w )
Tightest positively homogenous, convex lower bound.
Goal
Encourage w to have desired support structure.
supp(w ) = {i|wi 6= 0}
Idea
Penalty on support [Obozinski and Bach, 2012]:
pen(w ) = F (supp(w )) + kw kpp , p ∈ [1, ∞].
Relaxation(pen(w)): ΩFp (w )
Tightest positively homogenous, convex lower bound.
ΩF∞ (w ) = f (|w |)
ΩS (w ) := ΩF2 (w ).
SOWL: Definition
Smoothed OWL
ΩS (w ) := ΩF2 (w ).
d
!
1 X w2 i
ΩS (w ) = min + f (η) .
η∈Rd+ 2 ηi
i=1
SOWL: Definition
Smoothed OWL
ΩS (w ) := ΩF2 (w ).
d
!
1 X w2 i
ΩS (w ) = min + f (η) .
η∈Rd+ 2 ηi
i=1
Pd
Use OWL equivalance: f (|η|) = ΩF∞ (η) = i=1 ci |η|(i) ,
d
wi2
1X
ΩS (w ) = min + ci η(i) . (SOWL)
η∈Rd+ 2 ηi
i=1
| {z }
Ψ(w ,η)
OWL vs SOWL
Case: c = [1, 0, . . . , 0]> .
Case: c = 1d . | {z }
d−1
I ΩS (w ) = kw k1 .
I ΩS (w ) = kw k2 .
I ΩO (w ) = kw k1 .
I ΩO (w ) = kw k∞ .
OWL vs SOWL
Case: c = [1, 0, . . . , 0]> .
Case: c = 1d . | {z }
d−1
I ΩS (w ) = kw k1 .
I ΩS (w ) = kw k2 .
I ΩO (w ) = kw k1 .
I ΩO (w ) = kw k∞ .
Norm Balls
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
: OWL : SOWL
k
X sX
ΩS (w ) = kwGj k ci .
j=1 i∈Gj
k
X sX
ΩS (w ) = kwGj k ci .
j=1 i∈Gj
d
wi2
1 λX
min kXw − y k22 + + ci η(i) .
w ∈Rd ,η∈Rd+ 2n 2 ηi
i=1
| {z }
Γ(λ) (w ,η)
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
d
wi2
1 λX
min kXw − y k22 + + ci η(i) .
w ∈Rd ,η∈Rd+ 2n 2 ηi
i=1
| {z }
Γ(λ) (w ,η)
I ρij = xi> xj .
I c̃ = mini ci − ci+1 .
Grouping property ΩS : Statement
Learning Problem: LS-SOWL
d
wi2
1 λX
min kXw − y k22 + + ci η(i) .
w ∈Rd ,η∈Rd+ 2n 2 ηi
i=1
| {z }
Γ(λ) (w ,η)
I ρij = xi> xj .
I c̃ = mini ci − ci+1 .
ky k2 1 (λ)
There exists 0 ≤ λ0 ≤ √ (4
c̃
− 4ρ2ij ) 4 , such that ∀λ > λ0 , η̂i =
(λ)
η̂j
Grouping property ΩS : Interpretation
1.5
0.5
0
0 10 20 30
1 0 0
0 50 100 150 0 50 100 150
0.5
Lasso w OWL w
0
0 10 20 30 4
2
Figure: Original signal 1 2
0 0
0 50 100 150 0 100 200
I Early group discovery. SOWL w SOWL η
I Model variation.
Figure: x-axis : λ, y-axis: ŵ .
Proximal Methods: A brief overview
Proximal operator
1
proxΩ (z) = argminw kw − zk22 + Ω(w )
2
Initialization
I t (1) = 1, w̃ (1) = x (1) = 0.
Steps : k > 1
w (k) = proxΩ w̃ (k−1) − L1 ∇f (w̃ (k−1) ) .
I
q
(k)
2
I t = 1+ 1+4 t (k−1) /2.
(k−1)
w̃ (k) = w (k) + t t (k)−1 w (k) − w (k−1) .
I
Guarantee
I Convergence rate O(1/T 2 ).
I No additional assumptions than IST.
I Known to be optimal for this class of minimization problems.
Computing proxΩS
Problem:
1
proxλΩ (z) = argminw kw − zk22 + λΩ(w ).
2
(λ)
w = proxλΩS (z), ηw = argminη Ψ(w (λ) , η).
(λ)
Computing proxΩS
Problem:
1
proxλΩ (z) = argminw kw − zk22 + λΩ(w ).
2
(λ)
w = proxλΩS (z), ηw = argminη Ψ(w (λ) , η).
(λ)
(λ)
Key idea: Ordering of ηw remains same for all λ.
3 3
η2 η2
1 1
0 0
-5 0 5 10 0 10 20
log(λ) λ
Computing proxΩS
Problem:
1
proxλΩ (z) = argminw kw − zk22 + λΩ(w ).
2
(λ)
w = proxλΩS (z), ηw = argminη Ψ(w (λ) , η).
(λ)
(λ)
Key idea: Ordering of ηw remains same for all λ.
3 3
η2 η2
1 1
0 0
-5 0 5 10 0 10 20
log(λ) λ
(λ)
I ηw = (ηz − λ)+ .
I Same complexity as computing the norm ΩS (O(d log d)).
I True for all cardinality based F .
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).
i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
G1 Gk
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).
i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
3 G1 Gk
Assumptions: √
I ΣJ ,J is invertible, λ → 0, and λ n → ∞.
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).
i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
3 G1 Gk
Assumptions: √
I ΣJ ,J is invertible, λ → 0, and λ n → ∞.
Irrepresentability conditions
1. δk∗ = 0 if |J c | =
6 ∅.
−1
ΣJ c ,J (ΣJ ,J ) DwJ
∗
2. β
2
<1.
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).
i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
3 G1 Gk
Assumptions: √
I ΣJ ,J is invertible, λ → 0, and λ n → ∞.
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Random Design
Problem setting: LS-SOWL.
I True model: y = Xw ∗ + ε, X > ∼ N (µ, Σ), εi ∈ N (0, σ 2 ).
i
I Notation: J = {i|wi∗ 6= 0}, ηw ∗ = [δ1∗ , . . . , δ1∗ , . . . , δk∗ , . . . , δk∗ ].
| {z } | {z }
3 G1 Gk
Assumptions: √
I ΣJ ,J is invertible, λ → 0, and λ n → ∞.
3
D: Diagonal matrix, defined using groups G1 , . . . , Gk , and w ∗
Quantitative Simulation: Predictive Accuracy
4
The experiments followed the setup of Bondell and Reich [2008]
Quantitative Simulation: Predictive Accuracy
4
The experiments followed the setup of Bondell and Reich [2008]
Quantitative Simulation: Predictive Accuracy
4
The experiments followed the setup of Bondell and Reich [2008]
Quantitative Simulation: Predictive Accuracy
4
The experiments followed the setup of Bondell and Reich [2008]
Quantitative Simulation: Predictive Accuracy
4
The experiments followed the setup of Bondell and Reich [2008]
Predictive accuracy results