Speeding Up Greedy Forward Selection For Regularized Least-Squares

Speeding up Greedy Forward Selection for Regularized Least-Squares
Tapio Pahikkala, Antti Airola, and Tapio Salakoski

Department of Information Technology
University of Turku and Turku Centre for Computer Science,
Turku, Finland,
e-mail: firstname.lastname@utu.fi.
AbstractWe propose a novel algorithm for greedy forward per type of feature selection methods need a search strategy
feature selection for regularized least-squares (RLS) regression over the power set of features. As a search strategy, we
and classification, also known as the least-squares support use the greedy forward selection that adds one feature at
vector machine or ridge regression. The algorithm, which we
call greedy RLS, starts from the empty feature set, and on a time to the set of selected features but no features are
each iteration adds the feature whose addition provides the removed from the set at any stage. In addition to the search
best leave-one-out cross-validation performance. Our method strategy, a heuristic for assessing the goodness of the feature
is considerably faster than the previously proposed ones, since subsets is required. Measuring the prediction performance
its time complexity is linear in the number of training examples, on the training set is known to be unreliable. Therefore, as
the number of features in the original data set, and the desired
size of the set of selected features. Therefore, as a side effect suggested by [8], we use the leave-one-out cross-validation
we obtain a new training algorithm for learning sparse linear (LOO) approach, where each example in turn is left out of
RLS predictors which can be used for large scale learning. the training set and used for testing, as a search heuristic.
This speed is possible due to matrix calculus based short- The merits of using greedy forward selection as a search
cuts for leave-one-out and feature addition. We experimentally strategy and LOO criterion as a heuristic when doing feature
demonstrate the scalability of our algorithm compared to
previously proposed implementations. selection for RLS have been empirically shown by [9], [10].
An important property of the RLS algorithm is that it
has a closed form solution, which can be fully expressed in
I. I NTRODUCTION
terms of matrix operations. This allows developing efficient
In this paper, we propose a novel algorithm for greedy computational shortcuts for the method, since small changes
forward feature selection for regularized least-squares (RLS) in the training data matrix correspond to low-rank changes in
regression and classification. RLS (see e.g [1]), also known the learned predictor. Especially it makes possible the devel-
as the least-squares support vector machine (LS-SVM) [2] opment of efficient cross-validation algorithms. An updated
and ridge regression [3], is a state-of-the art machine learn- predictor, corresponding to a one learned from a training set
ing method suitable both for regression and classification, from which one example has been removed, can be obtained
and it has also been extended for ranking [4], [5]. via a well-known computationally efficient short-cut (see e.g.
In the literature, there are many approaches for the task [11]) which in turn enables the fast computation of LOO-
of feature selection (see e.g. [6]). We consider the wrapper based performance estimates. Analogously to removing the
type of feature selection [7] in which the feature selection effects of training examples from learned RLS predictors, the
and training of the RLS predictor are done simultaneously. effects of a feature can be added or removed by updating the
Performing feature selection for linear RLS results in a learned RLS predictors via similar computational short-cuts.
linear predictor which is sparse, that is, the predictor has Learning a linear RLS predictor with k features and m
nonzero parameters only for the selected features. Sparse training examples requires O(min{k 2 m, km2 }) time, since
predictors are beneficial for many reasons. For example, the training can be performed either in primal or dual form
when constructing an instrument for diagnosing a disease depending whether m > k or vice versa. Let n be the overall
according to a medical sample, sparse predictors can be number of features available for selection. Given that the
deployed in the instrument more easily than dense ones, computation of LOO performance requires m retrainings,
because they require less memory for storing and less that the forward selection tests O(n) possibilities in each
information from the sample in order to perform the diag- iteration, and that the forward selection has k iterations, the
nosis. Another benefit of sparse representations is in their overall time complexity of the forward selection with LOO
interpretability. If the predictor consists of only a small criterion becomes O(min{k 3 m2 n, k 2 m3 n}) in case RLS is
number of nonzero parameters, it makes it easier for a human used as a black-box method.
expert to explain the underlying concept. In machine learning and statistics literature, there have
The number of possible feature sets grows exponentially been several studies in which the computational short-cuts
with the number of available features. Therefore, the wrap- for LOO have been used to speed up the evaluation of feature
subset quality for RLS (see e.g. [9]). However, the consid- by L as MR , M:,L , and MR,L , respectively. We use an
ered approaches are still computationally quite demanding, analogous notation also for column vectors, that is, vi refers
since the LOO estimate needs to be re-calculated from to the ith entry of the vector v.
scratch for each considered subset of features. Recently, [10] Let X Rnm be a matrix containing the whole feature
introduced a method which uses additional computational representation of the examples in the training set, where n is
short-cuts for efficient updating of the LOO predictions the total number of features and m is the number of training
when adding new features to those already selected. The examples. The i, jth entry of X contains the value of the ith
computational cost of the incremental forward selection feature in the jth training example. Moreover, let y Rm
procedure is only O(km2 n) for the method. The speed be a vector containing the labels of the training examples. In
improvement is notable especially in cases, where the aim binary classification, the labels can be restricted to be either
is to select a large number of features but the training set 1 or 1, for example, while they can be any real numbers
is small. However, the method is still impractical for large in regression tasks.
training sets due to its quadratic scaling. In [12] we proposed In this paper, we consider linear predictors of type
computational short-cuts for speeding up feature selection
for RLS with LOO criterion and in [13] we proposed f (x) = wT xS , (1)
a feature selection algorithm for RankRLS, a RLS-based where w is the |S|-dimensional vector representation of the
algorithm for learning to rank. learned predictor and xS can be considered as a mapping
This paper concerns the construction of a greedy forward of the data point x into |S|-dimensional feature space.1
selection algorithm for RLS that takes advantage of our Note that the vector w only contains entries corresponding
previously proposed computational short-cuts. The algo- to the features indexed by S. The rest of the features of
rithm, which we call greedy RLS, can be carried out in the data points are not used in the prediction phase. The
O(kmn) time, where k is the number of selected features. computational complexity of making predictions with (1)
The method is computationally faster than the previously and the space complexity of the predictor are both O(|S|)
proposed implementations. provided that the feature vector representation xS for the
The contributions of this paper can be summarized as data point x is given.
follows: Given training data and a set of feature indices S, we find
We present greedy RLS, a linear time feature selec- w by minimizing the RLS risk. This can be expressed as the
tion algorithm for RLS that uses LOO as a selection following problem:
criterion.
argmin ((wT XS )T y)T ((wT XS )T y) + wT w .

We show that the predictors learned by greedy RLS
wR|S|
are exactly equivalent to those obtained via a standard (2)
wrapper feature selection approach which uses a greedy The first term in (2), called the empirical risk, measures
selection strategy, LOO selection criterion, and RLS as how well the prediction function fits to the training data.
a black-box method. The second term is called the regularizer and it controls
We show both theoretically and empirically that greedy the tradeoff between the loss on the training set and the
RLS is computationally faster than the previously pro- complexity of the prediction function.
posed algorithms that produce equivalent predictors. A straightforward approach to solve (2) is to set the
Namely, it is shown that the computational complexity derivative of the objective function with respect to w to
of greedy RLS is O(kmn) which makes it faster than zero. Then, by solving it with respect to w, we get
low-rank update LS-SVM proposed by [10] which
requires O(km2 n) time. w = (XS (XS )T + I)1 XS y, (3)
II. R EGULARIZED L EAST-S QUARES where I is the identity matrix. We note (see e.g. [14]) that
m an equivalent result can be obtained from
We start by introducing some notation. Let R and
Rnm , where n, m N, denote the sets of real valued w = XS ((XS )T XS + I)1 y. (4)
column vectors and n m-matrices, respectively. To denote
real valued matrices and vectors we use bold capital letters If the size of the set S of currently selected features is
and bold lower case letters, respectively. Moreover, index smaller than the number of training examples m, it is
sets are denoted with calligraphic capital letters. By denoting computationally beneficial to use the form (3) while using
Mi , M:,j , and Mi,j , we refer to the ith row, jth column, (4) is faster in the opposite case. Namely, the computational
and i, jth entry of the matrix M Rnm , respectively. Sim- complexity of the former is O(|S|3 + |S|2 m), while the that
ilarly, for index sets R {1, . . . , n} and L {1, . . . , m}, 1 In the literature, the formula of the linear predictors often also contain
we denote the submatrices of M having their rows indexed a bias term. Here, we assume that if such a bias is used, it will be realized
by R, the columns by L, and the rows by R and columns by using an extra constant valued feature in the data points.
Algorithm 1: Standard wrapper algorithm for RLS using the feature and the previously added features. With
Input: X R nm m
, y R , k, LOO(XR , y, ), we denote the LOO performance obtained
Output: S, w with a data matrix XR , a label vector y, and a regularization
1 S ; parameter , for RLS. In the end of the algorithm descrip-
2 while |S| < k do tion, t(XS , y, ) denotes the black-box training procedure
3 e ;
for RLS which takes a data matrix, a label vector, and a
4 b 0;
5 foreach i {1, . . . , n} \ S do value of the regularization parameter as input and returns a
6 R S {i}; vector representation of the learned predictor w.
7 ei LOO(XR , y, ); Training a linear RLS predictor with k features and
8 if ei < e then m training examples requires O(min{k 2 m, km2 }) time.
9 e ei ;
Given that the computation of LOO performance requires m
10 b i;
retrainings, that the selection of a new feature is done from
11 S S {b}; a set of size O(n) in each iteration, and that k features are
12 w t(XS , y, ); chosen, the overall time complexity of the forward selection
with LOO criterion is O(min{k 3 m2 n, k 2 m3 n}). Thus, the
Figure 1. Standard wrapper algorithm for RLS wrapper approach is feasible with small training sets and
in cases where the aim is to select only a small number of
features. However, at the very least quadratic complexity
of the latter is O(m3 + m2 |S|), and hence the complexity with respect to both the size of the training set and the
of training a predictor is O(min{|S|2 m, m2 |S|}). number of selected features makes the standard wrapper
To support the following considerations, we introduce approach impractical in large scale learning settings. The
the dual form of the prediction function and some extra space complexity of the wrapper approach is O(nm) which
notation. According to [15], the prediction function (1) can is equal to the cost of storing the data matrix X.
be represented in dual form as follows An immediate reduction for the above considered com-
putational complexities can be achieved via the short-cut
f (x) = aT (XS )T xS . methods for calculating the LOO performance presented in
Here a Rm is the vector of so-called dual variables, which Section II. In machine learning and statistics literature, there
can be obtained from a = Gy, where G = ((XS )T XS + have been several studies in which the computational short-
I)1 . cuts for LOO, as well as other types of cross-validation,
Next, we consider a well-known efficient approach for have been used to speed up the evaluation of feature subset
evaluating the LOO performance of a trained RLS predictor quality for RLS. For the greedy forward selection, the
(see e.g. [11]). Provided that we have the vector of dual approach proposed by [9] has a computational complexity
variables a and the diagonal elements of G available, the to O(k 2 m2 n + km3 n), since they train RLS in dual form
LOO prediction for the jth training example can be obtained and use the LOO short-cut (5) enabling the calculation of
in constant number of floating point operations from LOO as efficiently as training the predictor itself.
We note that an analogous short-cut for the LOO perfor-
yj (Gj,j )1 aj . (5) mance exists also for the primal form of RLS (see e.g. [11]),
that is, the LOO performance can be computed for RLS as
III. A LGORITHM D ESCRIPTIONS
efficiently as training RLS in the primal. Accordingly, we
A. Wrapper Approach can remove a factor m corresponding to the LOO compu-
Here, we consider greedy forward feature selection for tation from the time complexity of the wrapper approach
RLS with LOO criterion. In this approach, feature sets up which then becomes O(min{k 3 mn, k 2 m2 n}). However, we
to size k are searched and the goodness of a feature subset are not aware of any studies in which the LOO short-cut for
is measured via LOO classification error. The method is the primal formulation would be used for the greedy forward
greedy in the sense that it adds one feature at a time to the feature selection for RLS.
set of selected features, and no features are removed from Recently, [10] proposed an algorithm implementation
the set at any stage. A high level pseudo code of greedy which they call low-rank updated LS-SVM. The features
RLS is presented in Fig. 1. In the algorithm description, selected by the algorithm are equal to those selected by the
the outermost loop adds one feature at a time into the set standard wrapper algorithm and by the method proposed by
of selected features S until the size of the set has reached [9], but it uses certain additional computational short-cuts to
the desired number of selected features k. The inner loop speed up the selection process. The overall time complexity
goes through every feature that has not yet been added of the low-rank updated LS-SVM algorithm is O(km2 n).
into the set of selected features and, for each of feature, The complexity with respect to k is linear which is much
computes the LOO performance of the RLS predictor trained better than that of the standard wrapper approach, and hence
Algorithm 2: Greedy RLS where
Input: X R nm m
, y R , k, G = ((XS )T XS + I)1
Output: S, w
1 a 1 y; and diag(G) denotes a vector consisting of the diagonal
2 d 1 1; entries of G.
3 C 1 XT ; In the initialization phase of the greedy RLS algorithm
4 S ; (lines 1-4 in Algorithm 2) the set of selected features is
5 while |S| < k do empty, and hence the values of a, d, and C are initialized
6 e ;
7 b 0; to 1 y, 1 1, and 1 XT , respectively, where 1 Rm is
8 foreach i {1, . . . , n} \ S do a vector having every entry equal to 1. The time complexity
9 u C:,i (1 + Xi C:,i )1 ; of the initialization phase is dominated by the O(mn) time
10 a a u(Xi a); required for initializing C. Thus, the initialization phase is
11 ei 0; no more complex than one pass through the features.
12 foreach j {1, . . . , m} do
We now consider finding the optimal update direction
13 dj dj uj Cj,i ;
14 p yj (dj )1 aj ;
given a set of n |S| of available directions, that is, the
15 ei ei + (p yj )2 ; direction having the lowest LOO error. Computing the LOO
performance with formula (5) for the modified feature set
16 if ei < e then
17 e ei ; S {i}, where i is the index of the feature to be tested,
18 b i; requires the vectors a = Gy
e and d = diag(G), e where
19 u C:,b (1 + Xb C:,b )1 ; e = ((XS )T XS + (Xi )T Xi + I)1 .

G
20 a a u(Xb a);
21 foreach j {1, . . . , m} do Let us define a vector
22 dj dj uj Cj,b ;
23 C C u(Xb C); u = C:,i (1 + Xi C:,i )1 (7)
24 S S {b};
which is computable in O(m) time provided that C is
25 w XS a; available. Then, the matrix G
e can be rewritten as
Figure 2. Greedy RLS algorithm proposed by us

e = G G(Xi )T (1 + Xi G(Xi )T )1 Xi G
G
= G uXi G (8)
the selection of large feature sets is made possible. However, where the first equality is due to the well-known Sherman-
feature selection with large training sets is still infeasible Morrison-Woodbury formula. Accordingly, the vector a can
because of the quadratic complexity with respect to m. be written as
The space complexity of the low-rank updated LS-SVM
a = Gy
algorithm is O(nm + m2 ), because it stores the matrices
e
X and G requiring O(nm) and O(m2 ) space, respectively. = (G uXi G)y
Due to the quadratic dependence of m, this space complexity = a u(Xi a) (9)
is worse than that of the standard wrapper approach.
in which the lowermost expression is also computable in
B. Greedy Regularized Least-Squares O(m) time. Further, the entries of d can be computed from
Here, we present our novel algorithm for greedy forward
selection for RLS with LOO criterion. We refer to our algo- dj = G
e j,j
rithm as greedy RLS, since in addition to feature selection = (G uXi G)j,j
point of view, it can also be considered as a greedy algorithm = (G u(C:,i )T )j,j
for learning sparse RLS predictors. Pseudo code of greedy = dj uj Cj,i (10)
RLS is presented in Fig. 2.
In order to take advantage of the computational short-cuts, in a constant time, the overall time needed for computing d
greedy RLS maintains the current set of selected features again becoming O(m). Thus, provided that we have all the
S {1, . . . , n}, the vectors a, d Rm and the matrix necessary caches available, evaluating each feature requires
C Rmn whose values are defined as O(m) time, and hence one pass through the whole set of n
features needs O(mn) floating point operations.
a = Gy,
Thus, provided that we have all the necessary caches
d = diag(G), available, evaluating each feature requires O(m) time, and
C = GXT , (6) hence one pass through the whole set of n features needs
O(mn) time. But we still have to ensure that the caches can The runtime experiments are presented in Fig. 3. The
be initialized and updated efficiently enough. results are consistent with the algorithmic complexity anal-
When a new feature is added into the set of selected ysis of the methods. The the low-rank updated LS-SVM
features, the vector a is updated according to (9) and the method of [10] shows quadratic scaling with respect to the
vector d according to (10). Putting together (6), (7), and number of training examples, while the proposed method
(8), the cache matrix C can be updated via scales linearly. Moreover, the figure shows the running times
of greedy RLS for up to 50000 training examples, in which
C u(Xb C), case selecting 50 features out of 1000 took a bit less than
twelve minutes.
where b is the index of the feature having the best LOO
value, requiring O(mn) time. This is equally expensive as V. C ONCLUSION
the above introduced fast approach for trying each feature We propose greedy regularized least-squares, a novel
at a time using LOO as a selection criterion. training algorithm for sparse linear predictors. The predictors
Finally, if we are to select altogether k features, the overall learned by the algorithm are equivalent with those obtained
time complexity of greedy RLS becomes O(kmn). The by performing a greedy forward feature selection with leave-
space complexity is dominated by the matrices X and C one-out (LOO) criterion for regularized least-squares (RLS),
which both require O(mn) space. also known as the least-squares support vector machine or
ridge regression. That is, the algorithm works like a wrapper
type of feature selection method which starts from the empty
IV. E XPERIMENTAL RESULTS
feature set, and on each iteration adds the feature whose
We recall that our greedy RLS algorithm leads to results addition provides the best LOO performance. Training a
that are equivalent to those of the algorithms proposed predictor with greedy RLS requires O(kmn) time, where k
by [9] and by [10], while being computationally much is the number of non-zero entries in the predictor, m is the
more efficient. The aforementioned authors have in their number of training examples, and n is the original number of
work shown that the greedy forward selection with LOO features in the data. This is in contrast to the computational
criterion approach compares favorably to other commonly complexity O(min{k 3 m2 n, k 2 m3 n}) of using the standard
used feature selection techniques. Hence, in our experiments, wrapper method with LOO selection criterion in case RLS is
we focus on the scalability of our algorithm implementation used as a black-box method, and the complexity O(km2 n)
and compare it with the best previously proposed algorithm, of the method proposed by [10], which is the most effi-
namely to the low-rank updated LS-SVM introduced by cient of the previously proposed speed-ups. Hereby, greedy
[10]. We will not compare greedy RLS to the implementa- RLS is computationally more efficient than the previously
tion proposed by [9], because it was by [10] already shown known feature selection methods for RLS. We demonstrate
to be slower than the low-rank updated LS-SVM algorithm. experimentally the computational efficiency of greedy RLS
To test the scalability of the method, we use randomly compared to the best previously proposed implementation.
generated data from two normal distributions with 1000 We have made freely available a software package called
features of which 50 are selected. The number of examples is RLScore out of our previously proposed RLS based machine
varied. In the first experiment we vary the number of training learning algorithms2 . An implementation of the greedy RLS
examples between 500 and 5000, and provide a comparison algorithm is also available as a part of this software.
to the low-rank updated LS-SVM method [10]. In the second The study presented in this paper opens several directions
experiment the training set size is varied between 1000 and for future research. For example, greedy RLS can quite
50000. We do not consider the baseline method in the second straightforwardly be generalized to use different types of
experiment, as it does not scale up to the considered training cross-validation criterion, such as N -fold or repeated N -
set sizes. The runtime experiments were run on a modern fold. These are motivated by their smaller variance compared
desktop computer with 2.4 GHz Intel Core 2 Duo E6600 to the leave-one-out and they have also been shown to have
processor, 8 GB of main memory, and 64-bit Ubuntu Linux better asymptotic convergence properties for feature subset
9.10 operating system. selection for ordinary least-squares [17]. This generalization
We note that the running times of these two methods are can be achieved by using the short-cut methods developed
not affected by the choice of the regularization parameter, by us [18] and by [19]. In [13] we investigated this approach
or the distribution of the features or the class labels. This for the RankRLS algorithm by developing a selection criteria
is in contrast to iterative optimization techniques commonly based on leave-query-out cross-validation.
used to train, for example, support vector machines [16]. ACKNOWLEDGMENTS
Thus we can draw general conclusions about the scalability This work has been supported by the Academy of Finland.
of the methods from experiments with a fixed value for the
regularization parameter, and synthetic data. 2 Available at http://www.tucs.fi/RLScore
25000 105 700
greedy RLS greedy RLS greedy RLS
low-rank updated LS-SVM low-rank updated LS-SVM 600
20000
104
500
15000
CPU seconds
CPU seconds
400
CPU seconds
103
300
10000
200
102
5000
100
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 101500 1000 1500 2000 2500 3000 3500 4000 4500 5000 00 10000 20000 30000 40000 50000
training set size training set size training set size
Figure 3. Running times in CPU seconds for the proposed greedy RLS method and the and the low-rank updated LS-SVM of [10]. Linear scaling on
y-axis (left). Logarithmic scaling on y-axis (middle). Running times in CPU seconds for the proposed greedy RLS method (right).
R EFERENCES [11] R. Rifkin and R. Lippert, Notes on regularized least squares,

Massachusetts Institute of Technology, Tech. Rep. MIT-
[1] R. Rifkin, Everything old is new again: A fresh look at CSAIL-TR-2007-025, 2007.
historical approaches in machine learning, Ph.D. disser-
tation, Massachusetts Institute of Technology, Cambridge, [12] T. Pahikkala, A. Airola, and T. Salakoski, Feature selection
Massachusetts, USA, 2002. for regularized least-squares: New computational short-cuts
and fast algorithmic implementations, in Proceedings of the
[2] J. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor,
Twentieth IEEE International Workshop on Machine Learning
and J. Vandewalle, Least Squares Support Vector Machines.
for Signal Processing (MLSP 2010), S. Kaski, D. J. Miller,
World Scientific Pub. Co., Singapore, 2002.
E. Oja, and A. Honkela, Eds. IEEE, 2010.
[3] A. E. Hoerl and R. W. Kennard, Ridge regression: Biased es-
timation for nonorthogonal problems, Technometrics, vol. 12, [13] T. Pahikkala, A. Airola, P. Naula, and T. Salakoski, Greedy
pp. 5567, 1970. RankRLS: a linear time algorithm for learning sparse ranking
models, in SIGIR 2010 Workshop on Feature Generation
[4] T. Pahikkala, E. Tsivtsivadze, A. Airola, J. Boberg, and and Selection for Information Retrieval, E. Gabrilovich, A. J.
T. Salakoski, Learning to rank with pairwise regularized Smola, and N. Tishby, Eds. ACM, 2010, pp. 1118.
least-squares, in SIGIR 2007 Workshop on Learning to Rank
for Information Retrieval, T. Joachims, H. Li, T.-Y. Liu, and [14] S. R. Searle, Matrix Algebra Useful for Statistics. New York,
C. Zhai, Eds., 2007, pp. 2733. NY: John Wiley and Sons, Inc., 1982.
[5] T. Pahikkala, E. Tsivtsivadze, A. Airola, J. Boberg, and [15] C. Saunders, A. Gammerman, and V. Vovk, Ridge regression
J. Jarvinen, An efficient algorithm for learning to rank from learning algorithm in dual variables, in Proceedings of the
preference graphs, Machine Learning, vol. 75, no. 1, pp. Fifteenth International Conference on Machine Learning.
129165, 2009. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
1998, pp. 515521.
[6] I. Guyon and A. Elisseeff, An introduction to variable and
feature selection, Journal of Machine Learning Research, [16] L. Bottou and C.-J. Lin, Support vector machine solvers, in
vol. 3, pp. 11571182, 2003. Large-Scale Kernel Machines, ser. Neural Information Pro-
cessing, L. Bottou, O. Chapelle, D. DeCoste, and J. Weston,
[7] R. Kohavi and G. H. John, Wrappers for feature subset Eds. Cambridge, MA, USA: MIT Press, 2007, pp. 128.
selection, Artificial Intelligence, vol. 97, no. 1-2, pp. 273
324, 1997. [17] J. Shao, Linear model selection by cross-validation, Journal
of the American Statistical Association, vol. 88, no. 422, pp.
[8] G. H. John, R. Kohavi, and K. Pfleger, Irrelevant features and 486494, 1993.
the subset selection problem, in Proceedings of the Eleventh
International Conference on Machine Learning, W. W. Cohen [18] T. Pahikkala, J. Boberg, and T. Salakoski, Fast n-fold cross-
and H. Hirsch, Eds. San Fransisco, CA: Morgan Kaufmann validation for regularized least-squares, in Proceedings of
Publishers, 1994, pp. 121129. the Ninth Scandinavian Conference on Artificial Intelligence
(SCAI 2006), T. Honkela, T. Raiko, J. Kortela, and H. Valpola,
[9] E. K. Tang, P. N. Suganthan, and X. Yao, Gene selection Eds. Espoo, Finland: Otamedia, 2006, pp. 8390.
algorithms for microarray data based on least squares support
vector machine, BMC Bioinformatics, vol. 7, p. 95, 2006. [19] S. An, W. Liu, and S. Venkatesh, Fast cross-validation
algorithms for least squares support vector machine and
[10] F. Ojeda, J. A. Suykens, and B. D. Moor, Low rank updated kernel ridge regression, Pattern Recognition, vol. 40, no. 8,
LS-SVM classifiers for fast variable selection, Neural Net- pp. 21542162, 2007.
works, vol. 21, no. 2-3, pp. 437 449, 2008, advances in
Neural Networks Research: IJCNN 07.

Speeding Up Greedy Forward Selection For Regularized Least-Squares

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speeding Up Greedy Forward Selection For Regularized Least-Squares

Загружено:

Авторское право:

Доступные форматы

Speeding up Greedy Forward Selection for Regularized Least-Squares

Tapio Pahikkala, Antti Airola, and Tapio Salakoski

19 u C:,b (1 + Xb C:,b )1 ; e = ((XS )T XS + (Xi )T Xi + I)1 .

Figure 2. Greedy RLS algorithm proposed by us

R EFERENCES [11] R. Rifkin and R. Lippert, Notes on regularized least squares,

Вам также может понравиться