Restricted Boltzmann Machines For Collaborative Filtering

Restricted Boltzmann Machines
for Collaborative Filtering
Ruslan Salakhutdinov rsalakhu@cs.toronto.edu

Andriy Mnih amnih@cs.toronto.edu
Geoffrey Hinton hinton@cs.toronto.edu
University of Toronto, 6 Kings College Rd., Toronto, Ontario M5S 3G4, Canada
Abstract Low-rank approximations based on minimizing the

sum-squared distance can be found using Singular
Most of the existing approaches to collab- Value Decomposition (SVD). In the collaborative fil-
orative filtering cannot handle very large tering domain, however, most of the data sets are
data sets. In this paper we show how a sparse, and as shown by Srebro and Jaakkola (2003),
class of two-layer undirected graphical mod- this creates a difficult non-convex problem, so a naive
els, called Restricted Boltzmann Machines solution is not going work.1
(RBMs), can be used to model tabular data,
such as users ratings of movies. We present In this paper we describe a class of two-layer undi-
efficient learning and inference procedures for rected graphical models that generalize Restricted
this class of models and demonstrate that Boltzmann Machines to modeling tabular or count
RBMs can be successfully applied to the data (Welling et al., 2005). Maximum likelihood learn-
Netflix data set, containing over 100 mil- ing is intractable in these models, but we show that
lion user/movie ratings. We also show that learning can be performed efficiently by following an
RBMs slightly outperform carefully-tuned approximation to the gradient of a different objec-
SVD models. When the predictions of mul- tive function called Contrastive Divergence (Hinton,
tiple RBM models and multiple SVD models 2002).
are linearly combined, we achieve an error
rate that is well over 6% better than the score 2. Restricted Boltzmann Machines
of Netflixs own system. (RBMs)
Suppose we have M movies, N users, and integer rat-
ing values from 1 to K. The first problem in applying
1. Introduction
RBMs to movie ratings is how to deal efficiently with
A common approach to collaborative filtering is to as- the missing ratings. If all N users rated the same
sign a low-dimensional feature vector to each user and set of M movies, we could treat each user as a single
a low-dimensional feature vector to each movie so that training case for an RBM which had M softmax vis-
the rating that each user assigns to each movie is mod- ible units symmetrically connected to a set of binary
eled by the scalar-product of the two feature vectors. hidden units. Each hidden unit could then learn to
This means that the N M matrix of ratings that N model a significant dependency between the ratings of
users assign to M movies is modeled by the matrix different movies. When most of the ratings are miss-
X which is the product of an N C matrix U whose ing, we use a different RBM for each user (see Fig.
rows are the user feature vectors and a C M matrix 1). Every RBM has the same number of hidden units,
V 0 whose columns are the movie feature vectors. The but an RBM only has visible softmax units for the
rank of X is C the number of features assigned to movies rated by that user, so an RBM has few connec-
each user or movie. tions if that user rated few movies. Each RBM only
has a single training case, but all of the corresponding
Appearing in Proceedings of the 24 th International Confer- 1
ence on Machine Learning, Corvallis, OR, 2007. Copyright We describe the details of the SVD training procedure
2007 by the author(s)/owner(s). in section 7.
Restricted Boltzmann Machines for Collaborative Filtering
Binary hidden X K
m X
features p(hj = 1|V) = (bj + vik Wijk ) (2)
h i=1 k=1
where (x) = 1/(1 + ex ) is the logistic function, Wijk

is a symmetric interaction parameter between feature
W j and rating k of movie i, bki is the bias of rating k for
Visible movie movie i, and bj is the bias of feature j. Note that the
ratings bki can be initialized to the logs of their respective base
V rates over all users.
Missing
Missing
Missing
Missing
... ... The marginal distribution over the visible ratings V
is:
X exp (E(V, h))
p(V) = P 0 0
(3)
h V0 ,h0 exp (E(V , h ))
Figure 1. A restricted Boltzmann machine with binary
hidden units and softmax visible units. For each user, the with an energy term given by:
RBM only includes softmax units for the movies that user
has rated. In addition to the symmetric weights between m X
X K
F X m
X
each hidden unit and each of the K = 5 values of a soft- E(V, h) = Wijk hj vik + log Zi
max unit, there are 5 biases for each softmax unit and one i=1 j=1 k=1 i=1
for each hidden unit. When modeling user ratings with m X
K F
X X
an RBM that has Gaussian hidden units, the top layer is vik bki h j bj (4)
composed of linear units with Gaussian noise. i=1 k=1 j=1
PK
exp bli +hj Wijl is the normal-
P
where Zi = l=1 j
PK
weights and biases are tied together, so if two users ization term that ensures that l=1 p(vil = 1|h) = 1.
have rated the same movie, their two RBMs must use The movies with missing ratings do not make any con-
the same weights between the softmax visible unit for tribution to the energy function.
that movie and the hidden units. The binary states of
the hidden units, however, can be quite different for 2.2. Learning
different users. From now on, to simplify the nota-
tion, we will concentrate on getting the gradients for The parameter updates required to perform gradient
the parameters of a single user-specific RBM. The full ascent in the log-likelihood can be obtained from Eq.
gradients with respect to the shared weight parameters 3:
can then be obtained by averaging over all N users. log p(V)
Wijk = =
Suppose a user rated m movies. Let V be a K m Wijk
observed binary indicator matrix with vik = 1 if the
user rated movie i as k and 0 otherwise. We also let = <vik hj >data <vik hj >model (5)
hj , j = 1, ..., F , be the binary values of hidden (la-
tent) variables, that can be thought of as representing where is the learning rate. The expectation
stochastic binary features that have different values for <vik hj >data defines the frequency with which movie i
different users. with rating k and feature j are on together when the
features are being driven by the observed user-rating
2.1. The Model data from the training set using Eq. 2, and <>model is
an expectation with respect to the distribution defined
We use a conditional multinomial distribution (a soft-
by the model. The expectation < >model cannot be
max) for modeling each column of the observed
computed analytically in less than exponential time.
visible binary rating matrix V and a conditional
MCMC methods (Neal, 1993) can be employed to ap-
Bernoulli distribution for modeling hidden user fea-
proximate this expectation. These methods, however,
tures h (see Fig. 1):
are quite slow and suffer from high variance in their
estimates.
PF
exp (bki + k
j=1 hj Wij ) To avoid computing <>model , we follow an approxi-
p(vik = 1|h) = PK PF (1)
l=1 exp bli + j=1 hj Wijl mation to the gradient of a different objective function
called Contrastive Divergence (CD) (Hinton, 2002): over K ratings for a movie q:
Wijk = (<vik hj >data <vik hj >T ) (6) m X

X K
pj = p(hj = 1|V) = (bj + vik Wijk ) (9)
The expectation < >T represents a distribution of i=1 k=1
PF
samples from running the Gibbs sampler (Eqs. 1,2), exp (bkq + j=1 pj Wqj k
)
initialized at the data, for T full steps. T is typi- p(vqk = 1|
p) = PK F (10)
l
l
P
cally set to one at the beginning of learning and in- l=1 exp bq + j=1 pj Wqj
creased as the learning converges. By increasing T and take an expectation as our prediction. In our expe-
to a sufficiently large value, it is possible to approx- rience, Eq. 7 makes slightly more accurate predictions,
imate maximum likelihood learning arbitrarily well although one iteration of the mean field equations is
(Carreira-Perpinan & Hinton, 2005), but large values considerably faster. We use the mean field method in
of T are seldom needed in practice. When running the the experiments described below.
Gibbs sampler, we only reconstruct (Eq. 1) the distri-
bution over the non-missing ratings. The approximate
gradients of CD with respect to the shared weight pa- 3. RBMs with Gaussian Hidden Units
rameters of Eq. 6 can be then be averaged over all N We can also model hidden user features h as Gaus-
users. sian latent variables (Welling et al., 2005). This model
It was shown (Hinton, 2002) that CD learning is quite represents an undirected counterpart of pLSI (Hof-
efficient and greatly reduces the variance of the es- mann, 1999):
timates used for learning. The learning rule for the PF
exp (bk
i+
k
hj Wij )
biases is just a simplified version of Eq. 6. p(vik = 1|h) = PK j=1

l
P F l
exp bi + hj Wij
l=1 j=1
2.3. Making Predictions

hbj j
P k
vik Wij
2
p(hj = h|V) = 1 exp ik
2j 2j2
Given the observed ratings V, we can predict a rating
for a new query movie q in time linear in the number
of hidden units: where j2 is the variance of the hidden unit j.
X The marginal distribution over visible units V is given
p(vqk = 1|V) exp(E(vqk , V, h)) (7) by Eq. 3. with an energy term:
h1 ,...,hp
X X
F E(V, h) = Wijk hj vik + log Zi
Y X X
kq vil hj Wijl + vqk hj Wqj
k

exp + h j bj ijk i
j=1 hj {0,1} il X X (hj bj )2

vik bki + (11)
F
j
2j2
Y X ik
= kq k
vil Wijl + vqk Wqj

1 + exp + bj
j=1 il
We fix variances at j2 = 1 for all hidden units j, in
where kq
= exp (vqk bkq ).
Once we obtain unnormalized which case the parameter updates are the same as de-
scores, we can either pick the rating with the maximum fined in Eq. 6.
score as our prediction, or perform normalization over
K values to get probabilities p(vq = k|V) and take 4. Conditional RBMs
the expectation E[vq ] as our prediction. The latter
method works better. Suppose that we add w to each of the K weights from
the K possible ratings to each hidden feature and we
When asked to predict ratings for n movies q1 , q2 ,..., subtract w from the bias of the hidden feature. So long
qn , we can also compute as one of the K ratings is present, this does not have
any effect on the behaviour of the hidden or visible
p(vqk11 = 1, vqk22 = 1, ..., vqknn = 1|V) (8)
units because the softmax is over-parameterized. If,
however, the rating is missing, there is an effect of w
This, however, requires us to make K n evaluations for
on the total input to the hidden feature. So by using
each user.
the over-parametrization of the softmax, the RBM can
Alternatively, we can perform one iteration of the learn to use missing ratings to influence its hidden fea-
mean field updates to get the probability distribution tures, even though it does not try to reconstruct these
...
to learning biases and takes the form:

h Dij = <hj >data <hj >T ri (12)
Binary hidden
D features
We could instead define an arbitrary nonlinear func-
r
tion f (r|). Provided f is differentiable with respect
to , we could use backpropagation to learn :
W

f (r|)
= <hj >data <hj >T (13)

...
Visible
movie
V ratings In particular, f (r|) can be parameterized as a multi-
layer neural network.
Missing
Missing
Missing
Missing
... ... Conditional RBM models have been successfully used
for modeling temporal data, such as motion cap-
ture data (Taylor et al., 2006), or video sequences
(Sutskever & Hinton, 2006). For the Netflix task, con-
Figure 2. Conditional RBM. The binary vector r, indi- ditioning on a vector of rated/unrated movies proves
cating rated/unrated movies, affects binary states of the to be quite helpful it significantly improves perfor-
hidden units. mance.
Instead of using a conditional RBM, we can impute
missing ratings and it does not perform any computa- the missing ratings from the ordinary RBM model.
tions that scale with the number of missing ratings. Suppose a user rated a movie t, but his/her rating is
There is a more subtle source of information in the missing (i.e. it was provided as a part of the test set).
Netflix database that cannot be captured by the stan- We can initialize vt to the base rate of movie t, and
dard multinomial RBM. Netflix tells us in advance compute the gradient of the log-probability of the data
which user/movie pairs occur in the test set, so we with respect to this input (Eq. 3). The CD learning
have a third category: movies that were viewed but takes form:
for which the rating is unknown. This is a valuable X X
k k k
source of information about users who occur several vt = < Wt hj >data < Wt h j > T
times in the test set, especially if they only gave a j j
small number of ratings in the training set. If, for ex-
ample, a user is known to have rated Rocky 5, we After updating vtk , for k = 1, .., K, vtk are renormalized
already have a good bet about the kinds of movies he to obtain probability distribution over K values. The
likes. imputed values vt will now contribute to the energy
term of Eq. 4 and will affect the states of the hidden
The conditional RBM model takes this extra informa- units. Imputing missing values by following an ap-
tion into account. Let r {0, 1}M be a binary vec- proximate gradient of CD works quite well on a small
tor of length M (total number of movies), indicating subset of the Netflix data set, but is slow for the com-
which movies the user rated (even if these ratings are plete data set. Alternatively, we can use a set of mean
unknown). The idea is to define a joint distribution field equations Eqs. 9, 10 to impute the missing val-
over (V, h) conditional on r. In the proposed condi- ues. The imputed values will be quite noisy, especially
tional model, a vector r will affect the states of the at the early stages of training. Nevertheless, in our
hidden units (see Fig. 2): experiments, the model performance was significantly
PF improved by using imputations and was comparable to
exp (bki + k
j=1 hj Wij ) the performance of the conditional RBM.
p(vik = 1|h) = PK F
bli + j=1 hj Wijl
P
exp
l=1
X K
m X M
X 5. Conditional Factored RBMs
vik Wijk +

p(hj = 1|V, r) = bj + ri Dij
i=1 k=1 i=1
One disadvantage of the RBM models we have de-
scribed so far is that their current parameterization of
where Dij is an element of a learned matrix that mod- W RM KF results in a large number of free param-
els the effect of r on h. Learning D using CD is similar eters. In our current implementation, with F = 100
1.02 1 1
1.01 0.99 0.99

1 RBM with Gaussian
0.98 0.98
hidden units
0.99
0.97
0.97
0.98
0.96
0.97 0.96
Conditional
RMSE
RMSE
RMSE
0.95
0.96 0.95 RBM
Netflix Score RBM 0.94
0.95 0.94
0.93
0.94
0.93
RBM 0.92
0.93
0.92
0.92 Conditional 0.91 Conditional
RBM Factored
0.91 Start Start 0.91 0.9 RBM
CD T=3 Start CD T=3 CD T=3 CD T=5 CD T=9
CD T=5 CD T=9 CD T=5
0.9 0.9 0.89
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Epochs Epochs Epochs
Figure 3. Performance of various models on the validation data. Left panel: RBM vs. RBM with Gaussian hidden
units. Middle panel: RBM vs. conditional RBM. Right panel: conditional RBM vs. conditional factored RBM. The
y-axis displays RMSE (root mean squared error), and the x-axis shows the number of epochs, or passes through the entire
training dataset.
(the number of hidden units), M = 17770, and K = 5, In our experimental results section we show that a con-
we end up with about 9 million free parameters. By ditional factored RBM converges considerably faster
using proper weight-decay to regularize the model, we than a conditional unfactored RBM.
are still able to avoid serious overfitting. However, if
we increase the number of hidden features or the num- 6. Experimental Results
ber of movies,2 learning this huge parameter matrix
W becomes problematic. Reducing the number of free 6.1. Description of the Netflix Data
parameters by simply reducing the number of hidden
According to Netflix, the data were collected between
units does not lead to a good model because the model
October, 1998 and December, 2005 and represent the
cannot express enough information about each user in
distribution of all ratings Netflix obtained during this
its hidden state.
period. The training data set consists of 100,480,507
We address this problem by factorizing the parameter ratings from 480,189 randomly-chosen, anonymous
matrix W into a product of two lower-rank matrices users on 17,770 movie titles. As part of the training
A and B. In particular: data, Netflix also provides validation data, containing
1,408,395 ratings. In addition to the training and vali-
C
X dation data, Netflix also provides a test set containing
Wijk = Akic Bcj (14) 2,817,131 user/movie pairs with the ratings withheld.
c=1 The pairs were selected from the most recent ratings
from a subset of the users in the training data set,
where typically C M and C F . For example,
over a subset of the movies. To reduce the uninten-
setting C = 30, we reduce the number of free parame-
tional fine-tuning on the test set that plagues many
ters by a factor of three. We call this model a factored
empirical comparisons in the machine learning litera-
RBM. Learning matrices A and B is quite similar to
ture, performance is assessed by submitting predicted
learning W of Eq. 6:
ratings to Netflix who then post the root mean squared
error (RMSE) on an unknown half of the test set. As a
X
Akic = Bcj hj vik>data baseline, Netflix provided the score of its own system

<
j trained on the same data, which is 0.9514.

X
Bcj hj vik>T

< 6.2. Details RBM Training
j

X We train the RBM with F = 100, and the condi-
Akic vik hj >data

Bcj = < tional factored RBM with F = 500, and C = 30.
ik
To speed-up the training, we subdivided the Netflix
X dataset into small mini-batches, each containing 1000
Akic vik

< hj > T
cases (users), and updated the weights after each mini-
ik
2 batch. All models were trained for between 40 and 50
Netflixs own database contains about 65000 movie ti-
tles.
passes (epochs) through the entire training dataset.
The weights were updated using a learning rate of 0.99
0.01/batch-size, momentum of 0.9, and a weight de- 0.98
cay of 0.001. The weights were initialized with small 0.97

random values sampled from a zero-mean normal dis-
0.96
tribution with standard deviation 0.01. CD learning
0.95
was started with T = 1 and increased in small steps SVD
RMSE
during training. 0.94
0.93
6.3. Results 0.92
SVD
We compare different models based on their perfor- 0.91
Conditional
mance on the validation set. The error that Netflix 0.9
Factored
RBM CD T=3
reports on the test set is typically larger than the er-
0.89
ror we get on the validation set by about 0.0014. When 0 5 10 15 20 25 30 35 40 45
Epochs
the validation set is added to the training set, RMSE
on the test set is typically reduced by about 0.005. Figure 4. Performance of the conditional factored RBM
Figure 3 (left panel) shows performance of the RBM vs. SVD with C = 40 factors. The y-axis displays
and the RBM with Gaussian hidden units. The y- RMSE (root mean squared error), and the x-axis shows
axis displays RMSE, and the x-axis shows the number the number of epochs, or passes through the entire train-
ing dataset.
of epochs. Clearly, the nonlinear model substantially
outperforms its linear counterpart. Figure 3 (middle
panel) also reveals that conditioning on rated/unrated
information significantly improves model performance.
It also shows (right panel) that, when using a condi-
tional RBM, factoring the weight matrix leads to much
faster convergence. the indicator function, taking on value 1 if user i rated
movie j, and 0 otherwise. We then perform gradient
7. Singular Value Decomposition (SVD) descent in U and V to minimize the objective function
of Eq. 15.
SVD seeks a low-rank matrix X = U V 0 , where U
RN C and V RM C , that minimizes the sum- To speed-up the training, we subdivided the Netflix
squared distance to the fully observed target matrix data into mini-batches of size 100,000 (user/movie
Y . The solution is given by the leading singular vec- pairs), and updated the weights after each mini-batch.
tors of Y . In the collaborative filtering domain, most The weights were updated using a learning rate of
of the entries in Y will be missing, so the sum-squared 0.005, momentum of 0.9, and regularization parameter
distance is minimized with respect to the partially ob- = 0.01. Regularization, particularly for the Netflix
served entries of the target matrix Y . Unobserved en- dataset, makes quite a significant difference in model
tries of Y are then predicted using the corresponding performance. We also experimented with various val-
entries of X. ues of C and report the results with C = 40, since it
resulted in the best model performance on the valida-
Let X = U V 0 , where U RN C and V RM C de- tion set. Values of C in the range of [20, 60] also give
note the low-rank approximation to the partially ob- similar results.
served target matrix Y RN M . Matrices U and
V are initialized with small random values sampled We compared the conditional factored RBM with
from a zero-mean normal distribution with standard an SVD model (see Fig. 4). The conditional fac-
deviation 0.01. We minimize the following objective tored RBM slightly outperforms SVD, but not by
function: much. Both models could potentially be improved by
more careful tuning of learning rates, batch sizes, and
M
N X weight-decay. More importantly, the errors made by
X 2
f= Iij ui vj 0 Yij various versions of the RBM are significantly different
i=1 j=1 from the errors made by various versions of SVD, so
X
Iij k ui k2F ro + k vj k2F ro linearly combining the predictions of several different

+ (15)
ij versions of each method, using coefficients tuned on
the validation data, produces an error rate that is well
where k k2F ro denotes the Frobenius norm, and Iij is over 6% better than the Netflixs own baseline score.
Backpropagate
8. Future extensions Squared Error
There are several extensions to our model that we are

currently pursuing.
8.1. Learning Autoencoders

An alternative way of using an RBM is to treat this WT
learning as a pretraining stage that finds a good re- r D
gion of the parameter space (Hinton & Salakhut-
dinov, 2006). After pretraining, the RBM is un-
rolled as shown in figure 5 to create an autoencoder W
network in which the stochastic activities of the bi-
nary hidden features are replaced by deterministic,
real-valued probabilities. Backpropagation, using the
squared error objective function, is then used to fine- Figure 5. The unrolled RBM used to create an autoen-
tune the weights for optimal reconstruction of each coder network which is then fine-tuned using backpropa-
users ratings. However, overfitting becomes an issue gation of error derivatives.
and more careful model regularization is required.
8.2. Learning Deep Generative Models deep networks reduce the error significantly (Hinton
& Salakhutdinov, 2006) and our hope is that they will
Recently, (Hinton et al., 2006) derived a way to per- be similarly helpful for the Netflix data.
form fast, greedy learning of deep belief networks one
layer at a time, with the top two layers forming an
undirected bipartite graph which acts as an associa- 9. Summary and Discussion
tive memory.
We introduced a class of two-layer undirected graph-
The learning procedure consists of training a stack of ical models (RBMs), suitable for modeling tabular
RBMs each having only one layer of latent (hidden) or count data, and presented efficient learning and
feature detectors. The learned feature activations of inference procedures for this class of models. We
one RBM are used as the data for training the next also demonstrated that RBMs can be successfully ap-
RBM in the stack. plied to a large dataset containing over 100 million
user/movie ratings.
An important aspect of this layer-wise training proce-
dure is that, provided the number of features per layer A variety of models have recently been proposed for
does not decrease, each extra layer increases a lower minimizing the loss corresponding to a specific prob-
bound on the log probability of data. So layer-by-layer abilistic model (Hofmann, 1999; Canny, 2002; Marlin
training can be recursively applied several times3 to & Zemel, 2004). All these probabilistic models can
learn a deep, hierarchical model in which each layer be viewed as graphical models in which hidden factor
of features captures strong high-order correlations be- variables have directed connections to variables that
tween the activities of features in the layer below. represent user ratings. Their major drawback(Welling
et al., 2005) is that exact inference is intractable due
Learning multi-layer models has been successfully ap-
to explaining away, so they have to resort to slow or
plied in the domain of dimensionality reduction (Hin-
inaccurate approximations to compute the posterior
ton & Salakhutdinov, 2006), with the resulting mod-
distribution over hidden factors.
els significantly outperforming Latent Semantic Anal-
ysis, a well-known document retrieval method based Instead of constraining the rank or dimensionality of
on SVD (Deerwester et al., 1990). It has also been the factorization X = U V 0 , i.e. the number of factors,
used for modeling temporal data (Taylor et al., 2006; (Srebro et al., 2004) proposed constraining the norms
Sutskever & Hinton, 2006) and learning nonlinear em- of U and V . This problem formulation termed Max-
beddings (Salakhutdinov & Hinton, 2007). We are imum Margin Matrix Factorization could be seen as
currently exploring this kind of learning for the Net- constraining the overall strength of factors rather
flix data. For classification of the MNIST digits, than their number. However, learning MMMF re-
3 quires solving a sparse semi-definite program (SDP).
In fact, one can proceed learning recursively for as
many layers as desired. Generic SDP solvers run into difficulties with more
than about 10,000 observations (user/movie pairs), so
direct gradient-based optimization methods have been Salakhutdinov, R., & Hinton, G. E. (2007). Learning a
proposed in an attempt to make MMMF scale up to nonlinear embedding by preserving class neighbour-
larger problems. The Netflix data set, however, con- hood structure. AI and Statistics.
tains over 100 million observations and none of the
above-mentioned approaches can easily deal with such Srebro, N., & Jaakkola, T. (2003). Weighted low-rank
large data sets. approximations. Machine Learning, Proceedings
of the Twentieth International Conference (ICML
2003), August 21-24, 2003, Washington, DC, USA
Acknowledgments (pp. 720727). AAAI Press.
We thank Vinod Nair, Tijmen Tieleman and Ilya Srebro, N., Rennie, J. D. M., & Jaakkola, T. (2004).
Sutskever for many helpful discussions. We thank Net- Maximum-margin matrix factorization. Advances in
flix for making such nice data freely available and for Neural Information Processing Systems.
providing a free and rigorous model evaluation service.
Sutskever, I., & Hinton, G. E. (2006). Learn-
References ing multilevel distributed representations for high-
dimensional sequences (Technical Report UTML
Canny, J. F. (2002). Collaborative filtering with pri- TR 2006-003). Dept. of Computer Science, Univer-
vacy via factor analysis. SIGIR (pp. 238245). sity of Toronto.
ACM.
Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2006).
Carreira-Perpinan, M., & Hinton, G. (2005). On Modeling human motion using binary latent vari-
contrastive divergence learning. 10th Int. Work- ables. Advances in Neural Information Processing
shop on Artificial Intelligence and Statistics (AIS- Systems. MIT Press.
TATS2005).
Welling, M., Rosen-Zvi, M., & Hinton, G. (2005). Ex-
Deerwester, S. C., Dumais, S. T., Landauer, T. K., ponential family harmoniums with an application
Furnas, G. W., & Harshman, R. A. (1990). Indexing to information retrieval. NIPS 17 (pp. 14811488).
by latent semantic analysis. Journal of the American Cambridge, MA: MIT Press.
Society of Information Science, 41, 391407.
Hinton, & Salakhutdinov (2006). Reducing the dimen-

sionality of data with neural networks. Science, 313.
Hinton, G. E. (2002). Training products of experts by

minimizing contrastive divergence. Neural Compu-
tation, 14, 17111800.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A

fast learning algorithm for deep belief nets. Neural
Computation, 18, 15271554.
Hofmann, T. (1999). Probabilistic latent semantic

analysis. Proceedings of the 15th Conference on Un-
certainty in AI (pp. 289296). San Fransisco, Cali-
fornia: Morgan Kaufmann.
Marlin, B., & Zemel, R. S. (2004). The multiple mul-

tiplicative factor model for collaborative filtering.
Machine Learning, Proceedings of the Twenty-first
International Conference (ICML 2004), Banff, Al-
berta, Canada, July 4-8, 2004. ACM.
Neal, R. M. (1993). Probabilistic inference using

Markov chain Monte Carlo methods (Technical Re-
port CRG-TR-93-1). Department of Computer Sci-
ence, University of Toronto.

Restricted Boltzmann Machines For Collaborative Filtering

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Restricted Boltzmann Machines For Collaborative Filtering

Загружено:

Авторское право:

Доступные форматы

Restricted Boltzmann Machines

for Collaborative Filtering

Ruslan Salakhutdinov rsalakhu@cs.toronto.edu

Abstract Low-rank approximations based on minimizing the

where (x) = 1/(1 + ex ) is the logistic function, Wijk

Wijk = (<vik hj >data <vik hj >T ) (6) m X

2.3. Making Predictions

j=1 hj {0,1} il X X (hj bj )2

1.01 0.99 0.99

The weights were updated using a learning rate of 0.99

0.01/batch-size, momentum of 0.9, and a weight de- 0.98

cay of 0.001. The weights were initialized with small 0.97

There are several extensions to our model that we are

8.1. Learning Autoencoders

Hinton, & Salakhutdinov (2006). Reducing the dimen-

Hinton, G. E. (2002). Training products of experts by

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A

Hofmann, T. (1999). Probabilistic latent semantic

Marlin, B., & Zemel, R. S. (2004). The multiple mul-

Neal, R. M. (1993). Probabilistic inference using

Вам также может понравиться

Restricted Boltzmann Machines For Collaborative Filtering

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Restricted Boltzmann Machines For Collaborative Filtering

Загружено:

Авторское право:

Доступные форматы

Restricted Boltzmann Machines

for Collaborative Filtering

Ruslan Salakhutdinov rsalakhu@cs.toronto.edu

Abstract Low-rank approximations based on minimizing the

where (x) = 1/(1 + ex ) is the logistic function, Wijk

Wijk = (<vik hj >data <vik hj >T ) (6) m X

2.3. Making Predictions 

j=1 hj {0,1} il X X (hj bj )2

1.01 0.99 0.99

The weights were updated using a learning rate of 0.99

0.01/batch-size, momentum of 0.9, and a weight de- 0.98

cay of 0.001. The weights were initialized with small 0.97

There are several extensions to our model that we are

8.1. Learning Autoencoders

Hinton, & Salakhutdinov (2006). Reducing the dimen-

Hinton, G. E. (2002). Training products of experts by

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A

Hofmann, T. (1999). Probabilistic latent semantic

Marlin, B., & Zemel, R. S. (2004). The multiple mul-

Neal, R. M. (1993). Probabilistic inference using

Вам также может понравиться

Wijk = (<vik hj >data <vik hj >T ) (6) m X

2.3. Making Predictions