Академический Документы
Профессиональный Документы
Культура Документы
May 5, 2011
Abstract This status report contains the ideas and the experiments that we have performed or are currently performing on the KDD Cup 2011 dataset. The dataset is biggest of its kind with some unique features like hierarchical relations among items, different types of items, and dates/timestamps for ratings. Currently we have implemented several variants of matrix factorization approaches. We are currently at 25.0426 RMSE on the test set using Alternative Least Squares approach. We are currently at 92th position on the leader board. After parallelizing the training, one epoch takes roughly in the range of 200-400 seconds.
Contents
1 2 3 Introduction Dataset Experiments and Results 3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) 3.3 Sigmoid based Matrix Factorization (SMF) . . . . . . . . . . . . . . . . . . . 3.3.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sigmoid based Heirarchical Matrix Factorization (SHMF) . . . . . . . . . . . 3.5 Alternating least squares based Matrix Factorization (ALS) . . . . . . . . . . . 3.5.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Latent Feature log linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Neighborhood Based Correction . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ideas for further exploration 3 3 7 . 7 . 8 . 8 . 8 . 8 . 9 . 9 . 9 . 10 . 11 . 14 14
. . . . . . . . . . .
4 5
Parallelism 15 5.1 Alternating update and grouping strategy . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . 15 Software 16
Introduction
This report investigates different collaborative ltering methods on the KDD Cup 2011 dataset. The dataset was provided by Yahoo!, and was collected from their music service. It is the biggest of its kind which restricts our choice of algorithms to the ones that scale. Apart from the typical (user,item,rating) triplets, there is hierarchical information among the items (tracks/albums/artist/genre) amd time stamps which needs to be exploited. So far, we have been able to parallelize several variants of matrix factorization approaches run it in the order of minutes per epoch. Apart from that we have analyzed the dataset in terms of types of items and found out there is signicant overtting for tracks and albums. Furthermore, on the validation set we found out that majority of the error is on items that are rated fewer times. The rest of the report contains our current progress and description of the dataset.
Dataset
The KDD Cup 2011 competition has two tracks. This report presents the experiments we performed on Track 1 dataset. The statistics for the dataset is presented in the table 1. The ratings range from 0-100 and the dates range from roughly [0-5xxx] days. There is also session information present in the dataset along with the days.
USER ID
NO: RATINGS
DAY
TIME STAMP
ITEM IDS
#Users 1,000,990
#Items #Ratings #TrainRatings #ValidationRatings #TestRatings 624,961 262,810,175 252,800,275 4,003,960 6,005,940
Table 1. KDD Cup 2011 Track 1 Dataset.
x 10
5 Counts
0 20
20
40 Rating
60
80
100
120
artist 1
artist 2
album 1
album 2
album 3
album k
album m
album n
track n track l
track m
12
x 10
12
x 10
10
10
Genre
Artist Item
Album
Track
Genre
Artist Item
Album
Track
2.5
x 10
2500
2000
1.5 #Tracks
Counts 1500
1000
0.5
500
6 8 log(#ratings of Tracks)
10
12
10
11
12
1000
50
800 Counts
40 Counts
600
30
400
20
200
10
10
12
14
12
14
3
3.1
3.2
The objective function that we are minimizing here is the squared loss. The update for the U and I matrices are simultaneous, and we set the rst column of U and second column of I as 1 the corresponding second and rst columns in U and I respectively can be interpreted as the bias terms. We calculate the RMSE on the validation set after each epoch using the trained U and I matrices and terminate when either epoch limit has been reached or when RMSE diverges. Objective Function E = (ru,i Uu Ii )2 + ( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD 2 Derivative with respect to each example (r Uu Ii ) = 2(ru,i Uu Ii )Iik Uuk u,i (r Uu Ii )2 = 2(ru,i Uu Ii )Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik ))
3.3
The motivation for SMF was to keep the predicted rating in the range of [0-100]. Here the objective function similar to BRISMF. There are two ways of parallelizing SGD, both of them are dicussed in the Parallelism section. Objective Function E = (ru,i 100(Uu Ii ))2 +( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD Derivative with respect to each example (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Iik Uuk u,i (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik )) 3.3.1 Adding Temporal Term (ru,i 100(
k
The objective function after adding temporal term is E = rest of the derivation is similar to the previous section.
3.4
The algorithm is similar to SMF except for two key differences a) SGD training in a hierarchical fashion as shown below, here we use alternating training method instead of simultaneous updates. 8
b) Regularization term to make Ii for each item in the hierarchy similar, which is motivated by the fact that users tend to rate items in the same hierarchy similarly for ex: rating for a track and its corresponding album would be similar. for each epoch Update U using I Update Ii using U (i Genres) Update Ii using U (i Artists) Update Ii using U (i Albums) and regularization by (Ii Iartist(i) ) Update Ii using U (i T racks) and regularization by (Ii Ialbums(i) )
3.5
This method was rst presented in [4]. The main differences compared to previously dicussed methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularization parameter is multiplied by the number of ratings for that user (nu ) or item (ni ). Objective Function E = (ru,i Uu Ii )2 + ( nu ||Uu ||2 + ni ||Ii ||2 ) Least squares solution for a Uu and Ii T (MI(u) MI(u) + (nu E))Uu = Vu where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated. and E is the identity matrix and Vu = MI(u) RT (u, I(u)) Optimization Type LS T Update Rule Uu A1 Vu where A = (MI(u) MI(u) + (nu E)) u Ii Bi1 Yi ; derivation similar to Uu 3.5.1 Adding Temporal Term
The objective function after adding temporal term is E = (ru,i k Uuk Iik Ttk ))2 +( nu ||Uu ||2 + ni ||Ii ||2 ). The rest of the derivation is similar to the previous section. We rst learn the U and I matrices by xing all elements of T to be 1 and T is estimated in the end after we have estimated for U and I matrices.
3.6
In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} each corresponding to c = {0,...11} classes and learn latent features for each of the ratings. We x U 0 9
Objective Function E =
(ru,i
ik
Another scheme to cut down on parameters it to keep a single Uu for all ratings. Although we have implemented this scheme, we have skipped running experiments with it since we did not see signicant difference between the schemes in the initial runs.
3.7
We use the method in [3] which is a post processing step after learning the latent features for users and items. The NB correction is as follows
p sim(itemi , itemj )(ruj ruj ) , where rc is the corrected rating for user u = + sim(itemi , itemj ) j,j=i and item i and sim is the similarity metric. is learned through regression on validation set. The summation j is over all items the user has rated in the training set. k Iik .Ijk sim(itemi , itemj ) = max{0, } 2 2 Iik Ijk k k c rui p rui j,j=i
10
3.8
Results
Sigmoid Matrix Factorization results Validation Set Training set 36
The gure below shows the RMSE on train and validations set for sigmoid matrix factorization.
38
34
32 RMSE
30
28 convergence
26
24
22
4 epochs
Figure 6: SMF: RMSE on train and validation set The result show overtting after the second epoch. Item type specic RMSE is shown in g 7.
11
38 36 34
32 RMSE
32 RMSE 30 28 26 24 22
30
28
26
24
1.5
2.5
3 epochs
3.5
4.5
22
1.5
2.5
3 Epochs
3.5
4.5
35
25
20
1.5
2.5
3 Epochs
3.5
4.5
1.5
2.5
3 epochs
3.5
4.5
12
60 55
Validation error on Tracks 50
50 45 RMSE 40 35 30 25
45
40
35 RMSE 30 25 20
20
15 0 2 4 6 8 log(#ratings for Tracks) 10 12 14
12
14
(a) RMSE vs log( No: of ratings for Tracks) (b) RMSE vs log( No: of ratings for Albums)
55 50 45 40 RMSE RMSE 35 30 25 20 15 10 0 2 4 6 8 log(#ratings for Artists) 10 12 14 Validation error on Artists 40 35 30 25 20 15 10 5 0 Validation error on Genres
10
12
14
Figure 8: These are from the ALS run: RMSE vs No: of ratings (Note: red line shows the actual RMSE for that specic type) The results have been summarized in the table below. The results from tensor factorization has been excluded since we are not condent about the training scheme. The LFL run was not tuned for best parameters. Regularization is similar in all the methods except in ALS where its multiplied by Nu or Ni .
13
Method(//k) RMSE (Test set) RMSE with NB correction(Test Set) BRISMF(.001/.001/100) 28.6200 SMF (10/.0001/100) 25.6736 25.4884 SHMF (10/.0001/100) 25.1183 ALS (1/-/50) 25.0426 LFL (10/.0001/100) 26.5238 Table 4. Current Results on Test Set
3.8.1
Timing Information
All these runs were using 7 cores on the same node. It takes around 250 seconds to load all the les into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds. Method(k) Time in sec per epoch SMF (100) 200 ALS (50) 400
Table 4. Time
There are multiple schemes for residual tting mentioned in [5] which needs attention. Another Ii Wi idea is to exploit hierarchy in the constrained method described in [2] where Uu = Yu + . Ii where Ii is 1 if the user u has rated item i in the training set. We see that the NB correction method on SMF does improve the test RMSE and NB seems to have the same essence of the constraint method which is users which have rated items similarly tend to rate items in similar fashion. Both NB correction and constrained method tries to make the latent user features for similar users closer. The authors of [2] have noted that for users that have sparse ratings the constrained method provides a considerable improvement. We need to rst think of a way to parallelize the updates for Wi then ponder over how to exploit hierarchy here. One of the other contestants have claimed that he has got an RMSE of 23.97 using ALS with validation set in training using 100 latent features[6]. He is currently at rank 47th . We could optimistically assume that after adding constrained feature terms and including the validation set in training we should reach top 20.
14
Another scheme to try out is to blend different results. Since we are currently aiming at learning more about the dataset and coming up with better RMSE on validation set using a single method, we feel we should leave this till the end. Another note is that alternating update and grouping stragegy is faster but RMSE diverges for some initialization of latent matrices. The results of ALS tensor factorization is similar to ALS without the temporal term on the validation set. But one major difference is that the tensor factorization is achieving signicantly lower RMSE on the training set (18.xx compared to 20.xx ).
5
5.1
Parallelism
Alternating update and grouping strategy
In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while xing I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates. The matrix being updated is split into N groups and each group is updated independently.
FIXED I
5.2
In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing (u,i) pairs as illustrated in the gure below. This scheme can be recursively applied to each of the disjoint set for further levels parallelism. However, since the alternating update strategy seems to work for all the algorithms discussed, this scheme has not been implemented yet.
15
Item
User
Software
We initially chose Matlab for implementing the baseline methods. But it turned out that parallel processing on Matlab turned out be slower than sequential processing on single node. We suspect that the slowness is because of some communication overhead in our code which we havent been able to debug. After spending several days trying to gure out the problem, We gave up and decided to code in C++ which turned out to be good choice. The pthreads library on GNU/Linux is being used currently for parallelism. As far as we know, there exists no efcient collaborative ltering packages online. As a byproduct of the competition, we are also trying to build a robust and efcient package in the lines of liblinear for regression.
References
1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia, 2010 2. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization, Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 2008. 3. Gabor Takcs, Istvan Pilaszy, Bottyan Nemeth, Domonkos Tikk, Scalable Collaborative Filtering Approaches for Large Recommender Systems, Journal of Machine Learning Research 10: 623-656 (2009) 4. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Filtering for the Netix Prize,In AAIM(2008) 337-348 16
5. A. Paterek. Improving regularized Singular Value Decomposition for collaborative ltering. Proceedings of KDD Cup and Workshop, 2007. 6. http://groups.google.com/group/graphlab-kdd
17