Вы находитесь на странице: 1из 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261149618

Content-Based Filtering Recommendation Algorithm Using HMM

Conference Paper · August 2012


DOI: 10.1109/ICCIS.2012.112

CITATIONS READS
10 337

3 authors, including:

Zhifang Liao
Central South University
42 PUBLICATIONS   164 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

data mining View project

Big data relevant works View project

All content following this page was uploaded by Zhifang Liao on 09 March 2019.

The user has requested enhancement of the downloaded file.


2012 Fourth International Conference on Computational and Information Sciences

Content-Based Filtering Recommendation Algorithm Using HMM

Hui Li Fei Cai Zhifang Liao


School of Software School of Software School of Software
Central South University Central South University Central South University
Changsha, China Changsha, China Changsha, China
fzlihui@gmail.com csucaifei@126.com zhifang.liao@gmail.com

Abstract—In this paper, we combine probabilistic model and similarity of user profile and each profile of all the items and
classical content-based filtering recommendation algorithms to recommends items may satisfy user need or tastes.
propose a new algorithm for recommendation system, which we Combining the feature of items and the user interest model,
call content-based filtering recommendation algorithm using the utility function is usually defined as:
HMM. We utilize the HMM of recommended items to match user
model and recommend items using user data. According to u (c, s ) = score(ContentBaseProfile(c), Content ( s))
experiment result, this new method is more effective on
describing a user’s interest compared with the VSM-based
algorithm. There are many methods to calculate the utility, for
example, the cosine similarity measure:
Keywords-HMM; content-based filtering; recommendation system JJG JJJG
JJJG JJG w ⋅w s
I. INTRODUCTION u (c , s ) = cos( w c , w s ) = JJJG c JJG
The rapid development of internet brings the problem of w c × ws
information overload: user receives too much potential

¦
valuable information to approach the most useful part, which k
causes the reducing of efficiency [1]. Recently, many wi , c wi , s
applications of network (web portal, search engine, data index, = i =1

¦ ¦
k k
etc.) help users filter the information. However, all of these
methods just satisfy the basic needs and do not consider the
( i =1
w i2, c i =1
w i2, s )
personalized services to solve the information overload
problem. As an important method of information filter, Finally, the system sorts items by the utility values and
recommendation system provides personalized services to recommend other items to that user which have a high degree
users through identifying and predicting user preferences and of similarity to the user’s profile.
become a most useful method helping user to deal with III. HIDDEN MARKOV MODEL
information overload [2].
The most typical application of recommendation system is The basic theory of hidden Markov model was published
in the e-commerce which has a promising development in a series of classic papers by Baum and his colleagues in the
prospect. The online mall can recommend customers products, late 1960s and early 1970s and was firstly applied in speech
which may attract their attention or satisfy their needs (such as processing applications in the middle of 1970s [4]. From then
books, video, etc.), according to the users’ interests. It can on, HMM has been widely implemented in many fields.
enhance the products sales by meeting the potential needs of HMM is a probabilistic statistical model. A discrete hidden
users, which are usually not clear and fuzzy. Recommendation Markov model contains a set of states and an alphabet of
systems have become extremely common in recent years. output symbols [4]. States can both transform from one to the
Almost all of the large e-business systems have applied other and emit several output symbols. The probabilities of the
recommendation system in different ways, such as Amazon, two procedures make up the transition distributions over states
eBay, etc. and the emission distributions over the output symbols
respectively.
II. CONTENT-BASED FILTERING ALGORITHM A discrete hidden Markov model can be characteristically
The content-based recommendation algorithms derive defined as:
from information retrieval and filtering research [3]. The
content-based recommendation is the continuation and • A set of hidden states Q = {Q , Q ,!, Q } , where
development of early collaborative filtering method. Content- 1 2 N
based recommendation systems recommend items similar to Q (1 < t < N ) is the state in time t.
those that the user has selected in the past rather than the user’s t
comments of items. Many current content-based systems build • A set of output symbols X = { X , X ,! , X }
the users’ profile and items’ profile. A user profile contains 1 2 M
information about user’s tastes, preferences and needs which emitted by each hidden states.
can be elicited from users’ questionnaires or learned from their
transactional behavior over time, while an item profile contains • The probabilities distribution of state transitions:
a set of attributes of items. Then the system calculates the

978-0-7695-4789-3/12 $26.00 © 2012 IEEE 275


DOI 10.1109/ICCIS.2012.112
A = {a }, a = P[Q | Q ],1 ≤ i, j ≤ N . Means procedure helps improve the accuracy of subsequent
ij ij j i HMMs training and the reallocation procedure.
• The probabilities distribution of emitting output B. Steps of the algorithm
symbols in state j : We download data from GroupLens (http://www.grouplens.
org) which was collected from MovieLens (http://movielens.u
B = {b (k )}, b (k ) = P[ X | Q ],1 ≤ j ≤ N ,1 ≤ k ≤ M mn.edu). It contains 100,000 ratings (1-5) from 943 users on
j j k j
1682 movies. Each user has rated at least 20 movies. The
format of movie data is as follows:
• Initial state probabilities distribution: 32|Crumb (1994)|01-Jan-1994||http://us.imdb.com/M/title-
π = {π }, π = P[Q ],1 ≤ i ≤ N . exact?Crumb%20(1994)|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0
i i i
Movie information is a tab separated list of movie id |
The parameter of HMM can be simply expressed as an movie title | release date | video release date | IMDb URL |
Equation: λ =< A, B, π > DŽ unknown | Action | Adventure | Animation | Children's |
Figure I shows the hidden Markov model. Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir |
HMM can produce different sequences in different Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War |
probabilities and we can compute the probability of one Western |
sequence produced by a HMM. Therefore, we can recognize a The last 19 fields are the genres. 1 indicates the movie is of
group of sequences sharing the same pattern by training a that genre, while 0 indicates it is not. Movies can be in several
HMM. First, we estimate the state transition probabilities genres at once.
distribution, the output symbol probabilities distribution and We transform movie data into the following format we call
the initial state probabilities distribution to build the initial “genre string” without taking the publishing date into account:
HMMs. Next, use the chosen training algorithm to revise the 0000000100000000000. Each number indicates whether this
parameters to match the given data. The trained HMM can movie is belonging to that genre, and 0000000100000000000
represent a group of sequences well and we can use it to represents the movie Crumb. After the initializing of data, the
distinguish unknown sequences. The more the sequence is subsequent steps are:
similar to those in this group, the higher probability it can be a) Use K-Means with Hamming distance as similarity
produced in by the trained HMM. By computing the output standard to cluster “genre string” and get k clusters. Save the
probability, we can cluster sequences. result of clustering in array clusterResult[i][j], 1 i k , where
j is the number of sequences in that cluster.
IV. CONTENT-BASED FILTERING RECOMMENDATION
ALGORITHM USING HMM b) Train HMM for each cluster which is saved in
clusterResult[i][j] by applying the Forward-Backward
A. Theory of the algorithm algorithm and get k HMMs, each one of which represents its
This algorithm is based on the initial data which are in the cluster well.
form of sequences with equal length. In information theory, the c) Compute the probabilities of each “genre string”
Hamming distance between two strings with equal length is the produced from these k HMMs by using Forward algorithm
number of positions at which the corresponding symbols are and reallocate it to the HMM which has the largest probability
different. Put in other words, it represents the minimum to produce it. Then, save the new result in clusterResult[i][j].
number of substitutions required to change one string into the d) Repeat step 2 to step 3 until the clustering result does
other. For example, the Hamming distance between 1011101 not change any more or the number of iterations reaches the
and 1001001 is 2. Then, we use K-Means with Hamming maximum value we set.
distance as similarity standard to cluster these sequences
Through steps above, we cluster movies by using HMM.
preliminarily and build k initial clusters. Next, we train HMM
After a certain user has watched a movie, we can recommend
for each cluster by applying the Forward-Backward algorithm
another movie in the same genres of that one.
and get k HMMs. After that, We compute the probabilities of
each sequence produced from these k HMMs by using Forward
algorithm and reallocate each sequence to the HMM which has V. EXPERIMENT
the largest probability to produce it. The higher the probability In general, we use precision and recall rate to assess the
is, the farther the distance is between the sequence and HMM quality of recommendation results. They can be described as:
[5]. On next step, retrain each HMM in order to adapt model to C
the new clustering result. We repeat the reallocation and precision =
retraining procedures until the clustering result does not change T
any more or the number of iterations reaches the maximum C
value we set. recall rate=
We use first K-Means clustering result as the input to the S
HMM training in order to enhance similarity inside each C represents the number of correct recommended items. T
cluster and difference between clusters. Therefore, the first K- represents the total number of recommended items. S
represents the number of items which should be recommended.

276
We compare the precision and recall rates of vector space However, it is difficulty to apply an all-purpose
model based algorithm (VSM-Based algorithm) and HMM- recommendation algorithm and describe user interests precisely
based algorithm and Figure II shows the experiment result. for different background. In future work, we will focus on the
According to the result, we can see the average precision of development of relational techniques and keep improving the
HMM-based algorithm is higher than VSM-Based algorithm. average precision of our algorithm.
The main reason is that VSM-Based algorithm calls for
accurate match, however, the number of similar key words REFERENCE
which are shared between items and user interests is so few [1] D. Bawden, C. Holtham, and N. Courtney. “Perspectives on information
generally that the precision of VSM-Based algorithm overload,” Aslib Proceedings , vol. 51, Sep. 1999, pp. 249-255.
decreases sharply. By contrast, HMM-based algorithm can [2] G. Adomavicius and A. Tuzhilin. “Toward the Next Generation of
avoid this problem by utilizing probabilities distribution to Recommender Systems: A Survey of the State-of-the-Art and Possible
Extensions,” IEEE Transactions on Knowledge and Data Engineering,
measure similarity between items and user interests. Therefore, vol. 17, Jun. 2005, pp. 734-749.
the average precision of HMM-based algorithm is higher. [3] R. Baeza-Yates and B. Ribeiro-Neto. “Modem Information Retrieval,”
New York: Addison-Wesley Publishing Co, 1999, pp. 271–350.
VI. CONCLUSION
[4] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected
We propose content-based filtering recommendation Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77,
algorithm using HMM for personalized recommendation Feb. 1989, pp. 257–286.
system. Compared with VSM-Based algorithm, the new [5] S. Zhong and J. Ghosh. “A Unified Framework for Model-based
method is more effective on describing a user’s interest. Clustering,” Journal of Machine Learning Research, vol. 4, Dec. 2003,
pp.1001-1037.

Figure I

5HVXOWV&RPSDUDWLRQ


3UHFLVLRQ





     
5HFDOO5DWH

960%DVHG +00%DVHG

Figure II

277

View publication stats

Вам также может понравиться