Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT 1. INTRODUCTION
Web users display their preferences implicitly by navigating Web recommendation systems have been widely used to
through a sequence of pages or by providing numeric rat- help users locate information on the Web. In many cases,
ings to some items. Web usage mining techniques are used Web users’ interests or preferences are not directly accessi-
to extract useful knowledge about user interests from such ble; instead, they are implicitly captured during users’ in-
data. The discovered user models are then used for a vari- teraction with the Web site, such as navigating through a
ety of applications such as personalized recommendations. sequence of pages, or reflected through numeric ratings to
Web site content or semantic features of objects provide an- items.
other source of knowledge for deciphering users’ needs or Many data mining and machine learning approaches have
interests. We propose a novel Web recommendation system been used in building Web recommendation systems. Such
in which collaborative features such as navigation or rating systems, generally take the Web users’ navigation or rating
data as well as the content features accessed by the users data as input, and applying data mining or machine learn-
are seamlessly integrated under the maximum entropy prin- ing approaches to discover usage patterns that represent ag-
ciple. Both the discovered user patterns and the semantic gregate user models. When a new user comes to the site,
relationships among Web objects are represented as sets of his/her activity will be matched against these patterns to
constraints that are integrated to fit the model. In the case find like-minded users and select potential interesting items
of content features, we use a new approach based on Latent as recommendations.
Dirichlet Allocation (LDA) to discover the hidden seman- Typically, recommendation algorithms rely only on user
tic relationships among items and derive constraints used information (collaborative features) such as navigational his-
in the model. Experiments on real Web site usage data sets tory or rating data, and additional information such as con-
show that this approach can achieve better recommendation tent features of the items, which may provide valuable source
accuracy, when compared to systems using only usage infor- of complementary knowledge about user’s activities, is usu-
mation. The integration of semantic information also allows ally ignored. By incorporating content information with a
for better interpretation of the generated recommendations. user’s navigation or rating behavior, we may be able to gain
a deeper understanding of her underlying interests. For in-
Categories and Subject Descriptors stance, we may find an association pattern “page A ⇒ page
B” with high support and confidence in navigation data, or
H.2.8 [Database Management]: Database Applications— a pattern such as “movie C and movie D are always rated
Data Mining; I.2.6 [Artificial Intelligence]: Learning; I.5.1 similarly” in rating data. The discovery of such patterns,
[Pattern Recognition]: Models—Statistical by itself, does not explain the underlying reasons for their
existence.
General Terms To address this issue considerable work has been done to
Algorithms enhance traditional recommendation systems by integrating
data from other sources such as content attributes, link-
Keywords age structure, and user demographics [13, 8, 14, 17, 5]. In
these approaches, content information such as keywords or
Maximum Entropy, Recommendation, Web Usage Mining, attributes of items, or user demographic data is collected
User Profiling as well as the usage or rating data. In order to make use
of available data sources, different combination methods are
tried to make recommendations more effective and inter-
pretable. Generally, an integrated approach is prefered dur-
Permission to make digital or hard copies of all or part of this work for ing the mining or model learning phase to avoid subjective
personal or classroom use is granted without fee provided that copies are or ad hoc ways of combining evidence. This is also one of
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, our main motivations behind the maximum entropy recom-
to republish, to post on servers or to redistribute to lists, requires prior mendation system introduced in this paper.
specific permission and/or a fee. Maximum entropy model is a powerful statistical model
KDD’05, August 21–24, 2005, Chicago, Illinois, USA. which has been widely applied in many domains such as sta-
Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00.
tistical language learning [15], information retrieval [4] and • For rating data, each user is represented as a set as:
text mining [10]. The goal of a maximum entropy model is ui = {< ti1 , r1i >, < ti2 , r2i >, . . . }, where tik ∈ T and rki
to find a probability distribution which satisfies all the con- takes ordinal values, such as from {1,2,3,4,5}.
straints in the observed data while maintaining maximum
entropy. One of the advantages of such a model is that it • H(ui ): recent navigation/rating history of user i, a
enables the unification of information from multiple knowl- subset of ui . Since the most recent activities have
edge sources in one framework. Each knowledge source can greater influence on reflecting a user’s present inter-
be considered as a set of constraints in the model. From ests, we only use the last several pages to represent a
the intersection of all these constraints, a probability distri- user’s history.
bution with the highest entropy can be learned. Recently,
we have seen applications of maximum entropy in Web rec- 2.2 The Maximum Entropy Model
ommendation systems [12, 18, 6]. In these systems, only The system consists of two components. The offline com-
statistics from users’ navigation or rating data are used as ponent accepts constraints from the navigation/rating data
features to train a maximum entropy model. Therefore, the and Web site content information, and estimates the model
advantage of integrating multiple knowledge sources is not parameters. The online part reads an active user session
fully exploited. and runs the recommendation algorithm to generate recom-
In this paper, we propose a novel Web recommendation mendations (a set of pages or rating predictions for unrated
system, in which users’ navigational or rating data and the items) for the user.
semantic content features associated with items are seam- We use the following approach to generate predictions or
lessly integrated under the maximum entropy principle. First, recommendations for active users. In case of navigation
we discover statistics from users’ navigation data and use data, the conditional probability P r(td |H(ui )) of a page td
them as one set of constraints. Secondly, for content infor- being visited next given a user’s recent navigational history
mation, we use a new approach based on Latent Dirichlet H(ui ). In case of rating data, the conditional probability
Allocation (LDA) [1] to discover the hidden semantic rela- P r(< td , rd > |H(ui )) of an item td receiving rating rd next
tionships among visited or rated items and specify another given a user’s recent rating history H(ui ).
set of constraints based on these item association patterns. In our model, we use two sources of knowledge about Web
The LDA model is a proven approach that can uncover the users’ navigational behavior, namely features based on item-
hidden relationships among co-occurring objects. An exam- level usage patterns, and features based on item content
ple of such hidden patterns are “topics” connecting docu- associations. Features will be presented in Section 3. Based
ments and words [1], and the author-topic model in [9]. In on each feature fs , we represent a constraint as:
our framework, the two sets of constraints are combined to
fit the maximum entropy model, based on which we generate
dynamic recommendations for active users. P r(td |H(ui ))fs (H(ui ), td ) = fs (H(ui ), D(H(ui )))
ui td ∈T ui
(1)
2. RECOMMENDATIONS BASED ON MAX- where D(H(ui )) denotes the page following ui ’s history in
IMUM ENTROPY the training data. Each constraint specifies that the ex-
In this section, we present our maximum entropy recom- pected value of each feature w.r.t. the model distribution
mendation model and its algorithm. The scalability of this should always equal its observation value in the training
system is also discussed. data. After we have defined a set of features, F = {f1 , f2 , · · · , ft },
2.1 Notation
and accordingly, generated constraints for each feature, it’s
guaranteed that, under all the constraints, a unique distri-
In this paper, we will use the following notations: bution exists with maximum entropy [3]. This distribution
has the form:
• U = {u1 , u2 , . . . , um }: a set of m users.
exp( λs fs (H(ui ), td ))
• T = {t1 , t2 , . . . , tn }: a set of n pages/items. P r(td |H(ui )) = s
(2)
Z(H(ui ))
• A = {a1 , a2 , . . . , al }: a set of l attributes for pages/items where Z(H(ui )) = td ∈T P r(td |H(ui )) is the normaliza-
within a site. In a content-intensive site, these at- tion constant ensuring that the distribution sums to 1, and
tributes can be keywords from the site. While in a λs are the parameters needed to be estimated . Thus, each
database-driven site, each web page/item represents source of knowledge can be represented as features (and
an object which corresponds to a record in the under- constraints) with associated weights. By using Equation 2,
lying database. In this case, we extract the attributes all knowledge sources (represented by various features) are
of these records. Each page/item then can be repre- taken into account to make predictions about users’ next
sented as an attribute vector tj =< aj1 , aj2 , · · · >. action given their past navigation.
There have been several algorithms which can be applied
• For navigation data, each user is represented as a se- to estimate the λs. Here we use the Sequential Conditional
quence of visited pages as: ui =< ti1 , ti2 , · · · >, where Generalized Iterative Scaling (SCGIS) [2], which seems to be
tik ∈ T . We use binary representation here, assuming very efficient. To boost efficiency techniques such as dimen-
each page is either visited or not visited. (Our model sionality reduction using clustering [2, 11], efficient training
can be easily extended to handle non-binary data as algorithms [7], and automatic feature selection methods can
well.) be used.
2.3 Algorithm for Generating Recommenda- select highly correlated item pairs as features. Here we
tions use item rating similarity to measure the associations
After we have estimated the parameters associated with between items. To offset the difference of rating scales,
each feature, we can use Equation 2 to compute the probabil- we adopt the Adjusted Cosine Similarity measure:
ity of any unvisited page ti being visited next given certain
user’s history, and then pick the pages with highest proba-
m
bilities as recommendations, i.e. we compute the conditional (rak − rk ) × (rbk − rk )
probability P r(ti |ua ) given an active user ua . Then we sort
all pages based on the probabilities and pick the top N pages
to get a recommendation set. The algorithm is as follows:
sim(ta , tb ) =
m
k=1
(rak − rk )2 ×
m
(rbk − rk )2
k=1 k=1
Input: an active user session ua , parameters λs estimated
from the training data. where, rak denotes user k’s rating on item a, and rk
Output: a list of N pages sorted by probability of being stands for the average rating of user k.
visited next given ua .
For each item-pair < ta , tb >, where sim(ta , tb ) ≥ µ (µ
1. Consider the last pages of the active user session, for is the user-specified threshold, used to filter out some
each page ti that does not appear in the active session, less related item pairs), we define a feature as:
assume it is the next page to be visited, and evaluate
⎧⎪
all the features based on their definitions. (In this
⎨ 1 if user rated item ta
⎪⎩
paper, we use bigram as usage features, so we only
look at the last visited page. Apparently, high-order with rating ra and
fta ,ra ,tb ,rb (H(ui ), tb , rb ) =
page transitions can be used as features as well, with item tb with rating rb ,
higher computation complexity.) 0 otherwise
⎧⎨ zi |θti
φ
∼
∼
Discrete(θti )
Dirichlet(β)
(4)
(5)
⎩
1 if page ta is the last page
fta ,tb (H(ui ), td ) = in H(ui ), and tb = td , aj |zi , φ zi
∼ Discrete(φzi ) (6)
0 otherwise
Here, z represents a set of hidden “classes”. θti denotes
item ti ’s association with multiple “classes” and φzi specifies
• For rating data, generally the order of ratings is not “class” zi as a distribution over attributes. α and β are
important. We have to resort to other approaches to hyperparameters of the prior of θ and φ.
We use the Variational Bayes technique to estimate each Movie Class 8
Movie Name Attribute Probability
item’s association with multiple “classes” (θ), and the as- Mighty Morphin Power Rangers Genre: Adventure 0.290
sociations between “classes” and attributes (β). As shown Stargate Genre: Action 0.151
in [1], these “classes” themselves can be used as derived at- 2001: A Space Odyssey Genre: Sci-Fi 0.087
20,000 Leagues Under the Sea A: James Doohan 0.023
tributes for text classification task. We observe that most E.T.: The Extraterrestrial A: Walter Koenig 0.023
items are only strongly associated with one “class” (based Raiders of the Lost Ark A: William Shatner 0.023
on θ vector for each user), so we assign each item to its A Clockwork Orange D: Steven Spielberg 0.023
Back to the Future A: DeForest Kelley 0.020
dominant “class”. Therefore, each “class” comprises of a
Jaws … …
set of items which are semantically similar. To better inter- Star Trek: Generation
pret these “classes”, we can easily identify their prominent Star Trek: The Motion Picture
attributes based on β. and other Star Trek series
…
After assigning each item to a “class”, we can define our Movie Class 11
features. In the case of navigation data, for each item pair Movie Name Attribute Probability
< ta , tb > where ta and tb both belong to the same “class” Sense and Sensibility Genre: Romance 0.073
⎧⎨
Restoration Genre: Drama 0.039
I.Q. A: Hugh Grant 0.034
Bitter Moon A: Meg Ryan 0.034
⎩
Four Weddings and a Funeral A: Emma Thompson 0.027
1 if page ta is the last page The Englishman Who Went up a Hill A: Billy Crystal 0.023
fta ,tb ,z (H(ui ), td ) = in H(ui ), tb = td , But Came Down a Mountain
0 otherwise The Remains of the Day A: Roger Ashton-Griffiths 0.023
Sirens … …
In the case of rating data, similarly, if ta and tb both Sleepless in Seattle A: Rob Reiner 0.012
When Harry Met Sally … …
belong to the same “class” z, we define a feature as:
⎧⎪
Top Gun
⎨
Courage Under Fire
…
1 if user rated
fta ,ra ,tb ,rb ,z (H(ui ), tb , rb ) =
⎪⎩ item ta with rating ra ,
item tb with rating rb ,
0 otherwise
Figure 1: Examples of movie classes and their top at-
tributes
MAE Comparison
ItemCF 1.13
Maximum
Figure 2: Examples of real estate property classes and 1.08
their top attributes Entropy
4.3 Quantitative Evaluation sets to which they are applied. Data sparsity has a negative
impact on the accuracy and predictability of recommenda-
For “Realty data”, to exploit the ordering information tions. This is one area in which, we believe, the integration
in users’ navigation data, we built another recommenda- of semantic knowledge with ratings data can provide signifi-
tion system based on the standard first-order Markov model cant advantage. Here we created multiple training/test data
to predict and recommend which page to visit next. The sets in which the proportion of the training data to the com-
Markov-based system models each page as a state in a state plete ratings data set was changed from 90% to 10% (For
diagram with each state transition labeled by the condi- each user, we randomly select certain percentage of ratings
tional probabilities estimated from the actual navigational as training data and the rest as test data). These propor-
data from the server log data. It generates recommendations tions have a direct correspondence with the level of sparsity
based on the state transition probability of the last page in in the ratings data.
the active user session. Figure 5 indicates that the advantage of our system over
Figure 3 depicts the comparison of the Hit Ratio measures the item-based CF system tends to be more distinctive when
for these two recommender systems in “Realty data”. The the data gets sparser. One possible explanation is that when
experiments show that the maximum entropy recommenda- the training data gets sparser, we are less likely to find movie
tion system has a clear overall advantage in terms of accu- neighbors with very high rating similarity and make accurate
racy over the first-order Markov recommendation system on predictions based on neighbor movies’ ratings. At this time,
this data set. for maximum entropy recommendation system, some fea-
For the movie data, we implement a standard item-based tures based on movie rating similarity also get lower weights,
Sixth Conference on Natural Language
Learning(2002), 2002.
[8] B. Mobasher, H. Dai, T. Luo, Y. Sun, and J. Zhu.
Integrating web usage and content mining for more
effective personalization. In E-Commerce and Web
Technologies: Proceedings of the EC-WEB 2000
Conference, Lecture Notes in Computer Science
(LNCS) 1875, pages 165–176. Springer, September
2000.
[9] M.Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths.
Probabilistic author-topic models for information
discovery. In Proceedings of the International
Conference on Knowledge Discovery and Data Mining,
2004.
[10] K. Nigram, J. Lafferty, and A. McCallum. Using
maximum entropy for text classification. In
Figure 5: Impact of Data Sparsity on Recommendations Proceedings of IJCAI-1999, 1999.
[11] D. Pavlov, E. Manavoglu, D. Pennock, and C. Giles.
Collaborative filtering with maximum entropy. IEEE
but features based on movie content similarity will be less Intelligent Systems, Special Issue on Mining the Web
affected and still contribute remarkable weights to make ac- Actionable Knowledge, 2004.
curate predictions. [12] D. Pavlov and D. Pennock. A maximum entropy
approach to collaborative filtering in dynamic, sparse,
5. CONCLUSION high-dimensional domains. In Proceedings of Neural
In this paper, we have proposed a novel Web recommenda- Information Processing Systems(2002), 2002.
tion system in which users’ navigation or rating data as well [13] M. Pazzani. A framework for collaborative,
as the content features accessed by the users are seamlessly content-based and demographic filtering. Artificial
integrated under the maximum entropy principle. In the Intelligence Review, 13(5-6):393–408, 1999.
case of content information, we use a new approach based [14] A. Popescul, L. Ungar, D. Pennock, and S. Lawrence.
on Latent Dirichlet Allocation (LDA) to discover the hidden Probabilistic models for unified collaborative and
semantic relationships among items. The resulting models content-based recommendation in sparse-data
are used to generate dynamic recommendations for active environments. In Proceedings of 17th UAI, Seattle,
users. Experiments show that this approach can achieve WA, 2001.
better recommendation accuracy with small number of rec- [15] R. Rosenfeld. Adaptive statistical language modeling:
ommendations. Furthermore, the proposed framework pro- A maximum entropy approach. Phd dissertation,
vides reasonable accuracy in predictions in the face of high CMU, 1994.
data sparsity. [16] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Item-based collaborative filtering recommendation
6. REFERENCES algorithms. In Proceedings of the 10th International
WWW Conference, Hong Kong, May 2001.
[1] D. Blei, A. Ng, and M. Jordan. Latent dirichlet [17] K. Yu, A. Schwaighofer, V. Tresp, W. Ma, and
allocation. Journal of Machine Learning Research, H. Zhang. Collaborative ensembling learning:
3:993–1022, 2003. Combining collaborative and content-based
[2] J. Goodman. Sequential conditional generalized information filtering. In Proceedings of 19th UAI, 2003.
iterative scaling. In Proceedings of NAACL-2002, 2002. [18] C. Zitnick and T. Kanade. Maximum entropy for
[3] F. Jelinek. Statistical Methods for Speech Recognition. collaborative filtering. In Proceedings of 20th
MIT Press, MA, 1998. International Conference on Uncertainty in Artificial
[4] J. Jeon and R. Manmatha. Using maximum entropy Intelligence (UAI’04), Banff, Canada, July 2004.
for automatic image annotation. In Proceedings of the
International Conference on Image and Video
Retrieval (CIVR-2004), 2004.
[5] X. Jin, Y. Zhou, and B. Mobasher. A unified approach
to personalization based on probabilistic latent
semantic models of web usage and content. In
Proceedings of the AAAI 2004 Workshop on Semantic
Web Personalization (SWP’04), San Jose, CA, 2004.
[6] X. Jin, Y. Zhou, and B. Mobasher. Task-oriented web
user modeling for recommendation. In Proceedings of
the 10th International Conference on User Modeling
(UM’05), UK, April 2005.
[7] R. Malouf. A comparison of algorithms for maximum
entropy parameter estimation. In Proceedings of the